API Reference

Deploy Preprocess in your ingestion pipeline

Preprocess splits documents into optimal chunks of text for use in language model tasks. If you want to learn more about the solution check out what we are building and why.

Get your API key

Sign up at app.preprocess.co, buy a credit package, and get your API key here.

The steps of preprocessing

Preprocessing is a time-intensive task, for this reason, the API is asynchronous. The response to the ingest API call will confirm the document has been received correctly, and when the chunking is completed the result will be sent to the provided webhook. If you are not in the condition to set up a webhook we got you covered.

Every ingestion follows these steps:

  1. The ingest API is called with a document
  2. Preprocess converts and splits the document into optimal chunks
  3. If provided the webhook is called with the chunking result, which is available to download via GET API

Webhook allows the implementation of an async structure: your endpoint will receive a POST call when the process is completed with the result of the ingestion. Via GET API you can retrieve the result of the ingestion for the following 24 hours, even if you set up a webhook parameter.

The result consists of a list of textual chunks ready to be indexed or embedded depending on your needs.

Webhook Setup


Polling Setup


Quick start

Start the ingestion

curl --request POST \
     --url https://chunk.ing/?repeat_title=true \
     --header 'Content-Type: multipart/form-data' \
     --header 'x-api-key: your-api-key'
     --form 'file=@/your_file.ext'

Get the results

curl --request GET \
     --url https://chunk.ing/get_result?id=process_id \
     --header 'accept: application/json' \
     --header 'x-api-key: your-api-key'

Once the ingestion is finished (success = true) you get the list of chunks ready to be indexed or embedded.

{
  "status": "OK",
  "success": true,
  "message": "The file has been chunked successfully.",
  "info": {
    "file":{
      "name": string 
    }
  },
  "data": {
    "process":{
      "id": string
    },
    "detected_language": string ,
    "chunks": [
      "first chunk …",
      "second chunk …",
      "…",
      "…"
    ]
  }
}

Python SDK

If you prefer, you can use our Python SDK (a wrapper of the API), check the documentation here

from pypreprocess import Preprocess

# Initialize the SDK with a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")

# Chunk the file
preprocess.chunk()
preprocess.wait()

# Get the result
result = preprocess.result()
for chunk in result.data['chunks']:
    # Use the chunks