API Reference

Deploy Preprocess in your ingestion pipeline

Preprocess splits documents into optimal chunks of text for use in language model tasks. If you want to learn more about the solution check out what we are building and why.

Get your API key

Sign up at app.preprocess.co, buy a credit package, and get your API key here.

The steps of preprocessing

Preprocessing is a time-intensive task, for this reason, the API is asynchronous. The response to the parse API call will confirm the document has been received correctly, and when the chunking is completed the result will be sent to the provided webhook. If you are not in the condition to set up a webhook we got you covered.

Every file parsing follows these steps:

  1. The parse API is called with a document
  2. Preprocess converts and splits the document into optimal chunks
  3. If provided the webhook is called with the chunking result, which is available to download via GET API

Webhook allows the implementation of an async structure: your endpoint will receive a POST call when the process is completed with the result of the parsing. Via GET API you can retrieve the result of the parsing for the following 24 hours, even if you set up a webhook parameter.

The result consists of a list of textual chunks ready to be indexed or embedded depending on your needs.

Quick start

Start the parsing

curl --request POST \
     --url https://chunk.ing/?repeat_title=true \
     --header 'Content-Type: multipart/form-data' \
     --header 'x-api-key: your-api-key'
     --form 'file=@/your_file.ext'

Get the results

curl --request GET \
     --url https://chunk.ing/get_result?id=process_id \
     --header 'accept: application/json' \
     --header 'x-api-key: your-api-key'

Once the parsing is finished (success = true) you get the list of chunks ready to be indexed or embedded.

{
  "status": "OK",
  "success": true,
  "message": "The file has been chunked successfully.",
  "info": {
    "file":{
      "name": string 
    }
  },
  "data": {
    "process":{
      "id": string
    },
    "detected_language": string ,
    "chunks": [
      "first chunk …",
      "second chunk …",
      "…",
      "…"
    ]
  }
}

Python SDK

If you prefer, you can use our Python SDK (a wrapper of the API), check the documentation here

from pypreprocess import Preprocess

# Initialize the SDK with a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")

# Chunk the file
preprocess.chunk()
preprocess.wait()

# Get the result
result = preprocess.result()
for chunk in result.data['chunks']:
    # Use the chunks