Deploy Preprocess in your ingestion pipeline
Preprocess splits documents into optimal chunks of text for use in language model tasks. If you want to learn more about the solution check out what we are building and why.
Get your API key
Sign up at app.preprocess.co, buy a credit package, and get your API key here.
The steps of preprocessing
Preprocessing is a time-intensive task, for this reason, the API is asynchronous. The response to the parse API call will confirm the document has been received correctly, and when the chunking is completed the result will be sent to the provided webhook. If you are not in the condition to set up a webhook we got you covered.
Every file parsing follows these steps:
- The parse API is called with a document
- Preprocess converts and splits the document into optimal chunks
- If provided the webhook is called with the chunking result, which is available to download via GET API
Webhook allows the implementation of an async structure: your endpoint will receive a POST call when the process is completed with the result of the parsing. Via GET API you can retrieve the result of the parsing for the following 24 hours, even if you set up a webhook parameter.
The result consists of a list of textual chunks ready to be indexed or embedded depending on your needs.
Quick start
Start the parsing
curl --request POST \
--url https://chunk.ing/?repeat_title=true \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: your-api-key'
--form 'file=@/your_file.ext'
Get the results
curl --request GET \
--url https://chunk.ing/get_result?id=process_id \
--header 'accept: application/json' \
--header 'x-api-key: your-api-key'
Once the parsing is finished (success = true) you get the list of chunks ready to be indexed or embedded.
{
"status": "OK",
"success": true,
"message": "The file has been chunked successfully.",
"info": {
"file":{
"name": string
}
},
"data": {
"process":{
"id": string
},
"detected_language": string ,
"chunks": [
"first chunk …",
"second chunk …",
"…",
"…"
]
}
}
Python SDK
If you prefer, you can use our Python SDK (a wrapper of the API), check the documentation here
from pypreprocess import Preprocess
# Initialize the SDK with a file
preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/for/file")
# Chunk the file
preprocess.chunk()
preprocess.wait()
# Get the result
result = preprocess.result()
for chunk in result.data['chunks']:
# Use the chunks