Use optimal chunks for downstream language tasks
When the Preprocess API or SDK finishes splitting your document, it provides a structured output that can be easily integrated into your downstream applications. Understanding how to access and handle this output is crucial, whether you are using a simple Python script, integrating into a vector database, or connecting with retrieval-augmented generation (RAG) tools.
What’s in the Output?
At the end of the chunking process, you receive:
- Process ID and Metadata:
Details about the chunking process, such as theprocess_id
and the name of the file processed. - Detected Language:
The language of the document, either identified automatically or as provided by you. - Chunks:
A list of textual segments representing your original document, divided based on the document’s structure and semantics. These chunks are designed to be ready for indexing in vector stores, fine-tuning, or direct LLM queries without further processing.
Depending on your approach, you’ll either receive the result directly via a webhook or by polling for completion using the SDK or API endpoints.
Receiving the Output by Webhook
If you provided a webhook
when you started the chunking process, once the chunking is completed, the Preprocess API will send a POST
request to your webhook URL. The JSON payload will look like:
{
"status": "OK",
"success": true,
"message": "The file has been chunked successfully.",
"info": {
"process": {"id": "your_process_id_here"},
"file": {"name": "your_uploaded_file.ext"}
},
"data": {
"detected_language": "en",
"chunks": [
"This is the first chunk...",
"This is the second chunk...",
"...and so forth."
]
}
}
You can then parse this JSON in your application to store or further process the chunks.
Polling for Results via the SDK
If you prefer not to use a webhook, you can poll the API for results using the Python SDK. After starting the chunking:
from pypreprocess import Preprocess
preprocess = Preprocess(api_key="YOUR_API_KEY", filepath="path/to/your/file.pdf")
preprocess.chunk()
# Wait for the process to complete
result = preprocess.wait()
# Access chunks
chunks = result.data["chunks"]
Now, chunks
is a list of text segments. You can store them in a vector database, pass them to a Large Language Model (LLM), or feed them into other tools.
Polling for Results via the API (Without the SDK)
If you’re not using the Python SDK, you can still poll for results via the API:
- Start the process using
POST https://chunk.ing/?webhook=...
(or without a webhook). - Keep track of the returned
process_id
. - Poll
POST https://chunk.ing/get_result?id=your_process_id
until the chunks are available.
Once the result is ready, you’ll receive a JSON payload similar to the one posted to the webhook. Extract the chunks
and proceed as needed.
Downstream Use Cases
Once you have the chunks, how you use them depends on your application’s needs:
-
Storage in Vector Databases:
If you’re using a vector database, you can embed each chunk using your chosen embedding model and store them as individual records. Later, query the database to find the most relevant chunks. -
Direct Queries to LLMs:
Send a chunk directly to your LLM as context. For example:chunk = chunks[0] # Send to LLM (pseudo-code) response = llm.generate(prompt=f"Answer the question based on this context:\n{chunk}")
-
Integrating with LangChain:
LangChain can wrap chunks in itsDocument
structure:from langchain.schema import Document docs = [Document(page_content=c) for c in chunks] # docs can now be passed to LangChain embeddings or vector stores
-
Integrating with LlamaIndex:
In LlamaIndex, you can directly insert chunks into an index:from llama_index import VectorStoreIndex, Document index = VectorStoreIndex([]) for c in chunks: index.insert(Document(text=c))
By following the above approaches, you’ll be able to seamlessly handle the output of Preprocess, readying it for integrations with various tools and frameworks. More detailed instructions on deeper integrations and advanced workflows will be covered in their respective sections.