Build an ingestion pipeline that maximizes RAG performances
Build an ingestion pipeline that maximizes RAG performances
Preprocess converts and splits complex documents into optimal chunks of text via a simple API.
We handle preprocessing complexities, so you can focus on what matters.
Why
Poor document chunking has a very negative impact on results.
The most common chunking method consists of splitting documents on a fixed word count.
This chunking method feeds the LLM with irrelevant information resulting in poor RAG performance and increased hallucinations.
How we can help
Our API accurately parses long, complex documents to create RAG-ready data with unmatched precision.
It splits files into optimal text chunks following the original document layout and content semantics, ensuring that each portion is perfectly crafted for embedding, indexing, and retrieval.
This guarantees that LLMs receive the highest quality data to answer user queries accurately.
What's in it for you
- Up to 10x increase in RAG performance.
- Reduce operational costs: no more resources involved in developing and maintaining in-house chunking solutions.
- Accelerate time-to-market for new features and create new revenue streams.
How we do that
You can upload documents via API or Python SDK: PDFs, Office files, HTML, plain text.
Preprocess parses the document following titles, sections, paragraphs, tables, images, and lists.
Receive the document optimally chunked, ready for indexing and importing into a vector database.
curl -X POST "https://chunk.ing" \
-H "x-api-key: your_api_key" \
-H "Content-Type: multipart/form-data" \
-F "file=@/your_file.ext"
from pypreprocess import Preprocess
p = Preprocess(filepath="path/to/file", api_key=YOUR_API_KEY)
preprocess.chunk()
preprocess.wait()
result = preprocess.result()
Features
Feature | Plan |
---|---|
Intelligent Chunking | All plans |
PDF/MS Office/Open Office/HTML/Plain text Support | All plans |
Scanned PDF Support | All plans |
Parallel tasks | All plans |
Image text extraction | All plans |
Image export | Enterprise |
Chunk boundaries | Enterprise |
Zero Data Retention Agreements | Enterprise |
On-Prem Deployments | Enterprise |
Custom Processing Pipelines | Enterprise |
How to get started
Visit preprocess.co and sign up via email to get access to the free playground environment and to your API key.
To deploy the solution check out Getting started
Support
For any issue, feedback, or information please write us at [email protected]
Pricing
We don't have monthly fees. You can buy a package of credits, and once the credits end you can purchase an additional one.
- 10.000 credits -> 300$
- 50.000 credits -> 1250$
- 250.000 credits -> 5000$
How credits work
We identify 3 types of files: Document, Presentation, Spreadsheet
When a file is parsed, we determine its type based on the extension and content:
- PDF files can be either Document or Presentation.
- Word and Writer files are classified as Document.
- PowerPoint and Impress files are classified as Presentation.
- Excel and Calc files are classified as Spreadsheet.
- .txt files are classified as Document.
- .html files are classified as Document.
For Document and Presentation types, 1 credit = 1 page processed.
For Spreadsheet types, 1 credit = 1 sheet processed.
.html and .txt files are converted to PDF in advance, and pages are calculated based on the converted file.