Build an ingestion pipeline that maximizes RAG performances

Preprocess converts and splits complex documents (PDF, Word, PowerPoint, Excel, HTML, and plain text) into optimal chunks of text via a simple API.
We handle preprocessing complexities, so you can focus on what matters.

Get Started here

Why

Poor document chunking has a very negative impact on results.
The most common chunking method consists of splitting documents on a fixed word count.
This chunking method feeds the LLM with irrelevant information resulting in poor RAG performance and increased hallucinations.

How we can help

Our API accurately parses long, complex documents to create RAG-ready data with unmatched precision.
It splits files into optimal text chunks following the original document layout and content semantics, ensuring that each portion is perfectly crafted for embedding, indexing, and retrieval.
This guarantees that LLMs receive the highest quality data to answer user queries accurately.

What's in it for you

Up to 10x increase in RAG performance.
Reduce operational costs: no more resources involved in developing and maintaining in-house chunking solutions.
Accelerate time-to-market for new features and create new revenue streams.

How we do that

You can upload documents via API or Python SDK: PDFs, Office files, HTML, plain text.
Preprocess parses the document following titles, sections, paragraphs, tables, images, and lists.
Receive the document optimally chunked, ready for indexing and importing into a vector database.

curl -X POST  "https://chunk.ing" \
-H "x-api-key: your_api_key" \
-H "Content-Type: multipart/form-data" \
-F "file=@/your_file.ext"

from pypreprocess import Preprocess

p = Preprocess(filepath="path/to/file", api_key=YOUR_API_KEY)
preprocess.chunk()
preprocess.wait()
result = preprocess.result()

Features

Feature	Plan
Intelligent Chunking	All plans
PDF/MS Office/Open Office/HTML/Plain text Support	All plans
Scanned PDF Support	All plans
Parallel tasks	All plans
Image text extraction	All plans
Image export	Enterprise
Chunk boundaries	Enterprise
Zero Data Retention Agreements	Enterprise
On-Prem Deployments	Enterprise
Custom Processing Pipelines	Enterprise

How to get started

Visit preprocess.co and sign up via email to get access to the free playground environment and to your API key.

To deploy the solution check out Getting started

Support

For any issue, feedback, or information please write us at support@preprocess.co

Pricing

We don't have monthly fees. You can buy a package of credits, and once the credits end you can purchase an additional one.

10.000 credits -> 300$
50.000 credits -> 1250$
250.000 credits -> 5000$

📘
How credits work
We identify 3 types of files: Document, Presentation, Spreadsheet
When a file is parsed, we determine its type based on the extension and content:

PDF files can be either Document or Presentation.

Word and Writer files are classified as Document.

PowerPoint and Impress files are classified as Presentation.

Excel and Calc files are classified as Spreadsheet.

.txt files are classified as Document.

.html files are classified as Document.

For Document and Presentation types, 1 credit = 1 page processed.
For Spreadsheet types, 1 credit = 1 sheet processed.
.html and .txt files are converted to PDF in advance, and pages are calculated based on the converted file.

Preprocess API

Build an ingestion pipeline that maximizes RAG performances

Why

How we can help

What's in it for you

How we do that

Features

How to get started

Support

Pricing

📘
How credits work

Build an ingestion pipeline that maximizes RAG performances

Why

How we can help

What's in it for you

How we do that

Features

How to get started

Support

Pricing

📘How credits work

📘
How credits work