API Reference

Build an ingestion pipeline that maximizes RAG performances

Build an ingestion pipeline that maximizes RAG performances

Preprocess converts and splits complex documents into optimal chunks of text via a simple API.
We handle preprocessing complexities, so you can focus on what matters.

Why

Poor document chunking has a very negative impact on results.
The most common chunking method consists of splitting documents on a fixed word count.
This chunking method feeds the LLM with irrelevant information resulting in poor RAG performance and increased hallucinations.

How we can help

Our API accurately parses long, complex documents to create RAG-ready data with unmatched precision.
It splits files into optimal text chunks following the original document layout and content semantics, ensuring that each portion is perfectly crafted for embedding, indexing, and retrieval.
This guarantees that LLMs receive the highest quality data to answer user queries accurately.

What's in it for you

  • Up to 10x increase in RAG performance.
  • Reduce operational costs: no more resources involved in developing and maintaining in-house chunking solutions.
  • Accelerate time-to-market for new features and create new revenue streams.

How we do that

You can upload documents via API or Python SDK: PDFs, Office files, HTML, plain text.
Preprocess parses the document following titles, sections, paragraphs, tables, images, and lists.
Receive the document optimally chunked, ready for indexing and importing into a vector database.

curl -X POST  "https://chunk.ing" \
-H "x-api-key: your_api_key" \
-H "Content-Type: multipart/form-data" \
-F "file=@/your_file.ext"
from pypreprocess import Preprocess

p = Preprocess(filepath="path/to/file", api_key=YOUR_API_KEY)
preprocess.chunk()
preprocess.wait()
result = preprocess.result()

Features

FeaturePlan
Intelligent ChunkingAll plans
PDF/MS Office/Open Office/HTML/Plain text SupportAll plans
Scanned PDF SupportAll plans
Parallel tasksAll plans
Image text extractionAll plans
Image exportEnterprise
Chunk boundariesEnterprise
Zero Data Retention AgreementsEnterprise
On-Prem DeploymentsEnterprise
Custom Processing PipelinesEnterprise

How to get started

Visit preprocess.co and sign up via email to get access to the free playground environment and to your API key.

To deploy the solution check out Getting started

Support

For any issue, feedback, or information please write us at [email protected]

Pricing

We don't have monthly fees. You can buy a package of credits, and once the credits end you can purchase an additional one.

  • 10.000 credits -> 300$
  • 50.000 credits -> 1250$
  • 250.000 credits -> 5000$

📘

How credits work

We identify 3 types of files: Document, Presentation, Spreadsheet

When a file is parsed, we determine its type based on the extension and content:

  • PDF files can be either Document or Presentation.
  • Word and Writer files are classified as Document.
  • PowerPoint and Impress files are classified as Presentation.
  • Excel and Calc files are classified as Spreadsheet.
  • .txt files are classified as Document.
  • .html files are classified as Document.

For Document and Presentation types, 1 credit = 1 page processed.

For Spreadsheet types, 1 credit = 1 sheet processed.

.html and .txt files are converted to PDF in advance, and pages are calculated based on the converted file.