API Reference

Tailoring Document Chunking to Your Needs

When using the Preprocess API for document parsing and chunking, understanding how to leverage query parameters can significantly enhance the effectiveness and relevance of your parsed document chunks. Each parameter offers unique capabilities that cater to different use cases and requirements, ensuring flexibility and precision in document parsing.

webhook

Specifies the URL to which the API sends a POST request once document chunking is complete.

Use Case

Enables asynchronous processing and integration with other systems or workflows, reducing the need for manual intervention.

In a document management system, using a webhook ensures seamless integration with subsequent processes like indexing or embedding.

--form 'webhook=https://your_webhook.url'

Possible Values: A valid URL where the API can send notifications upon completion.

language

Specifies the language of the document using ISO 639-1 language codes to aid accurate parsing.

Use Case

Ensures accurate handling and segmentation of text in specific languages, improving the reliability of downstream processing.

When parsing documents, specifying the language ensures consistent treatment of text segments.

--form 'language=en'

Possible Values: ISO 639-1 language codes (e.g., "en" for English, "fr" for French).

merge

Controls whether short chunks are merged with adjacent ones to maximize the chunk length, aiming to maintain the logical structure of the chunks while ensuring each chunk contains as close to 512 tokens as possible.

Use Case

Maintains document structure and reduces the number of chunks needed for processing, ensuring efficient handling of lengthy documents.

In legal document processing, merging short sections ensures that each chunk remains manageable yet comprehensive, facilitating efficient embedding and search.

--form 'merge=true'

Possible Values: true or false, default false

repeat_title

Determines whether the title of parent paragraphs or sections should be repeated in each chunk.

Use Case

Provides clear context within each chunk, enhancing embedding, retrieval and understanding of document content.

Repeating section titles in technical manuals helps identify the topic and content of each parsed chunk.

--form 'repeat_title=true'

Possible Values: true or false, default false

repeat_table_header

Specifies whether tables' headers should be repeated in every chunk containing a part of the table.

Use Case

Enhances readability and understanding of segmented table data by ensuring consistent presentation.

Repeating table headers in financial reports allows LLMs to better understand table structure across segmented chunks.

--form 'repeat_table_header=true'

Possible Values: true or false, default false

table_output_format

Defines the output format for tables within parsed document chunks.

Use Case

Provides flexibility in how table data is provided downstream, catering to different system requirements or user preferences.

Integrating parsed data into a database might require table data in a specific format like Markdown or HTML for consistent rendering or analysis.

--form 'table_output_format=markdown'

Possible Values: text, html, or markdown, default text

keep_header

Determines whether the content of headers (e.g., page numbers, document titles, section titles) should be retained in the parsed chunks.

Use Case

Allows for customized handling of header content, useful when headers contain critical contextual information that needs to be preserved.

Setting keep_header=false removes distracting header content from parsed chunks, focusing only on the main body text.

--form 'keep_header=false'

Possible Values: true or false, default true

smart_header

Controls inclusion of only relevant headers in parsed chunks, ignoring non-essential header information if keep_header=true.

Use Case

Enhances parsing precision by focusing on headers that contribute directly to document content, improving overall data relevance.

When smart_header=true, only section or paragraph titles that are part of the body content are retained in parsed chunks, increasing precision and reliability.

--form 'smart_header=true'

Possible Values: true or false, default true

keep_footer

Specifies whether the content of footers (e.g., page numbers, footnotes) should be included in the parsed chunks.

Use Case

Offers flexibility in handling footer content, useful when footnotes are integral to document context or analysis.

Setting keep_footer=true ensures that all document elements, including footnotes, are captured in parsed chunks, maintaining document completeness.

--form 'keep_footer=true'

Possible Values: true or false, default false

image_text

Controls whether text contained within images should be extracted and included in the parsed chunks.

Use Case

Enhances document comprehension by including text from images, useful in documents where images contain critical textual information.

Setting image_text=true ensures that textual content embedded within charts or diagrams is captured in parsed chunks, providing comprehensive document analysis.

--form 'image_text=true'

Possible Values: true or false, default false


Practical Application

Consider parsing a technical document that includes headers, footers, and embedded images. Here’s how you might configure the API call to suit your needs:

curl --location 
--request POST 'https://chunk.ing/?webhook=https://your_webhook.url' \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: your_api_key' \
--form 'file=@"/your_file.pdf"' \
--form 'repeat_title=true' \
--form 'table_output_format=html' \
--form 'keep_header=true' \
--form 'smart_header=true' \
--form 'image_text=true'

In this example:

  • Titles are repeated for each chunk (repeat_title=true) to enrich the meaning of each chunk.
  • Only relevant headers are retained (smart_header=true), enhancing accuracy.
  • Text from images is extracted (image_text=true), providing a comprehensive view of the document's content.

By adjusting these parameters based on your specific document characteristics and processing goals, you can optimize the parsing and segmentation of documents effectively with the Preprocess API.