Tailoring Document Chunking to Your Needs
When using the Preprocess API for document parsing and chunking, understanding how to leverage query parameters can significantly enhance the effectiveness and relevance of your parsed document chunks. Each parameter offers unique capabilities that cater to different use cases and requirements, ensuring flexibility and precision in document parsing.
webhook
webhook
Specifies the URL to which the API sends a POST request once document chunking is complete.
Use Case
Enables asynchronous processing and integration with other systems or workflows, reducing the need for manual intervention.
In a document management system, using a webhook ensures seamless integration with subsequent processes like indexing or embedding.
--form 'webhook=https://your_webhook.url'
Possible Values: A valid URL where the API can send notifications upon completion.
language
language
Specifies the language of the document using ISO 639-1 language codes to aid accurate parsing.
Use Case
Ensures accurate handling and segmentation of text in specific languages, improving the reliability of downstream processing.
When parsing documents, specifying the language ensures consistent treatment of text segments.
--form 'language=en'
Possible Values: ISO 639-1 language codes (e.g., "en" for English, "fr" for French).
merge
merge
Controls whether short chunks are merged with adjacent ones to maximize the chunk length, aiming to maintain the logical structure of the chunks while ensuring each chunk contains as close to 512 tokens as possible.
Use Case
Maintains document structure and reduces the number of chunks needed for processing, ensuring efficient handling of lengthy documents.
In legal document processing, merging short sections ensures that each chunk remains manageable yet comprehensive, facilitating efficient embedding and search.
--form 'merge=true'
Possible Values: true
or false
, default false
repeat_title
repeat_title
Determines whether the title of parent paragraphs or sections should be repeated in each chunk.
Use Case
Provides clear context within each chunk, enhancing embedding, retrieval and understanding of document content.
Repeating section titles in technical manuals helps identify the topic and content of each parsed chunk.
--form 'repeat_title=true'
Possible Values: true
or false
, default false
repeat_table_header
repeat_table_header
Specifies whether tables' headers should be repeated in every chunk containing a part of the table.
Use Case
Enhances readability and understanding of segmented table data by ensuring consistent presentation.
Repeating table headers in financial reports allows LLMs to better understand table structure across segmented chunks.
--form 'repeat_table_header=true'
Possible Values: true
or false
, default false
table_output_format
table_output_format
Defines the output format for tables within parsed document chunks.
Use Case
Provides flexibility in how table data is provided downstream, catering to different system requirements or user preferences.
Integrating parsed data into a database might require table data in a specific format like Markdown or HTML for consistent rendering or analysis.
--form 'table_output_format=markdown'
Possible Values: text
, html
, or markdown
, default text
keep_header
keep_header
Determines whether the content of headers (e.g., page numbers, document titles, section titles) should be retained in the parsed chunks.
Use Case
Allows for customized handling of header content, useful when headers contain critical contextual information that needs to be preserved.
Setting keep_header=false
removes distracting header content from parsed chunks, focusing only on the main body text.
--form 'keep_header=false'
Possible Values: true
or false
, default true
smart_header
smart_header
Controls inclusion of only relevant headers in parsed chunks, ignoring non-essential header information if keep_header=true
.
Use Case
Enhances parsing precision by focusing on headers that contribute directly to document content, improving overall data relevance.
When smart_header=true
, only section or paragraph titles that are part of the body content are retained in parsed chunks, increasing precision and reliability.
--form 'smart_header=true'
Possible Values: true
or false
, default true
keep_footer
keep_footer
Specifies whether the content of footers (e.g., page numbers, footnotes) should be included in the parsed chunks.
Use Case
Offers flexibility in handling footer content, useful when footnotes are integral to document context or analysis.
Setting keep_footer=true
ensures that all document elements, including footnotes, are captured in parsed chunks, maintaining document completeness.
--form 'keep_footer=true'
Possible Values: true
or false
, default false
image_text
image_text
Controls whether text contained within images should be extracted and included in the parsed chunks.
Use Case
Enhances document comprehension by including text from images, useful in documents where images contain critical textual information.
Setting image_text=true
ensures that textual content embedded within charts or diagrams is captured in parsed chunks, providing comprehensive document analysis.
--form 'image_text=true'
Possible Values: true
or false
, default false
Practical Application
Consider parsing a technical document that includes headers, footers, and embedded images. Here’s how you might configure the API call to suit your needs:
curl --location
--request POST 'https://chunk.ing/?webhook=https://your_webhook.url' \
--header 'Content-Type: multipart/form-data' \
--header 'x-api-key: your_api_key' \
--form 'file=@"/your_file.pdf"' \
--form 'repeat_title=true' \
--form 'table_output_format=html' \
--form 'keep_header=true' \
--form 'smart_header=true' \
--form 'image_text=true'
In this example:
- Titles are repeated for each chunk (
repeat_title=true
) to enrich the meaning of each chunk. - Only relevant headers are retained (
smart_header=true
), enhancing accuracy. - Text from images is extracted (
image_text=true
), providing a comprehensive view of the document's content.
By adjusting these parameters based on your specific document characteristics and processing goals, you can optimize the parsing and segmentation of documents effectively with the Preprocess API.