Integrate in LangChain

This guide will show you how to incorporate Preprocess into a LangChain-based ingestion pipeline. By using Preprocess to chunk your documents into high-quality segments, you can enhance the downstream performance of LangChain’s various tools, including vector store indexing, QA over documents, and retrieval-augmented generation (RAG).

Why Integrate Preprocess with LangChain?

LangChain simplifies building applications around language models by providing abstraction layers for loading data, transforming text, and incorporating vector databases or traditional retrieval methods. However, effective ingestion pipelines often rely on high-quality, contextually relevant chunks. Preprocess offers a unique approach to chunking documents, resulting in chunks that preserve the logical flow and formatting of the original source.

By using Preprocess before feeding data to LangChain:

You ensure your text chunks are semantically meaningful.
You improve the accuracy of embeddings, similarity search, and retrieval tasks.
You reduce the need for brute-force token-based chunk splitting.

When to Start the Ingestion

Typically, you’ll want to run Preprocess on your documents before loading them into LangChain’s data loaders or vector stores. Consider the following pipeline:

Source Your Data: Obtain documents from a file system, a web resource, or an internal database.
Preprocess the Document: Use Preprocess to split the document into logical chunks.
Load into LangChain: Treat the Preprocess results as a data source. Load the chunks into LangChain as Document objects.
Index the Results: Insert the processed chunks into a vector store or use another indexing method (such as TF-IDF), enabling semantic search and retrieval.
Use in Your Application: With the data properly chunked and indexed, you can now build question-answering flows, chat interfaces, or any other LLM-driven application.

Example Workflow

Step 1: Chunking the Document with Preprocess

First, use the Preprocess Python SDK to upload a file and start the chunking process. After initialization and chunking, wait for the result.

from pypreprocess import Preprocess

# Replace with your actual API key
API_KEY = "YOUR_API_KEY"
FILEPATH = "path/to/your_document.pdf"

# Initialize with a local file
preprocess = Preprocess(api_key=API_KEY, filepath=FILEPATH)

# Set any options you need (e.g., merge short paragraphs)
preprocess.set_options({"merge": True, "repeat_title": True})

# Start the chunking process
preprocess.chunk()

# Wait for the chunking to finish and retrieve the result
result = preprocess.wait()

# Access the chunks
chunks = result.data['chunks']

At this point, chunks is a list of text segments, each representing a meaningful unit from the original document.

Step 2: Loading Chunks into LangChain

LangChain provides various DocumentLoaders for loading documents into its pipeline. Here, since we already have a list of chunks, we can create Document objects directly:

from langchain.schema import Document

documents = [Document(page_content=chunk) for chunk in chunks]

If you previously saved the Preprocess output to a JSON file, you could also use LangChain’s JSONLoader:

from langchain.document_loaders import JSONLoader

# Assuming result.json contains {"data": {"chunks": ["chunk1", "chunk2", ...]}}
loader = JSONLoader(file_path="result.json", jq_schema='.data.chunks[]')
documents = loader.load()

(The JSONLoader approach is useful if you receive the result via a webhook or have stored it for later use.)

Step 3: Indexing the Documents

Now that your documents are properly chunked, you can choose how to index them. LangChain supports multiple vector stores and retrieval techniques.

For example, using a vector database like Chroma:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embeddings = OpenAIEmbeddings(openai_api_key="YOUR_OPENAI_API_KEY")
vectorstore = Chroma.from_documents(documents, embeddings)

Or you might choose a local vector store or another third-party database. Once indexed, you can query documents semantically to find the most relevant chunks.

Step 4: Performing Retrieval and QA

With your vector store ready, you can easily set up a retrieval chain or a question-answering chain:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

retriever = vectorstore.as_retriever()
llm = ChatOpenAI(openai_api_key="YOUR_OPENAI_API_KEY")

qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
answer = qa_chain.run("What are the main conclusions of the document?")
print(answer)

In this way, Preprocess and LangChain work together: Preprocess ensures each document is divided into meaningful chunks, and LangChain leverages these chunks to provide accurate, contextually aware answers.

Other Possibilities

LangChain isn’t limited to vector databases:

TF-IDF Indexing: Instead of embedding-based retrieval, you can use a RetrievalQA chain with a TF-IDF retriever. Since you already have pre-chunked text from Preprocess, this step becomes simpler.
Hybrid Retrieval: Combine traditional keyword-based search with semantic embeddings. The well-structured chunks from Preprocess improve both modes of retrieval.
Complex Pipelines: Integrate multiple steps, such as cleaning data, applying custom text transformations, or performing entity linking. Preprocess ensures your input data is in the best possible shape before it enters your pipeline.

Summary

Integrating Preprocess with LangChain enables you to start your ingestion pipeline with high-quality document chunks. This leads to better search, retrieval, and QA performance. By following the steps above—chunking your document first with Preprocess and then leveraging LangChain’s loaders, vector stores, and chains—you can build sophisticated, high-performing applications around large language models.