Use Preprocess with LangChain
This guide will show you how to incorporate Preprocess into a LangChain-based ingestion pipeline. By using Preprocess to chunk your documents into high-quality segments, you can enhance the downstream performance of LangChain’s various tools, including vector store indexing, QA over documents, and retrieval-augmented generation (RAG).
Why Integrate Preprocess with LangChain?
LangChain simplifies building applications around language models by providing abstraction layers for loading data, transforming text, and incorporating vector databases or traditional retrieval methods. However, effective ingestion pipelines often rely on high-quality, contextually relevant chunks. Preprocess offers a unique approach to chunking documents, resulting in chunks that preserve the logical flow and formatting of the original source.
By using Preprocess before feeding data to LangChain:
- You ensure your text chunks are semantically meaningful.
- You improve the accuracy of embeddings, similarity search, and retrieval tasks.
- You reduce the need for brute-force token-based chunk splitting.
When to Start the Ingestion
Typically, you’ll want to run Preprocess on your documents before loading them into LangChain’s data loaders or vector stores. Consider the following pipeline:
- Source Your Data: Obtain documents from a file system, a web resource, or an internal database.
- Preprocess the Document: Use Preprocess to split the document into logical chunks.
- Load into LangChain: Treat the Preprocess results as a data source. Load the chunks into LangChain as
Document
objects. - Index the Results: Insert the processed chunks into a vector store or use another indexing method (such as TF-IDF), enabling semantic search and retrieval.
- Use in Your Application: With the data properly chunked and indexed, you can now build question-answering flows, chat interfaces, or any other LLM-driven application.
Example Workflow
Step 1: Chunking the Document with Preprocess
First, use the Preprocess Python SDK to upload a file and start the chunking process. After initialization and chunking, wait for the result.
from pypreprocess import Preprocess
# Replace with your actual API key
API_KEY = "YOUR_API_KEY"
FILEPATH = "path/to/your_document.pdf"
# Initialize with a local file
preprocess = Preprocess(api_key=API_KEY, filepath=FILEPATH)
# Set any options you need (e.g., merge short paragraphs)
preprocess.set_options({"merge": True, "repeat_title": True})
# Start the chunking process
preprocess.chunk()
# Wait for the chunking to finish and retrieve the result
result = preprocess.wait()
# Access the chunks
chunks = result.data['chunks']
At this point, chunks
is a list of text segments, each representing a meaningful unit from the original document.
Step 2: Loading Chunks into LangChain
LangChain provides various DocumentLoaders
for loading documents into its pipeline. Here, since we already have a list of chunks, we can create Document
objects directly:
from langchain.schema import Document
documents = [Document(page_content=chunk) for chunk in chunks]
If you previously saved the Preprocess output to a JSON file, you could also use LangChain’s JSONLoader
:
from langchain.document_loaders import JSONLoader
# Assuming result.json contains {"data": {"chunks": ["chunk1", "chunk2", ...]}}
loader = JSONLoader(file_path="result.json", jq_schema='.data.chunks[]')
documents = loader.load()
(The JSONLoader approach is useful if you receive the result via a webhook or have stored it for later use.)
Step 3: Indexing the Documents
Now that your documents are properly chunked, you can choose how to index them. LangChain supports multiple vector stores and retrieval techniques.
For example, using a vector database like Chroma:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings(openai_api_key="YOUR_OPENAI_API_KEY")
vectorstore = Chroma.from_documents(documents, embeddings)
Or you might choose a local vector store or another third-party database. Once indexed, you can query documents semantically to find the most relevant chunks.
Step 4: Performing Retrieval and QA
With your vector store ready, you can easily set up a retrieval chain or a question-answering chain:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(openai_api_key="YOUR_OPENAI_API_KEY")
qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
answer = qa_chain.run("What are the main conclusions of the document?")
print(answer)
In this way, Preprocess and LangChain work together: Preprocess ensures each document is divided into meaningful chunks, and LangChain leverages these chunks to provide accurate, contextually aware answers.
Other Possibilities
LangChain isn’t limited to vector databases:
-
TF-IDF Indexing: Instead of embedding-based retrieval, you can use a
RetrievalQA
chain with a TF-IDF retriever. Since you already have pre-chunked text from Preprocess, this step becomes simpler. -
Hybrid Retrieval: Combine traditional keyword-based search with semantic embeddings. The well-structured chunks from Preprocess improve both modes of retrieval.
-
Complex Pipelines: Integrate multiple steps, such as cleaning data, applying custom text transformations, or performing entity linking. Preprocess ensures your input data is in the best possible shape before it enters your pipeline.
Summary
Integrating Preprocess with LangChain enables you to start your ingestion pipeline with high-quality document chunks. This leads to better search, retrieval, and QA performance. By following the steps above—chunking your document first with Preprocess and then leveraging LangChain’s loaders, vector stores, and chains—you can build sophisticated, high-performing applications around large language models.