Use Preprocess with LLamaIndex
Integrating Preprocess with LlamaIndex allows you to streamline the ingestion and indexing of your documents. LlamaIndex is a data framework for large language models (LLMs) that helps you ingest, structure, and query your data effectively. By combining Preprocess’s semantic chunking capabilities with LlamaIndex’s data structures, you can build powerful applications for search, retrieval-augmented generation (RAG), and more.
When and Why to Use Preprocess with LlamaIndex
When to Start Ingestion:
- Before passing documents into LlamaIndex, you should run them through Preprocess to split them into semantically coherent chunks.
- This is useful when you have documents in various formats (PDFs, Word, HTML, etc.) and you need consistent, context-aware segments for indexing.
- You might trigger the ingestion process after receiving new documents from a data source, on a schedule (e.g., nightly), or as part of a pipeline whenever content is uploaded.
Why Preprocess Before LlamaIndex:
- LlamaIndex works best when fed with documents that are already divided into “chunks” suitable for language models.
- Preprocess ensures chunks respect the document’s natural structure, helping the LLM produce more accurate and contextually relevant results.
- By using Preprocess first, you avoid simplistic splitting strategies (e.g., fixed token windows), improving the quality and coherency of retrieved information.
Common Use Cases
Once you have your preprocessed chunks, LlamaIndex can help you:
- Build a Vector Store Index: Insert the chunks into a vector database (like Chroma or Pinecone) to perform semantic search and retrieval.
- Hybrid Retrieval Methods: Combine vector-based and traditional keyword-based retrieval for better coverage and accuracy.
- RAG Pipelines: Enhance LLM queries with relevant context pulled from your indexed chunks, enabling retrieval-augmented generation.
Prerequisites
- A valid Preprocess API Key.
- The
pypreprocess
Python SDK or the Preprocess API integration set up. - LlamaIndex installed (e.g.,
pip install llama-index
).
Example Workflow
-
Chunk the Document Using Preprocess
Start by using the Preprocess SDK to upload and chunk the document:from pypreprocess import Preprocess # Initialize the SDK with your API key and file path preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/to/your_file.pdf") # Optional: Set options to refine how the document is chunked preprocess.set_options({"merge": True, "repeat_title": True}) # Start the chunking process preprocess.chunk() # Wait until the chunking is complete result = preprocess.wait() # Retrieve the chunks chunks = result.data['chunks']
At this point,
chunks
is a list of text segments that respect the document’s structure and semantics. -
Load Chunks into LlamaIndex
With LlamaIndex, you can transform these chunks into
Document
objects and then build an index. For example:from llama_index import Document, VectorStoreIndex # Convert chunks into LlamaIndex Document objects documents = [Document(text=chunk) for chunk in chunks] # Build a vector index (default: in-memory) index = VectorStoreIndex(documents)
Alternatively, you can integrate a vector database (like Pinecone or Chroma) for persistence:
# Example with Chroma from llama_index import Document from llama_index.vector_stores import ChromaVectorStore from llama_index import VectorStoreIndex documents = [Document(text=chunk) for chunk in chunks] # Initialize Chroma Vector Store vector_store = ChromaVectorStore(collection_name="my_collection") index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
-
Perform Retrieval and Queries
Once your index is built, you can use LlamaIndex’s query engines to retrieve relevant context and answer questions:
query_engine = index.as_query_engine() # Query your data response = query_engine.query("What are the main topics discussed in the third section?") print(response)
LlamaIndex will use the underlying vector store to find the best matching chunks and help the LLM generate an answer.
-
Integration in Larger Pipelines
You can incorporate these steps into larger ingestion pipelines:
- Data Sources: Fetch documents from cloud storage, a CMS, or a local repository.
- Chunking (Preprocess): Run each new document through Preprocess to produce high-quality chunks.
- Indexing (LlamaIndex): Insert the chunks into a vector or keyword index.
- Usage: The resulting index can be connected to a frontend application, chatbot, or RAG system to provide context-aware responses.
Summary
By integrating Preprocess’s semantic chunking with LlamaIndex’s indexing and retrieval capabilities, you gain a powerful toolkit for managing, searching, and querying your textual data. The process typically involves:
- Chunking the document using Preprocess.
- Loading the chunks into LlamaIndex as documents.
- Creating an index and performing retrieval and queries.
- Optionally integrating these steps into more complex pipelines and data workflows.
This integration ensures that your language model tasks have a robust and contextually rich foundation.