API Reference

Use Preprocess with LLamaIndex

Integrating Preprocess with LlamaIndex allows you to streamline the ingestion and indexing of your documents. LlamaIndex is a data framework for large language models (LLMs) that helps you ingest, structure, and query your data effectively. By combining Preprocess’s semantic chunking capabilities with LlamaIndex’s data structures, you can build powerful applications for search, retrieval-augmented generation (RAG), and more.

When and Why to Use Preprocess with LlamaIndex

When to Start Ingestion:

  • Before passing documents into LlamaIndex, you should run them through Preprocess to split them into semantically coherent chunks.
  • This is useful when you have documents in various formats (PDFs, Word, HTML, etc.) and you need consistent, context-aware segments for indexing.
  • You might trigger the ingestion process after receiving new documents from a data source, on a schedule (e.g., nightly), or as part of a pipeline whenever content is uploaded.

Why Preprocess Before LlamaIndex:

  • LlamaIndex works best when fed with documents that are already divided into “chunks” suitable for language models.
  • Preprocess ensures chunks respect the document’s natural structure, helping the LLM produce more accurate and contextually relevant results.
  • By using Preprocess first, you avoid simplistic splitting strategies (e.g., fixed token windows), improving the quality and coherency of retrieved information.

Common Use Cases

Once you have your preprocessed chunks, LlamaIndex can help you:

  • Build a Vector Store Index: Insert the chunks into a vector database (like Chroma or Pinecone) to perform semantic search and retrieval.
  • Hybrid Retrieval Methods: Combine vector-based and traditional keyword-based retrieval for better coverage and accuracy.
  • RAG Pipelines: Enhance LLM queries with relevant context pulled from your indexed chunks, enabling retrieval-augmented generation.

Prerequisites

  • A valid Preprocess API Key.
  • The pypreprocess Python SDK or the Preprocess API integration set up.
  • LlamaIndex installed (e.g., pip install llama-index).

Example Workflow

  1. Chunk the Document Using Preprocess
    Start by using the Preprocess SDK to upload and chunk the document:

    from pypreprocess import Preprocess
    
    # Initialize the SDK with your API key and file path
    preprocess = Preprocess(api_key=YOUR_API_KEY, filepath="path/to/your_file.pdf")
    
    # Optional: Set options to refine how the document is chunked
    preprocess.set_options({"merge": True, "repeat_title": True})
    
    # Start the chunking process
    preprocess.chunk()
    
    # Wait until the chunking is complete
    result = preprocess.wait()
    
    # Retrieve the chunks
    chunks = result.data['chunks']
    

    At this point, chunks is a list of text segments that respect the document’s structure and semantics.

  2. Load Chunks into LlamaIndex

    With LlamaIndex, you can transform these chunks into Document objects and then build an index. For example:

    from llama_index import Document, VectorStoreIndex
    
    # Convert chunks into LlamaIndex Document objects
    documents = [Document(text=chunk) for chunk in chunks]
    
    # Build a vector index (default: in-memory)
    index = VectorStoreIndex(documents)
    

    Alternatively, you can integrate a vector database (like Pinecone or Chroma) for persistence:

    # Example with Chroma
    from llama_index import Document
    from llama_index.vector_stores import ChromaVectorStore
    from llama_index import VectorStoreIndex
    
    documents = [Document(text=chunk) for chunk in chunks]
    
    # Initialize Chroma Vector Store
    vector_store = ChromaVectorStore(collection_name="my_collection")
    index = VectorStoreIndex.from_documents(documents, vector_store=vector_store)
    
  3. Perform Retrieval and Queries

    Once your index is built, you can use LlamaIndex’s query engines to retrieve relevant context and answer questions:

    query_engine = index.as_query_engine()
    
    # Query your data
    response = query_engine.query("What are the main topics discussed in the third section?")
    print(response)
    

    LlamaIndex will use the underlying vector store to find the best matching chunks and help the LLM generate an answer.

  4. Integration in Larger Pipelines

You can incorporate these steps into larger ingestion pipelines:

  • Data Sources: Fetch documents from cloud storage, a CMS, or a local repository.
  • Chunking (Preprocess): Run each new document through Preprocess to produce high-quality chunks.
  • Indexing (LlamaIndex): Insert the chunks into a vector or keyword index.
  • Usage: The resulting index can be connected to a frontend application, chatbot, or RAG system to provide context-aware responses.

Summary

By integrating Preprocess’s semantic chunking with LlamaIndex’s indexing and retrieval capabilities, you gain a powerful toolkit for managing, searching, and querying your textual data. The process typically involves:

  1. Chunking the document using Preprocess.
  2. Loading the chunks into LlamaIndex as documents.
  3. Creating an index and performing retrieval and queries.
  4. Optionally integrating these steps into more complex pipelines and data workflows.

This integration ensures that your language model tasks have a robust and contextually rich foundation.