A production-grade context compression system for intelligent document analysis and question answering with comprehensive PDF processing capabilities.
Context Compressor is an advanced system that intelligently compresses and analyzes PDF documents using state-of-the-art techniques including Cross-Document Compression (CDC), Single-Document Mode (SDM), and multimodal processing. It achieves 90-95% token reduction while maintaining answer quality.
- Cross-Document Compression (CDC): Intelligent selection across multiple documents with document caps
- Single-Document Mode (SDM): Optimized for single-document analysis with relaxed constraints
- Auto-Router: Automatically switches between CDC and SDM based on content characteristics
- Multimodal Processing: Extracts and processes text, images, and tables from PDFs
- LLM Integration: Supports OpenRouter and local models with citation handling
- Performance Optimization: Caching, batching, and memory management
- Interactive Analysis: Terminal-based interface for real-time document analysis
git clone <repository>
cd context-compressor
pip install -e .# Set your OpenRouter API key
$env:OPENROUTER_API_KEY="your_api_key_here"
# Run interactive analysis (choose one):
python run_interactive.py
# OR
python main.py # Then select option 1The system guides you through a step-by-step process:
Step 1: PDF Selection
- Enter the path to your PDF file
- System validates the file exists and is accessible
Step 2: Question Input
- Type your question about the document
- Examples: "What are the main findings?", "How does the methodology work?"
- Type 'quit' to exit the system
Step 3: Parameter Configuration
- Compression Parameters: Control how content is selected and compressed
- LLM Parameters: Control answer generation and model selection
- API Configuration: Set up authentication for language models
Step 4: Analysis & Results
- System processes your request and displays comprehensive results
- Includes answer, citations, compression stats, and performance metrics
- Option to ask follow-up questions or start over
INTERACTIVE PDF QUESTION-ANSWERING SYSTEM
============================================================
Enter PDF file path (default: 2502.15840v1.pdf): topology_paper.pdf
PDF found: topology_paper.pdf
------------------------------------------------------------
Enter your question (or 'quit' to exit): What are the key topological concepts discussed?
============================================================
CONFIGURE ANALYSIS PARAMETERS
============================================================
COMPRESSION PARAMETERS:
Token budget for compression (default: 800): 1000
MMR lambda (diversity vs relevance) (default: 0.7): 0.8
Document cap (max chunks per doc) (default: 8): 10
Top M candidates for selection (default: 200): 250
Use auto-router for CDC/SDM (default: Y): Y
LLM PARAMETERS:
Model ID (default: qwen/qwen3-8b:free): anthropic/claude-3.5-sonnet
Max tokens to generate (default: 1024): 1200
Generation temperature (default: 0.1): 0.1
OpenRouter API key (or press Enter to use environment variable):
Analyzing: What are the key topological concepts discussed?
Please wait...
from context_compressor.compressor import ContextCompressor
from context_compressor.schemas import CompressionRequest
# Initialize compressor
compressor = ContextCompressor()
# Create request
request = CompressionRequest(
q="What are the main findings?",
B=800, # Token budget
candidates=[...], # Your document chunks
params={"lambda_": 0.7, "doc_cap": 8}
)
# Compress
response = compressor.compress(request)
print(response.context)- Extracts text, images, and tables from PDF documents
- Supports vector graphics and raster images
- Handles complex document layouts
- Provides content diagnosis and error handling
- What they are: Segmented pieces of text extracted from PDF documents
- How they're created: The extractor breaks down PDF content into meaningful units:
- Paragraphs or sections of text
- Sentences grouped together
- Logical content blocks (headers, body text, captions)
- Page-based segments with location information
- What each chunk contains:
{ "text": "The study found a 25% improvement in accuracy...", "tokens": 25, "section": "Results", "page": 14, "id": "chunk_001", "doc_id": "paper_2024" } - Why they matter:
- Enable selective compression (only relevant chunks are kept)
- Allow for 90-95% token reduction while maintaining quality
- Provide traceable citations back to specific document sections
- Enable parallel processing and memory optimization
- Main compression orchestrator
- Implements CDC/SDM logic with auto-routing
- Manages token budgets and constraints
- Provides comprehensive statistics
- Handles multimodal content compression
- Balances text, image, and table content
- Optimizes for different content types
- Provides modality-specific analysis
- Analyzes content characteristics
- Calculates
top1_doc_fracandentropy - Automatically selects CDC or SDM mode
- Optimizes for query type and document structure
- Purpose: Multi-document scenarios with document diversity
- Features: Document caps, section constraints, cross-document selection
- Use Case: Research papers, multi-source analysis, comparative studies
- Purpose: Single-document deep analysis
- Features: Relaxed document constraints, focused selection
- Use Case: Detailed paper analysis, book chapters, technical documents
- Algorithm: Greedy selection balancing relevance and diversity
- Parameters:
lambda_controls diversity vs relevance trade-off - Benefits: Reduces redundancy while maintaining coverage
Here's how the system processes your document and question:
Stage 1: Document Extraction
PDF Document → MultimodalExtractor → 300+ Document Chunks
Stage 2: Initial Ranking
All Chunks → BM25 + Dense Similarity Scoring → Top M Candidates (200 by default)
Stage 3: MMR Selection
Top M Candidates → MMR Algorithm (λ=0.7) → Selected Chunks (respecting Document Cap)
Stage 4: Context Assembly
Selected Chunks → Token Budget Check (800 tokens) → Final Compressed Context
Stage 5: LLM Generation
Compressed Context + Question → Language Model → Answer with Citations
Key Parameters in Action:
- Top M (200): Limits initial candidates for performance
- MMR Lambda (0.7): Balances relevance vs diversity in selection
- Document Cap (8): Ensures representation across sections
- Token Budget (800): Final limit on context size
- Auto-Router: Chooses CDC vs SDM based on document structure
When using python interactive_pdf_qa.py, you'll be prompted to configure these parameters:
Token Budget (100-2000, default: 800)
- What it does: Sets the maximum number of tokens allowed in the compressed context sent to the LLM
- Technical details: This is the hard limit on how much content the system can include in the final compressed context. The system will stop adding chunks once this limit is reached, even if more relevant content exists.
- Lower values (400-600): Faster processing, less detailed answers, lower cost, may miss important context
- Higher values (1000-1200): More comprehensive answers, slower processing, higher cost, includes more context
- Recommendation: Start with 800, increase for complex questions, decrease for quick summaries
MMR Lambda (0.0-1.0, default: 0.7)
- What it does: Controls the balance between relevance and diversity in the Maximal Marginal Relevance algorithm
- Technical details: MMR uses this formula:
score = λ × relevance + (1-λ) × diversity. Lambda controls the trade-off:- λ = 0.0: Only diversity matters (maximally diverse selection)
- λ = 1.0: Only relevance matters (most relevant content only)
- λ = 0.7: Balanced approach (70% relevance, 30% diversity)
- Lower values (0.3-0.5): Focus on most relevant content, may miss important context, less redundancy
- Higher values (0.8-1.0): Prioritize diverse content, covers more topics but may include less relevant info, more redundancy
- Recommendation: 0.7 provides good balance, use 0.5 for focused questions, 0.9 for broad analysis
Document Cap (1-20, default: 8)
- What it does: Maximum number of text chunks that can be selected from each document section (e.g., Introduction, Methods, Results)
- Technical details: This prevents the system from selecting too many chunks from a single section, ensuring representation across the entire document. For example, if set to 8, the system can select at most 8 chunks from the "Introduction" section, 8 from "Methods", etc.
- Lower values (3-5): More focused selection, faster processing, may miss important details in long sections
- Higher values (12-15): More comprehensive coverage, includes more context, may include redundant information
- Recommendation: 8 works well for most documents, increase for complex papers with long sections, decrease for simple documents
Top M Candidates (50-500, default: 200)
- What it does: Number of initial candidates considered in the first stage of selection before applying MMR
- Technical details: The system first ranks all document chunks by relevance (using BM25 + dense similarity scores), then takes the top M candidates for MMR selection. This is a performance optimization - instead of running MMR on all 1000+ chunks, it runs on the top 200 most relevant ones.
- Lower values (100-150): Faster processing, may miss some relevant content that wasn't in the top candidates
- Higher values (300-400): More thorough analysis, slower processing, considers more candidates
- Recommendation: 200 is optimal for most cases, increase for large documents with many sections, decrease for memory constraints
Auto-Router (Y/N, default: Y)
- What it does: Automatically chooses between Cross-Document Compression (CDC) and Single-Document Mode (SDM) based on content analysis
- Technical details: The router analyzes the document structure and calculates:
top1_doc_frac: Fraction of top candidates from the same document (1.0 = single document)entropy: Diversity measure of candidate distribution across documents- If
top1_doc_frac > 0.8andentropy < 0.3, it switches to SDM mode
- Yes: System analyzes content and picks the best mode automatically, optimizes for your specific document
- No: Forces CDC mode (useful for multi-document scenarios or when you want consistent behavior)
- Recommendation: Keep enabled unless you have specific requirements or want to force CDC mode
Model ID (default: "qwen/qwen3-8b:free")
- What it does: Specifies which language model to use for generating answers
- Free models: "qwen/qwen3-8b:free", "meta-llama/llama-3.1-8b:free"
- Paid models: "openai/gpt-4o", "anthropic/claude-3.5-sonnet", "google/gemini-pro"
- Recommendation: Start with free models, upgrade to paid for better quality
Max Tokens (50-2000, default: 1024)
- What it does: Maximum length of the generated answer
- Lower values (300-500): Concise answers, faster generation
- Higher values (1500-2000): Detailed explanations, longer generation time
- Recommendation: 1024 provides good detail, adjust based on question complexity
Temperature (0.0-1.0, default: 0.1)
- What it does: Controls creativity vs factual accuracy in responses
- Lower values (0.0-0.2): More factual, consistent answers
- Higher values (0.7-1.0): More creative, varied responses
- Recommendation: 0.1 for academic/research questions, 0.3-0.5 for creative analysis
OpenRouter API Key
- What it does: Authentication for accessing language models
- Required: Yes (unless using local models)
- How to get: Sign up at openrouter.ai and generate an API key
- Security: Can use environment variable
OPENROUTER_API_KEYinstead of entering directly
Token Budget: 400-600
Top M Candidates: 100-150
Max Tokens: 512
Model: qwen/qwen3-8b:free
Token Budget: 1000-1200
Top M Candidates: 300-400
Max Tokens: 1500
Model: openai/gpt-4o
Temperature: 0.1
Token Budget: 600
Top M Candidates: 100
Document Cap: 5
Max Tokens: 800
Token Budget: 1000
MMR Lambda: 0.8
Document Cap: 10
Temperature: 0.1
Model: anthropic/claude-3.5-sonnet
The system provides detailed analysis including:
- Original vs compressed token counts
- Token reduction percentage
- Candidates processed vs selected
- Compression mode used (CDC/SDM)
- Total processing time breakdown
- Extraction, compression, and LLM timing
- Percentage allocation across components
- First 8 lines of compressed context
- Shows exactly what content was selected
- Enables quality verification
- Overall efficiency metric
- Token efficiency percentage
- Time efficiency (normalized to 1 minute)
COMPREHENSIVE ANALYSIS RESULTS
================================================================================
QUESTION:
What are the main findings about model performance?
ANSWER:
The main findings highlight that Large Language Models (LLMs) exhibit significant
challenges in maintaining long-term coherence when managing straightforward,
long-running tasks...
COMPRESSION ANALYSIS:
Method: CDC (Cross-Document Compression)
Mode: single_doc
Original tokens: 7,469
Compressed tokens: 387
Token reduction: 94.8%
Candidates processed: 308
Candidates selected: 321
Selection ratio: 104.2%
PERFORMANCE METRICS:
Total processing time: 15.19s
PDF extraction time: 6.05s (39.8%)
Compression time: 0.15s (1.0%)
LLM generation time: 8.99s (59.2%)
EFFICIENCY SCORE:
Overall efficiency: 0.237 (higher is better)
Token efficiency: 94.8%
Time efficiency: 74.7% (normalized to 1 minute)
The system includes a comprehensive fine-tuning framework:
- AnchorExtractor: Extracts key information (entities, numbers, keyphrases)
- OracleCreator: Creates optimal selections using set-cover algorithms
- TrainingDataGenerator: Generates synthetic training data
- FineTuner: Trains bi-encoder and cross-encoder models
from context_compressor.fine_tuning import FineTuner
# Initialize fine-tuner
fine_tuner = FineTuner()
# Train bi-encoder
fine_tuner.train_bi_encoder(training_data, model_name="custom-bi-encoder")
# Train cross-encoder
fine_tuner.train_cross_encoder(training_data, model_name="custom-cross-encoder")For production deployment, use the integrated QA system:
# Start the server (choose one):
python run_api.py
# OR
python main.py # Then select option 2
# Access API endpoints
curl -X POST "http://localhost:8000/qa" \
-F "pdf_file=@document.pdf" \
-F "request={\"question\":\"What are the main findings?\"}"- Model Caching: Reuses loaded models across requests
- Embedding Cache: Stores computed embeddings
- Result Cache: Caches compression results for similar queries
- Automatic Cleanup: Releases unused resources
- Batch Processing: Efficient candidate processing
- Memory Monitoring: Tracks and optimizes memory usage
- Token Reduction: 90-95% typical compression
- Processing Time: 10-20 seconds for full pipeline
- Memory Usage: Optimized for production deployment
- Accuracy: Maintains answer quality despite compression
# Run interactive analysis
python run_interactive.py
# Test specific components
python -m pytest tests/
# Performance testing
python -m pytest tests/test_performance.py
# Or use the main menu
python main.py # Then select option 3MIT License - see LICENSE file for details.