A FastAPI service for searching college articulation PDFs using Pinecone vector database. Returns raw search results designed to be used as a tool by Gemini or other LLMs.
- 📄 Automatic PDF text extraction and chunking
- 🔍 Semantic search using Pinecone with automatic embeddings (llama-text-embed-v2)
- 🚀 FastAPI REST endpoints
- 📊 Source citations and similarity scores for each chunk
- 🧩 Tool-ready: Returns structured data perfect for LLM integration
- Install dependencies:
pip install -r requirements.txt- Set environment variables:
Create a
.envfile with:
PINECONE_API_KEY=your_pinecone_api_key_here
- Run the API:
python main.pyOr with uvicorn directly:
uvicorn main:app --reload --host 0.0.0.0 --port 8000Health check endpoint.
Detailed health check with index statistics.
Ingest PDF(s) into Pinecone vector database.
Request body (optional):
{
"pdf_path": "pdfs/AA_UF.pdf" // Optional: specific PDF path. If omitted, ingests all PDFs in pdfs/ folder
}Response:
{
"message": "Successfully ingested 1 PDF(s) with 150 total chunks",
"ingested_count": 150,
"pdfs_processed": ["AA_UF"]
}Search the college PDF vector database and return relevant chunks.
Request body:
{
"query": "What are the articulation agreements for University of Florida?",
"top_k": 5 // Optional: number of chunks to retrieve (default: 5)
}Response:
{
"query": "What are the articulation agreements for University of Florida?",
"chunks": [
{
"text": "The University of Florida has articulation agreements...",
"source": "AA_UF",
"score": 0.92,
"id": "AA_UF_0"
},
{
"text": "Transfer credits from community colleges...",
"source": "AA_UF",
"score": 0.88,
"id": "AA_UF_1"
}
],
"total_results": 2,
"context": "[Source: AA_UF]\nThe University of Florida has articulation agreements...\n\n---\n\n[Source: AA_UF]\nTransfer credits from community colleges..."
}Get statistics about the Pinecone index.
- Ingest all PDFs:
curl -X POST http://localhost:8000/ingest- Search the vector database:
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "What are the articulation agreements for Florida universities?", "top_k": 5}'This API is designed to be used as a tool by Gemini. The response format provides:
chunks: Array of relevant document chunks with metadatacontext: Pre-formatted combined context string ready for LLM consumptionsources: Source PDF names for citationscores: Similarity scores for relevance ranking
Example Gemini function calling:
# Gemini can call this API and use the returned chunks/context
# to generate answers based on the retrieved documents- Pinecone: Vector database with automatic embeddings using
llama-text-embed-v2 - PyPDF2: PDF text extraction
- FastAPI: REST API framework
- The index is automatically created if it doesn't exist
- PDFs are chunked with 1000 character chunks and 200 character overlap
- Each chunk includes metadata with the source PDF name
- The API automatically handles embedding generation through Pinecone's integrated model
- Returns raw search results - no LLM integration (designed to be used as a tool)