College Articulation PDF Vector Database API

A FastAPI service for searching college articulation PDFs using Pinecone vector database. Returns raw search results designed to be used as a tool by Gemini or other LLMs.

Features

📄 Automatic PDF text extraction and chunking
🔍 Semantic search using Pinecone with automatic embeddings (llama-text-embed-v2)
🚀 FastAPI REST endpoints
📊 Source citations and similarity scores for each chunk
🧩 Tool-ready: Returns structured data perfect for LLM integration

Setup

Install dependencies:

pip install -r requirements.txt

Set environment variables: Create a .env file with:

PINECONE_API_KEY=your_pinecone_api_key_here

Run the API:

python main.py

Or with uvicorn directly:

uvicorn main:app --reload --host 0.0.0.0 --port 8000

API Endpoints

`GET /`

Health check endpoint.

`GET /health`

Detailed health check with index statistics.

`POST /ingest`

Ingest PDF(s) into Pinecone vector database.

Request body (optional):

{
  "pdf_path": "pdfs/AA_UF.pdf"  // Optional: specific PDF path. If omitted, ingests all PDFs in pdfs/ folder
}

Response:

{
  "message": "Successfully ingested 1 PDF(s) with 150 total chunks",
  "ingested_count": 150,
  "pdfs_processed": ["AA_UF"]
}

`POST /search` or `POST /query`

Search the college PDF vector database and return relevant chunks.

Request body:

{
  "query": "What are the articulation agreements for University of Florida?",
  "top_k": 5  // Optional: number of chunks to retrieve (default: 5)
}

Response:

{
  "query": "What are the articulation agreements for University of Florida?",
  "chunks": [
    {
      "text": "The University of Florida has articulation agreements...",
      "source": "AA_UF",
      "score": 0.92,
      "id": "AA_UF_0"
    },
    {
      "text": "Transfer credits from community colleges...",
      "source": "AA_UF",
      "score": 0.88,
      "id": "AA_UF_1"
    }
  ],
  "total_results": 2,
  "context": "[Source: AA_UF]\nThe University of Florida has articulation agreements...\n\n---\n\n[Source: AA_UF]\nTransfer credits from community colleges..."
}

`GET /stats`

Get statistics about the Pinecone index.

Usage Example

Ingest all PDFs:

curl -X POST http://localhost:8000/ingest

Search the vector database:

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the articulation agreements for Florida universities?", "top_k": 5}'

Using as a Gemini Tool

This API is designed to be used as a tool by Gemini. The response format provides:

chunks: Array of relevant document chunks with metadata
context: Pre-formatted combined context string ready for LLM consumption
sources: Source PDF names for citation
scores: Similarity scores for relevance ranking

Example Gemini function calling:

# Gemini can call this API and use the returned chunks/context
# to generate answers based on the retrieved documents

Architecture

Pinecone: Vector database with automatic embeddings using llama-text-embed-v2
PyPDF2: PDF text extraction
FastAPI: REST API framework

Notes

The index is automatically created if it doesn't exist
PDFs are chunked with 1000 character chunks and 200 character overlap
Each chunk includes metadata with the source PDF name
The API automatically handles embedding generation through Pinecone's integrated model
Returns raw search results - no LLM integration (designed to be used as a tool)

articulation-vector-db-api

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
pdfs		pdfs
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
test_api.sh		test_api.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

College Articulation PDF Vector Database API

Features

Setup

API Endpoints

`GET /`

`GET /health`

`POST /ingest`

`POST /search` or `POST /query`

`GET /stats`

Usage Example

Using as a Gemini Tool

Architecture

Notes

articulation-vector-db-api

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

jongan69/articulation-vector-db-api

Folders and files

Latest commit

History

Repository files navigation

College Articulation PDF Vector Database API

Features

Setup

API Endpoints

GET /

GET /health

POST /ingest

POST /search or POST /query

GET /stats

Usage Example

Using as a Gemini Tool

Architecture

Notes

articulation-vector-db-api

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`GET /`

`GET /health`

`POST /ingest`

`POST /search` or `POST /query`

`GET /stats`

Packages