What is OpenWork AI?
OpenWork AI is an evidence-first medical research synthesis platform built specifically for healthcare professionals. Instead of acting like a chatbot that guesses answers, OpenWork behaves like a focused research assistant: it searches high-quality medical sources, gathers the most relevant evidence, and then produces a clear, citation-rich summary.
The platform is not a diagnostic tool, treatment recommendation system, or clinical decision support engine. It does not tell clinicians what to do. Its only job is to help them read, understand, and synthesize medical literature faster so they can make their own informed decisions. Every key statement in the final answer is backed by inline references, so clinicians can immediately verify where the information came from.
OpenWork currently pulls evidence from multiple trusted sources, including PubMed, PubMed Central, European PMC, DailyMed, and Indian clinical practice guidelines, etc.., with a strict focus on medical journals and regulatory documents. The goal is simple: reduce hours of manual literature review into under a minute, without compromising on evidence quality or transparency.
Inspiration and Problem
This project comes directly from observing how real clinicians around me work. My friends and family members are practicing doctors (including a pediatrician and a gynecologist in india), and I saw how much time they spend outside their regular working hours just trying to keep up with the literature.
A typical pattern looked like this:
- Long clinical hours.
- Then, late-night sessions reading multiple papers to answer a single question.
- Constant pressure to stay updated on new trials, guidelines, and safety information.
In many Indian hospitals and clinics, there is no dedicated research assistant or librarian. Doctors often do this work alone, manually searching PubMed, jumping between PDFs, guidelines, and drug labels. For complex questions, it can easily take several hours just to collect and synthesize high-quality evidence.
The core question that inspired OpenWork was:
“What if a system could do the heavy lifting of literature search, filtering, and synthesis, while still keeping clinicians fully in control of the final judgement?”
OpenWork is my attempt to build that system: an AI-powered, multi-agent research pipeline that always starts from evidence and never pretends to be a doctor.
How I Built It
Overall Architecture
OpenWork is built as a 7‑agent orchestration system on top of Gemini 3.0 models, with a Next.js frontend for the UI and a Python-based backend orchestrator to coordinate retrieval, reranking, and synthesis.
At a high level, the flow is:
- User asks a medical research question.
- Agent 1 interprets the query, expands abbreviations, and generates smarter search variants.
- Agent 2 and its sub‑agents run parallel searches across multiple sources.
- Agent 3 normalizes and deduplicates the evidence.
- Agent 4 (BGE reranker) ranks the most relevant chunks.
- Agent 5 checks for gaps or missing evidence (e.g., recency, safety data).
- Agent 6 synthesizes an answer with inline citations.
- Agent 7 acts as a verification gate to reduce hallucinations and enforce grounding.
All agents are powered by Gemini 3.0, using gemini-3.0-flash-preview where speed is critical and gemini-3.0-pro-exp where deeper reasoning and synthesis are required.
The 7-Agent System

Agent 1 – Query Intelligence
This agent analyzes the raw user query and transforms it into something that is search-ready. It:
- Expands abbreviations and medical shorthand.
- Extracts key entities (disease, drug, population, outcome, etc.).
- Generates multiple, specialized search variants for different sources.
This step makes a huge difference because clinicians often ask questions in natural language, and databases like PubMed respond better to more structured queries.
Agent 2 – Multi‑Source Retrieval Coordinator (with 2.1–2.5 sub‑agents)
This is a Python async orchestrator that fans out the expanded queries into parallel sub‑agents:
.png)
- 2.1 Guidelines Retriever – Vector search across Indian clinical practice guidelines, powered by Gemini 3 Flash + vector embeddings.
- 2.2 PubMed Intelligence – Builds advanced PubMed queries (with MeSH-like term expansion) using Gemini.
- 2.3 Full‑Text Fetcher – Pulls full text from PMC and other open-access PDFs, then structures them into chunks.
- 2.4 DailyMed Retriever – Extracts dosing, safety, black-box warnings, and pharmacology from FDA drug labels.
- 2.5 Tavily/Recent Literature Search – On-demand web search to plug in very recent publications not yet deeply indexed elsewhere.
All of these run concurrently, so the system can fetch a broad and deep evidence set quickly.
Agent 3 – Evidence Normalizer
Once all sources respond, the formats are messy: XML, HTML, JSON, PDF text, etc. Agent 3:
- Normalizes everything into a unified internal schema.
- Cleans and chunks the text.
- Deduplicates overlapping content across sources and queries.
This makes later ranking and synthesis much more reliable.
Agent 4 – BGE Reranker
To avoid drowning in noise, I integrated the BAAI/bge-reranker-v2-m3 model. It:
- Takes the user query and ~100+ evidence chunks as input.
- Scores each chunk for semantic relevance.
- Selects the top 10 or so highest-confidence evidence segments.
This reranking layer dramatically improves the precision of what the synthesis engine sees, making the final answer more grounded and focused.
Agent 5 – Evidence Gap Analyzer
Using gemini-3.0-pro-exp, this agent looks at the current evidence set and asks:
- Is there enough recent data?
- Are key study types missing (e.g., RCTs vs reviews)?
- Is there safety or regulatory information missing?
If it detects gaps, it can trigger another call to sub‑agents like 2.5 (recent web search) to fill those holes. This gives the pipeline a feedback loop rather than a one‑shot retrieval.
Agent 6 – Synthesis Engine
This is where the answer is actually written. Depending on the complexity of the question, it uses:
gemini-3.0-flash-previewfor faster, simpler summaries.gemini-3.0-pro-expfor complex, nuanced questions.
Key design rules:
- Every factual statement must be linked to at least one evidence chunk.
- Inline citations like [1], [2], [3] map directly to actual sources (PubMed IDs, guideline documents, FDA labels, etc.).
- Contradictions in the literature must be surfaced, not hidden. The agent is instructed to explicitly describe conflicting results and levels of evidence.
Agent 7 – Verification Gate
The final answer is passed through a separate verification step (again with Gemini 3.0). This agent:
- Checks that each claim is grounded in one of the retrieved evidence chunks.
- Flags potential hallucinations or unsupported extrapolations.
- Ensures the citations map to real references.
Only after passing this gate is the answer returned to the user.
Frontend and Infrastructure
- Frontend: Next.js 14 with TypeScript and a clean, professional UI tailored for clinicians (fast search, readable typography, structured evidence sections).
- Backend / Orchestration: Python async services coordinating all 7 agents and handling source connectors.
- Data Layer: 46+ evidence connectors (PubMed, PMC, DailyMed, Indian guidelines, etc.), vector search for guideline content, and unified citation parsing.
Challenges I Faced
Building OpenWork was not straightforward. Some of the main challenges were:
Designing a truly multi‑agent pipeline (not just one big prompt)
Initially, I tried to do everything in a single LLM call: query understanding, searching, and synthesis. It failed quickly: the model hallucinated sources, mixed up evidence, and performed poorly on complex questions. I had to break the system down into clearly separated agents, each with a very narrow responsibility, and orchestrate them carefully.Getting reliable, working citations
One of the hardest problems was ensuring every inline citation:Actually corresponds to a real source.
Has a working link.
Matches the claim that references it.
I went through multiple iterations where citations looked good on the surface, but links were broken, PMIDs were mismatched, or the text didn’t fully support the claim. I had to:
- Implement strict mapping between evidence chunks and citations.
- Add checks in the verification agent to validate grounding.
- Iterate on prompt design and response schemas until citation reliability improved.
- Integrating the BGE reranker correctly
Plugging in the BGE reranker wasn’t just about calling a model. I had to: - Decide the right chunk size and overlap.
- Balance recall (enough evidence) vs precision (not overwhelming the model).
- Tune the number of top‑k chunks that go into the synthesis step.
Several early versions either missed critical evidence or overwhelmed the LLM with too much context. Tuning this balance was a lot of trial and error.
- Handling failures and edge cases in external APIs
Real-world APIs like PubMed or PDF retrieval are not always clean: - Timeouts.
- Broken PDFs.
- Inconsistent HTML structures.
I had to build retries, fallbacks, and robust parsing so the system could still function even when one or two sources failed or returned messy data.
- Aligning strict safety and scope boundaries
I wanted to be very clear that OpenWork is not a diagnostic or treatment tool. That meant: - Designing prompts and UI copy that reinforce this boundary.
- Avoiding language like “you should treat with X.”
- Keeping the system focused on summarizing evidence, not giving clinical orders.
Accomplishments I’m Proud Of
The most meaningful validation came from real clinicians using the tool.
After deploying a development version, I shared the demo URL with my friends and my sister, who are working health professionals (including a pediatrician and a gynecologist). They tested OpenWork with their own real-world questions.
Some of the highlights:
- They were able to get high-quality, citation-backed summaries in under a minute for questions that would normally take them hours of manual reading.
- They liked that the system stayed inside the research domain and didn’t try to “act like a doctor.”
- They emphasized how something like this could be especially impactful in India, where many health professionals are overloaded and don’t have protected research time.
Their feedback was clear: this isn’t just a cool demo; it has the potential to free up time and cognitive load for clinicians, especially in resource-constrained environments. Hearing that from them was one of the biggest accomplishments for me in this project.
On the technical side, I’m proud that:
- The 7‑agent architecture is not just a diagram; it’s actually implemented and working.
- Every claim in the final answer is traceable to real evidence.
- The system is fast enough to be practical, thanks to careful use of Gemini 3 Flash and Pro, plus async orchestration and reranking.
What I Learned
This project taught me a lot on multiple levels:
Multi‑agent systems are about discipline, not just creativity
A good multi‑agent architecture is less about making the model “smart” and more about making the pipeline strict and predictable. The more I narrowed each agent’s responsibility, the better the overall system became.Evidence-first design changes how you think about LLMs
When you force yourself to ground every claim in actual documents, you naturally:Rethink retrieval quality.
Care more about reranking and normalization.
Treat the LLM as a reasoning layer on top of evidence, not a source of truth.
This mindset significantly reduced hallucinations and made the system much more trustworthy.
- Iteration through failure is unavoidable
I failed multiple times: - Early versions hallucinated references.
- Some pipelines returned beautifully written but poorly grounded answers.
- API integration broke more times than I’d like to admit.
Each failure forced me to ask: “Where exactly did this go wrong in the pipeline?” and then refine that specific agent, prompt, or connector. Over time, this led to a much more robust and debuggable architecture.
- Real-user feedback is more valuable than perfect code
Showing the tool early to real doctors completely changed my priorities. Instead of chasing fancy features, I focused on: - Speed.
- Clarity of answers.
- Reliability of citations.
- Limiting scope to medical research, not general-purpose chat.
Their reactions and suggestions helped me align the project with actual clinical needs rather than just building a technically impressive demo.
Final Thoughts
OpenWork AI is still a work in progress, but it already proves that AI can meaningfully support medical research workflows when used in the right way: as an evidence-first assistant, not as a replacement for professional judgment.
For me, this project is not just a hackathon submission. It is a stepping stone toward building scalable, trustworthy research infrastructure for health professionals in India—so that they can spend less time on manual literature triage and more time on what actually matters: taking care of patients.
Built With
- async
- baai/bge-reranker-v2-m3
- dailymed
- european-pmc
- gemini-3.0-flash
- gemini-3.0-pro
- gemini-3.0-pro-experimental
- google-cloud
- google-cloud-firestore
- indian-clinical-practice-guidelines-(firestore-+-gcs)
- javascript
- next.js-14
- node.js
- pubmed-api
- pubmed-central-(pmc)
- python
- python-asyncio
- react
- tailwind-css
- tavily-search-api
- typescript
- vertex-ai
- xml/html/json/pdf-parsers
Log in or sign up for Devpost to join the conversation.