Bastinel

Inspiration

With the rise of AI tools, misinformation is everywhere and it’s increasingly difficult to know what to trust. Large language models often sound confident even when they’re completely wrong, and most users have no way of knowing when that happens. We felt that trust in AI should not come from blind faith, but rather verification. We wanted to build a system that doesn’t just generate answers but also checks them using multiple independent models. Bastinel was born from that idea: an AI that actively protects users from hallucinations by validating its own outputs and showing how reliable they really are. Our goal was to make a tool that anyone could use to quickly spot misinformation and understand what’s true.

What it does

Bastinel takes the prompt provided by the user and runs it through a structured, multi-model verification pipeline. Instead of relying on a single LLM, Bastinel uses two separate generator models that each create their own answers independently. Then, two "judging" models analyze those answers, check their factual grounding, and score them for accuracy using carefully designed prompts. The system averages the judge scores into a final “hallucination likelihood” score that is simple for users to understand. A high score indicates the answer is trustworthy. A low score warns the user that the information is probably unreliable. By showing the score, reasoning, and comparison between models, Bastinel makes AI behavior transparent, helping users see not just what the AI says, but how confident it actually deserves to be.

How we built it

We automated our development workflow using a GitHub Actions pipeline. Our infrastructure runs on AWS, and we make all LLM calls through OpenRouter. Bastinel uses a total of four models per query: Two generator models that each produce responses independently, and two judge models that evaluate the generators and score their accuracy. Using careful prompt engineering, each judge compares the two generator outputs, checks them for factual correctness, assigns a score, and returns reasoning. We then average the judge scores and output a final reliability metric. Inspired by the structure of Bloom filters, we embraced the idea of multiple independent checks combined into a unified confidence score.

The frontend was built using Framer Motion, TailwindCSS, and ShadCN components to provide smooth, clean, modern interactions.

Challenges we ran into

Algorithm redesign: Our initial approach used Gemini to generate the response and then prompt all other models in a single chain. This created bias because the other models inherited Gemini’s context. We had to completely re-architect the pipeline so each model works independently.
Infrastructure overhaul: We originally planned to run judging on AWS Bedrock, but we had to migrate to OpenRouter due to model availability and flexibility constraints.
Balancing accuracy vs efficiency: Using four models dramatically improves reliability but significantly increases token usage and latency.

Accomplishments that we're proud of

We built a tool with real-world applications that could genuinely help people navigate misinformation and trust AI systems more.
We developed a clean frontend and a fully modular, multi-model verification pipeline.
We created detailed pseudo test cases to validate our scoring logic and maintain consistent reliability across models.