Inspiration

We’ve all experienced Mechanic Anxiety, that sinking feeling when your car engine starts making a strange knocking sound, or your AC unit starts buzzing. You know you need a professional diagnosis, but you also fear the bill.

Is it a loose screw (500 PKR fix) or a catastrophic rod failure (50,000 PKR fix)?

We realized that while mechanics rely on years of ear-training to diagnose faults, Artificial Intelligence now has the ability to see and hear with superhuman precision. We wanted to build a tool that democratizes this expertise giving every car owner, homeowner, and technician a senior diagnostic engineer in their pocket.

SonicFix was born from a simple question: What if your phone could tell you exactly what’s wrong, just by listening?

What it does

SonicFix is a multimodal diagnostic assistant that fuses Visual and Acoustic data to identify mechanical failures in real-time.

  1. Visual Context: The user snaps a photo of the machine (e.g., a car engine, a washing machine, or an industrial compressor). This grounds the AI, preventing it from guessing blindly.
  2. Acoustic Analysis: The user records the sound of the machine running.
  3. Smart Filtering: On-device models filter out background noise (speech, traffic) to ensure only mechanical sounds are analyzed.
  4. Instant Diagnosis: The app returns a detailed report identifying the specific fault (e.g., "Worn Serpentine Belt"), the severity level, actionable repair steps, and even an estimated repair cost tailored to the local market (Pakistan 2026).

How we built it

We built SonicFix using a Flutter frontend for cross-platform performance and a robust Serverless Python Backend on Firebase Cloud Functions.

The core innovation is our "Fusion Pipeline":

  1. Signal Pre-processing (The Gatekeeper): We integrated YAMNet (from TensorFlow Hub) as a first line of defense. Before wasting expensive API tokens, YAMNet analyzes the raw audio waveform to classify the sound source.
# YAMNet confirms if the audio is 'Mechanical' or 'Silence/Speech'
if yamnet_data["primary_sound"] in NON_MECHANICAL_BLACKLIST:
    flag_for_review()
else:
    proceed_to_fusion()

  1. Multimodal Fusion (The Brain): We use the Gemini 3 Flash API to perform true multimodal reasoning. We don't just send text; we inject the Image, the Audio, and the YAMNet Classification Tag into a single prompt. This allows Gemini 3 to correlate visual signs of wear (e.g., rust on a pulley) with specific audio frequencies (e.g., a high-pitched squeal), achieving accuracy that single-mode models cannot match.

  2. Resilient Architecture (The Fallback): Since Gemini 3 is in Preview, we engineered a production-grade Fallback Cascade. Our system prioritizes gemini-3-flash-preview for its reasoning power but automatically degrades to gemini-3.0-pro or gemini-1.5-flash if the preview API returns a 503 Service Unavailable error.

Challenges we ran into

503 Service Errors: Being on the bleeding edge of Gemini 3 Preview meant dealing with stability issues. Our initial deploys failed frequently with "Service Unavailable." This forced us to implement the Fallback Hierarchy strategy, which turned a weakness into one of our strongest architectural features.

  • Audio Sampling Rates: YAMNet is strictly picky about 16kHz mono audio. We had to write a custom ensure_sample_rate() pre-processor using scipy to normalize audio from different mobile devices before analysis.

Accomplishments that we're proud of

  • True Multimodal Integration: We aren't just sending text to a chatbot. We successfully effectively fused Vision (Image) and Hearing (Audio) into a single inference pass using Gemini 3.
  • The YAMNet Integration: Successfully implementing a TensorFlow Hub model inside a serverless function to act as an "Expert Signal" for the LLM was a major technical win.
  • Regional Pricing Logic: We successfully prompted the model to understand the specific economic context of Pakistan (PKR), making the tool genuinely useful for our local target audience rather than just giving generic dollar estimates.

What we learned

  • Edge vs. Cloud Balance: We learned that "Cloud-only" isn't always best. Using YAMNet as a lightweight filter saves massive amounts of compute time by rejecting non-mechanical audio early.
  • The Power of Prompt Engineering: We discovered that giving the model a "Role" (Role: You are SonicFix, a Senior Mechanical Diagnostics AI) drastically improved the quality and structure of the JSON output compared to generic prompts.

What's next for SonicFix

  • Real-Time AR Overlay: Using Gemini 3's video capabilities to overlay repair instructions directly onto the engine block through the phone camera.
  • OBD-II Integration: Connecting via Bluetooth to the car's computer to combine sensor data codes with our audio-visual analysis for 100% diagnostic certainty.
  • Enterprise API: Offering our "Audio-Visual Diagnostic" endpoint to insurance companies for automated claim verification.

Built With

Share this project:

Updates