Inspiration

The web is humanity's greatest library, but for millions with motor disabilities, the simple act of clicking a link is a monumental barrier. I was inspired by the question: How can we build a more helpful and accessible web for everyone? I saw the launch of Chrome's built-in AI not just as a new tool for developers, but as a revolutionary opportunity to create deeply personal and private assistive technologies without expensive, specialized hardware. My vision for Nutshell was to build a tool that could literally open up the digital world to those who have been left behind, giving them the power to explore, learn, and connect with complete independence.

This project is my answer. It's built on the belief that technology should empower, and that the best AI is the kind that runs securely on your own device, working for you and you alone.

What it does

Nutshell is a Chrome Extension that transforms the browsing experience by leveraging Chrome's new built-in AI APIs and advanced camera-based interaction. It moves beyond simple summarization to create a comprehensive, hands-free navigation system:

  1. AI-Powered Summaries: To reduce the physical burden of navigating confusing websites, Nutshell provides instant, on-device summaries. By simply looking at a link, the user gets a concise preview, ensuring every interaction is meaningful. This works across various content types:
  • Regular Web Pages: Uses the Summarizer API or Prompt API to generate concise summaries from extracted article content

  • YouTube Videos: Intelligently intercepts and processes video captions via XHR interception, then uses the Prompt API to generate structured summaries from the video's transcript and description

  1. Complete Hands-Free Navigation: This is the heart of Nutshell's mission. Using a standard webcam and the Human.js library for on-device computer vision, it offers full browser control:
  • Head-Tracked Cursor: Smooth and precise cursor control is mapped to the user's head movements using:

  • One-Euro Filter ($1€$ filter): A real-time signal filtering algorithm that eliminates jitter while maintaining responsiveness. The filter's behavior is governed by adaptive cutoff frequencies based on movement velocity

  • Head Calibration System: Users calibrate by positioning their head at five points (center, left, right, up, down), creating a personalized control space that adapts to their natural range of motion

  • Intelligent Movement Mapping: Combines head translation (position) and rotation (pitch/yaw) with different gains for center vs. edge regions, providing both precision and range

  • Mouth-Open Click: A voluntary mouth-opening gesture triggers a 'left-click', providing an intuitive and low-effort way to interact. This includes:

  • Calibration System: Users calibrate by keeping their mouth closed for samples, then opening wide, establishing personalized thresholds

  • Cooldown Protection: 800ms cooldown prevents accidental multiple clicks from sustained mouth opening

  • Real-time Mouth Aspect Ratio Detection: Uses facial landmark detection to measure mouth opening

  • Dwell-Based Interaction: When hovering over links or UI elements:

  • Visual Feedback: A growing ring indicator shows dwell progress

  • Configurable Timing: Default 600ms dwell time, adjustable in settings

  • Magnetic Snapping: Cursor automatically snaps to nearby interactive elements within 45px radius for easier targeting

  • Effortless Navigation: Dedicated screen zones for browser control:

  • Scroll Zones: Looking at the top/bottom 180px of the screen triggers smooth scrolling with visual feedback gradients

  • Browser Navigation: Left edge (80px) for back navigation, right edge (80px) for forward navigation with 400ms dwell requirement and purple/orange visual indicators

How I built it

I engineered Nutshell as a fully client-side application, ensuring maximum privacy and performance.

  • Multimodal AI Interaction: The core of the project combines two input streams: the user's gaze (via head-tracking) and Chrome's AI. This creates a practical, multimodal AI application. I used the powerful, open-source Human.js library to perform real-time face and head-pose tracking directly in the browser, with custom modifications for head translation tracking and facial landmark detection.

  • On-Device AI with Gemini Nano: To generate summaries, Nutshell uses Chrome's built-in AI APIs:

  • The Summarizer API is used for quick, high-quality article summaries with streaming support for real-time updates

  • The Prompt API provides flexibility with custom prompts. For example, my custom prompt for YouTube video summaries instructs the model to act as an expert analyst, creating a structured summary from the video's transcript and description

  • Intelligent Content Preparation: Getting good results from an AI requires good input. I developed specialized content extraction for different platforms:

  • YouTube Captions: The extension injects a script into YouTube.com that intercepts XHR requests for caption data. It captures both JSON3 (newer) and XML (older) caption formats, parses timestamps and text segments, and makes them available to the content script via a secure messaging API

  • Smart Truncation: For long content, the system intelligently preserves the beginning, middle, and end segments to fit within the AI's context window while maintaining narrative coherence

  • Advanced Computer Vision Pipeline:

  • Real-time Processing: Uses requestVideoFrameCallback for optimal frame timing and skips frames when detection is in progress to maintain performance

  • One-Euro Filter Implementation: Applied to both X and Y coordinates independently with configurable parameters (min cutoff: 0.4, beta: 0.0025, derivative cutoff: 1.0)

  • Adaptive Smoothing: Different lerp factors for center (0.06) vs. edge (0.10) movements to balance precision and responsiveness

  • Face Detection: Tracks face presence and score to ensure reliable head tracking before enabling cursor control

Challenges I ran into

1. Achieving Smooth, Non-Jittery Cursor Control:

This was a major hurdle. Raw head pose data from computer vision is inherently noisy. To solve this, I implemented a One-Euro Filter ($1€$ filter), a classic algorithm in human-computer interaction for filtering noisy signals in real-time. The filter's behavior is governed by the following equations, where I tuned the cutoff frequency ($f_c$) and beta ($\beta$) to create a responsive yet stable cursor:

$$\tau = \frac{1}{2\pi f_c}$$

$$\alpha = \frac{1}{1 + \frac{\tau}{dt}}$$

The adaptive cutoff frequency adjusts based on movement velocity, providing more smoothing during slow movements and less during fast movements.

2. Distinguishing Intentional vs. Involuntary Actions:

For mouth-open clicking, I needed to prevent false positives from natural mouth movements like talking or yawning. The solution involved:

  • Personalized calibration that learns each user's baseline mouth closure and maximum comfortable opening

  • A threshold calculation that sits between these two extremes

  • An 800ms cooldown period to prevent repeated clicks from sustained mouth opening

3. YouTube Caption Extraction:

YouTube doesn't expose captions through a public API. I solved this by:

  • Injecting a script into the page context (not the extension context) to intercept XMLHttpRequest

  • Monitoring all network requests for caption endpoints (timedtext or caption)

  • Parsing both JSON3 and XML caption formats

  • Implementing a secure postMessage bridge to transfer caption data from page context to extension context

4. Streaming Updates Without Flickering:

When AI summaries stream in character by character, naive implementations cause tooltip flickering and repositioning. I solved this by:

  • Tracking which URL's content is currently displayed in the tooltip

  • Only accepting streaming updates for the exact URL currently being processed

  • Canceling pending hide timeouts when new content arrives

  • Implementing a request token system for YouTube to prevent stale updates

Accomplishments that I'm proud of

  • A Truly Private Assistive Tool: By running 100% on-device, Nutshell offers life-changing accessibility without ever sending a user's camera feed, browsing data, or content to the cloud. This aligns perfectly with the privacy-first ethos of on-device AI.

  • Pixel-Precise Navigation: The combination of head-tracking, the One-Euro filter, and magnetic snapping is so effective that users can accurately navigate dense websites like Wikipedia, hovering over specific inline links to get summaries without losing their place.

  • Bridging the Web-Native Accessibility Gap: Nutshell brings capabilities traditionally found only in expensive, dedicated operating system software (like eye-gaze systems costing thousands of dollars) directly into the open web, using standard device hardware.

  • Intelligent Content Summarization for Any Link: The custom logic for handling YouTube transcripts demonstrates a deeper, more practical application of AI beyond just summarizing simple articles. The YouTube caption extraction alone required solving several non-trivial engineering challenges.

  • Sophisticated Computer Vision Pipeline: Successfully implementing head pose tracking with translation detection, One-Euro filtering, adaptive smoothing, and calibration creates a production-quality interaction system from open-source components.

  • Streaming AI Integration: Real-time streaming of AI-generated summaries provides immediate feedback to users, with sophisticated state management to prevent race conditions and stale updates.

What I learned

This project proved to me that on-device AI is a paradigm shift for accessibility. Chrome's new APIs empower individual developers to build powerful, privacy-first assistive technologies that previously required massive resources or expensive specialized hardware.

Key technical learnings:

  • Signal Processing Matters: Raw computer vision data requires sophisticated filtering and smoothing to create usable interfaces

  • Context-Specific AI Prompts: Different content types (articles, videos, threads) benefit from specialized prompting strategies

  • Multi-Strategy Robustness: Implementing fallback strategies creates more reliable systems

  • User Calibration is Essential: What works for one user's physiology doesn't work for another—personalized calibration is key to accessibility

  • Privacy by Architecture: Building with on-device AI from the start creates fundamentally more private systems than retrofitting privacy into cloud-based solutions

It also highlighted that the most "helpful" AI applications are often those that integrate seamlessly into a user's workflow, solving practical, real-world problems with elegance and respect for the user.

What's next for Nutshell

Nutshell has a clear path to becoming an even more robust assistive tool. Next steps include:

  • Enhanced Click Alternatives:

  • Eye blink detection for more discreet clicking

  • Customizable dwell times per interaction type

  • Voice command integration for complex actions

  • Advanced Navigation:

  • Customizable gestures mapping specific head movements to actions like "copy," "paste," or "close tab"

  • Smart scrolling with variable speed based on gaze position

  • Tab and window management via head gestures

  • Improved AI Features:

  • Multi-turn conversations with content (ask questions about summarized pages)

  • Smart content highlighting based on AI-identified key points

  • Personalized summary styles based on user preferences

  • Platform Expansion:

  • Support for more specialized websites (LinkedIn, GitHub, documentation sites)

  • Mobile browser support as APIs become available

  • Integration with screen readers for users with visual + motor impairments

  • Distribution:

  • Chrome Web Store publication to reach users who need it most

  • User testing with individuals who have motor impairments

  • Documentation and tutorial videos for onboarding

+ 49 more
Share this project:

Updates