Inspiration
Access to elite executive presence and communication coaching has traditionally been a privilege reserved for those who can afford expensive private tutors. Growing up in Nigeria, I witnessed brilliant minds held back simply because they lacked the opportunity to master formal public speaking. My mission for ConnectPro was to democratize high-level coaching—aligned with SDG 4: Quality Education—by putting a world-class "Coach Marcus" in the pocket of every student and professional, regardless of their background.
How ElevenLabs was used
We utilized the full spectrum of the ElevenLabs ecosystem to create an immersive, multi-sensory coaching environment:
- Conversational AI Agent: We embedded a custom-trained ElevenLabs Agent Widget (supporting >30 languages) that acts as "Coach Marcus," capable of holding real-time conversations about public speaking strategies and answering student queries instantly using Gemini 3 Flash Preview.
- Text-to-Speech (TTS) API: We utilized the Rachel voice model via the Python backend to deliver high-fidelity, encouraging feedback and "pre-game" motivational talks, giving the application a consistent human personality.
- Sound Effects Generation: We implemented the Sound Effects API to dynamically generate audience reactions (e.g., "enthusiastic applause" vs. "polite clapping") based on the user's real-time performance score, gamifying the learning experience.
ENV files
Our env file can be gotten in the additional file named 'env_archDiagram_presentation_elevenlabsProf' submission in a zipped format along with our architectural diagram, eleven labs agent widget profs and presentation file.
What it does
ConnectPro is an AI-powered executive suite that transforms how people practice presentation skills through a holistic feedback loop:
- AI Scripting: Uses Gemini 2.0 Flash to craft powerful scripts complete with stage directions (e.g., [Pause], [Make Eye Contact]).
- Live Practice: Students read aloud while the app tracks them using a Hybrid Speech Engine, highlighting words in real-time as they are spoken.
- Executive Metrics: The app analyzes the speech stream to count filler words ("um", "ah") and calculates a live "Delivery Score."
- Dynamic Audio Environment: Uses ElevenLabs to simulate a real conference room—providing immediate applause for good points or murmurs for poor delivery.
- Multimodal Coaching: After the session, the AI analyzes the audio recording (via Google Cloud Storage) to provide deep insights on tone, pace, and confidence, delivered verbally by the Coach Marcus voice.
How we built it
We engineered a state-of-the-art multimodal architecture on Google Cloud:
- The Brain: Vertex AI (Gemini 2.0 Flash-Exp) for high-EQ script generation and multimodal performance assessment.
- The Voice: ElevenLabs API (TTS & Sound Gen) and Conversational Agents for the auditory interface.
- The Input: A dual-layer system using Web Speech API for zero-latency UI feedback and Google Cloud Speech-to-Text for high-accuracy transcription.
- The Memory: Google Cloud Storage (GCS) to persist audio sessions for deep analysis.
- Deployment: Containerized Python (Flask) backend running on Google Cloud Run.
- Identity: Google OAuth 2.0 for secure, personalized session access.
Challenges we ran into
The most significant technical hurdle was achieving "Zero-Latency" visual feedback while performing heavy AI processing. We solved this by architecting a hybrid system: the browser handles the immediate "green highlighting" of spoken words, while Google Cloud STT and Gemini 2.0 process the deep executive scoring in the background asynchronously. We also had to manage complex state transitions to ensure the ElevenLabs Audio (Coach voice) didn't overlap with the student's recording or the generated sound effects.
Accomplishments that we're proud of
We successfully created a "Personality-Driven" AI. By combining ElevenLabs’ emotive human-like voices with Gemini’s reasoning, "Coach Marcus" feels like a genuine mentor rather than a robot. We are also proud of our Resilient Voice Architecture; we wrote fallback logic that automatically switches to Google Cloud Text-to-Speech if the primary voice API encounters network issues, ensuring the coaching session is never interrupted.
What we learned
We discovered that Multimodal AI (Gemini 2.0) is a game-changer for soft skills education. By sending the actual audio files from GCS to the model (rather than just text transcripts), we learned that AI can provide feedback not just on what was said, but how it was delivered—capturing the nuances of "Executive Presence" that standard text-based analysis misses.
What's next for ConnectPro
The next step is to implement ElevenLabs Outbound Calling, allowing Coach Marcus to call users for "Surprise" impromptu speaking drills to test their readiness. We also plan to expand the multimodal logic to analyze the student's facial expressions and body language via webcam to detect confidence cues.
Built With
- elevenlabs-api
- google-maps
- google-speech-to-text
- google-vertexai

Log in or sign up for Devpost to join the conversation.