Every great DJ set tells a story. You start in a smoky jazz lounge, the tempo rises, and suddenly you're tearing through a neon-lit highway at midnight. That emotional arc—the narrative of a mix—is something experienced DJs craft intuitively, but it has never been something you can simply describe in words and let a machine execute.
We asked ourselves: what if you could type a cinematic vibe prompt like "From a rainy jazz club to a high-speed neon chase" and have an AI build the transition for you?
Traditional DJ software lets you sort by BPM and key, but it knows nothing about mood. Streaming algorithms recommend songs you already like, but they cannot sculpt a journey across emotional landscapes. We wanted to bridge that gap—combining the semantic understanding of large language models with the mathematical rigor of vector similarity search over real audio features—to create a tool that mixes music the way a filmmaker scores a scene.
What We Learned
Audio feature extraction is surprisingly deep. Using librosa to compute MFCCs (Mel-Frequency Cepstral Coefficients), chroma features, and tempo estimation taught us how much musical information can be compressed into a 512-dimensional vector. Choosing the right number of MFCC coefficients and the right segment duration (we settled on 30 seconds) directly impacts how vibe-aware the embeddings are.
Vector databases make similarity search trivial—once your embeddings are good. Actian VectorAI gave us sub-second cosine similarity queries across our entire track library. The real challenge was not the database; it was ensuring the vectors we fed it actually captured perceptual similarity rather than just spectral fingerprints.
Prompt deconstruction is an under-explored UX pattern. Wrapping Sphinx AI as a CLI to decompose a free-text vibe prompt into structured attributes (BPM range, musical key trajectory, mood vectors) showed us that LLM reasoning can serve as semantic middleware between human intent and numerical queries.
Crossfading in the browser is non-trivial. Managing two HTMLAudioElement instances, smoothly interpolating volumes over a configurable duration, then swapping deck references without audio glitches required careful state management in React using useRef and useCallback.
Real-time telemetry transforms a demo into an experience. Our "Brain Panel"—a cyberpunk-styled live log feed built with Framer Motion—lets the audience see the AI think in real time. Showing Sphinx reasoning steps and Actian distance scores side-by-side turned a backend process into a visual spectacle.
How We Built It
AuraMix is a full-stack application with four cooperating layers:
1. Audio Ingestion Pipeline (backend/ingest.py)
- Scan a user-specified directory recursively for .mp3 files.
- For each track, load the first 30 seconds with librosa at 22,050 Hz.
- Extract metadata:
- BPM via librosa.beat.beat_track
- Musical key via a chroma-CQT argmax heuristic
- Genre via a BPM-bracket heuristic (Ballad under 70, R&B under 100, Rock/Pop under 120, Dance under 140, EDM 140 or higher)
- Compute 20 MFCCs, flatten, and pad or truncate to a 512-dimensional embedding vector.
- Batch upsert all vectors and payloads into an Actian VectorAI collection called track_embeddings using the CortexClient Python SDK over gRPC with cosine distance.
- Optionally write a local fallback_db.json for offline or demo use.
2. Intelligence Backend (backend/main.py — FastAPI)
- POST /api/transition — the core endpoint:
- Calls Sphinx AI via subprocess to deconstruct the vibe prompt into BPM, key, and three mood vectors.
- Generates a deterministic 512-dimensional query vector seeded by a hash of the prompt and vibe weight.
- Performs a top-k cosine search (k = 3) against Actian VectorAI; falls back to a local Euclidean-distance scan over fallback_db.json if the database is unreachable.
- Logs the full reasoning chain and results to Neon DB (PostgreSQL) for history and auditability.
POST /api/ingest — triggers background ingestion with real-time progress polling via GET /api/ingest/status.
GET /api/history — retrieves the last 50 mix events from Neon DB.
GET /api/audio?path=... — streams .mp3 files to the browser via FileResponse.
3. Frontend (Next.js + Tailwind + Framer Motion)
- A dual-deck DJ interface with animated spinning vinyl, per-deck play and pause, real-time volume bars, and an equalizer visualizer.
- A vibe prompt bar with a Dark to Energy slider (0–100) and an Execute button.
- A smooth crossfader: 40-step linear interpolation over 4 seconds, swapping audio element references at the end so the new track becomes Deck A.
- A Music Folder input with a Link and Ingest button and a live progress bar.
- A Neon DB Mix History panel showing past prompts and matched tracks.
4. The Brain Panel (BrainPanel.tsx)
A cyberpunk-themed, auto-scrolling telemetry feed. Each log entry is color-coded by source:
- Cyan — Sphinx AI reasoning steps
- Emerald — Actian VectorAI distance results
- Fuchsia — System events (ingestion, errors)
Entries animate in with Framer Motion springs and include expandable JSON metadata.
Infrastructure
- Actian VectorAI runs as a Docker container (williamimoh/actian-vectorai-db:1.0b) exposing gRPC on port 50051.
- Neon DB provides serverless PostgreSQL for the mix_history table.
- Sphinx AI is invoked as a local CLI binary wrapped with Python's subprocess module.
Challenges We Faced
Bridging text semantics and audio embeddings. Our vibe prompts are natural language, but our track embeddings are raw MFCC features. Ideally we would use a cross-modal model to project text and audio into a shared latent space. Under hackathon time constraints we used a deterministic hash-seeded random vector as a proxy, which still produces consistent, prompt-sensitive results—but closing this gap is our top priority for future work.
Actian VectorAI beta SDK quirks. The CortexClient beta does not return full payloads in search results, requiring a second client.get() call per result. We also encountered duplicate ID collisions when re-ingesting, which we solved by dropping and recreating the collection on every ingest and using sequential integer IDs.
Browser autoplay restrictions. Modern browsers block audio.play() unless it originates from a user gesture. We had to restructure our crossfade logic so that the initial play always chains from a click event, and subsequent automated fades ride on that same user-activation context.
State management during crossfades. Swapping two HTMLAudioElement references mid-fade while simultaneously updating React state for volumes, play and pause icons, and deck labels required precise coordination between useRef, useState, and setInterval—a single missed cleanup could leave ghost audio playing.
Embedding dimensionality and quality trade-offs. We experimented with different numbers of MFCCs (13 vs 20) and different flatten and pad strategies. Too few coefficients lost timbral nuance; too many introduced noise from silent padding. We settled on 20 MFCCs truncated or padded to 512 dimensions as the best balance for our track library size.
Fallback resilience under demo conditions. Hackathon Wi-Fi is unpredictable. We built a full offline fallback path: if Actian is unreachable, the backend computes manual Euclidean distances over a local fallback_db.json, ensuring the demo never hard-fails regardless of network conditions.

Log in or sign up for Devpost to join the conversation.