Multimodal AI researcher obsessed with how machines perceive, remember, and generate the world. Based in Mountain View, CA.
PhD from UMD focused on diffusion model memorization — built memorization evals for diffusion models and CSD, a widely-used style similarity metric. Also built evals for video understanding: CinePile (long-video QA benchmark, Best Paper at CVPR 2024 SynCV) and ARGUS (hallucination/omission eval for dense captions). (Friends call me the "Evals Shill" for a reason.)
Before academia: did SGD in industry for a while in India, IIT Madras alum, founded a Fashion AI startup that was way too early to the party.
Open to collabs on generative modeling (evals + post-training). Hit me up: gowthami [dot] somepalli [at] gmail.com
// featured writing
Latent Scaffolding: Z-Image Is Secretly an I2I Model
A simple architectural splice unlocks zero-shot image-to-image variations with no training.
BLOG POST · PART 2Latent Scaffolding: Token Dropout for Diverse Image Variations
Vision-only token dropout solves mode collapse. Hunting attention sinks and finding two orthogonal knobs for diversity.