Inspiration
The inspiration behind SightSync stemmed from a deep understanding of the challenges faced by visually impaired individuals in their daily lives. AXA proposed a challange where creative and innovative use of LLMs, Image and Video models were used so we envisioned a solution that leverages cutting-edge open-source models to create a virtual assistant capable of providing valuable assistance to the blind community.
What it does
SightSync is a comprehensive virtual assistant designed for the visually impaired. It harnesses the power of various open-source models to perform crucial tasks seamlessly, assisting the app users by delivering natural, accurate verbal descriptions of their surroundings. The key functionalities include:
- Text to Speech (TTS): Utilizes FastPitch to convert text information into clear and natural-sounding speech.
- Speech to Text (STT): Employs Distil-Whisper to transcribe spoken words into text, enhancing communication and interaction.
- Image Captioning: Leverages CogVLM for image captioning, enabling the identification and understanding of visual content.
- Object Detection: Integrates Grounding Dino to recognize objects within images, providing detailed descriptions.
- Language Understanding: Empowered by Zephyr, the language model extracts user intent from queries, categorizing them into general descriptions, item location, and more.
These models are hosted on-prem, making it possible to avoid 3rd party APIs and just use our owns, ensuring data security and user privacy. The backend seamlessly integrates these models into a unified API, forming the backbone of SightSync.
How we built it
Building SightSync involved a meticulous process of model selection and API development. We chose models based on their performance, with a focus on achieving a balance between speed and accuracy. The choice of Distil-Whisper, FastPitch, CogVLM, Grounding Dino, and Zephyr was made strategically to address specific needs while ensuring efficient processing.
The API endpoints were crafted around each model, with special attention given to CogVLM, considering its weight in the project. MLops principles were crucial in developing a robust infrastructure that could handle the diverse set of tasks performed by SightSync.
Challenges we ran into
The primary challenges revolved around the integration of CogVLM, given its novelty and sparse documentation. Creating API endpoints that effectively communicated with each model required meticulous attention to detail. Balancing the need for speed in processing, while maintaining acceptable accuracy posed a unique set of challenges that we successfully navigated.
Accomplishments that we're proud of
Overcoming the challenges posed by model integration and creating a seamless API that unifies diverse models is a significant accomplishment for the SightSync team. The commitment to user impact and the incorporation of state-of-the-art models showcase our dedication to creating a valuable tool for the visually impaired community.
What we learned
SightSync provided us with invaluable insights into the field of MLops. Navigating through the integration of novel models, particularly with limited documentation, enhanced our problem-solving skills. We gained a deep understanding of optimizing model selection for specific use cases, considering both efficiency and accuracy.
What's next for SightSync
The future of SightSync is promising. The extensibility of the project is evident in the capabilities of Zephyr, which can be further harnessed to accommodate new user intents. We envision expanding the virtual assistant's functionality to cater to additional task-specific models. Continuous improvement and the integration of emerging technologies will remain at the forefront, ensuring that SightSync continues to make a meaningful impact on the daily lives of those it aims to assist.
Built With
- cogvlm
- fastpitch
- groundingdino
- imagecaptioning
- item-location
- llm
- objectlocation
- stt
- tts
- whisper
- zephyr
Log in or sign up for Devpost to join the conversation.