DAVE | Devpost

Inspiration

According to the World Health Organization, 2.2 billion people around the world live with a near or distance vision impairment. Over a quarter of this planet's population sees the world differently from how it should be, and this profoundly harms how people learn, walk, and work. Our team personally knows the impacts of vision impairment - three of our four members need to use contacts or glasses in daily life. Therefore, born out of our mutual interest in benefitting society, we developed DAVE - Digital Assistant for Vision Enhancement - to ease the burden of a vision-impaired lifestyle.

What it does

This project places a camera above the user's left eye on a pair of glasses to take pictures of the world around the user
The user can summon the assistant by saying "DAVE"
- The user can ask DAVE to describe the image of the world around them
- Alternatively, the user can ask DAVE questions about their surroundings
The camera takes a picture from its perspective
The user's speech is interpreted as text and combined with the image to prompt the model
The model provides an appropriate response to the user through speech

How we built it

We started with Python, a language we are all familiar with, and incorporated it for both the frontend and backend. This move allowed for consistency in integrating both sides. We used a multimodal model to respond to both images and text, specifically choosing LLaVA, and in order to translate the user's speech into text (automatic speech recognition), we used OpenAI's Whisper model. Initially, we did consider using the recently-released Distil-Whisper model as it promised to be six times faster and 49% smaller than Whisper, but considering how we could access the Whisper API and use even less computational resources, we proceeded with OpenAI's option. Verbalizing responses was made possible by Google Text-to-Speech (gTTS). The webcam could communicate through OpenCV, and taking input from its microphone was done through the SpeechRecognition package. Accessing LLaVA required an innovative web-scraping approach enabled by Selenium. We created the user interface with Tkinter.

Seeing as we had an even split between frontend and backend people, we worked in pairs on the interface and the models through the night and offered assistance to each other.

Challenges we ran into

When we started the project, we had no shortage of challenges when it came to which large language model we would use. We wanted to use a multimodal model, one capable of accepting both images and text as input. Our first thought was to use OpenAI's APIs as GPT-4V accepts images and demonstrates strong performance. However, when we scoured through the forums to learn if GPT-4V was accessible through the API, we found it was not supported. This lead us to pivot to an open-source alternative, LLaVA, which demonstrated strong efficacy as well. Problems ensued when we tried to set it up through Hugging Face, resulting in KeyErrors that we could not resolve despite following multiple issues on the model's GitHub page. This lead us to a third option - Bard - which seemed promising as Google provided a model capable of accepting images. Not only would it need to be prompted a specific way to use images' context, it also had no API, and when we tried to use cookies with a repository acting as an unofficial API, we walked away with nothing but 429 errors. After these crushing defeats, we realized we could circumvent all of these problems by sending the prompts directly to LLaVA's Gradio-built webpage and web-scraping the responses for our uses, which would act in place of an API for our purposes.

After we established communications with LLaVA, our woes continued as we needed to parse the response as it was generated to speed up communication. Doing this was critical as waiting for the response to finish before outputting it as speech added several seconds of lag to chatting. In our initial attempt, the text-to-speech voice would awkwardly pause on certain sections or skip over/backtrack to certain areas in the response. We first needed to line up these pauses in a way that made sense, more specifically whenever a sentence ended. As the dialogue mismatches persisted, we needed to carefully review how it iterates through the response and separated responses based on describing surroundings and answering questions.

Originally, we intended to use a React frontend. However, since we were unable to get the bytes from the React frontend to the Python backend, we decided to switch to a Python frontend made possible by Tkinter.

Further instability presented itself with some of Whisper's inaccuracies. In louder settings, such as the main HackRPI hall, the model would pick up on conversations from other groups, and this negatively impacted our testing. We needed to move to a quieter area, more specifically an unused sleep room, to ensure the speech recognition ran as intended, and as a result, we noticed measurable improvements in accuracy.

Accomplishments that we're proud of

We are proud of how we were able to combine the diverse technologies we had available in an agent capable for improving the welfare of the visually impaired. Despite the challenges, it was especially rewarding to combine programs in a way where they could communicate with each other.

One thing we were thrilled by was the web-scraper we built to enable the use of LLaVA despite the challenges faced in setting it up with Hugging Face. It was a stroke of genius by a teammate who figured out a clever approach to getting what we need, effectively forgoing any need to set up a model locally or abiding by an API. That being said, we are also pleasantly surprised to hear how well LLaVA responded to prompts, whether they be in great detail for descriptions or to-the-point answers for questions.

What we learned

It was our first time developing a workflow that incorporated speech, text, and images. Prior to this, some of us had experience in one or two of these fields, and combining all three posed a unique opportunity. We capitalized on this, and in doing so, we each learned more about the types of technologies available for us to utilize be they APIs, models, or packages.

Individually, we each improved on our team-building and communication skills as we were determined to strike a balance between members' roles. This would allow for us to collaborate more efficiently and produce a minimum viable product within the time limit. When we met, we had an ambitious idea to create an assistant like that found in fiction, and in diving into this project, we gained a deep understanding for the innovative mindset involved the process. With how we split the work, each of us understood how our components fit together to create the end result.

What's next for DAVE

DAVE holds immense potential for the visually impaired. Streamlining the design will allow for a less awkward package that people will not think twice about. This could be combined with a pair of earbuds with noise cancellation and/or a transparency mode, to allow for less disturbances in the assistant conversing with the user. Improving the device would not just be limited to aesthetics as we could include more technologies such as a camera on an arm that performs gaze estimation to aid in object recognition. Alternatively, we could develop a mobile application that offers this powerful assistant with the amenities and convenience a smartphone offers, including front and rear cameras.

We believe DAVE can go beyond helping blind people - it can help everyone. A portable AI assistant with a human-level view of the world can help people identify hazardous conditions like flooding or wildfires, look up products online, and recognize ingredients that go against dietary restrictions. Offering vision for information and speech for control present two domains with immense potential. Vision allows for rich domains to be explored and analyzed for patterns. Speech provides the ultimate convenience for the user as they merely need to vocalize their commands. Combining both in such a package allows for the unparalleled convenience of hands-free use, and including eye-tracking to estimate where the user's gaze is can further this and provide information about objects that catch the user's focus.

The potential of this project expands beyond a subset of the global populace - it can impact everyone. A project like this changes the world.

Built With

llava
opencv
python
selenium
speechrecognition
text-to-speech
tkinter
webcam
webscraping
whisper

Submitted to

HackRPI X
- Winner #1 Best Hack

Created by

I primarily focused on developing machine learning applications for the projects and managing the distribution of teamwork.

Aryash
I worked on the backend and machine learning components. I researched and recommended multimodal models like LLaVA, engineered prompts, implemented ASR capabilities with Whisper and SpeechRecognition, and investigated text-to-speech capabilities with gTTS.

Zachary Fernandes
Recent M.S. Computer Science graduate from RPI excited about AI, ML, Data, and their many applications - open to work and collaborations
I worked on the front end

Drew Bhavsar
Versatile developer with abilities ranging from JavaScript, Java, C++, to COBOL and JCL
Shamar Samuels