Inspiration
As the use of large language models (LLMs) continues to grow across industries, B2B companies face significant challenges in deploying these powerful tools while ensuring compliance with data privacy regulations. Traditional data anonymization methods often strip away essential context, reducing the effectiveness of LLM-based products and services. This is particularly problematic in sectors with heightened privacy concerns, where building customer trust is paramount. Fear of violating regulations and potential data misuse can slow down innovation and limit the potential of LLMs. DataMask aims to address these challenges by providing a context-aware data anonymization layer for LLMs.
What It Does
DataMask is an API privacy layer that utilizes locally hosted language models to identify and replace sensitive personal information within datasets. By assessing the context and importance of the redacted data, the system generates contextually appropriate alternatives that maintain the semantic integrity and analytical value of the original text. This innovative approach ensures data privacy while preserving the usefulness of the dataset for analysis and insights.
How We Built It
The main goal of DataMask is to anonymize sensitive data in a way that retains its usefulness for analysis, without compromising individual privacy. Here's how we built it:
Locally Hosted Language Models: We use language models hosted on a local server to scan and process datasets. This ensures enhanced security as the data does not need to be sent to external servers. The initial task of these models is to accurately identify PII within the dataset.
Context Preservation: After identifying the PII, the system assesses the importance of the redacted information within its context. The language model determines why specific information is crucial and how it contributes to the overall meaning of the text. This step is critical to ensure that replacing the PII does not alter the fundamental insights or value derived from the data.
Synonym Generation: We use Gemma, a fine-tuned version of Google's Gemini model, to generate contextually appropriate alternatives for the identified PII. Gemma is designed to create non-identifiable alternatives that maintain the semantic integrity of the original text.
Data Integration: The newly generated, anonymized terms or phrases replace the original PII in the dataset. This step requires careful integration to maintain the logical flow and readability of the data, ensuring that the dataset remains useful for analysis.
Analysis of Anonymized Data: The final dataset, now devoid of PII but still contextually intact, can be safely analyzed using various tools without risking privacy breaches. This allows researchers and analysts to work with the data more freely, without concerns over privacy violations.
Tech Stack
Gemma (fine-tuned model based on Google's Gemini-7B) OpenLLM (to turn local LLM into API) Gemini Pro 1.5 (for function batch calling) Presidio API (for traditional PII masking) Anvil (for demo UI, built on top of Google Colab) BentoML (for Gemma deployment)
Challenges We Ran Into
Determining who our target customer would be (B2B or B2C) Identifying important contextual elements without having the private data touch Gemini Getting the LLM to return structured data from an NLP prompt
Accomplishments That We're Proud Of
We integrated a lot of new tools to implement a complex model framework that addresses a major issue affecting companies building on top of LLMs. We built a product with real market value, providing a full application solution to a growing but unmet need.
What We Learned
We learned how to integrate many of the new and developing AI tools to create a production-ready application.
What's Next for DataMask
- Add customization features designed for different niches beyond therapy, such as law, medicine, and research
- Convert to Django API and deploy to the cloud to bring it into full production
DataMask demonstrates the potential for context-aware data anonymization in enabling the safe and effective use of LLMs across industries. By preserving privacy while maintaining data utility, DataMask empowers B2B companies to innovate with confidence.
Built With
- anvil
- bentoml
- gemini
- gemma
- openllm
- presidio
Log in or sign up for Devpost to join the conversation.