Synthetic Fine Tuning

Inspiration

What our team noticed is that Large Language Models (or LLMs for short) have difficulty with domain-specific questions. However, a problem with making LLMs useful for domain-specific questions is the sheer amount of data that you have to feed in order to fine tune the LLM so that it is accurate. Finding this amount of information in order to train bigger models in this manner not only costs significant manpower but can be environmentally detrimental because training is known to be an energy-intensive process with a staggering carbon footprint. Therefore, we propose a new method of fine-tuning LLMs via a process called Synthetic Tuning in order to vastly improve the efficiency of fine tuning LLMs for domain-specific questions.

What it does

Synthetic tuning works by fine tuning a larger LLM in order to take a small sample dataset and generate synthetic data. The synthetic data generated can then be used to fine tune a smaller LLM. The larger LLM will be known as the synthetic model (SM) and is only trained once using the Together.ai api. The synthetic model is given a use case and a sample dataset which the synthetic model then uses to generate a large amount of synthetic data. The use case will be reused later and will be the purpose of the variable model. The sample data set must be large enough that the synthetic model can learn associations necessary for generating synthetic data. The synthetic data from the SM is then used to fine tune an untuned smaller LLM for the use case specified in synthetic data generation This smaller model will be known as the variable model (VM) and is fine tuned for every new use case.

Challenges we ran into

One of the main challenges that we ran into was diversifying the data that we fed into the synthetic model so that the synthetic model itself would be applicable to a wide range of use cases. One way that we accomplished this was to take in pre-existing data scraped from vast databases available on the web that corresponded to the particular use cases we wanted to test on (Health, Finance, and Consumer Goods).

Accomplishments that we're proud of

One thing that we are proud of is that we were able to effectively fine tune Llama 7b to synthesize data for subsequent fine tuning. We were also able to identity at least 3 use cases that our algorithm would be useful for which involve medical, financial, and product data. Some of the next steps in order to improve our algorithm is to create a more powerful synthetic model and improve the tuning data which is passed to each of the variable models.