Skip to content

Pocoya/BabyLM

Repository files navigation

BabyLM

Instructions to run

Download dataset

The Nordisk familjebok dataset can be downloaded using the download.sh script. Due to high server load, the file may not be downloaded. If this is the case, delete the files that have failed to download, and rerun the script.

Split the dataset

Run the split.py script to split the dataset

Clean the dataset

Run the clean.py script to clean the dataset

Create training and test dataset

Run the dataset_organizer.py script to create a training and test dataset.

Train the models

The script train.sh creates a virtual environment, installs dependencies, trains the tokenizer, trains the teacher models, trains the student model, and then evaluates the performance of the models.

Files

  • clean.py - Cleans the entries in the encyclopedia
  • dalaj-ged-superlim_test.jsonl - File containing the evaluation dataset
  • dataset_organizer.py - Merges encyclopedia entries into a train and test dataset
  • dataset.py - File to handle loading of dataset during training
  • download.sh - Script to download the raw dataset
  • *.yaml - Configuration files to choose model parameters. The config that you want to utilize should be linked or copied to config.yaml
  • README.md - This file, contains information about this repository
  • requirements.txt - Libraries utilized by this project
  • split.py - Extracts the entries from the dictionary
  • test.py - Evaluates the model
  • tokenizer.py - Creates a tokenizer
  • train.sh - Script that trains the model, see above
  • train_student - Trains student model, the teacher models used are selected by modifying the file
  • train_teacher - Trains a teacher model, config.yaml needs to be present in this directory

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •