BabyLM

Instructions to run

Download dataset

The Nordisk familjebok dataset can be downloaded using the download.sh script. Due to high server load, the file may not be downloaded. If this is the case, delete the files that have failed to download, and rerun the script.

Split the dataset

Run the split.py script to split the dataset

Clean the dataset

Run the clean.py script to clean the dataset

Create training and test dataset

Run the dataset_organizer.py script to create a training and test dataset.

Train the models

The script train.sh creates a virtual environment, installs dependencies, trains the tokenizer, trains the teacher models, trains the student model, and then evaluates the performance of the models.

Files

clean.py - Cleans the entries in the encyclopedia
dalaj-ged-superlim_test.jsonl - File containing the evaluation dataset
dataset_organizer.py - Merges encyclopedia entries into a train and test dataset
dataset.py - File to handle loading of dataset during training
download.sh - Script to download the raw dataset
*.yaml - Configuration files to choose model parameters. The config that you want to utilize should be linked or copied to config.yaml
README.md - This file, contains information about this repository
requirements.txt - Libraries utilized by this project
split.py - Extracts the entries from the dictionary
test.py - Evaluates the model
tokenizer.py - Creates a tokenizer
train.sh - Script that trains the model, see above
train_student - Trains student model, the teacher models used are selected by modifying the file
train_teacher - Trains a teacher model, config.yaml needs to be present in this directory

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
nf_2ed_raw_dataset		nf_2ed_raw_dataset
.gitignore		.gitignore
README.md		README.md
clean.py		clean.py
dalaj-ged-superlim_test.jsonl		dalaj-ged-superlim_test.jsonl
dataset.py		dataset.py
dataset_organizer.py		dataset_organizer.py
download.sh		download.sh
gpt2.yaml		gpt2.yaml
llama.yaml		llama.yaml
opt.yaml		opt.yaml
requirements.txt		requirements.txt
roberta.yaml		roberta.yaml
split.py		split.py
student.yaml		student.yaml
test.py		test.py
tokenizer.py		tokenizer.py
train.sh		train.sh
train_student.py		train_student.py
train_teacher.py		train_teacher.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabyLM

Instructions to run

Download dataset

Split the dataset

Clean the dataset

Create training and test dataset

Train the models

Files

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Pocoya/BabyLM

Folders and files

Latest commit

History

Repository files navigation

BabyLM

Instructions to run

Download dataset

Split the dataset

Clean the dataset

Create training and test dataset

Train the models

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages