The Nordisk familjebok dataset can be downloaded using the download.sh script.
Due to high server load, the file may not be downloaded. If this is the case, delete the files that have failed to download, and rerun the script.
Run the split.py script to split the dataset
Run the clean.py script to clean the dataset
Run the dataset_organizer.py script to create a training and test dataset.
The script train.sh creates a virtual environment, installs dependencies, trains the tokenizer, trains the teacher models, trains the student model, and then evaluates the performance of the models.
clean.py- Cleans the entries in the encyclopediadalaj-ged-superlim_test.jsonl- File containing the evaluation datasetdataset_organizer.py- Merges encyclopedia entries into a train and test datasetdataset.py- File to handle loading of dataset during trainingdownload.sh- Script to download the raw dataset*.yaml- Configuration files to choose model parameters. The config that you want to utilize should be linked or copied toconfig.yamlREADME.md- This file, contains information about this repositoryrequirements.txt- Libraries utilized by this projectsplit.py- Extracts the entries from the dictionarytest.py- Evaluates the modeltokenizer.py- Creates a tokenizertrain.sh- Script that trains the model, see abovetrain_student- Trains student model, the teacher models used are selected by modifying the filetrain_teacher- Trains a teacher model,config.yamlneeds to be present in this directory