TrainVerify

Introduction

Overview: TrainVerify is a verification tool to ensure parallelization equivalence in distributed model training. It guarantees that the parallelized model is arithmetically equivalent to its original single-device version, thereby eliminating errors such as incorrect tensor transformations or faulty collective communication introduced during parallelization.

Workflow: TrainVerify takes the execution plans (execplans) of both the original (single-device) and parallelized (multi-device) models, each enriched with lineages. It then transforms these into symbolic SSA DAGs (single-static assignment directed acyclic graphs). Such dual graphs are partitioned into independent stages using lineage information. Each stage undergoes parallel execution, where shape reduction is applied, followed by symbolic verification using Z3. If all stages pass, TrainVerify guarantees end-to-end equivalence.

Implementation: TrainVerify is implemented with modular interfaces. It defines a general graph interface and a solver interface, supporting a registry of allowable operators. The nnScaler backend constructs SSA DAGs that conform to TrainVerify’s requirements. A high-level view of the implementation is shown below:

Prepare for your Verification

💻 Hardware Requirements

Most GPU machine could run below verification of example MLP model. To verify large models such as Llama3(405B) and DeepSeek-V3(671B), we recommend to run TrainVerify on a machine with at least 32 CPU (virtual) cores and 1TB memory.

Installation

Clone the verification branch from nnscaler repo:

cd TrainVerify
git clone https://github.com/microsoft/nnscaler.git -b verification

Create conda environment.

conda env create -f conda-environment.yml
conda activate TrainVerify

Run demo verification for an example MLP model gen_model/model/mlp.py parallelization.
1. Capture model graphs for verification
```
cd Verdict;
mkdir gen_model/mgeners;
bash scripts/gen_mlp_demo.sh
```
2. Perform verification
```
mkdir data/logs;
bash scripts/demo_mlp.sh 
```
  The scripts/demo_mlp.sh essentially runs the following command:
```
python main.py \
    --sm gen_model/mgeners/mlp_default_dp1_pp1_tp1_nm1_gbs128_ly2_h0_hi128_sq0.pkl \ 
    --pm gen_model/mgeners/mlp_default_dp2_pp2_tp2_nm2_gbs128_ly2_h0_hi128_sq0.pkl \
    --seed 0 \
    --time \
    --max_ser_proc 30 \
    --max_vrf_proc 30 \
    --loglevel INFO \
    --no_cache_nodes \
    --no_cache_stages \
    |& tee -a data/logs/mlp_default_dp2_pp2_tp2_nm2_gbs128_ly2_h0_hi128_sq0.txt
```
  Command interpretation: main.py is the entry of TrainVerify. --sm and --pm sepcify the paths of single-device model's and parallelized model's execution plan respectively. --seed sets z3 random seed. --time activates timer. --max_ser_proc and --max_vrf_proc set the multiprocessing pool size for building SSA DAGs, and parallel stage execution respectively. --loglevel sets logger level. --no_cache_nodes and --no_cache_stages ignore any cached data and run verification from scratch. |& tee -a ... writes logs to a file for inspection convenience.
  
  👀 Expected Output: The program should print the following message or similar at the bottom of the output. Indicating successful execution of all stages, as well as the verified end-to-end equivalence.
```
PID: ... - ✅ SUCCESS 
Stats(success=True, ... )
```
  A failed run would print PID: ... - ❌ FAIL.
Once the demo runs successfully, can proceed to verify your own models. We will continue to provide additional example model verifications in the future.

Citation

@article{lu2025trainverify,
  title={TrainVerify: Equivalence-Based Verification for Distributed LLM Training},
  author={Lu, Yunchi and Miao, Youshan and Tan, Cheng and Huang, Peng and Zhu, Yi and Zhang, Xian and Yang, Fan},  
  booktitle={Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP'25)},
  year={2025}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit Contributor License Agreements.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
Verdict		Verdict
docs/assets		docs/assets
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
conda-environment.yml		conda-environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TrainVerify

Introduction

Prepare for your Verification

💻 Hardware Requirements

Installation

Citation

Contributing

Trademarks

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TrainVerify

Introduction

Prepare for your Verification

💻 Hardware Requirements

Installation

Citation

Contributing

Trademarks

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages