🤗 HuggingFace • 📄 Paper • 📝 Blog Post
LLMs constantly produce instructions for everything, from diverse real-world goals (e.g., filing taxes, cooking recipes) to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work; surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions; and manual verification doesn't scale.
How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.
- ⛏️ How2Mine — a multi-stage pipeline that mines structured procedures (goal + resources + steps) from web documents; running it on ~1M pages yields 351K procedures across 14 topics.
- 🎯 How2Bench — a 7K-example evaluation benchmark balanced across topics, with:
- How2Score — an LLM-as-a-judge protocol that checks whether a generated procedure contains any critical failure that would prevent achieving the goal.
- How2Judge — an open 8B judge (distilled from GPT-5) that achieves 80.5% agreement with human annotators, enabling low-cost, reproducible evaluation.
- 🚀 How2Train — the remaining mined procedures used as RL training data; using How2Score as a reward improves How2Bench performance by >10 points across three models without regressions on 12 standard benchmarks.
Requires Python >= 3.11 and uv.
uv venv && uv syncAPI calls (OpenAI, Anthropic, Gemini) are handled by lm-deluge, which handles rate limiting, retries, and provider-specific translation (OpenAI, Anthropic, Gemini, etc.). Local model inference uses vLLM.
After installation, each component can be accessed via the h2e CLI:
# Mine procedures from documents (requires OPENAI_API_KEY or other provider key)
uv run h2e mine run --config examples/mine/configs/openai_sync.yaml
# Evaluate a model on How2Bench (uses How2Judge via vLLM by default)
uv run h2e bench run --config examples/bench/configs/official_benchmark.yaml
# Deduplicate training data against the test set
uv run python examples/train/dedup_against_test.py \
--train-path hf://how2everything/how2train_rl_100k?split=train \
--test-path hf://how2everything/how2bench?split=train \
--output-path data/train_deduped.jsonlSee each component's README for full details:
- ⛏️ How2Mine — mine procedures from your own documents
- 🎯 How2Bench — evaluate models and reproduce the leaderboard
- 🚀 How2Train — prepare training data and run RL with open-instruct
All artifacts are available in the How2Everything HuggingFace collection.
| Artifact | Description | Link |
|---|---|---|
| How2Judge | Open 8B judge model | how2everything/how2judge |
| How2Mine | 351K procedures mined from 980K web docs | how2everything/how2mine |
| How2Bench | 7K evaluation benchmark | how2everything/how2bench |
| How2Train | Training set (deduped against How2Bench via dedup_against_test.py) |
how2everything/how2train |
| WildChat labeled | WildChat labeled by OpenAI query type classifier | how2everything/WildChat-4.8M |
| lmsys-chat labeled | lmsys-chat labeled by OpenAI query type classifier | how2everything/lmsys-chat-1m |
@misc{chang2026how2everythingminingwebhowto,
title={How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs},
author={Yapei Chang and Kyle Lo and Mohit Iyyer and Luca Soldaini},
year={2026},
eprint={2602.08808},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.08808},
}