How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Overview

LLMs constantly produce instructions for everything, from diverse real-world goals (e.g., filing taxes, cooking recipes) to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work; surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions; and manual verification doesn't scale.

How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.

⛏️ How2Mine — a multi-stage pipeline that mines structured procedures (goal + resources + steps) from web documents; running it on ~1M pages yields 351K procedures across 14 topics.
🎯 How2Bench — a 7K-example evaluation benchmark balanced across topics, with:
- How2Score — an LLM-as-a-judge protocol that checks whether a generated procedure contains any critical failure that would prevent achieving the goal.
- How2Judge — an open 8B judge (distilled from GPT-5) that achieves 80.5% agreement with human annotators, enabling low-cost, reproducible evaluation.
🚀 How2Train — the remaining mined procedures used as RL training data; using How2Score as a reward improves How2Bench performance by >10 points across three models without regressions on 12 standard benchmarks.

Installation

Requires Python >= 3.11 and uv.

uv venv && uv sync

API calls (OpenAI, Anthropic, Gemini) are handled by lm-deluge, which handles rate limiting, retries, and provider-specific translation (OpenAI, Anthropic, Gemini, etc.). Local model inference uses vLLM.

Quickstart

After installation, each component can be accessed via the h2e CLI:

# Mine procedures from documents (requires OPENAI_API_KEY or other provider key)
uv run h2e mine run --config examples/mine/configs/openai_sync.yaml

# Evaluate a model on How2Bench (uses How2Judge via vLLM by default)
uv run h2e bench run --config examples/bench/configs/official_benchmark.yaml

# Deduplicate training data against the test set
uv run python examples/train/dedup_against_test.py \
    --train-path hf://how2everything/how2train_rl_100k?split=train \
    --test-path hf://how2everything/how2bench?split=train \
    --output-path data/train_deduped.jsonl

See each component's README for full details:

⛏️ How2Mine — mine procedures from your own documents
🎯 How2Bench — evaluate models and reproduce the leaderboard
🚀 How2Train — prepare training data and run RL with open-instruct

Released Artifacts

All artifacts are available in the How2Everything HuggingFace collection.

Artifact	Description	Link
How2Judge	Open 8B judge model	how2everything/how2judge
How2Mine	351K procedures mined from 980K web docs	how2everything/how2mine
How2Bench	7K evaluation benchmark	how2everything/how2bench
How2Train	Training set (deduped against How2Bench via `dedup_against_test.py`)	how2everything/how2train
WildChat labeled	WildChat labeled by OpenAI query type classifier	how2everything/WildChat-4.8M
lmsys-chat labeled	lmsys-chat labeled by OpenAI query type classifier	how2everything/lmsys-chat-1m

Citation

@misc{chang2026how2everythingminingwebhowto,
      title={How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs}, 
      author={Yapei Chang and Kyle Lo and Mohit Iyyer and Luca Soldaini},
      year={2026},
      eprint={2602.08808},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.08808}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
examples		examples
how2everything		how2everything
prompts		prompts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Overview

Installation

Quickstart

Released Artifacts

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

Overview

Installation

Quickstart

Released Artifacts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages