Skip to content

lilakk/how2everything

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🤗 HuggingFace📄 Paper📝 Blog Post

Overview

How2Everything Overview

LLMs constantly produce instructions for everything, from diverse real-world goals (e.g., filing taxes, cooking recipes) to plans for AI agents, but improving this capability is challenging. Outputs can sound fluent while describing steps that don't actually work; surface-level metrics miss critical mistakes like omitted prerequisites or contradictory instructions; and manual verification doesn't scale.

How2Everything closes this gap with a practical loop: mine real procedures from the web → benchmark LLM outputs → detect critical failures (missing steps, wrong order, omissions) → use that signal to train better models.

  • ⛏️ How2Mine — a multi-stage pipeline that mines structured procedures (goal + resources + steps) from web documents; running it on ~1M pages yields 351K procedures across 14 topics.
  • 🎯 How2Bench — a 7K-example evaluation benchmark balanced across topics, with:
    • How2Score — an LLM-as-a-judge protocol that checks whether a generated procedure contains any critical failure that would prevent achieving the goal.
    • How2Judge — an open 8B judge (distilled from GPT-5) that achieves 80.5% agreement with human annotators, enabling low-cost, reproducible evaluation.
  • 🚀 How2Train — the remaining mined procedures used as RL training data; using How2Score as a reward improves How2Bench performance by >10 points across three models without regressions on 12 standard benchmarks.

Installation

Requires Python >= 3.11 and uv.

uv venv && uv sync

API calls (OpenAI, Anthropic, Gemini) are handled by lm-deluge, which handles rate limiting, retries, and provider-specific translation (OpenAI, Anthropic, Gemini, etc.). Local model inference uses vLLM.

Quickstart

After installation, each component can be accessed via the h2e CLI:

# Mine procedures from documents (requires OPENAI_API_KEY or other provider key)
uv run h2e mine run --config examples/mine/configs/openai_sync.yaml

# Evaluate a model on How2Bench (uses How2Judge via vLLM by default)
uv run h2e bench run --config examples/bench/configs/official_benchmark.yaml

# Deduplicate training data against the test set
uv run python examples/train/dedup_against_test.py \
    --train-path hf://how2everything/how2train_rl_100k?split=train \
    --test-path hf://how2everything/how2bench?split=train \
    --output-path data/train_deduped.jsonl

See each component's README for full details:

  • ⛏️ How2Mine — mine procedures from your own documents
  • 🎯 How2Bench — evaluate models and reproduce the leaderboard
  • 🚀 How2Train — prepare training data and run RL with open-instruct

Released Artifacts

All artifacts are available in the How2Everything HuggingFace collection.

Artifact Description Link
How2Judge Open 8B judge model how2everything/how2judge
How2Mine 351K procedures mined from 980K web docs how2everything/how2mine
How2Bench 7K evaluation benchmark how2everything/how2bench
How2Train Training set (deduped against How2Bench via dedup_against_test.py) how2everything/how2train
WildChat labeled WildChat labeled by OpenAI query type classifier how2everything/WildChat-4.8M
lmsys-chat labeled lmsys-chat labeled by OpenAI query type classifier how2everything/lmsys-chat-1m

Citation

@misc{chang2026how2everythingminingwebhowto,
      title={How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs}, 
      author={Yapei Chang and Kyle Lo and Mohit Iyyer and Luca Soldaini},
      year={2026},
      eprint={2602.08808},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.08808}, 
}

About

Official code for "How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages