ReasonEmbed

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

Introduction

We propose ReasonEmbed, a new text embedding model for reasoning-intensive document retrieval based on innovations of how synthetic data is generated and used. Our work includes the following technical contributions.

We design a novel data synthesis method, called ReMixer.
We introduce a self-adaptive training method tailored for our synthetic data, termed Redapter.
We implement ReasonEmbed based on multiple LLM backbones of varying model sizes, which achieve state-of-the-art (SOTA) performance on reasoning-intensive document retrieval tasks. Notably, our model built on Qwen3-4B reaches an nDCG@10 score of 37.1 on the BRIGHT benchmark, which already surpasses all existing text embedding models. While the Qwen3-8B based varient improves the performance to 38.1. Moreover, on the R2MED benchmark, ReasonEmbed-Qwen3-8B attains an nDCG@10 score of 43.18, surpassing all of the existing models by a large margin and leading to new SOTA performance.

For more details, please refer to our paper.

Performance

BRIGHT

Evaluation results of ReasonEmbed on the BRIGHT benchmark:

R2MED

Evaluation results of ReasonEmbed on the R2MED benchmark:

Open-Source Resources

Models

Embedder

Resource Type	Name	Link	Release Date	Comments
Model	ReasonEmbed-Qwen3-8B (BGE-Reasoner-Embed-0928)	🤗reason-embed-qwen3-8b-0928	Oct 11, 2025	nDCG@10 = 38.1 on BRIGHT using original queries; fine-tuned on Qwen/Qwen3-8B with our synthetic dataset using the novel RI-InfoNCE loss; submission to BRIGHT leaderboard
Model	ReasonEmbed-Basic-Qwen3-8B	🤗reason-embed-basic-qwen3-8b-0928	Oct 11, 2025	nDCG@10 = 37.1 on BRIGHT using original queries; fine-tuned on Qwen/Qwen3-8B with our synthetic dataset using the basic InfoNCE loss
Model	ReasonEmbed-Qwen3-4B	🤗reason-embed-qwen3-4b-0928	Oct 11, 2025	nDCG@10 = 37.1 on BRIGHT using original queries; fine-tuned on Qwen/Qwen3-4B with our synthetic dataset using the novel RI-InfoNCE loss
Model	ReasonEmbed-Basic-Qwen3-4B	🤗reason-embed-basic-qwen3-4b-0928	Oct 11, 2025	nDCG@10 = 35.3 on BRIGHT using original queries; fine-tuned on Qwen/Qwen3-4B with our synthetic dataset using the basic InfoNCE loss
Model	ReasonEmbed-Llama-3.1-8B	🤗reason-embed-llama-3.1-8b-0928	Oct 11, 2025	nDCG@10 = 36.2 on BRIGHT using original queries; fine-tuned on meta-llama/Llama-3.1-8B with our synthetic dataset using the novel RI-InfoNCE loss
Model	ReasonEmbed-Basic-Llama-3.1-8B	🤗reason-embed-basic-llama-3.1-8b-0928	Oct 11, 2025	nDCG@10 = 34.9 on BRIGHT using original queries; fine-tuned on meta-llama/Llama-3.1-8B with our synthetic dataset using the basic InfoNCE loss
Model	BGE-Reasoner-Embed-0821	-	-	nDCG@10 = 32.5 on BRIGHT using original queries; will not be released due to its suboptimal performance compared to BGE-Reasoner-Embed-0923; submission to BRIGHT leaderboard

Annotator

Resource Type	Name	Link	Release Date	Comments
Model	ReasonEmbed-Annotator-Qwen3-8B	🤗 (TBA) reason-embed-annotator-qwen3-8b-0928	-	Distilled relevance annotator based on Qwen/Qwen3-8B used in our paper

Evaluation

Search Results

Resource Type	Name	Link	Release Date	Comments
Search Results	search results of ReasonEmbed on BRIGHT	🤗	Oct 11, 2025

Code and Evaluation Scripts

Resource Type	Name	Link	Release Date	Comments
Evaluation Code and Scripts	BRIGHT evaluation code	code & scripts	Oct 10, 2025	integrated with FlagEmbedding
Evaluation Code and Scripts	evaluation script of BM25 on BRIGHT	code & scripts	Oct 10, 2025	implemented using FlagEmbedding framework

Data and Synthesis Code

Data

Resource Type	Name	Link	Release Date	Comments
Training Data	ReasonEmbed-Data-0928	🤗reason-embed-data-0928	Oct 11, 2025	used for training all ReasonEmbed models in our paper

Code and Scripts

Resource Type	Name	Link	Release Date	Comments
Data Synthesis Code and Scripts	ReMixer	(TBA)	-	to be released

Training Code

Resource Type	Name	Link	Release Date	Comments
Training Code and Scripts	Redapter	(TBA)	-	to be released

Citation

If you find this repository useful, please consider giving a star ⭐ and citation:

@article{chen2025reasonembed,
  title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval},
  author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2510.08252},
  year={2025}
}

Name		Name	Last commit message	Last commit date
parent directory ..
evaluation_bright/scripts		evaluation_bright/scripts
evaluation_r2med		evaluation_r2med
imgs		imgs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

Introduction

Performance

BRIGHT

R2MED

Open-Source Resources

Models

Embedder

Annotator

Evaluation

Search Results

Code and Evaluation Scripts

Data and Synthesis Code

Data

Code and Scripts

Training Code

Citation

FilesExpand file tree

ReasonEmbed

Directory actions

More options

Directory actions

More options

Latest commit

History

ReasonEmbed

Folders and files

parent directory

README.md

ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval

Introduction

Performance

BRIGHT

R2MED

Open-Source Resources

Models

Embedder

Annotator

Evaluation

Search Results

Code and Evaluation Scripts

Data and Synthesis Code

Data

Code and Scripts

Training Code

Citation