RepliCAT: Multi-Agent Framework for Reproducing Research

Inspiration

Reproducibility in research is often harder than it should be. Not because the ideas are unclear, but because the operational details are scattered. Many existing research repos contain undocumented dependencies, ambiguous experiment commands, missing datasets, environment mismatches, or small inconsistencies between the paper & the repo.

In our own research (and in conversations with other researchers), we’ve repeatedly seen similar issues: a paper looks promising, the repo exists, but reproducing the results requires hours (or days) of detective work. We wanted to treat research reproduction like a software engineering workflow problem rather than simply an academic one.

RepliCAT was inspired by this. What if reproducing a paper felt more like running CI/CD on a codebase (structured, verified, and easy to use) instead of manually reverse-engineering instructions?

What it does

RepliCAT is an agentic system that turns "I want to reproduce this paper" into a very specific replication plan, and all it needs is a paper title, url link, file upload, or citation.

Given these inputs, RepliCAT can:

Understand the paper and repository, generating all the code you need as a researcher to reproduce their experiments
Generate a step-by-step replication plan (e.g. git clone for the original research repo, environment setup, dataset & artifact acquisition, and exact experiment commands)
Verify the plan against the paper and codebase. It also estimates feasibility: suggesting required Hardware (CPU/GPU), Tools and APIs, estimated cost, and estimated time
Allow human to interrupt and reprompt the model if needed in a multi-step agentic architecture
Produce a structured replication report with rubric-based scores: Scientific Methodology (focusing on statistical rigor, while flagging data poisoning & cherry-picking results), Transparency, Fact / Source Reliability, Repeatability, and Clarity as well as any necessary logged code modifications to the official research repo.

How we built it

RepliCAT is an agentic driven system built on two distinct stacks: the Agent Stack and the Tech Stack. The Agent Stack is driven by a Lab Supervisor which orchestrates a multi-stage agent pipeline with typed, machine-checkable outputs. The Agent Stack was powered by Anthropic's Claude SDK Agents and hosted in Modal's Sandbox environments for execution.

Project Image 1

Agent Stack

Stage 1: Review

A ResearchAgent performs paper analysis, codebase analysis, and fact-check analysis on the research paper (condensing these findings using a structured rubric into a markdown file and a set of reproducibility scores).

Stage 2: Plan + Verify + Gate

Given the work done by the ResearchAgent, the ReplicationAgent produces a runnable "ReplicationPlan". A separated VerifierAgent checks the validity of the code, process, and alignment with the paper / repo. This stage also generates a feasibility estimate (required tools, cost, and time), allowing a human to stay in the loop and approve the code changes as necessary.

Stage 3: Replication Director

A ReplicationDirector agent spawns child agents designed to execute experiments either locally (Docker) or remotely (Modal sandbox backends) depending on the complexity of the research experiment.

Stage 4: Report

Generates a structured "Replication Report" in JSON + Markdown that includes a detailed and comprehensive report of the RepliCAT metrics.

All inter-agent communication is enforced through typed Pydantic models to ensure structured outputs and reduce parsing errors (with additional processing as necessary). Prompts are centralized for consistency.

Project Image 2

Tech Stack

The Agent Stack was integrated into a larger Tech Stack which consisted of a traditional front-end and back-end as well as a database, all making extensive use CloudFlare's Development Platform.

The front-end was designed to target two main functionalities.

To display important information such as metrics of reproducibility, clarity, detailed replication steps, artifacts, and other crucial components of a solid scientific process. As a community, researchers can use the RepliCAT dashboard as a resource for collecting and standardizing analysis about the validity and value of research from a wide range of fields.
To offer an interface where users can upload the papers of their choice for detailed analysis by the agentic RepliCAT system. This allows researchers to continually increase the catalog of valuable information and meta-information pertaining to the most niche and most popular research fields.

The main job of the backend was to act as the bridge between all data streams, bringing together the Agent Stack, front-end, and database. For the backend, we made extensive use of CloudFlare Workers. For example, they were crucial in both initial interaction with the Modal Sandboxes to instantiate agent action as well as the intermediary for streaming status and progress information between the front-end and Agent Stack during active paper review. We also made use of CloudFlare's D1 SQL Database to store the growing list of research with RepliCAT analysis.

Challenges we ran into

Minimal modification policy:
We enforced a strict rule: never modify the original repository unless explicitly required. Every change must be logged. This ensures we balance between the 2 extremes (not deviating too far from the original premise / repo of the research paper, while making enough changes to successfully reproduce experiments).

Plan validity vs. hallucination risk:
LLM-generated plans can drift from the official research repo structure. The VerifierAgent cross-checks these plans against real files and documentation.

Execution safety:
Running arbitrary research code could be risky and also expensive. We designed a mandatory human approval gate and structured command preview before any execution.

Computation Limitations: We faced limited compute power and were unable to recreate every single paper in the given time constraints. We built a minifier, to find minimum experiment to prove quality of paper.

Accomplishments that we're proud of

Designed a full multi-step multi-agent pipeline making use of Agentic AI, Cloudflare Dev Platforms, and Sandbox Execution Environments that performs much-needed analysis of the research quality of research across fields
Created a database and dashboard for collecting / compiling an increasing list of audited research
Compiled an extensive rubric for assessing research replication, transparency, clarity, and quality

What we learned

We learned how to design multi-step Agentic systems for a wide range of tasks aimed at completing a goal of analyzing research quality
We learned how to use tools such as Modal, Cloudflare Dev, Claude SDK

What's next for RepliCat: Multi-Agent Framework for Reproducing Research

Our long-term vision is to make reproducibility and replicability measurable, comparable, and automated (starting from just a paper and its repository). Through the RepliCAT platform, and given both time and compute resources, we hope to create an extensive database of research, both big and small, that allows for detailed analysis of the findings of one paper, patterns in research at a specific conference, or the trends of an entire research field. We would love to see RepliCAT become an important tool for improving ease of research replication and a symbol for promoting ethical research with reproducibility, clarity, reliability, and transparency.