Environmentcuda_kernel_fused_softmax
Training
Reward0.94
Loss0.031
Rollouts12.4k
Epoch34/50

The Gym Where AI Models Get Stronger

RL training and evaluation infrastructure for coding agents. Built-ins, imported benchmarks, SWE-bench Pro, Terminal-Bench 2.0, mixed-environment reward routing, and thin DAPO/TRL/verl integrations in one stack.

pip install deepgymv0.3.0 live
deepgym hero background
Train stronger models.
RL
Agent Connected
Train stronger models.
Loading...
deepgym hero
The Problem

Most teams build this from scratch

Code model teams often spend months building sandbox infrastructure, test harnesses, and evaluation pipelines. Then they maintain it indefinitely. We handle the infra so you can focus on training.

deepgym dashboard
Episodes/day
Latency
low
Parallel
scalable
Isolation
full

Sandbox Setup

Weeks of Docker config, security hardening, and resource management before you can run a single episode.

Test Harness Maintenance

Test suites drift, edge cases multiply, and reward signals degrade. Keeping evaluation reliable is a full-time job.

Scaling Bottleneck

Running millions of parallel episodes needs orchestration, monitoring, and failover. Most teams cap at hundreds.

Scaling Bottleneck

Running millions of parallel episodes needs orchestration, monitoring, and failover. Most teams cap at hundreds.

Product

Three ways to use DeepGym

Training loops, evaluation benchmarks, and community sharing.

Train

Run RL training loops with verifiable rewards. Thin adapters for DAPO, TRL, verl, and OpenRLHF, plus per-test-case and shaped reward breakdowns.

from deepgym.integrations.dapo import make_dapo_reward_fn
reward_fn = make_dapo_reward_fn(env)
scores = reward_fn(completions=batch)

Evaluate

Benchmark against built-ins plus imported HumanEval, MBPP, BigCodeBench, EvalPlus, SWE-bench Pro, and Terminal-Bench 2.0 tasks through one API.

$ deepgym run swebench_pro
✓ patch applies cleanly
✓ fail_to_pass tests fixed
→ pass_fraction: 0.83
same reward API for terminal + coding tasks

Share

Push environments and results to HuggingFace Hub. Load community environments. Publish leaderboard datasets.

push_environment_to_hub(env, "org/coin-change")
✓ pushed to HuggingFace Hub
push_results_to_hub(results)
✓ leaderboard updated
Integrations

Plugs into your existing stack

First-class integrations with the frameworks you already use. One-line setup for training, evaluation, and sharing.

TRL / GRPOTrainer

Drop-in reward function for HuggingFace TRL. One line to add verifiable code execution rewards to your GRPO training loop.

reward_fn = make_trl_reward_fn(env)

DAPO

Thin DAPO reward and config helpers for verl-style recipes without reimplementing the trainer layer.

reward_fn = make_dapo_reward_fn(env)

verl (ByteDance)

Compatible compute_score function that plugs into verl training pipelines.

score_fn = make_verl_compute_score()

OpenRLHF

FastAPI reward server endpoint. Deploy as a sidecar and point OpenRLHF at it.

router = create_openrlhf_router()

HuggingFace Hub

Push and pull environments as HF datasets. Share evaluation results as leaderboard datasets.

push_environment_to_hub(env, repo_id)

lm-eval Harness

Register DeepGym environments as lm-eval tasks. Run them from the lm-eval CLI alongside other benchmarks.

lm_eval --tasks deepgym_coin_change

Gymnasium API

Standard Gymnasium-compatible interface. reset(), step(), render(). Works with any RL framework that speaks Gym.

obs, reward, done, info = env.step(action)
Capabilities

Everything you need to train and eval code models

From sandbox execution to adversarial testing, multi-turn agents to community sharing.

Sandboxed Execution

Every environment runs real code in Daytona containers. Full OS-level isolation with network restrictions and resource limits. Auto-fallback to local mode for development.

mode: daytona | local | auto
isolation: full
network: restricted
escape: blocked

Adversarial Testing

Built-in reward hack detection with 5+ attack strategies. Probes for empty solutions, hardcoded results, and pattern exploits. RL-based exploit discovery finds novel attacks.

$ deepgym adversarial coin_change
✓ empty solution: blocked
✓ hardcoded output: blocked
✓ pattern exploit: blocked
△ 1 edge case found

Multi-Turn Agents

Step-by-step agent interaction with intermediate rewards. Record full trajectories. Safe mode restricts execution to Python only.

runner = MultiTurnRunner()
trajectory, result = runner.run(env, agent)
→ 4 steps, score: 0.92

Computer-Use & Tool-Use

Beyond code: browser interaction, screenshot verification, file system tasks, API requests, and data pipelines. Full GUI agent support.

Environment types
codingcomputer-usetool-use
screenshotclicktypescrollbash

Per-Test-Case Rewards

Fine-grained reward signals with per-case scoring, input summaries, and error traces. Shape rewards beyond binary pass/fail.

cases:
test_0: 1.0 coins=[1,2,5]
test_1: 1.0 coins=[2]
test_2: 0.0 coins=[]
score: 0.67

Rich Environment Library

24 built-in environments across core families, plus imported HumanEval and MBPP tasks, repo-level SWE-bench Pro patches, and Terminal-Bench 2.0 shell workflows.

Built-in envs24
Importable2,350+
Repo / terminalnative
HumanEvalMBPPSWE-bench ProTerminal-Bench 2.0

CLI & Web UI

Full CLI for running, evaluating, and creating environments. Browser-based debugging UI for interactive testing with real-time feedback.

$ deepgym run coin_change
$ deepgym eval medium
$ deepgym web --port 8080
$ deepgym serve --host 0.0.0.0

FastAPI Server

REST API with OpenAPI docs. Run single episodes, batch scoring, and full evaluation suites. API key authentication for production.

POST /v1/run
POST /v1/run-batch
POST /v1/eval
GET /v1/environments

Async & Batch

AsyncDeepGym with semaphore-based concurrency, strict per-sample routing, and mixed benchmark batches for smarter training runs across task types.

mixed = MixedEnvironment([...])
batch = dg.run_batch(
mixed, completions, environment_name=[...]
)
Train stronger models.
Train stronger models.

Start Training with Verifiable Rewards

Sandboxed code execution, benchmark-backed repo and terminal tasks, and reward signals that plug into DAPO, TRL, verl, and OpenRLHF.

pip install deepgymv0.3.0