The Gym Where AI Models Get Stronger
RL training and evaluation infrastructure for coding agents. Built-ins, imported benchmarks, SWE-bench Pro, Terminal-Bench 2.0, mixed-environment reward routing, and thin DAPO/TRL/verl integrations in one stack.

Loading...

Most teams build this from scratch
Code model teams often spend months building sandbox infrastructure, test harnesses, and evaluation pipelines. Then they maintain it indefinitely. We handle the infra so you can focus on training.
Sandbox Setup
Weeks of Docker config, security hardening, and resource management before you can run a single episode.
Test Harness Maintenance
Test suites drift, edge cases multiply, and reward signals degrade. Keeping evaluation reliable is a full-time job.
Scaling Bottleneck
Running millions of parallel episodes needs orchestration, monitoring, and failover. Most teams cap at hundreds.
Scaling Bottleneck
Running millions of parallel episodes needs orchestration, monitoring, and failover. Most teams cap at hundreds.
Three ways to use DeepGym
Training loops, evaluation benchmarks, and community sharing.
Train
Run RL training loops with verifiable rewards. Thin adapters for DAPO, TRL, verl, and OpenRLHF, plus per-test-case and shaped reward breakdowns.
Evaluate
Benchmark against built-ins plus imported HumanEval, MBPP, BigCodeBench, EvalPlus, SWE-bench Pro, and Terminal-Bench 2.0 tasks through one API.
Share
Push environments and results to HuggingFace Hub. Load community environments. Publish leaderboard datasets.
Plugs into your existing stack
First-class integrations with the frameworks you already use. One-line setup for training, evaluation, and sharing.
TRL / GRPOTrainer
Drop-in reward function for HuggingFace TRL. One line to add verifiable code execution rewards to your GRPO training loop.
DAPO
Thin DAPO reward and config helpers for verl-style recipes without reimplementing the trainer layer.
verl (ByteDance)
Compatible compute_score function that plugs into verl training pipelines.
OpenRLHF
FastAPI reward server endpoint. Deploy as a sidecar and point OpenRLHF at it.
HuggingFace Hub
Push and pull environments as HF datasets. Share evaluation results as leaderboard datasets.
lm-eval Harness
Register DeepGym environments as lm-eval tasks. Run them from the lm-eval CLI alongside other benchmarks.
Gymnasium API
Standard Gymnasium-compatible interface. reset(), step(), render(). Works with any RL framework that speaks Gym.
Everything you need to train and eval code models
From sandbox execution to adversarial testing, multi-turn agents to community sharing.
Sandboxed Execution
Every environment runs real code in Daytona containers. Full OS-level isolation with network restrictions and resource limits. Auto-fallback to local mode for development.
Adversarial Testing
Built-in reward hack detection with 5+ attack strategies. Probes for empty solutions, hardcoded results, and pattern exploits. RL-based exploit discovery finds novel attacks.
Multi-Turn Agents
Step-by-step agent interaction with intermediate rewards. Record full trajectories. Safe mode restricts execution to Python only.
Computer-Use & Tool-Use
Beyond code: browser interaction, screenshot verification, file system tasks, API requests, and data pipelines. Full GUI agent support.
Per-Test-Case Rewards
Fine-grained reward signals with per-case scoring, input summaries, and error traces. Shape rewards beyond binary pass/fail.
Rich Environment Library
24 built-in environments across core families, plus imported HumanEval and MBPP tasks, repo-level SWE-bench Pro patches, and Terminal-Bench 2.0 shell workflows.
CLI & Web UI
Full CLI for running, evaluating, and creating environments. Browser-based debugging UI for interactive testing with real-time feedback.
FastAPI Server
REST API with OpenAPI docs. Run single episodes, batch scoring, and full evaluation suites. API key authentication for production.
Async & Batch
AsyncDeepGym with semaphore-based concurrency, strict per-sample routing, and mixed benchmark batches for smarter training runs across task types.
Start Training with Verifiable Rewards
Sandboxed code execution, benchmark-backed repo and terminal tasks, and reward signals that plug into DAPO, TRL, verl, and OpenRLHF.