ExpertEvals — The Full Enterprise RL Environment

01 — The Gap

What exists today vs. what enterprises need.

Existing RL environments do a great job of training models on coding tasks. We take it further — adding the full enterprise context so models learn to operate as software engineers within real organizational constraints.

Standard RL Environments

The foundation

Single Dockerfile with a focused codebase
Clear instruction file describing the task
Model edits code, runs tests, pass/fail
Great for training core coding ability
Efficient to produce and evaluate at scale
Well-suited for general-purpose model improvement
Trains models to write code effectively

Full Enterprise RL Environments

Where we add value

Multi-service architecture with real infrastructure
Complete SDLC artifacts: PRDs, designs, stories, discussions
Model navigates ambiguity, makes decisions, implements
Business context drives every technical choice
Team discussions, review comments, architectural debates
Regulatory and compliance verification built into tests
Trains models to be enterprise software engineers

02 — The 8 Layers

Every environment contains the full picture.

A real enterprise project is not just code. It's requirements, debates, decisions, constraints, reviews, and organizational reality. Our environments capture all of it.

01 Business Context

PRD Business case Regulatory requirements brief Stakeholder map Success metrics Competitive context

02 Discovery & Scoping

Technical scoping doc Architecture Decision Records Team discussion threads Meeting notes Risk assessment Vendor evaluation Dependency map

03 Planning & Design

Epics & stories Acceptance criteria Technical design doc API contracts (OpenAPI) Data model / ERD Sequence diagrams Implementation plan Security threat model

04 The Codebase

Multi-service repo Infrastructure as Code CI/CD pipeline configs Database migrations Environment configs Existing technical debt Existing tests (some flaky)

05 Review & Quality

PR review history Code style guide Security review checklist Architecture board feedback Performance benchmarks Quality gates

06 Testing

Test plan Test cases (unit, integration, E2E) UAT scenarios Regression suite Performance test criteria Test data specs Known bugs / tech debt

07 Operations

Deployment runbook Rollback plan Monitoring dashboards Alerting rules On-call procedures SLA definitions

08 Organizational Context

Team structure Approval workflows Change management process Cross-team dependencies Institutional knowledge Escalation paths

03 — Anatomy

What an environment looks like on disk.

payment-gateway-migration/ context/ business/ prd.md # Product requirements regulatory-brief.md # Compliance requirements driving this stakeholders.md # Who owns what, who approves discovery/ scoping-doc.md # Technical feasibility analysis adr-001-event-sourcing.md # Architecture decision records adr-002-kafka-vs-rabbitmq.md team-discussion-thread.md # Sanitized Slack/Teams thread risk-assessment.md planning/ epics.md # Jira-style epics with acceptance criteria stories.md # User stories with story points technical-design.md # Component architecture, data flow api-contract.yaml # OpenAPI spec data-model.sql # Schema design implementation-plan.md # Phased rollout, feature flags reviews/ pr-review-history.md # Past PR reviews showing team standards security-checklist.md # Security team's review template style-guide.md # Code conventions operations/ deployment-runbook.md # Step-by-step prod deploy monitoring.md # What to watch, alerting rules sla.md # 99.95% uptime, 200ms p95 environment/ Dockerfile # Multi-service sandbox docker-compose.yml # Services + Kafka + Postgres + Redis services/ payment-api/ # The main service codebase fraud-detection/ # Downstream consumer settlement/ # Batch settlement service legacy-gateway/ # The system being replaced infra/ terraform/ # IaC configs ci-cd/ # Pipeline definitions db/ migrations/ # Existing schema migrations seed-data.sql # Synthetic test data tests/ test.sh # Test harness entry point test_functional.py # Does it work correctly? test_regression.py # Did anything else break? test_compliance.py # Audit trail, PII masking, regulatory format test_performance.py # Latency, throughput, no degradation test_security.py # Auth, encryption, input validation task.toml # Metadata: difficulty, timeouts, resources instruction.md # What the model sees

04 — Multi-Skill Verification

Test the model at any stage of the SDLC.

Because the environment contains the full project lifecycle, we can set the task entry point at any stage — and verify the model's output against what actually matters at that stage.

Task Type	What the Model Does	Verification
Requirements → Design	Given PRD + team discussions, produce a technical design document	All requirements addressed, architecture is sound, edge cases covered
Design → Stories	Given design doc, break into epics and stories with acceptance criteria	Completeness, dependency ordering, story sizing, no gaps
Stories → Implementation	Given stories + existing codebase, implement the feature	Tests pass, code review standards met, no regression, compliance checks
Code → Review	Given a PR with planted issues, provide thorough review	Catches bugs, security issues, style violations, suggests improvements
Incident → Fix	Given an incident report + codebase, find root cause and fix	Fix resolves the issue, no new problems, includes post-mortem
Migration → Delivery	Given legacy system + target spec, plan and execute migration	Functional parity, backward compatibility, no data loss, performance maintained
Full SDLC	Given just the PRD, produce everything through to working code	Multi-stage verification at each phase of delivery

05 — Our Approach

Built top-down from real projects.

We don't imagine what enterprise environments should look like. We take real projects we've delivered, strip all identifying and sensitive information, and what remains becomes the environment. The artifacts are authentic because they came from reality.

01

Select Project

Pick a completed enterprise engagement — a migration, integration, or platform build that represents a common pattern.

02

Inventory Artifacts

Catalog every artifact: PRDs, design docs, Jira exports, discussion threads, code, reviews, test plans, runbooks.

03

Sanitize

Strip all PII, PHI, client names, proprietary business data. Replace with synthetic equivalents that preserve the shape and complexity.

04

Generalize

Abstract client-specific details into industry patterns. "Acme Bank" instead of the real name, synthetic data with the same structure.

05

Package & Verify

Build Docker environment, write verification scripts for each SDLC stage, validate that the environment tests what matters.

06 — Our Differentiation

What sets our environments apart.

Deep Domain Experience

Years of hands-on enterprise implementation gives us an intuitive understanding of how these projects actually unfold — the change management processes, the DBA pushback, the compliance requirements. That lived experience shapes every environment we build.

Rooted in Real Work

Our environments are derived from real project patterns, not imagined from scratch. The messy discussion thread where someone says "this won't work because the batch job locks the table" — that texture comes from having been there.

Full SDLC Verification

We go beyond testing whether code passes a unit test. Our verification spans the entire delivery lifecycle — requirements understanding, architectural decisions, code quality, compliance, and operational readiness.

Compounding Value

Each environment teaches models how enterprises actually work, which makes them better at helping enterprises — which is exactly what AI labs want to offer their customers. Better environments lead to better models lead to stronger demand.