ExpertEvals — Build Approach: Top-Down vs Bottom-Up

01 — Two Approaches

Starting from reality vs. starting from scratch.

Top-Down

Recommended — start here

Take a real project you've already delivered. Strip all PII, PHI, client names, and proprietary data. What remains — the PRDs, stories, Slack debates, code, reviews, test plans — becomes the environment. It's real because it came from reality.

Pros

Artifacts are authentically messy — real debates, real pushback, real constraints
Contains the "unknown unknowns" — edge cases that are hard to invent from scratch
Much faster to produce at scale — the content already exists, you're editing not creating
Genuinely out-of-distribution — these patterns don't exist in public training data
Higher value to AI labs because the texture is real, not synthetic

Cons

Requires access to past project artifacts (Jira, Confluence, Slack, repos)
Sanitization is non-trivial — must be thorough to avoid IP/privacy violations
Needs legal review to confirm sanitized output is safe to commercialize
Some projects may not survive sanitization — if the domain details ARE the interesting part
Client contracts may restrict even sanitized derivative use

Bottom-Up

Use to fill gaps

Start from scratch. Domain experts and engineers design what an enterprise environment should look like and craft every artifact — original PRDs, synthetic code, authored team discussions — drawing on their collective experience of how enterprises operate.

Pros

Can start immediately without needing access to past project data
Full creative control over what the environment tests
Zero risk of accidentally leaking real client information
Can target known model weaknesses by design
No legal review needed — everything is original creation

Cons

Artifacts can feel too polished — may lack the organic texture of real projects
Harder to capture "unknown unknowns" — the unexpected workarounds and institutional knowledge
More time per environment — creating from scratch takes more effort than sanitizing existing work
Scaling is slower — each environment requires significant creative investment
AI-assisted content generation can produce patterns already in training data, which may reduce out-of-distribution value

02 — Top-Down Process

From real project to packaged environment.

The top-down process takes a completed enterprise engagement and systematically transforms it into a sellable RL training environment.

Top-Down Pipeline

5 Steps

01

Select Project

Pick a completed engagement that represents a common enterprise pattern. Ideally one with good artifact retention — Jira history, Confluence pages, Slack exports, git repos, design docs.

02

Inventory Artifacts

Catalog everything: PRDs, technical scoping docs, ADRs, Jira exports, Slack/Teams threads, design documents, API specs, code repositories, PR reviews, test plans, deployment runbooks, incident reports.

03

Sanitize

Remove all PII (names, emails, SSNs), PHI (medical records, patient data), client identifiers (company names, project codes, internal URLs, proprietary business logic). Replace with synthetic equivalents that preserve shape and complexity.

04

Generalize

Abstract client-specific details into industry patterns. "Acme Bank" instead of the real name. Generated account numbers in the same format. Synthetic data with identical structure. The patterns are real; the specifics are not.

05

Package & Verify

Build Docker environment with all artifacts organized by SDLC phase. Write verification scripts for each stage. Independent review to confirm nothing identifiable remains. Validate that the environment actually tests what matters.

03 — Bottom-Up Process

Building from domain expertise when no project exists.

Use bottom-up for scenarios you know exist in the market but haven't delivered as projects. The quality depends entirely on the depth of domain knowledge applied.

Bottom-Up Pipeline

6 Steps

01

Define Scenario

Domain experts describe the enterprise pattern from experience: "At banks like X, migrating Y typically involves Z." No specific client, just the pattern observed across multiple engagements.

02

Draft Artifacts

Write the PRD, scoping doc, team discussion threads, architecture decisions. Domain experts drive content; engineers handle structure. AI tools help generate volume but experts review for authenticity.

03

Build Codebase

Scaffold the multi-service architecture, database schemas, infrastructure configs. Include realistic technical debt, existing tests (some flaky), and the kind of workarounds that real codebases accumulate.

04

Add Organizational Context

Create team structure docs, approval workflows, change management process, cross-team dependencies. This is where domain knowledge matters most — generic org context is obvious and low-value.

05

Expert Review

Domain experts review every artifact for authenticity: "A real DBA wouldn't approve this schema," "This Slack thread is too polite — real teams push back harder." Iterate until it passes the smell test.

06

Package & Verify

Same packaging as top-down: Docker environment, verification scripts for each SDLC stage, metadata, documentation.

04 — Recommendation

Start top-down. Fill gaps bottom-up.

Top-down first, bottom-up to fill gaps.

Start with top-down for the first 10–20 environments. Real projects give you authentic artifacts fast. The messy reality is the product — the Slack thread where someone says "this won't work because the batch job locks the table" is worth more than a perfectly crafted synthetic scenario.

Then use bottom-up to fill specific gaps — scenarios you know exist in the market but haven't delivered as projects. These bottom-up environments will be informed by patterns you've already seen across the top-down ones, so they'll be more realistic than if you started bottom-up from day one.

The key question for the team: Do we have a completed project — ideally financial services — where we still have access to the artifacts (Jira, Confluence, Slack, repo, design docs)? That becomes our first environment.

05 — The Product Vision

A machine that turns past work into a new revenue stream.

The end state is a repeatable process — a pipeline that takes a real enterprise project and produces a packaged RL training environment that AI labs will buy.

Input

Completed enterprise project with full artifact history (PRDs, Jira, Slack, code, reviews, tests, runbooks)

Our Pipeline

Inventory, sanitize PII/PHI/IP, generalize, build Docker sandbox, write verification scripts

Output

Packaged RL environment with 8 SDLC layers, multi-stage verification, ready to sell to AI labs