Checking access…

RL Environment Reference: Investment Banking

Enterprise Data Glossary Platform

An organization-wide system that indexes every data attribute across code, databases, manual entries, Confluence, and third-party systems. Built for investment banks where senior regulatory, technology, and compliance leadership need unified visibility into data ownership, lineage, and PII — and where that data spans wealth management, trading, fund services, prime brokerage, securities lending, and customer onboarding.

The Problem

Data fragmentation at scale.

In a large investment bank, data attributes live in hundreds of places, owned by dozens of teams, with no single source of truth. Regulators demand answers the bank cannot provide without weeks of manual archaeology.

01

Data Scattered Everywhere

Attributes defined in code, databases, Confluence pages, Excel sheets, emails, manual processes. No index.

02

Unclear Ownership

Who owns "client risk score"? Nobody knows. Original owner left three years ago. Data still flowing.

03

Duplicate Definitions

"Trade date" defined seven ways across trading, settlement, accounting, and reporting systems.

04

PII Blindness

Regulators ask "where is customer PII processed?" — answer takes 3 months and is incomplete.

05

Broken Lineage

Can't trace how a regulatory report field was calculated. Source systems, transformations, and rules are opaque.

06

Stale Documentation

Confluence says one thing, code does another. Documentation drifted months ago; nobody updates it.

07

No Discovery

New engineer asks "does this attribute already exist?" — no way to find out. Duplicate work proliferates.

08

External Data Sharing

Publishing reference data to clients, counterparties, or regulators requires manual extraction each time.

Users

Three distinct personas, one platform.

E

Senior Executive Leadership

Regulatory · Technology · Compliance

Needs a clear view of data ownership, lineage, and risk across the organization. Makes decisions on audits, remediation priorities, and regulatory responses.

Needs

  • Executive dashboards on data health
  • PII heatmaps by business unit
  • Regulatory readiness reports
  • Ownership gap analysis
  • Drill-down from summary to detail
T

Technical Team

Platform Engineers · Developers · Data Engineers

Builds systems that produce or consume data. Needs to publish attributes to the glossary as part of the development lifecycle — not as a separate manual task.

Needs

  • Jenkins / GitHub Actions integration
  • Annotation-based publishing (in-code metadata)
  • Automated DB schema scanning
  • CI/CD-integrated validation
  • REST / GraphQL APIs
O

Data Owners

Business SMEs · Product Owners · Ops Leads

Own the business meaning of data for their function (e.g., Portfolio Accounting SOPs, fund NAV definitions). Update manually-sourced data, approve changes, transition ownership.

Needs

  • Simple form-based UI for updates
  • Bulk upload from Excel
  • Ownership transfer workflow
  • Approval queues for changes
  • Business glossary maintenance

Architecture

System landscape at a glance.

The platform is a layered architecture: ingestion channels feed a core metadata engine; processing modules enrich and validate; consumption interfaces surface insights to different users.

Data Glossary Platform — System Architecture
Ingestion Code Scanner DB Scanner Confluence Crawler Jenkins Plugin API SDK Manual UI Bulk Excel Upload
↓ ↓ ↓ ↓ ↓
Core Engine Metadata Store (Graph DB) Business Glossary Lineage Graph Ownership Registry
↓ ↓ ↓ ↓ ↓
Processing PII Classifier (ML) Duplicate Detector Lineage Reconciler IB Domain Ontology Quality Scorer
↓ ↓ ↓ ↓ ↓
Workflow Approval Engine Ownership Transfer Change Request Queue Audit Trail
↓ ↓ ↓ ↓ ↓
Consumption Executive Dashboard Search & Discovery Lineage Viewer PII Heatmap Publishing API
↓ ↓ ↓ ↓ ↓
External Client Data Feeds Regulatory Submissions Counterparty Exchange Partner Publishing

Platform Modules

8 modules that solve the full problem.

Click any module to see detailed workflows, integrations, and interactions.

01 Automated Data Cataloging (Code & DB Scanners)
Technical Team

Data attributes are auto-discovered from source code, database schemas, and running systems. Engineers don't separately "catalog" their data — the platform catalogs itself as code is written and deployed.

In-Code Annotations (Java / Python / TypeScript)

Engineers annotate fields directly in their code. The scanner extracts these during the build process.

// Java example — Portfolio Accounting service @DataAttribute( name = "trade_settlement_date", description = "T+2 settlement date for equity trades per DTCC rules", owner = "portfolio.accounting@bank.com", domain = "TRADE_SETTLEMENT", pii = false, lineage = { "trade_capture.execution_date", "calendar.settlement_rules" } ) private LocalDate settlementDate;

Jenkins / CI-CD Pipeline Integration

Build stage — glossary-plugin scans source code for @DataAttribute annotations
Validation stage — checks for duplicate attribute names, missing ownership, PII without classification
Publish stage — attributes pushed to Glossary API; delta from previous build shown in PR
Fail-fast — build fails if new PII attribute is defined without owner; forces compliance upfront

Database Schema Scanner

Scheduled scan — agents connect to target databases (Oracle, Postgres, SQL Server, Snowflake, DB2)
Metadata extraction — tables, columns, data types, nullability, foreign keys, indexes
Sampling — reads sample rows (anonymized) to improve classification accuracy
Change detection — compares against previous scan; flags new columns, dropped tables, type changes

Confluence / Wiki Crawler

Crawls configured Confluence spaces on a schedule
LLM extraction of data attribute definitions from narrative text
Links documentation to corresponding code/DB attributes
Flags drift between documented definition and actual implementation

Systems & Integrations

Glossary SDK (Java, Python, TS, C#) Jenkins Plugin GitHub Actions GitLab CI Azure DevOps Bitbucket Pipelines Confluence API JDBC / ODBC Connectors
02 Manual Data Entry Interface
Data Owners

Not all data is in code. Fund NAV definitions, trading rules, counterparty reference data, and operational SOPs often live in Excel or team heads. Data owners need a simple, non-technical UI to maintain this data.

Entry Surfaces

Web form UI — guided forms with required fields, auto-suggestions from IB domain ontology, validation
Bulk Excel upload — downloadable template; uploaded file validated; errors shown inline; partial loads supported
Excel add-in — edit directly in Excel; changes sync to the platform on save
Mobile-friendly UI — senior data owners can approve changes on the go

Guided Data Capture

Domain-aware prompts — if owner selects "Fund Services" domain, system offers templates for NAV, AUM, subscriptions
Relationship suggestions — "This looks similar to 'fund.nav_daily' in Fund Accounting. Is it the same?"
Inline PII detection — real-time warning if entered value patterns match PII (SSN, email, phone)
Auto-save drafts — work-in-progress never lost; multi-session editing

Required Metadata

Attribute name Business definition Data type Allowed values Source system(s) Owner Domain PII classification Regulatory tags Update cadence
03 Ownership & Change Management Workflow
Admins & Owners

People leave, teams reorganize, business responsibilities shift. Without an ownership transition workflow, data becomes orphaned and governance falls apart.

Ownership Transfer Workflow

Initiation — outgoing owner, admin, or manager initiates transfer of N attributes
Proposed successor — new owner proposed; system validates they have relevant domain access
Acceptance — proposed owner must explicitly accept; receives email + in-app notification
Handover brief — system auto-generates summary: definitions, downstream consumers, recent changes, open issues
Manager approval — both outgoing and incoming managers sign off
Effective date — transfer takes effect; audit log preserves history; consumers notified

Orphaned Data Workflow

System detects orphaned attributes (owner left, no successor, no activity for N days)
Orphan report generated weekly; escalated to domain head
Domain head assigns interim owner; 30-day SLA for permanent reassignment
If SLA breached: escalation to divisional CTO / CDO

Change Request Workflow

Any user can propose a change to an attribute (definition, classification, lineage)
Owner receives notification; approves, rejects, or requests clarification
For high-impact changes (PII classification, regulatory attribute), compliance review required
Approved changes create new version; downstream consumers notified; old version preserved for audit
04 Data Hygiene Engine
Platform Automation

The most important module — without automated hygiene, the glossary rots within a year. This is what makes the platform maintain its value over time.

Duplicate Detection

Exact match — same attribute name in same domain; flag immediately
Semantic similarity — ML model compares definitions; "trade_date" vs "execution_date" vs "transaction_date"
Value-pattern match — sampled values follow identical patterns (e.g., two attributes both holding ISINs)
Resolution workflow — owners of potential duplicates collaborate to merge, alias, or declare distinct

Broken Lineage Detection

Attribute declared lineage to source that no longer exists (deleted table, renamed column)
Chain interruptions: A → B → C, but B has no evidence of producing C
Orphaned downstream consumers: something depends on attribute that was deprecated
Weekly broken-lineage report; ticketed to owners with 14-day SLA

Unclear Ownership Detection

Owner email address bounces
Owner hasn't logged into platform for 90 days
Group ownership with no named escalation point
Ownership claimed by role that no longer exists in HR system

Data Quality Scoring

Every attribute gets a quality score (0-100) based on:

Completeness of metadata Ownership clarity Lineage completeness Documentation quality PII classification Downstream usage Freshness of updates Dispute history
05 Lineage Visualization & PII Heatmap
Executive + Compliance

Interactive graph visualization of how data flows through the organization. PII is visually highlighted at every node so compliance can quickly assess exposure.

Example: Trade Settlement Lineage

order_management.client_order
trade_capture.execution
clearing.matched_trade
settlement.settled_trade
client.account_id
client.account_id
clearing.counterparty_ref
settlement.beneficiary_id
market_data.price_feed
trade_capture.notional
risk.position_exposure

Lineage Features

Upstream view — "Where does this attribute come from?" Trace back to originating system
Downstream view — "Who uses this attribute?" See all consumers including reports, regulatory filings
Column-level lineage — not just table-to-table; specific column transformations
Transformation detail — click an edge to see the SQL or code that produces the derivation
Impact analysis — "If I change this, what breaks?" Cascade view with affected systems

PII Heatmap

Organization-wide map of where PII is stored, processed, and transmitted
Classification levels: SSN/TIN, name, email, phone, address, DOB, account numbers, biometric
Filter by jurisdiction: GDPR scope, CCPA scope, India DPDP, Singapore PDPA
Risk scoring by PII volume × system criticality × access controls
Regulatory report export for audits (e.g., Article 30 records under GDPR)
06 Investment Banking Domain Knowledge
Core Intelligence

The differentiator. The platform ships with built-in ontologies for investment banking domains. When scanning code or databases, it automatically recognizes attributes based on domain patterns — not just generic metadata extraction.

Wealth Management

Client lifecycle, portfolio management, financial planning, advisory

client_id ssn_tin client_name aum risk_tolerance investment_objective portfolio_id asset_allocation holdings benchmark suitability_score ips_version beneficiary_info fee_schedule

Equity Trading

Order management, execution, allocation, reporting

order_id ticker isin cusip sedol side order_type quantity limit_price fill_price venue execution_id commission settlement_date corporate_action

Derivatives

Options, futures, swaps, structured products

contract_id underlying strike expiry option_type notional premium delta gamma vega theta implied_vol cva xva isda_agreement csa_terms

Fund Services

NAV calculation, transfer agency, fund accounting, distribution

fund_id nav_per_unit total_assets aum subscription redemption distribution management_fee performance_fee expense_ratio unitholder_id valuation_date cutoff_time

Prime Brokerage

Margin, financing, securities lending, consolidated reporting

margin_account initial_margin variation_margin excess_equity buying_power financing_rate hypothecation_flag rehypo_limit concentration_risk stress_test_result

Securities Lending

Loans, collateral, recalls, corporate actions on loaned securities

loan_id lender borrower loaned_quantity rebate_rate fee_rate collateral_type collateral_value haircut recall_date indemnification

Customer & Account Onboarding

KYC, AML, CIP, documentation, approvals

customer_id legal_name dob tax_id address citizenship tax_residency pep_status sanctions_screen_result kyc_risk_tier cdd_edd_status source_of_wealth beneficial_owner fatca_status crs_classification

Post-Trade & Settlement

Clearing, settlement, reconciliation, regulatory reporting

trade_id settlement_date trade_date clearing_house settlement_cycle cash_account security_account failed_trade_flag reconciliation_break cat_report_id mifir_report_status emir_uti

How Domain Knowledge Powers Auto-Discovery

Scanner encounters column exec_px in trading database
Ontology matches pattern: equity trading → execution → fill_price alias
Auto-classification: domain = Equity Trading, standard_name = fill_price, PII = false
Suggested owner = team registered for Equity Trading domain; requires confirmation
Linked to existing canonical definition of fill_price in glossary
07 Executive Dashboards & Search
Senior Leadership

What senior executives actually see. The platform synthesizes the technical metadata into business-meaningful views for regulatory, compliance, and technology leadership.

Regulatory Readiness Dashboard

% of regulatory-impact attributes with verified ownership
% with complete lineage to source
Open audit findings by severity
Time-to-respond benchmark for regulatory data requests
Coverage by regulation: GDPR, CCPA, MiFIR, EMIR, CAT, SOX, DPDP

Data Health Scorecard (by Business Unit)

Wealth Management: 87% coverage, 12 orphans, 34 duplicates pending resolution
Equity Trading: 93% coverage, 3 orphans, lineage 91% complete
Fund Services: 76% coverage, flag: significant manual data not yet cataloged
Trend analysis: quarter-over-quarter improvement/regression

Natural Language Search (LLM-powered)

Users search in plain English; LLM interprets against the metadata graph:

# Example queries "Where do we process EU client PII for derivatives trading?" # Returns: 14 systems, 47 attributes, ownership matrix, GDPR Art. 30 record "Who owns NAV calculation for Alternative Funds?" # Returns: Fund Accounting team, specific owner, last updated 3 days ago "Show me orphaned attributes in Prime Brokerage" # Returns: 8 attributes, last owners, suggested new owners "What regulatory reports use trade_settlement_date?" # Returns: MiFIR Transaction Reporting, CAT, internal Trade Blotter
08 Publishing & External Data Exchange
Technical + Compliance

Reference data, regulatory reports, and client feeds all need to be published to external parties. The platform manages publishing as a governed, auditable function.

Internal Publishing

Downstream teams subscribe to attribute "channels" (e.g., "wealth.client_reference")
Changes publish as events to Kafka topics; downstream systems consume in real-time
Schema evolution managed; breaking changes require consumer sign-off before release
Versioned APIs (v1, v2) with deprecation schedules

External Publishing (Regulators, Clients, Counterparties)

Regulatory submissions — MiFIR, EMIR, CAT, FR Y-14 — data mapped from glossary to regulatory schemas
Client reference data — daily position feeds, consolidated reports, tax packages; mapped to client-specific formats
Counterparty exchange — trade confirmations, collateral statements via SWIFT, FpML, FIX
Partner integrations — API-based sharing of reference data with fund administrators, custodians

Governance Controls

PII attributes blocked from external publishing unless explicitly approved
Data contracts define exactly what each consumer can access
Complete audit trail: who accessed what, when, via what channel
Right-to-erasure workflow cascades through all external consumers

Protocols & Standards Supported

REST API GraphQL Kafka Streams SWIFT MT/MX FpML (Derivatives) FIX Protocol ISO 20022 SFTP Batch OData Parquet / Iceberg Tables

RL Environment Value

Why this system makes a premium training environment.

Building this platform is exactly the kind of multi-year, multi-team enterprise engineering effort that no synthetic RL environment can authentically reproduce. It spans every layer of the stack and every role in the organization.

DEPTH

Multi-Layer Stack

Scanners in multiple languages, graph DB, ML classifiers, workflow engines, dashboards, publishing APIs. Every layer has its own complexity.

BREADTH

Cross-Domain Knowledge

Requires deep understanding of wealth, trading, derivatives, fund services, prime brokerage — each with its own vocabulary, regulations, and edge cases.

HUMAN

Workflow Intricacy

Ownership transitions, approval chains, exception handling, orphan remediation. Real organizational friction captured authentically.

AMBIGUITY

Fuzzy Problems

Duplicate detection across domains, semantic similarity, PII boundary cases, lineage reconciliation. Requires judgment, not just rule-following.