Bias Audit Dashboard
B2B AI tool for trust and safety teams at UK media companies. Detects bias across 6 categories with SHAP explainability, a live moderation simulator, and a full audit trail — built for Online Safety Act 2023, Ofcom Broadcasting Code, and BBC Editorial Guidelines compliance.
Executive Summary
Built an end-to-end AI bias detection system for UK media trust and safety teams. The hybrid architecture routes content through a tiered pipeline: inputs of 2 words or fewer are auto-approved at zero cost, 4–15 word inputs go to Claude API for semantic classification, and longer inputs run through a TF-IDF + XGBoost classifier with SHAP word-level highlights. The system achieves F1 0.90 across 6 bias categories and includes a live comment moderation simulator that demonstrates the full human-in-the-loop workflow.
F1 Score
Across all 6 bias categories
Routing Architecture
Cost-optimised classification pipeline
Training Examples
Synthetic UK media content
Bias Categories
Demographic, gender, racial, religious, geographic, neutral
Problem Statement
UK media companies — broadcasters, publishers, streaming platforms — produce and moderate thousands of pieces of content daily. Trust and safety teams are responsible for ensuring that content does not systematically disadvantage groups based on age, gender, race, nationality, religion, sexuality, or geography.
Today this work is done manually. A reviewer reads content, applies their own judgement, and makes a call. This process has three critical failure modes: inconsistency (two reviewers assess the same content differently with no shared rubric), scale (manual review cannot handle 50,000 items per week), and blind spots (human reviewers have their own biases — without a structured detection layer, systematic bias in content can go undetected for months).
The result: media companies face regulatory risk under Ofcom and the Online Safety Act 2023, reputational damage, and advertiser pressure — without the tooling to demonstrate they are actively auditing for bias.
"The Online Safety Act 2023 places a duty of care on platforms to protect users from harmful content. Ofcom can fine companies up to £18 million or 10% of global annual turnover for failures. The BBC Editorial Guidelines require content to treat audiences with respect and avoid unjustified harm — covering exactly the demographic, gender, racial, religious, and geographic bias categories this system detects. A trust and safety team that cannot demonstrate structured, auditable bias review is exposed — regardless of how good their intentions are."
Solution Overview
A hybrid AI system that gives trust and safety analysts a structured, explainable, auditable bias detection layer — so they can make faster, more consistent, and more defensible content moderation decisions at scale. The human always decides. The system never acts autonomously.
Tiered Routing Architecture
Three-tier classification pipeline: Tier 1 (≤2 words) auto-approves at zero cost. Tier 2 (4–15 words) routes to Claude API for semantic classification — handles short text that TF-IDF cannot reliably classify. Tier 3 (>15 words) runs the full XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.
SHAP Word-Level Explainability
Every verdict includes SHAP highlights showing the specific words that triggered the classification. A trust and safety analyst can see exactly which words drove the score — not just a number. This is the difference between a tool reviewers trust and one they ignore.
Comment Moderation Simulator
Live two-panel interface: public feed on the left (comments post instantly), moderation queue on the right (bias analysis appears with Approve/Flag/Remove actions). Demonstrates the full human-in-the-loop workflow in under 30 seconds — the exact demo moment that lands in interviews.
Architecture Diagram
Technical Implementation
Data & Methodology
Data Dictionary
| Feature | Type | Description | Source |
|---|---|---|---|
| content | string | UK media content item — headline, social post, video description, or article excerpt | Claude API generation |
| label | binary (0/1) | 0 = neutral, 1 = biased (any non-neutral category) | Derived from category |
| category | categorical (6) | demographic_bias, gender_stereotyping, racial_bias, religious_bias, geographic_bias, neutral | Claude API labelling |
| confidence_ground_truth | float (0.70–0.99) | How clear-cut the example is. Clear examples: 0.85–0.99. Subtle/ambiguous: 0.70–0.84 | Designed by difficulty tier |
| split | categorical (train/test) | Stratified 80/20 split — 400 train, 100 test per category | Assigned at generation time |
Methodology
Generated 3,000 synthetic UK media content items using Claude API (claude-haiku-4-5). Six categories, 500 items each, with a deliberate 60/40 clear/subtle difficulty split. Neutral examples were specifically designed to reference the same groups and topics as biased categories — to force the model to learn the bias signal, not just the topic. When demographic_bias and racial_bias overlapped (both scoring F1 0.65 on first attempt), the fix was not model tuning — it was redesigning the generation prompts to create sharper category boundaries. This is a PM decision, not a data science one.
Validation Approach
- •Stratified 80/20 train/test split — 400 train, 100 test per category
- •F1 score per category with explicit 0.78 threshold from PRD
- •Fairness constraint: no category flagged at more than 2× the rate of any other
- •Manual validation: 50 test cases reviewed for explanation accuracy
- •SHAP validation: top feature words checked against domain knowledge
Proof of Impact
0.90
Overall F1 score across all 6 bias categories — all above the 0.78 PRD threshold
Results Comparison
| Metric | Before | After | Change |
|---|---|---|---|
| demographic_bias F1 | 0.65 (first attempt) | 0.89 | Prompt redesign |
| racial_bias F1 | 0.65 (first attempt) | 0.87 | Journalism framing |
| geographic_bias F1 | N/A | 0.92 | Within threshold |
| Fairness disparity ratio | Unknown | 1.00× (perfect) | Constraint satisfied |
| Explanation accuracy | Hallucinated context | Content-grounded | Fixed by passing original content |
| Short text accuracy | 99% false positives | Semantic via Claude API | Tiered routing |
Key Insights
The model improvement from F1 0.65 to 0.89 on demographic_bias came from redesigning the training prompts, not tuning the model. This is the core PM insight: data quality is a product decision. The model learns what the data shows — if the data conflates two categories, no amount of hyperparameter tuning will fix it.
The tiered routing architecture was driven by a business and ethics decision, not a technical one. TF-IDF + XGBoost produces high-confidence false positives on short text — 'you are a christian' scoring 99% HIGH RISK religious bias. The cost of a wrong high-confidence verdict (regulatory risk, reviewer trust collapse) is orders of magnitude higher than the cost of a Claude API call (£0.0003).
Geographic bias misclassifies when the bias is carried by adjectives rather than location nouns. 'People from the north of England lack ambition' scores as NEUTRAL because the model sees no geographic trigger words — the bias is in 'lack ambition', not 'north of England'. The Claude explanation layer correctly identifies this as a model failure and tells the reviewer to flag it manually. This is the hybrid architecture working as designed.
The fairness disparity ratio of 1.00× means every category is flagged at exactly the same rate. This is a product outcome, not a coincidence — the dataset was deliberately balanced at 500 items per category, and the PRD specified a 2× maximum disparity as a hard constraint.
Ethics & Responsible AI
This project applies responsible AI principles from the ground up — not as documentation added after the build, but as design constraints that shaped architectural decisions. Drawing on Cathy O'Neil's Weapons of Math Destruction: a model is dangerous when it is opaque, operates at scale, and causes harm to the people it scores. Every decision in this system was made to be the opposite.
Human Agency is Non-Negotiable
The system never makes a content decision. Every result requires an explicit reviewer action: Approve, Flag, or Escalate. There is no auto-remove. The system cannot suppress or publish content on its own. This is an ethical constraint embedded in the architecture, not a UX choice.
Explainability is a Right
A reviewer who cannot understand why content was flagged cannot make a defensible decision. Every result includes SHAP word highlights, a plain English Claude explanation, and a confidence score. The reviewer must be able to read the explanation and either agree or override it with full understanding.
Uncertainty Must Be Communicated Honestly
The model is not certain. Confidence scores are shown prominently on every result. Low-confidence verdicts are visually distinct. The tiered routing architecture was implemented specifically because showing a 99% HIGH RISK verdict on neutral content violates this principle — false certainty is more dangerous than acknowledged uncertainty.
Synthetic Data is a Deliberate Privacy Choice
No real user content is stored or processed. The model learns patterns, not individuals. This is documented as a deliberate product decision in MODEL_DECISIONS.md — not a shortcut.
Guardrails & Safeguards
| Rule | Threshold | Rationale |
|---|---|---|
| Human reviewer always required | No auto-approve, no auto-remove | Removal without human review is censorship without accountability |
| Confidence shown on every verdict | Always visible in UI | Reviewers must know how certain the model is before acting on it |
| Short text routes to Claude, not XGBoost | 4–15 words via Tier 2 | False high-confidence verdicts destroy reviewer trust and create compliance risk |
| No real user content stored in v1 | Synthetic data only | Privacy by design — no DPIA required for v1 |
Bias Audit & Fairness Assessment
Fairness metric computed using the dataset directly: disparity ratio (max flag rate / min flag rate across categories) = 1.00× — no category is flagged at a disproportionate rate. Computed in audit.py using pandas. Fairlearn integration (demographic parity, equal opportunity, predictive parity, individual fairness) is planned for v2. Known limitation: geographic_bias is misclassified when bias is carried by adjectives rather than location nouns. Documented in MODEL_DECISIONS.md.
OKRs & Success Metrics
Objective
Demonstrate responsible AI product management capability through a shipped, production-grade bias detection system that is technically sound, ethically grounded, and commercially relevant to UK media compliance requirements
Key Results
Achieve F1 > 0.78 on all 6 bias categories
100%Target: F1 > 0.78 per category
Implement human-in-the-loop architecture with no auto-remove capability
100%Target: Human always decides
Success Metrics
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Overall F1 Score | > 0.78 all categories | 0.90 — all 6 above threshold | Achieved |
| Fairness disparity ratio | < 2× | 1.00× — constraint satisfied | Achieved |
| Live demo | Public URL | bias-audit-dashboard.vercel.app | Achieved |
| PM artefacts | 5 documents | 5 shipped before build started | Achieved |
Roadmap & Future Vision
Now
Completed- →TF-IDF + XGBoost classifier — F1 0.90 across 6 categories
- →Tiered routing: Tier 1 (free), Tier 2 (Claude API), Tier 3 (XGBoost + SHAP)
- →Live Content Analyser + Comment Moderation Simulator
- →Audit Dashboard with fairness metrics
- →Deployed: bias-audit-dashboard.vercel.app
Next
In Progress- →Supabase audit log with full session persistence
- →Magic link auth for B2B trial access
- →PDF export of audit log for Ofcom reporting
- →Retrain geographic_bias to catch adjective-based bias patterns
Later
Planned- →Fine-tune on real UK media content (with consent)
- →Bulk upload / batch processing for large content queues
- →Real-time stream monitoring for live broadcast
- →Sentence transformer upgrade for better semantic understanding
Learnings & Reflections
What Went Well
- •Writing PM artefacts before any code forced clarity on what the product needed to do — the PRD's 0.78 F1 threshold became a real design constraint that shaped dataset generation, not a metric added at the end
- •The hybrid architecture decision (ML for detection, Claude for explanation) was made on day one and proved correct — the two layers have different strengths and the product is better for keeping them separate
- •The tiered routing architecture came from a genuine product problem (short text false positives) and was justified on business and ethics grounds before any code was written — that's the right order
Challenges Faced
- •The first attempt at demographic_bias and racial_bias both scored F1 0.65 — not because the model was wrong, but because the training data created overlapping categories. The fix was redesigning the generation prompts, not tuning the model. This took a full iteration cycle to diagnose correctly.
- •The explain endpoint initially hallucinated context because it only received the score and category — not the original content. The fix (passing content to the explain endpoint) was obvious in retrospect but took seeing a bad explanation in production to identify.
- •Geographic bias remains a known failure mode: 'People from the north of England lack ambition' scores as NEUTRAL because the bias is in the adjectives, not the location noun. TF-IDF sees tokens, not semantic relationships. Documented honestly in MODEL_DECISIONS.md.
What I'd Do Differently
- •Test the explanation endpoint with real edge cases before shipping — the hallucination bug would have been caught in 10 minutes of manual testing
- •Design bias category boundaries explicitly before generating training data — the demographic/racial overlap cost a full iteration cycle that upfront design would have prevented
- •Build the simulator earlier — it became the most compelling demo feature but was added last
"The model improved from F1 0.65 to 0.89 on the hardest categories by changing the training data, not the algorithm. Data quality is a product decision. The model learns exactly what the data shows — and if the data conflates two categories, no amount of hyperparameter tuning will teach the model to tell them apart."
PM Artefacts
Written before any code. Every project ships with a full PM artefact set.
Let's Connect
I am actively seeking Junior AI PM / Technical PM roles at companies building AI-powered products in media, trust and safety, e-commerce, or consumer applications. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI that most technical candidates don't have. Let's connect.
Quick Links
© 2025 Ogbebor Osaheni. Built with Next.js, React, and Tailwind CSS.
