Bias Audit Dashboard

B2B AI tool for trust and safety teams at UK media companies. Detects bias across 6 categories with SHAP explainability, a live moderation simulator, and a full audit trail — built for Online Safety Act 2023, Ofcom Broadcasting Code, and BBC Editorial Guidelines compliance.

Shipped• By Ogbebor Osaheni • March 2026

View Live Demo

Executive Summary

Built an end-to-end AI bias detection system for UK media trust and safety teams. The hybrid architecture routes content through a tiered pipeline: inputs of 2 words or fewer are auto-approved at zero cost, 3–15 word inputs go to Claude API for semantic classification, and longer inputs (16+ words) run through a TF-IDF + XGBoost classifier with SHAP word-level highlights. The Tier 3 XGBoost classifier achieves F1 0.90 across 6 bias categories on the held-out test set (Tier 2 is not separately benchmarked), and the system includes a live comment moderation simulator that demonstrates the full human-in-the-loop workflow.

0.90

F1 Score

Across all 6 bias categories

3-Tier

Routing Architecture

Cost-optimised classification pipeline

3,000

Training Examples

Synthetic UK media content

Bias Categories

Demographic, gender, racial, religious, geographic, neutral

Problem Statement

UK media companies — broadcasters, publishers, streaming platforms — produce and moderate thousands of pieces of content daily. Trust and safety teams are responsible for ensuring that content does not systematically disadvantage groups based on age, gender, race, nationality, religion, sexuality, or geography.

Today this work is done manually. A reviewer reads content, applies their own judgement, and makes a call. This process has three critical failure modes: inconsistency (two reviewers assess the same content differently with no shared rubric), scale (manual review cannot handle 50,000 items per week), and blind spots (human reviewers have their own biases — without a structured detection layer, systematic bias in content can go undetected for months).

The result: media companies face regulatory risk under Ofcom and the Online Safety Act 2023, reputational damage, and advertiser pressure — without the tooling to demonstrate they are actively auditing for bias.

"The Online Safety Act 2023 places a duty of care on platforms to protect users from harmful content. Ofcom can fine companies up to £18 million or 10% of global annual turnover for failures. The BBC Editorial Guidelines require content to treat audiences with respect and avoid unjustified harm — covering exactly the demographic, gender, racial, religious, and geographic bias categories this system detects. A trust and safety team that cannot demonstrate structured, auditable bias review is exposed — regardless of how good their intentions are."

Solution Overview

A hybrid AI system that gives trust and safety analysts a structured, explainable, auditable bias detection layer — so they can make faster, more consistent, and more defensible content moderation decisions at scale. The human always decides. The system never acts autonomously.

Tiered Routing Architecture

Three-tier classification pipeline: Tier 1 (≤2 words) auto-approves at zero cost. Tier 2 (3–15 words) routes to Claude API for semantic classification — handles short text that TF-IDF cannot reliably classify. Tier 3 (>15 words) runs the full XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.

SHAP Word-Level Explainability

Every verdict includes SHAP highlights showing the specific words that triggered the classification. A trust and safety analyst can see exactly which words drove the score — not just a number. This is the difference between a tool reviewers trust and one they ignore.

Comment Moderation Simulator

Live two-panel interface: public feed on the left (comments post instantly), moderation queue on the right (bias analysis appears with Approve/Flag/Remove actions). Demonstrates the full human-in-the-loop workflow in under 30 seconds — the exact demo moment that lands in interviews.

Architecture Diagram

Technical Implementation

Data & Methodology

Data Dictionary

Feature	Type	Description	Source
content	string	UK media content item — headline, social post, video description, or article excerpt	Claude API generation
label	binary (0/1)	0 = neutral, 1 = biased (any non-neutral category)	Derived from category
category	categorical (6)	demographic_bias, gender_stereotyping, racial_bias, religious_bias, geographic_bias, neutral	Claude API labelling
confidence_ground_truth	float (0.70–0.99)	How clear-cut the example is. Clear examples: 0.85–0.99. Subtle/ambiguous: 0.70–0.84	Designed by difficulty tier
split	categorical (train/test)	Stratified 80/20 split — 400 train, 100 test per category	Assigned at generation time

Methodology

Generated 3,000 synthetic UK media content items using Claude API (claude-haiku-4-5). Six categories, 500 items each, with a deliberate 60/40 clear/subtle difficulty split. Neutral examples were specifically designed to reference the same groups and topics as biased categories — to force the model to learn the bias signal, not just the topic. When demographic_bias and racial_bias overlapped (both scoring F1 0.65 on first attempt), the fix was not model tuning — it was redesigning the generation prompts to create sharper category boundaries. This is a PM decision, not a data science one.

Validation Approach

•Stratified 80/20 train/test split — 400 train, 100 test per category
•F1 score per category with explicit 0.78 threshold from PRD
•Fairness constraint: no category flagged at more than 2× the rate of any other
•Manual validation: 50 test cases reviewed for explanation accuracy
•SHAP validation: top feature words checked against domain knowledge

Proof of Impact

0.90

Tier 3 XGBoost classifier F1 across all 6 bias categories on the held-out test set — all above the 0.78 PRD threshold (Tier 2 not separately benchmarked)

Results Comparison

Metric	Before	After	Change
demographic_bias F1	0.65 (first attempt)	0.89	Prompt redesign
racial_bias F1	0.65 (first attempt)	0.87	Journalism framing
geographic_bias F1	N/A	0.92	Within threshold
Dataset balance ratio	Not measured	1.00× dataset balance ratio (equal examples per category)	Dataset-level only — not model fairness
Explanation accuracy	Hallucinated context	Content-grounded	Fixed by passing original content
Short text accuracy	99% false positives	Semantic via Claude API	Tiered routing

Key Insights

The model improvement from F1 0.65 to 0.89 on demographic_bias came from redesigning the training prompts, not tuning the model. This is the core PM insight: data quality is a product decision. The model learns what the data shows — if the data conflates two categories, no amount of hyperparameter tuning will fix it.

The tiered routing architecture was driven by a business and ethics decision, not a technical one. TF-IDF + XGBoost produces high-confidence false positives on short text — 'you are a christian' scoring 99% HIGH RISK religious bias. The cost of a wrong high-confidence verdict (regulatory risk, reviewer trust collapse) is orders of magnitude higher than the cost of a Claude API call (£0.0003).

Geographic bias misclassifies when the bias is carried by adjectives rather than location nouns. 'People from the north of England lack ambition' scores as NEUTRAL because the model sees no geographic trigger words — the bias is in 'lack ambition', not 'north of England'. The Claude explanation layer correctly identifies this as a model failure and tells the reviewer to flag it manually. This is the hybrid architecture working as designed.

The 1.00× disparity ratio measures dataset balance, not model fairness. It reflects a deliberately balanced training set — 500 examples per category, so max/min count is 1.00× — and the PRD's 2× maximum disparity applied to that dataset-construction check. It does NOT mean the trained model treats groups fairly on real predictions: a perfectly balanced dataset can still produce systematically different false-positive rates across categories. Prediction-level fairness (per-category false-positive and true-positive rates via Fairlearn) is not computed in v1 — it is a v2 action item.

Ethics & Responsible AI

This project applies responsible AI principles from the ground up — not as documentation added after the build, but as design constraints that shaped architectural decisions. Drawing on Cathy O'Neil's Weapons of Math Destruction: a model is dangerous when it is opaque, operates at scale, and causes harm to the people it scores. Every decision in this system was made to be the opposite.

Human Agency is Non-Negotiable

The system never makes a content decision. Every result requires an explicit reviewer action: Approve, Flag, or Escalate. There is no auto-remove. The system cannot suppress or publish content on its own. This is an ethical constraint embedded in the architecture, not a UX choice.

Explainability is a Right

A reviewer who cannot understand why content was flagged cannot make a defensible decision. Every result includes SHAP word highlights, a plain English Claude explanation, and a confidence score. The reviewer must be able to read the explanation and either agree or override it with full understanding.

Uncertainty Must Be Communicated Honestly

The model is not certain. Confidence scores are shown prominently on every result. Low-confidence verdicts are visually distinct. The tiered routing architecture was implemented specifically because showing a 99% HIGH RISK verdict on neutral content violates this principle — false certainty is more dangerous than acknowledged uncertainty.

Synthetic Data is a Deliberate Privacy Choice

No real user content is stored or processed. The model learns patterns, not individuals. This is documented as a deliberate product decision in MODEL_DECISIONS.md — not a shortcut.

Guardrails & Safeguards

Rule	Threshold	Rationale
Human reviewer always required	No auto-approve, no auto-remove	Removal without human review is censorship without accountability
Confidence shown on every verdict	Always visible in UI	Reviewers must know how certain the model is before acting on it
Short text routes to Claude, not XGBoost	3–15 words via Tier 2	False high-confidence verdicts destroy reviewer trust and create compliance risk
No real user content stored in v1	Synthetic data only	Privacy by design — no DPIA required for v1

Bias Audit & Fairness Assessment

Fairness metric computed using the dataset directly: disparity ratio (max flag rate / min flag rate across categories) = 1.00× — no category is flagged at a disproportionate rate. Computed in audit.py using pandas. Fairlearn integration (demographic parity, equal opportunity, predictive parity, individual fairness) is planned for v2. Known limitation: geographic_bias is misclassified when bias is carried by adjectives rather than location nouns. Documented in MODEL_DECISIONS.md.

OKRs & Success Metrics

Objective

Demonstrate responsible AI product management capability through a shipped, production-grade bias detection system that is technically sound, ethically grounded, and commercially relevant to UK media compliance requirements

Key Results

Achieve F1 > 0.78 on all 6 bias categories

100%

Target: F1 > 0.78 per category

Implement human-in-the-loop architecture with no auto-remove capability

100%

Target: Human always decides

Success Metrics

Metric	Target	Achieved	Status
Overall F1 Score	> 0.78 all categories	0.90 — all 6 above threshold	Achieved
Dataset balance ratio	< 2×	1.00× — dataset balanced, model fairness is v2	Achieved
Live demo	Public URL	bias-audit-dashboard.vercel.app	Achieved
PM artefacts	5 documents	5 shipped before build started	Achieved

Roadmap & Future Vision

Now

Completed

→TF-IDF + XGBoost classifier — F1 0.90 across 6 categories
→Tiered routing: Tier 1 (free), Tier 2 (Claude API), Tier 3 (XGBoost + SHAP)
→Live Content Analyser + Comment Moderation Simulator
→Audit Dashboard with fairness metrics
→Deployed: bias-audit-dashboard.vercel.app

In Progress

→Supabase audit log with full session persistence
→Magic link auth for B2B trial access
→PDF export of audit log for Ofcom reporting
→Retrain geographic_bias to catch adjective-based bias patterns

Later

Planned

→Fine-tune on real UK media content (with consent)
→Bulk upload / batch processing for large content queues
→Real-time stream monitoring for live broadcast
→Sentence transformer upgrade for better semantic understanding

Learnings & Reflections

What Went Well

•Writing PM artefacts before any code forced clarity on what the product needed to do — the PRD's 0.78 F1 threshold became a real design constraint that shaped dataset generation, not a metric added at the end
•The hybrid architecture decision (ML for detection, Claude for explanation) was made on day one and proved correct — the two layers have different strengths and the product is better for keeping them separate
•The tiered routing architecture came from a genuine product problem (short text false positives) and was justified on business and ethics grounds before any code was written — that's the right order

Challenges Faced

•The first attempt at demographic_bias and racial_bias both scored F1 0.65 — not because the model was wrong, but because the training data created overlapping categories. The fix was redesigning the generation prompts, not tuning the model. This took a full iteration cycle to diagnose correctly.
•The explain endpoint initially hallucinated context because it only received the score and category — not the original content. The fix (passing content to the explain endpoint) was obvious in retrospect but took seeing a bad explanation in production to identify.
•Geographic bias remains a known failure mode: 'People from the north of England lack ambition' scores as NEUTRAL because the bias is in the adjectives, not the location noun. TF-IDF sees tokens, not semantic relationships. Documented honestly in MODEL_DECISIONS.md.

What I'd Do Differently

•Test the explanation endpoint with real edge cases before shipping — the hallucination bug would have been caught in 10 minutes of manual testing
•Design bias category boundaries explicitly before generating training data — the demographic/racial overlap cost a full iteration cycle that upfront design would have prevented
•Build the simulator earlier — it became the most compelling demo feature but was added last

"The model improved from F1 0.65 to 0.89 on the hardest categories by changing the training data, not the algorithm. Data quality is a product decision. The model learns exactly what the data shows — and if the data conflates two categories, no amount of hyperparameter tuning will teach the model to tell them apart."

PM Artefacts

Written before any code. Every project ships with a full PM artefact set.

PRD — Bias Audit Dashboard

Model Decisions — Bias Audit Dashboard

Ethics Framework — Bias Audit Dashboard

Let's Connect

I am actively seeking Junior AI PM / Technical PM roles at companies building AI-powered products in media, trust and safety, e-commerce, or consumer applications. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI that most technical candidates don't have. Let's connect.

Contact Info

osaheniogbebor.c@gmail.com

Connect with me

GitHub

Available on request