Bias Audit Dashboard

B2B AI tool for trust and safety teams at UK media companies. Detects bias across 6 categories with SHAP explainability, a live moderation simulator, and a full audit trail — built for Online Safety Act 2023, Ofcom Broadcasting Code, and BBC Editorial Guidelines compliance.

Shipped• By Ogbebor OsaheniMarch 2026

Executive Summary

Built an end-to-end AI bias detection system for UK media trust and safety teams. The hybrid architecture routes content through a tiered pipeline: inputs of 2 words or fewer are auto-approved at zero cost, 4–15 word inputs go to Claude API for semantic classification, and longer inputs run through a TF-IDF + XGBoost classifier with SHAP word-level highlights. The system achieves F1 0.90 across 6 bias categories and includes a live comment moderation simulator that demonstrates the full human-in-the-loop workflow.

0.90

F1 Score

Across all 6 bias categories

3-Tier

Routing Architecture

Cost-optimised classification pipeline

3,000

Training Examples

Synthetic UK media content

6

Bias Categories

Demographic, gender, racial, religious, geographic, neutral

Problem Statement

UK media companies — broadcasters, publishers, streaming platforms — produce and moderate thousands of pieces of content daily. Trust and safety teams are responsible for ensuring that content does not systematically disadvantage groups based on age, gender, race, nationality, religion, sexuality, or geography.

Today this work is done manually. A reviewer reads content, applies their own judgement, and makes a call. This process has three critical failure modes: inconsistency (two reviewers assess the same content differently with no shared rubric), scale (manual review cannot handle 50,000 items per week), and blind spots (human reviewers have their own biases — without a structured detection layer, systematic bias in content can go undetected for months).

The result: media companies face regulatory risk under Ofcom and the Online Safety Act 2023, reputational damage, and advertiser pressure — without the tooling to demonstrate they are actively auditing for bias.

"The Online Safety Act 2023 places a duty of care on platforms to protect users from harmful content. Ofcom can fine companies up to £18 million or 10% of global annual turnover for failures. The BBC Editorial Guidelines require content to treat audiences with respect and avoid unjustified harm — covering exactly the demographic, gender, racial, religious, and geographic bias categories this system detects. A trust and safety team that cannot demonstrate structured, auditable bias review is exposed — regardless of how good their intentions are."

Solution Overview

A hybrid AI system that gives trust and safety analysts a structured, explainable, auditable bias detection layer — so they can make faster, more consistent, and more defensible content moderation decisions at scale. The human always decides. The system never acts autonomously.

Tiered Routing Architecture

Three-tier classification pipeline: Tier 1 (≤2 words) auto-approves at zero cost. Tier 2 (4–15 words) routes to Claude API for semantic classification — handles short text that TF-IDF cannot reliably classify. Tier 3 (>15 words) runs the full XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.

SHAP Word-Level Explainability

Every verdict includes SHAP highlights showing the specific words that triggered the classification. A trust and safety analyst can see exactly which words drove the score — not just a number. This is the difference between a tool reviewers trust and one they ignore.

Comment Moderation Simulator

Live two-panel interface: public feed on the left (comments post instantly), moderation queue on the right (bias analysis appears with Approve/Flag/Remove actions). Demonstrates the full human-in-the-loop workflow in under 30 seconds — the exact demo moment that lands in interviews.

Architecture Diagram

Technical Implementation

Data & Methodology

Data Dictionary

FeatureTypeDescriptionSource
contentstringUK media content item — headline, social post, video description, or article excerptClaude API generation
labelbinary (0/1)0 = neutral, 1 = biased (any non-neutral category)Derived from category
categorycategorical (6)demographic_bias, gender_stereotyping, racial_bias, religious_bias, geographic_bias, neutralClaude API labelling
confidence_ground_truthfloat (0.70–0.99)How clear-cut the example is. Clear examples: 0.85–0.99. Subtle/ambiguous: 0.70–0.84Designed by difficulty tier
splitcategorical (train/test)Stratified 80/20 split — 400 train, 100 test per categoryAssigned at generation time

Methodology

Generated 3,000 synthetic UK media content items using Claude API (claude-haiku-4-5). Six categories, 500 items each, with a deliberate 60/40 clear/subtle difficulty split. Neutral examples were specifically designed to reference the same groups and topics as biased categories — to force the model to learn the bias signal, not just the topic. When demographic_bias and racial_bias overlapped (both scoring F1 0.65 on first attempt), the fix was not model tuning — it was redesigning the generation prompts to create sharper category boundaries. This is a PM decision, not a data science one.

Validation Approach

  • Stratified 80/20 train/test split — 400 train, 100 test per category
  • F1 score per category with explicit 0.78 threshold from PRD
  • Fairness constraint: no category flagged at more than 2× the rate of any other
  • Manual validation: 50 test cases reviewed for explanation accuracy
  • SHAP validation: top feature words checked against domain knowledge

Proof of Impact

0.90

Overall F1 score across all 6 bias categories — all above the 0.78 PRD threshold

Results Comparison

MetricBeforeAfterChange
demographic_bias F10.65 (first attempt)0.89Prompt redesign
racial_bias F10.65 (first attempt)0.87Journalism framing
geographic_bias F1N/A0.92Within threshold
Fairness disparity ratioUnknown1.00× (perfect)Constraint satisfied
Explanation accuracyHallucinated contextContent-groundedFixed by passing original content
Short text accuracy99% false positivesSemantic via Claude APITiered routing

Key Insights

The model improvement from F1 0.65 to 0.89 on demographic_bias came from redesigning the training prompts, not tuning the model. This is the core PM insight: data quality is a product decision. The model learns what the data shows — if the data conflates two categories, no amount of hyperparameter tuning will fix it.

The tiered routing architecture was driven by a business and ethics decision, not a technical one. TF-IDF + XGBoost produces high-confidence false positives on short text — 'you are a christian' scoring 99% HIGH RISK religious bias. The cost of a wrong high-confidence verdict (regulatory risk, reviewer trust collapse) is orders of magnitude higher than the cost of a Claude API call (£0.0003).

Geographic bias misclassifies when the bias is carried by adjectives rather than location nouns. 'People from the north of England lack ambition' scores as NEUTRAL because the model sees no geographic trigger words — the bias is in 'lack ambition', not 'north of England'. The Claude explanation layer correctly identifies this as a model failure and tells the reviewer to flag it manually. This is the hybrid architecture working as designed.

The fairness disparity ratio of 1.00× means every category is flagged at exactly the same rate. This is a product outcome, not a coincidence — the dataset was deliberately balanced at 500 items per category, and the PRD specified a 2× maximum disparity as a hard constraint.

Ethics & Responsible AI

This project applies responsible AI principles from the ground up — not as documentation added after the build, but as design constraints that shaped architectural decisions. Drawing on Cathy O'Neil's Weapons of Math Destruction: a model is dangerous when it is opaque, operates at scale, and causes harm to the people it scores. Every decision in this system was made to be the opposite.

Human Agency is Non-Negotiable

The system never makes a content decision. Every result requires an explicit reviewer action: Approve, Flag, or Escalate. There is no auto-remove. The system cannot suppress or publish content on its own. This is an ethical constraint embedded in the architecture, not a UX choice.

Explainability is a Right

A reviewer who cannot understand why content was flagged cannot make a defensible decision. Every result includes SHAP word highlights, a plain English Claude explanation, and a confidence score. The reviewer must be able to read the explanation and either agree or override it with full understanding.

Uncertainty Must Be Communicated Honestly

The model is not certain. Confidence scores are shown prominently on every result. Low-confidence verdicts are visually distinct. The tiered routing architecture was implemented specifically because showing a 99% HIGH RISK verdict on neutral content violates this principle — false certainty is more dangerous than acknowledged uncertainty.

Synthetic Data is a Deliberate Privacy Choice

No real user content is stored or processed. The model learns patterns, not individuals. This is documented as a deliberate product decision in MODEL_DECISIONS.md — not a shortcut.

Guardrails & Safeguards

RuleThresholdRationale
Human reviewer always requiredNo auto-approve, no auto-removeRemoval without human review is censorship without accountability
Confidence shown on every verdictAlways visible in UIReviewers must know how certain the model is before acting on it
Short text routes to Claude, not XGBoost4–15 words via Tier 2False high-confidence verdicts destroy reviewer trust and create compliance risk
No real user content stored in v1Synthetic data onlyPrivacy by design — no DPIA required for v1

Bias Audit & Fairness Assessment

Fairness metric computed using the dataset directly: disparity ratio (max flag rate / min flag rate across categories) = 1.00× — no category is flagged at a disproportionate rate. Computed in audit.py using pandas. Fairlearn integration (demographic parity, equal opportunity, predictive parity, individual fairness) is planned for v2. Known limitation: geographic_bias is misclassified when bias is carried by adjectives rather than location nouns. Documented in MODEL_DECISIONS.md.

OKRs & Success Metrics

Objective

Demonstrate responsible AI product management capability through a shipped, production-grade bias detection system that is technically sound, ethically grounded, and commercially relevant to UK media compliance requirements

Key Results

Achieve F1 > 0.78 on all 6 bias categories

100%

Target: F1 > 0.78 per category

Implement human-in-the-loop architecture with no auto-remove capability

100%

Target: Human always decides

Success Metrics

MetricTargetAchievedStatus
Overall F1 Score> 0.78 all categories0.90 — all 6 above thresholdAchieved
Fairness disparity ratio< 2×1.00× — constraint satisfiedAchieved
Live demoPublic URLbias-audit-dashboard.vercel.appAchieved
PM artefacts5 documents5 shipped before build startedAchieved

Roadmap & Future Vision

Now

Completed
  • TF-IDF + XGBoost classifier — F1 0.90 across 6 categories
  • Tiered routing: Tier 1 (free), Tier 2 (Claude API), Tier 3 (XGBoost + SHAP)
  • Live Content Analyser + Comment Moderation Simulator
  • Audit Dashboard with fairness metrics
  • Deployed: bias-audit-dashboard.vercel.app

Next

In Progress
  • Supabase audit log with full session persistence
  • Magic link auth for B2B trial access
  • PDF export of audit log for Ofcom reporting
  • Retrain geographic_bias to catch adjective-based bias patterns

Later

Planned
  • Fine-tune on real UK media content (with consent)
  • Bulk upload / batch processing for large content queues
  • Real-time stream monitoring for live broadcast
  • Sentence transformer upgrade for better semantic understanding

Learnings & Reflections

What Went Well

  • Writing PM artefacts before any code forced clarity on what the product needed to do — the PRD's 0.78 F1 threshold became a real design constraint that shaped dataset generation, not a metric added at the end
  • The hybrid architecture decision (ML for detection, Claude for explanation) was made on day one and proved correct — the two layers have different strengths and the product is better for keeping them separate
  • The tiered routing architecture came from a genuine product problem (short text false positives) and was justified on business and ethics grounds before any code was written — that's the right order

Challenges Faced

  • The first attempt at demographic_bias and racial_bias both scored F1 0.65 — not because the model was wrong, but because the training data created overlapping categories. The fix was redesigning the generation prompts, not tuning the model. This took a full iteration cycle to diagnose correctly.
  • The explain endpoint initially hallucinated context because it only received the score and category — not the original content. The fix (passing content to the explain endpoint) was obvious in retrospect but took seeing a bad explanation in production to identify.
  • Geographic bias remains a known failure mode: 'People from the north of England lack ambition' scores as NEUTRAL because the bias is in the adjectives, not the location noun. TF-IDF sees tokens, not semantic relationships. Documented honestly in MODEL_DECISIONS.md.

What I'd Do Differently

  • Test the explanation endpoint with real edge cases before shipping — the hallucination bug would have been caught in 10 minutes of manual testing
  • Design bias category boundaries explicitly before generating training data — the demographic/racial overlap cost a full iteration cycle that upfront design would have prevented
  • Build the simulator earlier — it became the most compelling demo feature but was added last

"The model improved from F1 0.65 to 0.89 on the hardest categories by changing the training data, not the algorithm. Data quality is a product decision. The model learns exactly what the data shows — and if the data conflates two categories, no amount of hyperparameter tuning will teach the model to tell them apart."

PM Artefacts

Written before any code. Every project ships with a full PM artefact set.

PRD — Bias Audit Dashboard
Model Decisions — Bias Audit Dashboard
Ethics Framework — Bias Audit Dashboard

Let's Connect

I am actively seeking Junior AI PM / Technical PM roles at companies building AI-powered products in media, trust and safety, e-commerce, or consumer applications. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI that most technical candidates don't have. Let's connect.

© 2025 Ogbebor Osaheni. Built with Next.js, React, and Tailwind CSS.