Pulse — AI Audience Sentiment Monitor

Real-time AI audience sentiment classification for live UK broadcast events. TF-IDF + XGBoost emotion classifier (Macro F1 0.830) with tiered routing, SHAP explainability, and an editorial guardrail built into the architecture — built for BBC, Channel 4, ITV, and Sky.

Shipped• By Ogbebor OsaheniMarch 2026

Executive Summary

A real-time AI sentiment classification system that gives broadcast producers structured, explainable, auditable audience intelligence — so they can make faster, more audience-aware editorial decisions during live events. The system never makes editorial decisions. The producer always decides.

0.830

Macro F1 Score

Emotion classifier across 5 categories

3-Tier

Routing Architecture

Auto-neutral, Claude API, XGBoost

2,699

Training Examples

Synthetic BAFTA social posts

5+6

Classifications

5 emotions, 6 topics multi-label

Problem Statement

Live broadcast producers at UK broadcasters make editorial decisions in real time with almost no structured audience intelligence. During high-stakes events — the BAFTAs, election nights, live finals — audience reaction exists in volume on social media but reaches the gallery too late, too unstructured, and too noisy to act on.

The challenge is not technical. The barrier is product thinking: how do you surface real-time audience sentiment in a form that a live broadcast producer can read in under 5 seconds and act on immediately? How do you build a tool that is fast enough for a gallery, honest enough about its uncertainty, and safe enough for an Ofcom-regulated broadcaster to use?

This project addresses the full product lifecycle: from ML model design and labelling guide to tiered routing architecture to SHAP explainability to editorial ethics to a shipped, interactive demo — demonstrating how AI Product Managers bridge technical capability with editorial integrity and regulatory awareness.

Solution Overview

A real-time AI sentiment classification system that gives broadcast producers structured, explainable, auditable audience intelligence — so they can make faster, more audience-aware editorial decisions during live events. The system never makes editorial decisions. The producer always decides.

Tiered Routing Architecture

Three-tier classification: Tier 1 (≤3 words) auto-classifies as neutral at zero cost. Tier 2 (4–20 words) routes to Claude API for semantic understanding of short casual text. Tier 3 (>20 words) runs full TF-IDF + XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.

SHAP Word-Level Explainability

Every Tier 3 classification includes SHAP highlights showing the specific words that drove the emotion classification. A producer can see exactly why a post was flagged as angry versus negative — not just a score.

Editorial Guardrail Built In

A non-dismissible persistent label on every screen: 'Pulse surfaces audience signals. Editorial decisions remain with the producer.' This is not a UX flourish — it is an ethical constraint embedded in the architecture, documented in the Ethics Framework.

Technical Implementation

Data & Methodology

Data Dictionary

FeatureTypeDescriptionSource
contentstringSynthetic BAFTA 2026 social media post in Twitter/X registerClaude API generation
emotioncategorical (5)excited, positive, neutral, negative, angry — classified by tone not topicGeneration prompt + labelling guide
topicsmulti-label (6)winner_reaction, presenter_performance, ceremony_production, diversity_representation, fashion_red_carpet, general_audience_reactionClaude API topic assignment
confidencefloat (0.70–0.99)Generation difficulty signal. Clear examples 0.85–0.99, subtle 0.70–0.84Designed by difficulty tier
splitcategorical (train/test)Stratified 80/20 split by emotion categoryAssigned after generation

Methodology

Generated 2,699 synthetic BAFTA 2026 social media posts using Claude API (claude-haiku-4-5). Key design decision: wrote the labelling guide before generating any data — defining all emotion boundaries, topic definitions, and decision trees upfront. When negative/angry boundary produced F1 0.742, the fix was targeted augmentation (300 additional negative posts with sharper prompt definitions) not hyperparameter tuning. When fashion_red_carpet produced F1 0.696, the fix was rebalancing the test split from 15 to 30 examples — a measurement problem, not a model problem.

Validation Approach

  • Stratified 80/20 train/test split by emotion category
  • F1 per category with 0.78 threshold (emotion) and 0.75 threshold (topic multi-label)
  • Diagnostic script: class balance, test sample size per category, similar category imbalance
  • Targeted augmentation for negative category: 180 clear + 120 subtle additional examples
  • Test split rebalance for fashion_red_carpet: 15 → 30 test examples

Proof of Impact

0.830

Macro F1 — emotion classifier across all 5 categories

Results Comparison

MetricBeforeAfterChange
negative F10.7420.750Targeted augmentation
fashion_red_carpet F10.696 (15 test)0.776 (30 test)Split rebalance
Tier 2 short text accuracyWrong (TF-IDF out of distribution)Correct (Claude API semantic)Tiered routing
general_audience_reaction F1N/A0.304 (accepted)Documented limitation

Key Insights

The negative/angry F1 improvement came from redesigning the generation prompts using the labelling guide decision trees — not from tuning the model. The labelling guide is a product decision. The model learns exactly what the data shows.

The fashion_red_carpet failure was a measurement problem, not a model problem. At 15 test examples, one wrong prediction moves F1 by 6.7 percentage points. Rebalancing the split to 30 examples moved F1 from 0.696 to 0.776 at zero additional cost.

The tiered routing architecture was driven by a product and ethics decision: TF-IDF + XGBoost produces wrong high-confidence results on short text. Showing a wrong answer with 99% confidence destroys producer trust. Claude API for Tier 2 costs $0.0003 per call. The cost of a wrong high-confidence verdict in a live broadcast context is orders of magnitude higher.

general_audience_reaction F1 0.304 is an accepted limitation documented in MODEL_DECISIONS.md. The category is defined by exclusion — it applies when nothing else does. Models learn by positive examples. A category defined by the absence of other signals is structurally harder to learn. In production, low-confidence topic tags are shown with a distinct visual indicator.

Ethics & Responsible AI

This project applies responsible AI principles from the ground up. The central ethical tension: if a live broadcast producer consistently responds to sentiment signals, the algorithm gradually shapes editorial decisions — and the broadcast stops reflecting editorial judgement and starts reflecting algorithmic optimisation. Every decision in this system was made to prevent that failure mode.

Editorial Sovereignty is Non-Negotiable

The system never makes a content decision. Every result requires an explicit producer response. The editorial guardrail is persistent and non-dismissible on every screen. This is an ethical constraint embedded in the architecture.

Transparency Over False Confidence

Every classification includes a confidence score. Low-confidence results are visually distinct. The tiered routing architecture exists specifically because showing wrong high-confidence results violates this principle — false certainty is more dangerous than acknowledged uncertainty.

The Feedback Loop Risk is Documented

If producers consistently respond to sentiment spikes, audience behaviour adapts, and the model begins learning the consequences of its own influence rather than genuine sentiment. This risk is documented in the Ethics Framework and the v2 roadmap requires training data from multiple diverse events to mitigate it.

Social Signal Caveat Built Into the UI

Social media audiences skew younger and more urban than linear broadcast audiences. This demographic gap is surfaced in the dashboard as a persistent caveat — not buried in documentation.

Guardrails & Safeguards

RuleThresholdRationale
Editorial guardrail on every screenNon-dismissibleProducer must never forget the signal is advisory not directive
Confidence shown on every verdictAlways visibleProducer must know how certain the model is before acting
Short text routes to Claude not XGBoost4–20 words via Tier 2Wrong high-confidence verdicts destroy trust and could cause editorial errors
Alert names signal not actionDescriptive onlyAlert says 'Negative spike: Winner Reaction' never 'Consider changing coverage'

Bias Audit & Fairness Assessment

Fairness constraint: no emotion category flagged at more than 2× the rate of any other. Known limitation: general_audience_reaction F1 0.304 — structural catch-all limitation, low-confidence results shown with distinct visual indicator. Negative/angry boundary F1 0.750 — in production, alert system uses combined negative + angry score, not either emotion alone.

OKRs & Success Metrics

Objective

Demonstrate responsible AI product management capability through a shipped, production-grade broadcast sentiment tool that is technically sound, editorially safe, and commercially relevant to UK live broadcasting.

Key Results

Achieve Macro F1 > 0.78 on emotion classifier

100%

Target: Macro F1 > 0.78

Implement tiered routing with editorial guardrail

100%

Target: All three tiers operational

Deploy to public URL with scripted BAFTA simulation

100%

Target: Live public demo

Success Metrics

MetricTargetAchievedStatus
Emotion Macro F1> 0.780.830 — all categories above 0.75Achieved
Tiered routing3 tiers operationalTier 1 auto-neutral, Tier 2 Claude API, Tier 3 XGBoostAchieved
Live demoPublic URLpulse-pi-inky.vercel.appAchieved
PM artefacts5 documents5 shipped before build started + Labelling GuideAchieved

Learnings & Reflections

What Went Well

  • Writing the labelling guide before generating any data produced cleaner category boundaries than the iterative approach used in earlier projects — the decision trees made every label unambiguous before a single API call was made
  • The tiered routing architecture solved two problems simultaneously: short text accuracy and honest uncertainty communication — the same architecture decision served both technical and ethical requirements
  • The scripted BAFTA narrative arc makes the demo immediately legible to any broadcaster — the CONTROVERSY stage with negative/angry spikes tells the product story better than any explanation

Challenges Faced

  • The negative/angry boundary remained at F1 0.750 after two targeted augmentation runs — the remaining gap is likely irreducible with synthetic data and requires real broadcast audience comments under a proper consent and anonymisation framework
  • general_audience_reaction F1 0.304 is a structural limitation of catch-all categories — accepted and documented, but it surfaces the fundamental tension between ML training methodology and category design
  • Render free tier cold start (30–50 seconds) creates friction in the demo — health check retry loop mitigates this but does not eliminate it

What I'd Do Differently

  • Build tiered routing from day one — the moment I decided to support manual text input I should have asked 'what is the shortest input a producer would type?' and designed the routing architecture before the classification layer
  • Write the labelling guide before the generation script, not after the first failed training run — this was the correct sequence and would have saved two full regeneration cycles
  • Generate a 100-row pilot before full generation — $0.02 to validate label boundaries before committing to a $0.50 full run

"The labelling guide is not documentation — it is product design. Every emotion boundary definition is a decision about what the model learns. Every topic decision tree is a specification. Writing it before generating data is the correct order. The model learns exactly what the data shows."

PM Artefacts

Written before any code. Every project ships with a full PM artefact set.

PRD — Pulse
Model Decisions — Pulse
Ethics Framework — Pulse

Let's Connect

I am actively seeking Junior AI PM and Technical PM roles at UK media companies — BBC, Channel 4, ITV, Sky, and media-adjacent tech. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI and editorial ethics that most technical candidates do not have. Let's connect.

© 2025 Ogbebor Osaheni. Built with Next.js, React, and Tailwind CSS.