Pulse — AI Audience Sentiment Monitor

Real-time AI audience sentiment classification for live UK broadcast events. TF-IDF + XGBoost emotion classifier (Macro F1 0.830) with tiered routing, SHAP explainability, and an editorial guardrail built into the architecture — built for BBC, Channel 4, ITV, and Sky.

Shipped• By Ogbebor Osaheni • March 2026

View Live Demo

Executive Summary

A real-time AI sentiment classification system that gives broadcast producers structured, explainable, auditable audience intelligence — so they can make faster, more audience-aware editorial decisions during live events. The system never makes editorial decisions. The producer always decides.

0.830

Macro F1 Score

Emotion classifier across 5 categories

3-Tier

Routing Architecture

Auto-neutral, Claude API, XGBoost

2,699

Training Examples

Synthetic BAFTA social posts

5+6

Classifications

5 emotions, 6 topics multi-label

Problem Statement

Live broadcast producers at UK broadcasters make editorial decisions in real time with almost no structured audience intelligence. During high-stakes events — the BAFTAs, election nights, live finals — audience reaction exists in volume on social media but reaches the gallery too late, too unstructured, and too noisy to act on.

The challenge is not technical. The barrier is product thinking: how do you surface real-time audience sentiment in a form that a live broadcast producer can read in under 5 seconds and act on immediately? How do you build a tool that is fast enough for a gallery, honest enough about its uncertainty, and safe enough for an Ofcom-regulated broadcaster to use?

This project addresses the full product lifecycle: from ML model design and labelling guide to tiered routing architecture to SHAP explainability to editorial ethics to a shipped, interactive demo — demonstrating how AI Product Managers bridge technical capability with editorial integrity and regulatory awareness.

Solution Overview

Tiered Routing Architecture

Three-tier classification: Tier 1 (≤3 words) auto-classifies as neutral at zero cost. Tier 2 (4–20 words) routes to Claude API for semantic understanding of short casual text. Tier 3 (>20 words) runs full TF-IDF + XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.

SHAP Word-Level Explainability

Every Tier 3 classification includes SHAP highlights showing the specific words that drove the emotion classification. A producer can see exactly why a post was flagged as angry versus negative — not just a score.

Editorial Guardrail Built In

A non-dismissible persistent label on every screen: 'Pulse surfaces audience signals. Editorial decisions remain with the producer.' This is not a UX flourish — it is an ethical constraint embedded in the architecture, documented in the Ethics Framework.

Fails Loud, Not Silent

A deliberate responsible-AI design decision: when the Tier 2 Claude API call fails, the classifier returns an explicit degraded state — emotion 'unavailable', a degraded flag, and an error string — rather than silently defaulting to 'neutral'. A silent failure during a negative-sentiment spike would suppress the very alert the product exists to surface, so degraded results are shown as unavailable and excluded from alert scoring.

Technical Implementation

Data & Methodology

Data Dictionary

Feature	Type	Description	Source
content	string	Synthetic BAFTA 2026 social media post in Twitter/X register	Claude API generation
emotion	categorical (5)	excited, positive, neutral, negative, angry — classified by tone not topic	Generation prompt + labelling guide
topics	multi-label (6)	winner_reaction, presenter_performance, ceremony_production, diversity_representation, fashion_red_carpet, general_audience_reaction	Claude API topic assignment
confidence	float (0.70–0.99)	Generation difficulty signal. Clear examples 0.85–0.99, subtle 0.70–0.84	Designed by difficulty tier
split	categorical (train/test)	Stratified 80/20 split by emotion category	Assigned after generation

Methodology

Generated 2,699 synthetic BAFTA 2026 social media posts using Claude API (claude-haiku-4-5). Key design decision: wrote the labelling guide before generating any data — defining all emotion boundaries, topic definitions, and decision trees upfront. When negative/angry boundary produced F1 0.742, the fix was targeted augmentation (300 additional negative posts with sharper prompt definitions) not hyperparameter tuning. When fashion_red_carpet produced F1 0.696, the fix was rebalancing the test split from 15 to 30 examples — a measurement problem, not a model problem.

Validation Approach

•Stratified 80/20 train/test split by emotion category
•F1 per category with 0.78 threshold (emotion) and 0.75 threshold (topic multi-label)
•Diagnostic script: class balance, test sample size per category, similar category imbalance
•Targeted augmentation for negative category: 180 clear + 120 subtle additional examples
•Test split rebalance for fashion_red_carpet: 15 → 30 test examples

Proof of Impact

0.830

Macro F1 — emotion classifier across all 5 categories

Results Comparison

Metric	Before	After	Change
negative F1	0.742	0.750	Targeted augmentation
fashion_red_carpet F1	0.696 (15 test)	0.776 (30 test)	Split rebalance
Tier 2 short text accuracy	Wrong (TF-IDF out of distribution)	Correct (Claude API semantic)	Tiered routing
general_audience_reaction F1	N/A	0.304 (accepted)	Documented limitation
Tier 2 API failure handling	Silent neutral (hides negative spikes)	Explicit degraded state, excluded from alerts	Fail-loud design

Key Insights

The negative/angry F1 improvement came from redesigning the generation prompts using the labelling guide decision trees — not from tuning the model. The labelling guide is a product decision. The model learns exactly what the data shows.

The fashion_red_carpet failure was a measurement problem, not a model problem. At 15 test examples, one wrong prediction moves F1 by 6.7 percentage points. Rebalancing the split to 30 examples moved F1 from 0.696 to 0.776 at zero additional cost.

The tiered routing architecture was driven by a product and ethics decision: TF-IDF + XGBoost produces wrong high-confidence results on short text. Showing a wrong answer with 99% confidence destroys producer trust. Claude API for Tier 2 costs $0.0003 per call. The cost of a wrong high-confidence verdict in a live broadcast context is orders of magnitude higher.

general_audience_reaction F1 0.304 is an accepted limitation documented in MODEL_DECISIONS.md. The category is defined by exclusion — it applies when nothing else does. Models learn by positive examples. A category defined by the absence of other signals is structurally harder to learn. In production, low-confidence topic tags are shown with a distinct visual indicator.

Ethics & Responsible AI

This project applies responsible AI principles from the ground up. The central ethical tension: if a live broadcast producer consistently responds to sentiment signals, the algorithm gradually shapes editorial decisions — and the broadcast stops reflecting editorial judgement and starts reflecting algorithmic optimisation. Every decision in this system was made to prevent that failure mode.

Editorial Sovereignty is Non-Negotiable

The system never makes a content decision. Every result requires an explicit producer response. The editorial guardrail is persistent and non-dismissible on every screen. This is an ethical constraint embedded in the architecture.

Transparency Over False Confidence

Every classification includes a confidence score, and every score is labelled by type: 'calibrated' for Tier 3 (a real XGBoost predict_proba probability) versus 'estimated' for Tier 1 and Tier 2 (not calibrated). An LLM-estimated number is never presented as if it were a calibrated probability. Low-confidence results are visually distinct. The tiered routing architecture exists specifically because showing wrong high-confidence results violates this principle — false certainty is more dangerous than acknowledged uncertainty.

The Feedback Loop Risk is Documented

If producers consistently respond to sentiment spikes, audience behaviour adapts, and the model begins learning the consequences of its own influence rather than genuine sentiment. This risk is documented in the Ethics Framework and the v2 roadmap requires training data from multiple diverse events to mitigate it.

Social Signal Caveat Built Into the UI

Social media audiences skew younger and more urban than linear broadcast audiences. This demographic gap is surfaced in the dashboard as a persistent caveat — not buried in documentation.

Guardrails & Safeguards

Rule	Threshold	Rationale
Editorial guardrail on every screen	Non-dismissible	Producer must never forget the signal is advisory not directive
Confidence shown on every verdict	Always visible	Producer must know how certain the model is before acting
Short text routes to Claude not XGBoost	4–20 words via Tier 2	Wrong high-confidence verdicts destroy trust and could cause editorial errors
Alert names signal not action	Descriptive only	Alert says 'Negative spike: Winner Reaction' never 'Consider changing coverage'

Bias Audit & Fairness Assessment

Pulse does not compute model-level fairness in v1. The fairness work in v1 is a dataset-level design constraint: the synthetic training set was balanced so that no emotion category dominates, keeping per-class F1 above 0.75 for every emotion. Prediction-level fairness auditing across demographic segments (Fairlearn) is a roadmap item, not an implemented v1 check — see the Later roadmap. Known limitation: general_audience_reaction F1 0.304 — structural catch-all limitation, low-confidence results shown with distinct visual indicator. Negative/angry boundary F1 0.750 — in production, alert system uses combined negative + angry score, not either emotion alone.

OKRs & Success Metrics

Objective

Demonstrate responsible AI product management capability through a shipped, production-grade broadcast sentiment tool that is technically sound, editorially safe, and commercially relevant to UK live broadcasting.

Key Results

Achieve Macro F1 > 0.78 on emotion classifier

100%

Target: Macro F1 > 0.78

Implement tiered routing with editorial guardrail

100%

Target: All three tiers operational

Deploy to public URL with scripted BAFTA simulation

100%

Target: Live public demo

Success Metrics

Metric	Target	Achieved	Status
Emotion Macro F1	> 0.78	0.830 — all categories above 0.75	Achieved
Tiered routing	3 tiers operational	Tier 1 auto-neutral, Tier 2 Claude API, Tier 3 XGBoost	Achieved
Live demo	Public URL	pulse-pi-inky.vercel.app	Achieved
PM artefacts	5 documents	5 shipped before build started + Labelling Guide	Achieved

Learnings & Reflections

What Went Well

•Writing the labelling guide before generating any data produced cleaner category boundaries than the iterative approach used in earlier projects — the decision trees made every label unambiguous before a single API call was made
•The tiered routing architecture solved two problems simultaneously: short text accuracy and honest uncertainty communication — the same architecture decision served both technical and ethical requirements
•The scripted BAFTA narrative arc makes the demo immediately legible to any broadcaster — the CONTROVERSY stage with negative/angry spikes tells the product story better than any explanation

Challenges Faced

•The negative/angry boundary remained at F1 0.750 after two targeted augmentation runs — the remaining gap is likely irreducible with synthetic data and requires real broadcast audience comments under a proper consent and anonymisation framework
•general_audience_reaction F1 0.304 is a structural limitation of catch-all categories — accepted and documented, but it surfaces the fundamental tension between ML training methodology and category design
•Render free tier cold start (30–50 seconds) creates friction in the demo — health check retry loop mitigates this but does not eliminate it

What I'd Do Differently

•Build tiered routing from day one — the moment I decided to support manual text input I should have asked 'what is the shortest input a producer would type?' and designed the routing architecture before the classification layer
•Write the labelling guide before the generation script, not after the first failed training run — this was the correct sequence and would have saved two full regeneration cycles
•Generate a 100-row pilot before full generation — $0.02 to validate label boundaries before committing to a $0.50 full run

"The labelling guide is not documentation — it is product design. Every emotion boundary definition is a decision about what the model learns. Every topic decision tree is a specification. Writing it before generating data is the correct order. The model learns exactly what the data shows."

PM Artefacts

Written before any code. Every project ships with a full PM artefact set.

PRD — Pulse

Model Decisions — Pulse

Ethics Framework — Pulse

Let's Connect

I am actively seeking Junior AI PM and Technical PM roles at UK media companies — BBC, Channel 4, ITV, Sky, and media-adjacent tech. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI and editorial ethics that most technical candidates do not have. Let's connect.

Contact Info

osaheniogbebor.c@gmail.com

Connect with me

GitHub

Available on request

Pulse — AI Audience Sentiment Monitor

Executive Summary

Problem Statement

Solution Overview

Tiered Routing Architecture

SHAP Word-Level Explainability

Editorial Guardrail Built In

Fails Loud, Not Silent

Technical Implementation

Phase 0: PM Artefacts

Phase 1: Data & Model

Phase 2: FastAPI Backend

Phase 3: Next.js Frontend

Data & Methodology

Data Dictionary

Methodology

Validation Approach

Proof of Impact

Results Comparison

Key Insights

Ethics & Responsible AI

Editorial Sovereignty is Non-Negotiable

Transparency Over False Confidence

The Feedback Loop Risk is Documented

Social Signal Caveat Built Into the UI

Guardrails & Safeguards

Bias Audit & Fairness Assessment

OKRs & Success Metrics

Objective

Key Results

Success Metrics

Learnings & Reflections

What Went Well

Challenges Faced

What I'd Do Differently

PM Artefacts

Let's Connect

Contact Info