Pulse — AI Audience Sentiment Monitor
Real-time AI audience sentiment classification for live UK broadcast events. TF-IDF + XGBoost emotion classifier (Macro F1 0.830) with tiered routing, SHAP explainability, and an editorial guardrail built into the architecture — built for BBC, Channel 4, ITV, and Sky.
Executive Summary
A real-time AI sentiment classification system that gives broadcast producers structured, explainable, auditable audience intelligence — so they can make faster, more audience-aware editorial decisions during live events. The system never makes editorial decisions. The producer always decides.
Macro F1 Score
Emotion classifier across 5 categories
Routing Architecture
Auto-neutral, Claude API, XGBoost
Training Examples
Synthetic BAFTA social posts
Classifications
5 emotions, 6 topics multi-label
Problem Statement
Live broadcast producers at UK broadcasters make editorial decisions in real time with almost no structured audience intelligence. During high-stakes events — the BAFTAs, election nights, live finals — audience reaction exists in volume on social media but reaches the gallery too late, too unstructured, and too noisy to act on.
The challenge is not technical. The barrier is product thinking: how do you surface real-time audience sentiment in a form that a live broadcast producer can read in under 5 seconds and act on immediately? How do you build a tool that is fast enough for a gallery, honest enough about its uncertainty, and safe enough for an Ofcom-regulated broadcaster to use?
This project addresses the full product lifecycle: from ML model design and labelling guide to tiered routing architecture to SHAP explainability to editorial ethics to a shipped, interactive demo — demonstrating how AI Product Managers bridge technical capability with editorial integrity and regulatory awareness.
Solution Overview
A real-time AI sentiment classification system that gives broadcast producers structured, explainable, auditable audience intelligence — so they can make faster, more audience-aware editorial decisions during live events. The system never makes editorial decisions. The producer always decides.
Tiered Routing Architecture
Three-tier classification: Tier 1 (≤3 words) auto-classifies as neutral at zero cost. Tier 2 (4–20 words) routes to Claude API for semantic understanding of short casual text. Tier 3 (>20 words) runs full TF-IDF + XGBoost + SHAP pipeline. Cost and accuracy matched to input complexity.
SHAP Word-Level Explainability
Every Tier 3 classification includes SHAP highlights showing the specific words that drove the emotion classification. A producer can see exactly why a post was flagged as angry versus negative — not just a score.
Editorial Guardrail Built In
A non-dismissible persistent label on every screen: 'Pulse surfaces audience signals. Editorial decisions remain with the producer.' This is not a UX flourish — it is an ethical constraint embedded in the architecture, documented in the Ethics Framework.
Technical Implementation
Data & Methodology
Data Dictionary
| Feature | Type | Description | Source |
|---|---|---|---|
| content | string | Synthetic BAFTA 2026 social media post in Twitter/X register | Claude API generation |
| emotion | categorical (5) | excited, positive, neutral, negative, angry — classified by tone not topic | Generation prompt + labelling guide |
| topics | multi-label (6) | winner_reaction, presenter_performance, ceremony_production, diversity_representation, fashion_red_carpet, general_audience_reaction | Claude API topic assignment |
| confidence | float (0.70–0.99) | Generation difficulty signal. Clear examples 0.85–0.99, subtle 0.70–0.84 | Designed by difficulty tier |
| split | categorical (train/test) | Stratified 80/20 split by emotion category | Assigned after generation |
Methodology
Generated 2,699 synthetic BAFTA 2026 social media posts using Claude API (claude-haiku-4-5). Key design decision: wrote the labelling guide before generating any data — defining all emotion boundaries, topic definitions, and decision trees upfront. When negative/angry boundary produced F1 0.742, the fix was targeted augmentation (300 additional negative posts with sharper prompt definitions) not hyperparameter tuning. When fashion_red_carpet produced F1 0.696, the fix was rebalancing the test split from 15 to 30 examples — a measurement problem, not a model problem.
Validation Approach
- •Stratified 80/20 train/test split by emotion category
- •F1 per category with 0.78 threshold (emotion) and 0.75 threshold (topic multi-label)
- •Diagnostic script: class balance, test sample size per category, similar category imbalance
- •Targeted augmentation for negative category: 180 clear + 120 subtle additional examples
- •Test split rebalance for fashion_red_carpet: 15 → 30 test examples
Proof of Impact
0.830
Macro F1 — emotion classifier across all 5 categories
Results Comparison
| Metric | Before | After | Change |
|---|---|---|---|
| negative F1 | 0.742 | 0.750 | Targeted augmentation |
| fashion_red_carpet F1 | 0.696 (15 test) | 0.776 (30 test) | Split rebalance |
| Tier 2 short text accuracy | Wrong (TF-IDF out of distribution) | Correct (Claude API semantic) | Tiered routing |
| general_audience_reaction F1 | N/A | 0.304 (accepted) | Documented limitation |
Key Insights
The negative/angry F1 improvement came from redesigning the generation prompts using the labelling guide decision trees — not from tuning the model. The labelling guide is a product decision. The model learns exactly what the data shows.
The fashion_red_carpet failure was a measurement problem, not a model problem. At 15 test examples, one wrong prediction moves F1 by 6.7 percentage points. Rebalancing the split to 30 examples moved F1 from 0.696 to 0.776 at zero additional cost.
The tiered routing architecture was driven by a product and ethics decision: TF-IDF + XGBoost produces wrong high-confidence results on short text. Showing a wrong answer with 99% confidence destroys producer trust. Claude API for Tier 2 costs $0.0003 per call. The cost of a wrong high-confidence verdict in a live broadcast context is orders of magnitude higher.
general_audience_reaction F1 0.304 is an accepted limitation documented in MODEL_DECISIONS.md. The category is defined by exclusion — it applies when nothing else does. Models learn by positive examples. A category defined by the absence of other signals is structurally harder to learn. In production, low-confidence topic tags are shown with a distinct visual indicator.
Ethics & Responsible AI
This project applies responsible AI principles from the ground up. The central ethical tension: if a live broadcast producer consistently responds to sentiment signals, the algorithm gradually shapes editorial decisions — and the broadcast stops reflecting editorial judgement and starts reflecting algorithmic optimisation. Every decision in this system was made to prevent that failure mode.
Editorial Sovereignty is Non-Negotiable
The system never makes a content decision. Every result requires an explicit producer response. The editorial guardrail is persistent and non-dismissible on every screen. This is an ethical constraint embedded in the architecture.
Transparency Over False Confidence
Every classification includes a confidence score. Low-confidence results are visually distinct. The tiered routing architecture exists specifically because showing wrong high-confidence results violates this principle — false certainty is more dangerous than acknowledged uncertainty.
The Feedback Loop Risk is Documented
If producers consistently respond to sentiment spikes, audience behaviour adapts, and the model begins learning the consequences of its own influence rather than genuine sentiment. This risk is documented in the Ethics Framework and the v2 roadmap requires training data from multiple diverse events to mitigate it.
Social Signal Caveat Built Into the UI
Social media audiences skew younger and more urban than linear broadcast audiences. This demographic gap is surfaced in the dashboard as a persistent caveat — not buried in documentation.
Guardrails & Safeguards
| Rule | Threshold | Rationale |
|---|---|---|
| Editorial guardrail on every screen | Non-dismissible | Producer must never forget the signal is advisory not directive |
| Confidence shown on every verdict | Always visible | Producer must know how certain the model is before acting |
| Short text routes to Claude not XGBoost | 4–20 words via Tier 2 | Wrong high-confidence verdicts destroy trust and could cause editorial errors |
| Alert names signal not action | Descriptive only | Alert says 'Negative spike: Winner Reaction' never 'Consider changing coverage' |
Bias Audit & Fairness Assessment
Fairness constraint: no emotion category flagged at more than 2× the rate of any other. Known limitation: general_audience_reaction F1 0.304 — structural catch-all limitation, low-confidence results shown with distinct visual indicator. Negative/angry boundary F1 0.750 — in production, alert system uses combined negative + angry score, not either emotion alone.
OKRs & Success Metrics
Objective
Demonstrate responsible AI product management capability through a shipped, production-grade broadcast sentiment tool that is technically sound, editorially safe, and commercially relevant to UK live broadcasting.
Key Results
Achieve Macro F1 > 0.78 on emotion classifier
100%Target: Macro F1 > 0.78
Implement tiered routing with editorial guardrail
100%Target: All three tiers operational
Deploy to public URL with scripted BAFTA simulation
100%Target: Live public demo
Success Metrics
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Emotion Macro F1 | > 0.78 | 0.830 — all categories above 0.75 | Achieved |
| Tiered routing | 3 tiers operational | Tier 1 auto-neutral, Tier 2 Claude API, Tier 3 XGBoost | Achieved |
| Live demo | Public URL | pulse-pi-inky.vercel.app | Achieved |
| PM artefacts | 5 documents | 5 shipped before build started + Labelling Guide | Achieved |
Learnings & Reflections
What Went Well
- •Writing the labelling guide before generating any data produced cleaner category boundaries than the iterative approach used in earlier projects — the decision trees made every label unambiguous before a single API call was made
- •The tiered routing architecture solved two problems simultaneously: short text accuracy and honest uncertainty communication — the same architecture decision served both technical and ethical requirements
- •The scripted BAFTA narrative arc makes the demo immediately legible to any broadcaster — the CONTROVERSY stage with negative/angry spikes tells the product story better than any explanation
Challenges Faced
- •The negative/angry boundary remained at F1 0.750 after two targeted augmentation runs — the remaining gap is likely irreducible with synthetic data and requires real broadcast audience comments under a proper consent and anonymisation framework
- •general_audience_reaction F1 0.304 is a structural limitation of catch-all categories — accepted and documented, but it surfaces the fundamental tension between ML training methodology and category design
- •Render free tier cold start (30–50 seconds) creates friction in the demo — health check retry loop mitigates this but does not eliminate it
What I'd Do Differently
- •Build tiered routing from day one — the moment I decided to support manual text input I should have asked 'what is the shortest input a producer would type?' and designed the routing architecture before the classification layer
- •Write the labelling guide before the generation script, not after the first failed training run — this was the correct sequence and would have saved two full regeneration cycles
- •Generate a 100-row pilot before full generation — $0.02 to validate label boundaries before committing to a $0.50 full run
"The labelling guide is not documentation — it is product design. Every emotion boundary definition is a decision about what the model learns. Every topic decision tree is a specification. Writing it before generating data is the correct order. The model learns exactly what the data shows."
PM Artefacts
Written before any code. Every project ships with a full PM artefact set.
Let's Connect
I am actively seeking Junior AI PM and Technical PM roles at UK media companies — BBC, Channel 4, ITV, Sky, and media-adjacent tech. My background in sociology and anthropology combined with an MSc in Managing AI in Business gives me a perspective on responsible AI and editorial ethics that most technical candidates do not have. Let's connect.
Quick Links
© 2025 Ogbebor Osaheni. Built with Next.js, React, and Tailwind CSS.
