Pulse
Call TrackingReal-Time BiddingRouting PlansIVR BuilderPay Per CallConversation AIAI Quality AnalysisPost-Call TranscriptionFraud IntelligenceAI Real-Time ReportsPayout ManagementClosed-Loop Attribution
Signal
Virtual NumbersNumber IntelligenceAI Predictive DialerCampaignsVisual IVRAI Voice AgentsRevenue BuilderWorkforce ManagementCommunication HubCall Management
Company
PricingBlogCase StudiesIntegrationsEventsAbout UsCareersContact
Sign in
Pulse — pulse.teldrip.comSignal — signal.teldrip.com
Try PulseTry Signal freeTalk to sales
Back to blog
Conversation AI15 min read· Dec 2025

Intent scoring at 60ms: an engineering teardown

How we got streaming intent and sentiment classification under 60ms median, why TTFT (time-to-first-token) is the wrong metric, and what the right metric actually is.

When we started building real-time intent scoring for live calls, the first question everyone asked was: why does latency matter if the call is 3 minutes long? The answer is that real-time scoring isn't useful for post-call reporting — it's useful for routing decisions, escalation triggers, and dynamic script changes that happen mid-call. At 300ms, you're routing 5 seconds late. At 60ms, you're routing before the caller finishes their first sentence. That's the difference between a system that reacts and one that leads.

Audio stream
20ms chunks
Feature extraction
Edge inference
Intent signal <60ms

Why TTFT is the wrong metric

Time-to-first-token (TTFT) is the standard latency metric in LLM systems — it measures how long until the model starts generating output. For our use case, it's not just incomplete; it's misleading. We don't need the model to start generating — we need it to produce a useful signal that our routing layer can act on.

The metric we optimized for is TTFS: time-to-first-signal — specifically, the time from start of audio input to a routing-actionable intent score. This is a stricter requirement than TTFT in one dimension (we need a complete score, not partial output) and a looser requirement in another (we don't need the full transcript, just the intent classification).

01
TTFT measures generation start. Useful for conversational AI where the caller expects a response. Not useful for routing decisions where what matters is the routing event, not a generated sentence.
02
TTFS measures signal readiness. The threshold: a confidence score of ≥0.80 on any intent class. Below 0.80, we classify as "uncertain" and continue collecting audio rather than routing prematurely.
03
The gap matters. In our architecture, TTFS typically runs 2–3× faster than TTFT because we route on a classification signal, not a generated response. Optimizing for TTFT would have delivered a slower routing system.

Architecture: how we got to 60ms

The 60ms target was non-negotiable for our use case — it's the minimum latency at which real-time call routing can happen before a human caller registers any interaction with the system. Getting there required decisions at four levels of the stack:

  1. Audio chunking. We process in 20ms non-overlapping chunks at 16kHz mono. Most production speech pipelines use 100–200ms chunks because they're optimized for transcription accuracy. At 20ms, we sacrifice some transcription quality in exchange for signal readiness 5–10× earlier. For intent classification (not verbatim transcription), this tradeoff is favorable.
  2. Feature extraction. Rather than running audio through a full transcription pipeline before classification, we extract acoustic and prosodic features (pitch, energy, speaking rate, pause patterns) in parallel with partial transcription. These features carry significant intent signal even before words are fully recognized.
  3. Model pruning and quantization. Our production intent classifier is a pruned version of a larger model, running at INT8 precision. Full-precision inference took 35ms per chunk — too slow for our target. Quantized inference runs in 8–12ms per chunk, with less than 2% accuracy degradation on our validation set.
  4. Edge inference. The model runs on GPU-backed edge nodes co-located with our carrier interconnects. Round-trip to a central inference cluster adds 12–25ms. Running at the edge eliminates this. This was the single largest latency reduction in our architecture — 18ms improvement for moving inference closer to the source.

False positives: the tradeoff you're actually managing

Confidence thresholdFalse positive rateTTFS (p50)Missed signals
0.7012.4%38ms3.2%
0.804.1%58ms7.8%
0.901.2%94ms18.3%
0.950.4%142ms31.5%

We set 0.80 as our production threshold. At 0.70, the false positive rate causes too many incorrect routing decisions — agents receiving callers who don't match the intent signal is worse than a slightly delayed correct signal. At 0.90+, latency climbs past the useful window for live routing decisions, and missed signals leave too many callers unrouted. The 0.80 setting delivers <4% false positives and sub-60ms median TTFS. It's the right tradeoff for production-scale call centers.

Related articles
Conversation AI

AI voice agents + human escalation, without the uncanny valley

10 min readJan 2026
Attribution

Closing the loop: server-side conversion APIs in 2026

14 min readMar 2026
Pay-Per-Call

Setting RTB floor prices by vertical: a 2026 playbook

12 min readFeb 2026

Ready to close the loop on your
revenue stack?

Teldrip Pulse handles call tracking, RTB and attribution. Signal handles telephony, AI voice agents and outbound. Spin up a free trial in minutes.

Try Pulse Try Signal freeTalk to sales
7-day free trial on Signal · Cancel any time