Latency
Latency — The time delay between a user action and the system's response. The dominant felt-quality metric for consumer apps; affects everything from autocomplete to AI chat to smart-lock unlock time.
What latency is, in user-facing terms
Latency is the time between when you do a thing and when the app shows you the response. Tap a button, the app updates in 80ms — that’s the latency. Type a query in a chatbot, the first token streams back in 600ms — that’s the latency. Press your finger on a smart-lock keypad, the bolt retracts in 850ms — that’s the latency.
Latency is the dominant felt-quality metric for consumer apps. A product can be feature-complete, accurate, and well-designed; if its latency is bad, it feels bad to use. The reverse is rarely true — fast and ugly beats slow and pretty for daily-use products.
The thresholds that matter
Researchers have measured the latency thresholds at which user experience degrades:
- <100ms: the response feels instantaneous. The user perceives no delay.
- 100-300ms: the response feels fast but not instant. Acceptable for most interactions.
- 300-1000ms: the user notices the delay. They may start to lose flow on rapid interactions.
- >1000ms: the user begins to wonder if the app is broken. Trust degrades.
These thresholds matter especially for autocomplete, AI chat, and any interaction where the user is actively waiting on a response.
The mean isn’t the right metric
Latency is usually measured in two ways. The mean (or median) latency is the average response time. The 95th percentile (p95) is the response time at which only 5% of requests take longer. For user experience, p95 is the more important metric.
The reason: a product whose mean latency is 200ms but p95 is 4 seconds feels worse than a product with mean 400ms and p95 600ms. The 5% of slow responses dominate user perception because they’re the ones the user notices and remembers. “It usually works fast but sometimes it just hangs” is a worse UX than “it’s never instant but it’s always reasonable.”
We weight tail latency (p95, p99) explicitly in the AI tools verdicts. Claude Code’s mean response time is competitive with Cursor’s; the tail-latency profile is what makes it feel different.
Why this is hard for AI products specifically
AI inference has high latency variance compared to traditional CRUD applications. The same prompt can take 800ms or 8 seconds depending on the load on the model server, the length of the response, and the routing. This makes p95 latency a more honest metric than mean for any AI product, and the marketing materials that quote average response times are often hiding tail performance that’s much worse.
Streaming responses (where the AI sends tokens as they’re generated rather than waiting for the full response) is a partial fix. The first-token latency becomes the metric that matters for perceived responsiveness; total response time matters for completion. Most modern AI products stream by default.
Where latency comes from
Three sources, in order of most-controllable to least:
- The application code. Inefficient code in your app’s hot paths is the most common source of avoidable latency. This is what good engineering teams optimize.
- The network. Round-trip time to the server, packet loss, congestion. Limited by physics and network quality.
- The backend service. Database queries, third-party API calls, AI inference. Often the largest contributor.
For consumer apps that are PWAs or web-based, network latency dominates. For native apps with local processing (PlateLens’s photo recognition, Apple Watch on-device ML), the latency is mostly the application code and the local compute. For AI products, the inference latency dominates.
Related concepts
For the machine-learning models whose inference latency is often the dominant component of AI-product response times, see machine learning. For the PWA model and its latency trade-offs vs. native, see PWA.