Back to Blogs

Speech-to-text (STT) performance benchmark on diverse name set

We benchmarked current STT models to see how they perform

Ibrahim Cotran, Christopher ZhengSep 23, 2025

Speech-to-text (STT) services often struggle when there’s short audio snippets for transcription (e.g., numbers, yes, no, etc.) or with the long-tail of names. This is because the STT services are missing the additional context that can improve transcription accuracy. We ran a targeted benchmark of three STT engines on 97 short audio clips to compare accuracy. Here are the results.

How we tested

  • Content: 97 single-utterance clips with a diverse set of names and numbers (phone numbers, ZIP codes, and dates).

  • Scoring: Measured edit distance with leading, trailing, and repeated spaces removed.

    • Exact (edit distance = 0), Close (0 < edit distance < 2), or Neither (edit distance >= 2)

  • Models tested: Google STT V2, Google STT V2 Chirp 2, and Deepgram Nova-3

All were run with default settings; no phrase lists or language hints were provided.

Edit distance was compared against human transcription of the audio clips. 

Results

With the edit distances, we computed percentages of transcriptions that were exact matches and percentages that were exact or close matches. 

STT Results

** indicates best performing

Key patterns observed

Names are still hard. While many models have improved with names like “Siobhan”, names such as “Min-seo Kim” or “Mai Pham” are consistently challenging for STT services to transcribe accurately. 

Numbers are much improved in modern models. Across phone numbers, ZIP codes, and dates, engines were solid, but we still see occasional issues with repeated digits (e.g., 80001 was transcribed as 8001). 

Lessons Learned

Deepgram Nova 3 was consistently the best STT across names and numbers, but both Google options were dependable for numbers. Either way, it’s probably valuable to implement validation tools with simple rules (i.e., regex for phone number formats, zip code length, sensible date ranges, etc.). Additionally, from a cost perspective, Deepgram Nova 3 is priced at $0.0077 per minute while Google STT V2 costs $0.024 per minute and Google STT V2 Chirp 2 $0.016 per minute. 


Check out our demo agents

Take a look at how our agents work in real-world business use cases.