ServiceNow AI Releases AU-Harness Benchmark to Evaluate Code-Switching in Automatic Speech Recognition

ServiceNow AI has released AU-Harness, a benchmark dataset and evaluation toolkit designed to assess Automatic Speech Recognition systems on code-switching speech. Code-switching occurs when bilingual speakers seamlessly alternate between languages within a single utterance, a common behavior in international customer service centers and multilingual workplaces. Despite its real-world prevalence, conventional ASR benchmarks frequently assume a single primary language, leading to performance degradation in practical deployments. The benchmark evaluates model performance using four distinct language pairs that mix English with Spanish, French, and German. The datasets represent typical IT support and human resources dialogues, containing spoken utterances ranging from 12 to 40 words. By testing seven modern speech models, including OpenAI's Whisper and several Large Audio-Language Models, the research revealed that recognition accuracy varies significantly depending on the language pair and the length of word embeddings. Traditional commercial ASR models often struggle when forced to parse multiple languages dynamically, resulting in high Word Error Rates. AU-Harness provides a standardized framework to quantify these error rates under realistic, mixed-language conditions. This benchmark offers developers and enterprise architects concrete metrics to guide the selection and fine-tuning of speech-to-text models for global applications.
Related tools
Recommended tools for this topic
These picks prioritize high-intent tools relevant to this topic. Some links may include partner or affiliate tracking.
Strong fit for AI, backend, and frontend readers looking for an AI-first coding workflow.
View CursorNatural next step for readers evaluating LLM adoption, APIs, and production inference.
Explore APIA strong fit for readers comparing Claude-class models, safety, and long-context workflows.
View AnthropicComparison
| Aspect | Before / Alternative | After / This |
|---|---|---|
| Language Assumption | Assumes a single, pre-declared primary language for the audio stream | Accommodates dynamic language switching mid-utterance (code-switching) |
| Evaluation Context | General-purpose read speech or monolingual conversational datasets | Domain-specific enterprise dialogues (IT support and Human Resources) |
| Performance Metric Focus | Standard global Word Error Rate (WER) | WER variations analyzed across specific language pairs and embedding lengths |
Source: Hugging Face Blog
This page summarizes the original source. Check the source for full details.
