12th November, 2025
The dataset is publicly available on 🤗: soketlabs/CoSHE-Eval
Automatic Speech Recognition (ASR) research in India has evolved rapidly, yet the majority of existing benchmarks remain monolingual. For instance, datasets such as Vistaar focus exclusively on Hindi (Devanagari script), providing limited insight into multilingual and mixed-language speech performance.
However, real-world Indian conversations are rarely monolingual. Code-switching—alternating between two or more languages within a single utterance—is a natural linguistic behaviour across India. In particular, Hindi–English mixing (Hinglish) is deeply integrated into urban and semi-urban communication patterns.
This behaviour introduces significant challenges for ASR models, including:
Despite these realities, there has been no standardised benchmark for evaluating ASR models on code-switched speech. To bridge this gap, we developed the CoSHE-Eval, a Hindi–English benchmark curated through a hybrid pipeline combining multimodal transcription (Gemini 2.5 Pro - Thinking) and human verification for high-fidelity transcriptions.
The dataset construction process followed a three-stage pipeline:
Each record in the ground-truth table contains:
| Column | Description |
|---|---|
audio_file_name | Unique identifier for each audio sample |
audio | Path or URI of the audio clip |
transcription | Final verified bilingual transcription including inline tags and metadata |
Audio samples were curated from publicly available sources, spanning diverse speech contexts such as:
Each clip was segmented into chunks of up to 59 seconds (median around 56s).
Note: No speaker demographic balancing (age, gender, accent) was performed intentionally, to preserve the natural variability of source material.
Initial transcriptions were generated using Gemini 2.5 Pro (Thinking Mode) — a large multimodal reasoning model by Google DeepMind with advanced multilingual understanding. A custom system prompt was engineered to capture bilingual structure, tone, and contextual meaning.
The model was instructed to:
[calm], [sarcastically]).This design enabled consistent multilingual fidelity while capturing the prosodic and emotional character of each utterance.
Following automated transcription, each entry was subjected to manual review by human annotators to ensure precision and consistency.
Annotators verified:
This human-in-the-loop process produces a high-confidence ground truth suitable for benchmarking fine-grained ASR performance.
To capture paralinguistic nuances, Gemini was prompted to embed inline emotion and tone labels in square brackets.
| Category | Example Tags |
|---|---|
| Voice Delivery Styles | [excited], [whispering], [shouting], [rushed] |
| Emotional States | [nervous], [frustrated], [cheerfully], [calm] |
| Narrative / Structural Cues | [pause], [reflective], [dramatic tone] |
| Character Styles | [sarcastically], [matter-of-fact], [playfully] |
Although not used for the ASR benchmark itself, these enrichments make the dataset suitable for emotion-aware speech models and speech-to-text sentiment analysis.
| Attribute | Description |
|---|---|
| Total Samples | 1985 |
| Total Duration | ~30 hours |
| Minimum Segment Length | 0.60 seconds |
| Maximum Segment Length | 59.8 seconds |
| Mean Segment Length | 53.3 seconds |
| Median Segment Length | 56.9 seconds |
| Timestamp Validation | Incremental and aligned with audio duration |
| Speaker Segmentation | Maintains full utterances; no mid-sentence cuts |
| Accent Labels | Haryanvi, UP Hindi, South Indian, Urban North Indian (where inferable) |
| Metadata Fields | speaker_id, gender, confidence_score, pitch, pace |

The CoSHE-Eval fills a critical gap in ASR evaluation for bilingual Indian speech. By integrating multimodal transcription, emotion tagging, and human verification, it provides a high-quality benchmark for real-world code-switching performance assessment.
In upcoming releases, we plan to:
Through these efforts, we aim to standardise evaluation of multilingual ASR models and foster open, reproducible research in Indian language technologies.
This dataset is distributed under a Research-Only, Non-Commercial License.
By accessing or using the dataset, you acknowledge and agree that: