Sommelier: A Scalable Web-Speech Pre-Processing Pipeline for Full-Duplex Speech Language Models

Contributions

Scalable Pipeline for Full-Duplex SLMs: We release a scalable pipeline for curating podcast-style, multi-turn conversational speech suitable for full-duplex training, helping alleviate the community-wide data scarcity.
High-Fidelity Overlap Processing: We provide a detailed processing strategy that explicitly handles overlaps via rigorous diarization analysis and reduces ASR hallucinations using paralleled model ensembling and n-gram filtering.
Full-Duplex Fine-Tuning and Data Requirement Insights: We validate our pipeline by fine-tuning the full-duplex model Moshi on Sommelier-processed speech and analyze practical data requirements for stable full-duplex training.

🍷Sommelier