🍷Sommelier

A Scalable Open Multi-turn Audio Pre-processing
for Full-duplex Speech Language Models

Sommelier Pipeline Overview

Overview of the Sommelier pipeline for curating podcast-style, multi-turn conversational speech suitable for full-duplex training.

Contributions

  • Scalable Pipeline for Full-Duplex SLMs: We release a scalable pipeline for curating podcast-style, multi-turn conversational speech suitable for full-duplex training, helping alleviate the community-wide data scarcity.
  • High-Fidelity Overlap Processing: We provide a detailed processing strategy that explicitly handles overlaps via rigorous diarization analysis and reduces ASR hallucinations using paralleled model ensembling and n-gram filtering.
  • Full-Duplex Fine-Tuning and Data Requirement Insights: We validate our pipeline by fine-tuning the full-duplex model Moshi on Sommelier-processed speech and analyze practical data requirements for stable full-duplex training.

Demo (1)

Explore the Sommelier pipeline output: original audio, separated speaker segments, and transcripts.

Original Audio

Episode: Dr. Beth Harris and Dr. Steven Zucker of Smarthistory

From Open Minds from Creative Commons · Apple Podcasts · Licensed under Creative Commons

00:00 / 00:00

Speaker Timeline

Click any segment to play. The timeline syncs with the original audio.

Segment Player

Click a segment on the timeline or transcript to play it here.

Full Transcript

Processing Statistics


Demo (2)

A second example from a different podcast episode.

Original Audio

Episode: Test English with Overlap (2min)

00:00 / 00:00

Speaker Timeline

Click any segment to play. The timeline syncs with the original audio.

Segment Player

Click a segment on the timeline or transcript to play it here.

Full Transcript

Processing Statistics