The MICO speech-to-text pipeline

[Written by H. Björklund, J. Björklund, Y. Demeke, and A. Dahlgren] At IBC we had many questions about the MICO speec pipeline. Its purpose is to perform speech recognition on audio and video content and output the transcription as time stamped text, XML, or RDF. The analysis is composed of four steps; (1) audio demultiplexing, (2) speaker diarization, (3) speech transcription and (4) an optional conversion to RDF. The multiplexing serves to he facilitate video analysis and to down sample the audio signal to the rate used in the transcription step. he diarization is done by LIUM, and results in segmentation information along with gender classification and speaker partitioning. The speech-to-text extractor is responsible for the actual transcription, and we based it on the open-source speech recognition toolkit Kaldi. This is written in C++ which aligns well with the rest of the MICO platform. The language model for US English provided with the Kaldi toolkit has been the basis for experiments within the platform. By default, the pipeline produces a transcription in XML format, but there are auxiliary components which translate the transcript to time stamped text or RDF to simplify processing in downstream extractors. More information about the speech-to-text pipeline is available in the technical reports on Publications section of this site. tidens_naturlaere_fig40 By Morten Bisgaard – From the book “Tidens naturlære” 1903 by Poul la Cour, Public Domain, https://commons.wikimedia.org/w/index.php?curid=1030099

Mico Blog

26 SepThe MICO speech-to-text pipeline

Recent Posts

Categories

Archives