Speech recognition is a key technological enabler in multimedia information retrieval, and a prioritised requirement in the MICO use cases. The platform offers an advanced speech processing pipeline that inputs content items with audio parts, and outputs transcriptions in various formats.

The first extractor in the pipeline is audio demultiplexing, which downsamples the audio signal to match the sample rate used in the transcription step. The resulting signal is at 8 kHz, commonly known as `telephone speech’. The second step is diarization, which is mainly done with LIUM. The purpose is to divide the signal into smaller segments, which are easier to process and also opens for parallelization. Although this is currently not in use in MICO, LIUM is also capable of performing speaker identification and gender detection.The central step in the pipeline is the speech-to-text extractor.

The current implementation is based on the open-source library Kaldi, and we are presently to release and alternative based on Microsoft Speech. The transcoding is done using neural nets and Gaussian mixture models. Although Kaldi supports deeper forms of analysis, we use only online decoding and thus trade some precision for real-time performance. The last step in the pipeline produces a transcription in XML format, and there are auxiliary components to translate this into plain text or RDF.

A major chCopperphone_Micallenge in our work is multilingual support. Since Kaldi is a relatively new library, fewer language models are available than for, e.g., CMU Sphinx. Furthermore, since training these language models requires access to large corpora, requires a good deal of computational power, and also manual labour, new models are slow to appear. We address this by, one the one hand, training our own language models, and on the other, wrapping commercial extractors such as Bing which are known to have a broad language support.


Text: H.Björklund, J.Björklund, A.Dahlgren, and Y.Demeke

Photograph: By MaeLRie – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=31467975