MICO is all about cross-media analysis, and this includes, as a subtask, automatic speech recognition (ASR) in video. For this reason, we are interested in the capabilities of different ASR libraries. Here follows a short account of an evaluation, conducted in collaboration with CodeMill Ltd., of two open-source libraries, PocketSphinx and Kaldi, and two proprietary libraries, Nuance Dragon and VoxSigma.
The task is to obtain transcriptions of arbitrary speech recordings; such an application of speech recognition software is called large-vocabulary continuous-speech recognition (LVCSR), as opposed to, for example, voice control or keyword search, which pose very different requirements on the software.
Description of libraries
Pocketsphinx is one of a family of software libraries, CMU Sphinx, from Carnegie Mellon University. It is written in C and works with acoustic and language models that are downloadable for free for US English; less reliable models exist for a few other languages. As its name suggests, pocketsphinx requires little memory and storage space (around 50 MB), and it can be installed on most platforms and with few dependencies.
Kaldi is a C++ library that was originally designed for speech researchers but it is now starting to be used in transcription applications. The ambition for Kaldi is to be open-ended enough that different algorithms can be supported; a recent addition to kaldi is a neural-net library which is believed to be the state of the art algorithm at the present. In contrast to pocketsphinx, the model files for Kaldi are large (just under 1 GB), and its installation requires the numerical library ALPHA. Kaldi used to be supported for Windows, but is currently only guaranteed to build on Linux.
Nuance Dragon is a commercial software which is designed primarily for dictation. As such, it comes with extensive functionality for training of speaker profiles. It is shipped with models for 6 languages and a large number of accents. It has an API in C++, C# and Visual Basic.
VoxSigma from Vocapia is profiled against transcription and mentions 16 supported languages on its home page. In addition to transcription it does speaker segmentation and is also able to do language recognition. VoxSigma comes either in the form of a web-based API or a local installation on Linux.
Pocketsphinx has a fairly well documented API and its functions can be called either as C++ library calls or as standalone programs. With few dependencies, we found it easy to install and integrate on Linux, Windows and OSX.
Kaldi’s extensive API documentation is hard to digest for the non-expert but once understood, it can be accessed both as library calls or as standalone programs, just as Pocketsphinx. We have built the current version of Kaldi on Linux, and we know that it is doable on OSX. However, it is unknown how much effort it takes to build the current version of Kaldi on Windows, since it is no longer supported on that platform.
We tested all libraries against a test suite consisting of approx. 70 minutes of speech from videos freely available on YouTube, for which there existed official transcripts. The numbers for the word accuracy rate (WACC) are shown in Table 1. We have included only audio with good sound quality in the comparison. The averages are taken over the files, without adjusting for their different lengths.
The obvious conclusion to be drawn from Table 1 is that, with the chosen settings, VoxSigma outperforms all other libraries in terms of WACC; that Nuance Dragon and Kaldi perform about equally well on most videos, and finally that Pocketsphinx has the worst WACC.
The one test case which breaks the pattern is cameron_mandela, which is spoken in a British accent. We speculate that pocketsphinx can more easily accommodate for different accents, whereas Kaldi (with the chosen model) performs better on US English specifically. We have not yet tried to train a specific British-English model for Kaldi, which could improve the result.
In conclusion, when choosing a software library, the higher average WACC for Kaldi must be weighed against the disadvantages in ease of implementation, portability, and storage space consumption. For the commercial libraries, we conclude that Dragon Nuance is probably too profiled against dictation to be of good use for transcription.
|Conan O’Brien and Mila Kunis||36,1||34,3||59,5||59,9|
|Average over files||61,9||71,4||84,0||67,2||85,9|
Table 1. Word accuracy rates (WACC) for CodeMill’s test suite of videos decoded by four speech engines. Default US English language models were used in all cases.
The sample set was as follows:
- Amanda Burden (2014) http://youtu.be/j7fRIGphgtk
- Barrack Obama (2009) http://youtu.be/-1ljmtaibC4
- George Takei (2014) http://youtu.be/LeBKBFAPwNc
- A.J. Jacobs (2014) http://youtu.be/2_lBiFZ85d0
- Conan O’Brien and Mila Kunis (2012) http://youtu.be/BCCypss2HDc
- David Cameron (2013) http://youtu.be/61Zp4T_b9fs