Automated Tools to Wreck A Nice Beach
An Overview of Automatic Speech Recognition
Automated speech recognition research has been a work in progress for a long time. The companies who took an early leadership position in this field in the 1950s proclaimed that the problem would be solved within five years. Similar proclamations have been made since, but almost 60 years later, modern speech recognition systems are still outperformed by a typical three year old child.
This is not to say we have not made progress. Speech recognition research has come a long way and the technology is commercially viable for a number of important applications -- with a couple important constraints: the recognizer must be "trained" for a particular talker, and/or have a tightly constrained vocabulary. While the allure of automated speech recognition for captioning is very compelling, typically neither of these two constraints is valid.
Speech Recognition-Based Captioning Products Continue to Appear
Despite this, every few years someone brings a speech-recognition-based captioning product to the market. The most recent one to make a splash is the new Google system for captioning YouTube videos. Manufacturers can make very impressive demos with these systems and while this makes great fodder for the press, these demos are not indicative of the actual performance you will get when captioning your own videos.
Accuracy is Critical for Accessibility
On today's typical captioning task (where neither the talker nor the vocabulary is constrained), error rates of today's speech recognition systems are in excess of 20%. To put this in perspective, readers report that error rates above 3% significantly degrade the intelligibility of text, and by the time the error rate reaches 10% they report that they are unable to even discern the topic being discussed (see our Research). To see what this looks like, try this video with the auto-captions enabled. To get the full experience, turn the sound off and see if you can follow the content.
The argument that "something is better than nothing" is often put forward for these sorts of tools, but at best this argument is valid only for "low stakes" content; these sorts of tools should not be used for high-stakes content. Speech-to-text tools have low accuracy rates that may be suitable for entertainment videos and for keyword searching. For higher stakes content, such as academic content, accuracy rates typical of speech-to-text tools have not met legal accessibility guidelines, nor have they been acceptable enough to rely on the output to deliver academic education.
For government, academic institutions, and corporations that do not want their message distorted by error, or risk civil rights lawsuits, or compromise academic integrity, a solution that ensures the best possible output is needed.
AST's Commitment to Quality Captioning
At AST, we believe that speech recognition technology is still not sufficient to provide a quality result to your viewers; captioning with a 20%+ error rate may provide comic relief, but it offers nothing in the way of accessibility. We make extensive use of automation to keep costs low, but we feel that this needs to be done without compromising the quality of the end result.
(About the title: "Wreck a nice beach" is acoustically very similar to "Recognize speech" and represents the type of recognition error that an automated system could easily make but a human transcriber is unlikely to make.)