Automatic Sync Technologies brings years of experience and expertise in the speech processing and multimedia production arenas, offering a solid foundation in the areas of speech recognition, audio and voice processing and software engineering.
Work on our core technology began in 1990 resulting in first generation product, Lipsync, off-line and real-time production software that enables the automatic synchronization of a voice recording with visual media, such as animation or text.
In 2003, AST was awarded a Small Business Innovation Research (SBIR) grant by the Department of Education to investigate innovative ways to leverage technology for captioning and to create a functioning prototype of an automated closed captioning system that would facilitate access to broadcast and instructional materials for the deaf community and others who benefit from closed captioning. This proof-of-concept system was to provide a fast, accurate, and inexpensive alternative to traditional captioning, where high costs and long turnaround times have hampered compliance with federal regulations and universal access to broadcast and instructional materials.
AST developed a functioning proof-of-concept prototype of an automated, web-based captioning system that has now evolved into the CaptionSync product, in its fifth year of serving customers.
As part of this research project, AST evaluated the use of speech recognition technology. Analyses of error rates using speech recognition systems, trained stenographers and student workers was conducted. Speech recognition products offer an inexpensive way to automate the conversion of speech into text. Speech recognition engines offer a wide range in the accuracy of the results. They achieve their highest quality output when the files processed reflect one speaker, and the system is trained to process that speaker’s language. This typically requires a process of correcting the output for each speaker and sequentially investing in the creation of a speaker profile. In addition, speech recognition engine output can be improved by adding terms to a dictionary.
When the audio to be transcribed has multiple speakers, poor audio recording quality, complicated terminology, or when the speaker has an accent, the quality of the transcript tends to deteriorate.
Table 1 depicts the typical output quality results for various forms of transcription.
|Source||Typical Error Rate||Result|
|Trained Stenographer||0.5% to 1%||No problems|
|Student transcriber||Variable||Expect to be worse than stenographer|
|Speech Rec: trained||3% to 5+%||Varies from acceptable to poor|
|Speech Rec: untrained||20% to 40%||Unintelligible|
Table 1. Error Rates By Transcriber Type
Additional research was performed on comprehension rates and accuracy. An example of a document and a transcript with an 80% accuracy rate are found in Tables 2 and 3.
Table 2. Document with no errors
Table 3. Document transcribed using speech recognition systems with 80% accuracy
Analysis on comprehension and attention focus indicates that with an error rate greater than 10%, readers are less able to comprehend the main concepts and facts presented.
Table 4 demonstrates the impact on comprehension at different error rates.
In an academic environment, accuracy becomes even more critical as students are assessed based on the accuracy of their retention.
The economics behind using speech recognition systems to deliver accurate results indicated that when error rates were 3% or greater, the cost of repairing a bad transcript outweighed the cost of performing a transcription with a trained stenographer.
Instructional materials delivered randomly to students- 50% got captioned videos, 50% did not. Students who watched captioned videos were more engaged, more responsive to questions about video, were able to make the connections to their lives better. Students who received captioned video averaged 1 GPA point increase over students not exposed to captions.
The Closed Captioning Handbook, Robson, Gary, 2004
Augmenting an auditory experience with captions more than doubles the retention and comprehension levels.
Adult Literacy: Captioned Videotapes and Word Recognition. Rogner, Benjamin Michael, 1992
Adult students that used captioned video presentations progressed significantly better than those using traditional literacy techniques.
Dual coding and bilingual memory. Paivio, A., & Lambert, W. 1981. Journal of Verbal Learning & Verbal Behavior, 20, 532-539.
Dual Coding Theory postulates that both visual and verbal information are processed differently and along distinct channels with the human mind creating separate representations for information processed in each channel. Allan Paivio conducted several studies at the University of Western Ontario.
Multi-Modal Learning: See It, Hear It, Do It, Master It. Granström, House, & Karlsson 2002, Clark & Mayer 2003
Use of two or more senses to avoid sensory overload
UC Berkeley: Costs, Culture, and Complexity: A Two-year Analysis of Technology Enhancements in a Large Lecture Course at UC Berkeley
Northwestern University: Lecture capture in higher education: a Northwestern study
University of Wisconsin: Insights regarding undergraduate preference for lecture capture
University of Western Australia: The Lectopia Service and Students with Disabilities
New York University: Study on Uses of Video in Higher Education
The Essential Higher Ed Closed Captioning Guide: An AST Whitepaper by Kevin Erler, Ph.D.