Speech recognition is the inter-disciplinary sub-field of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers.
Types of Speech Recognition
- Speaker Dependent
- Speaker Independent
Speaker Dependent: Some speech recognition systems require “training” where an individual speaker reads text or isolated vocabulary into the system. The system analyzes the person’s specific voice and uses it to fine-tune the recognition of that person’s speech, resulting in increased accuracy.
Speaker Independent: Systems that do not use training are called “speaker independent” systems.
Techniques Used for Speech Recognition
In the early 2000s, speech recognition was still dominated by traditional approaches such as Hidden Markov Models combined with feed forward artificial neural networks.
Today, however, many aspects of speech recognition have been taken over by a deep learning method called Long short-term memory (LSTM), a recurrent neural network.
LSTM RNNs avoid the vanishing gradient problem and can learn “Very Deep Learning” tasks that require memories of events that happened thousands of discrete time steps ago, which is important for speech.
Around 2007, LSTM trained by Connectionist Temporal Classification (CTC)started to outperform traditional speech recognition in certain applications.
In 2015, Google’s speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which is now available through Google Voice to all smartphone users.
In contrast to the steady incremental improvements of the past few decades, the application of deep learning decreased word error rate by 30%.
Benefits from Speech Recognition System
For individuals that are Deaf or Hard of Hearing, speech recognition software is used to automatically generate a closed-captioning of conversations such as discussions in conference rooms, classroom lectures, and/or religious services.
Also very useful for people who have difficulty using their hands, ranging from mild repetitive stress injuries to involve disabilities that preclude using conventional computer input devices.
Individuals with learning disabilities who have problems with thought-to-paper communication (essentially they think of an idea but it is processed incorrectly causing it to end up differently on paper) can possibly benefit from the software.
Performance of Speech Recognition System
- The performance of speech recognition systems is usually evaluated in terms of accuracy and speed.
- Accuracy is usually rated with word error rate (WER), whereas speed is measured with the real time factor.
- Other measures of accuracy include Single Word Error Rate (SWER) and Command Success Rate (CSR).
Challenges of Speech Recognition System
- Speech recognition by machine is a very complex problem.
- Vocalizations vary in terms of accent, pronunciation, articulation, roughness, nasality, pitch, volume, and speed.
- Speech is distorted by a background noise and echoes, electrical characteristics.