Automatic Speech Recognition - An Overview

Name: Automatic Speech Recognition - An Overview
Uploaded: 2017-09-11T00:00:00.000Z
Duration: 2 h 48 min 40 s
Description: An overview of how Automatic Speech Recognition systems work and some of the challenges. See more on this video at https://www.microsoft.com/en-us/research/video/automatic-speech-recognition-overview/

Introduction to ASR Systems

Overview of ASR Systems

Preethi introduces herself and the topic of Automatic Speech Recognition (ASR) systems, explaining that the tutorial will cover a high-level overview, challenges, and open problems in ASR.

An ASR system translates spoken utterances into text, which can be represented as words, syllables, sub-word units, or characters.

Examples of ASR Applications

Common examples include YouTube's closed captioning and voicemail transcription. Voicemail transcriptions often have poor accuracy.

Dictation systems are highlighted as early prototypes of ASR technology; they have improved significantly over time.

Popular voice assistants like Siri and Cortana utilize ASR engines for speech recognition but may struggle with understanding context due to limitations in their language understanding modules.

Why is ASR Desirable?

Benefits of ASR Technology

Speech is a natural communication method; effective ASR systems allow users to interact with devices hands-free.

Companies like Toyota and Honda invest in robust speech recognition for safer driving experiences while using technology.

Voice-driven interfaces can assist both literate and illiterate users, making technology more accessible.

Preservation of Languages

Developing ASR systems for endangered languages can help preserve these languages by providing technological support.

Challenges in Building Effective ASR Systems

Sources of Variability

The origins of ASR date back to the 1950s; despite being easy for humans familiar with a language, several factors complicate speech recognition.

Style of Speech

Continuous speech presents challenges due to coarticulation—where preceding words affect subsequent ones—making it harder for systems compared to isolated word recognition.

Environmental Factors

Noisy environments or challenging acoustics (e.g., reverberation or background conversations) significantly hinder an ASR system's performance.

Speaker Characteristics

Individual differences such as accent, rate of speech, age-related variations contribute to the complexity faced by ASR technologies.

Task-Specific Constraints

ASR Challenges and Historical Development

Understanding ASR Constraints

The complexity of Automatic Speech Recognition (ASR) is influenced by various constraints, including limited grammar and vocabulary. This makes the task simpler but also highlights challenges in recognizing speech in languages without written forms or those that are morphologically rich.

ASR presents significant challenges, as evidenced by its historical development since the 1920s. The discussion will cover notable systems rather than an exhaustive history.

Early Prototypes of ASR

The first prototype, Radio Rex, was a frequency detector rather than a true recognition system. It responded to the vowel sound "e" at 500 Hertz but had limitations regarding noise and speaker variability.

Radio Rex's design was criticized for being biased towards adult male voices, failing to recognize children's or female speech effectively.

Advancements in ASR Technology

IBM's SHOEBOX system (1962) marked a significant step forward by recognizing connected strings of digits and performing basic mathematical operations. However, it was limited to isolated word recognition with only 16 words.

In the 1970s, ARPA funded a $3 million project aimed at developing advanced speech recognition systems capable of evaluating continuous speech. The winning system, HARPY from CMU, recognized connected speech with a vocabulary of 1,000 words.

Statistical Models Revolutionize ASR

By the 1980s, statistical models became prevalent in ASR development. These models were formulated as noisy channel problems and utilized hidden Markov models for better generalization compared to previous rule-based approaches.

Vocabulary sizes expanded significantly during this period; however, progress plateaued until deep neural networks emerged around 2006.

Deep Neural Networks Transforming ASR

Modern state-of-the-art ASR systems like Cortana and Siri rely on deep neural network models for improved performance.

Breakthroughs in Speech Recognition Technology

Introduction to Deep Neural Networks

The development of deep neural networks aims to mimic human brain functions for improved speech recognition.

This technology has led to a significant increase in recognition rates by approximately 30%, marking a substantial advancement over previous models.

Current Challenges in Speech Recognition

Despite advancements, questions remain about the ongoing challenges in speech recognition, particularly regarding accent adaptation.

A video featuring Pratibha Patil, India's 12th president, illustrates the effectiveness of automatic captioning systems and highlights areas for improvement.

Evaluating ASR Systems

The word error rate (WER), a common metric for evaluating Automatic Speech Recognition (ASR) systems, shows a respectable 10% error rate for Indian English.

Google's sophisticated ASR system accounts for various English dialects, including Indian English, enhancing its performance.

Accent Impact on Recognition Rates

A comparison with speeches from different speakers reveals that accents significantly affect recognition accuracy; Abdul Kalam's pronounced accent leads to higher error rates.

Factors such as room acoustics and sentence structure also contribute to the challenges faced by ASR systems when processing heavily accented speech.

Case Study: Sarah Palin's Speech

Analyzing Sarah Palin’s speech demonstrates how strong accents and non-standard word order can complicate predictions made by language models.

Understanding ASR Systems

Introduction to ASR Challenges

The initial misinterpretation of "gift basket" as "your basket" highlights challenges in Automatic Speech Recognition (ASR) systems, particularly with accents or dialects.

The discussion touches on the metaphorical idea of welcoming illegal immigrants with "gift baskets," indicating a broader societal commentary intertwined with technology.

Structure of an ASR System

An overview of the typical structure of an ASR system is introduced, emphasizing the importance of understanding its components before delving deeper into specific questions.

The first component involves acoustic analysis, which converts speech waveforms into discrete representations or features for further processing.

Acoustic Analysis and Feature Extraction

Raw speech signals are discretized into samples typically ranging from 10 to 25 milliseconds, allowing for effective feature extraction within stationary speech frames.

Features extracted from these frames must be representative and non-redundant to ensure efficient processing; this process is complex and requires significant signal processing expertise.

Importance of Phonemes in ASR

A phoneme is defined as a discrete unit of sound that differentiates words in a language; understanding phonemes is crucial for accurate speech recognition.

The "beads-on-a-string" analogy illustrates how words can be represented as sequences of phonemes, similar to letters forming words in written text.

Pronunciation Dictionaries and Language Dependency

Experts create pronunciation dictionaries that map words to their phonemic representations; CMUdict is highlighted as a notable resource containing around 150,000 English words.

Developing pronunciation dictionaries is labor-intensive due to the need for linguistic expertise and identifying commonly used words across languages.

Understanding the Need for Phonemes in Speech Recognition

The Challenge of Vocabulary Limitations

The speaker discusses the limitations of using whole words in speech recognition, particularly when encountering unfamiliar words during testing.

Emphasizes that while a small vocabulary may seem manageable, larger vocabularies present challenges when certain words have not been encountered during training.

Transitioning to Phoneme Representation

Proposes representing words as sequences of phonemes, which allows for better generalization by leveraging known acoustic samples from training data.

Notes that for limited vocabulary tasks, word-level models might suffice; however, most tasks benefit from a more granular phoneme representation.

Acoustic Models and Hidden Markov Models (HMM)

Introduces the concept of acoustic features extracted from raw speech signals and how they relate to phoneme sequences through probabilistic models.

Describes HMM as a method used to learn mappings between feature vectors and phonemes, utilizing hidden states represented in a weighted graph format.

Probabilistic Nature of Phone Sequences

Highlights the uncertainty in predicting which phone will appear next, necessitating multiple hypotheses within the model.

Explains how Gaussian Mixture Models are employed to determine probabilities associated with generating specific speech vectors based on current states.

Evolution to Deep Neural Networks (DNN)

Discusses the transition from HMM to DNN for mapping speech signals to phonemes, emphasizing the extraction of features from fixed windows around speech frames.

Understanding Acoustic Models in Speech Recognition

Introduction to Acoustic Features

The speaker invites questions, indicating a willingness to clarify technical details about mapping acoustic features to phone sequences.

Emphasizes that the output from the acoustic model is a probabilistic distribution of phone sequences rather than a single best sequence.

Transition from Phones to Words

Discusses the need for an intermediate representation (phones) before arriving at word sequences from speech utterances.

Introduces pronunciation dictionaries as essential tools linking phone sequences to words; highlights that this module is expert-derived and not learned from data.

Pronunciation Variation Insights

References research conducted on the Switchboard corpus, which includes detailed phonetic transcriptions of conversational speech.

Points out significant pronunciation variations observed in actual speech compared to dictionary pronunciations, noting how accents and speed affect articulation.

Analysis of Phonetic Transcriptions

Illustrates examples of four words with discrepancies between standard dictionary pronunciations and actual spoken forms, highlighting multiple legitimate pronunciations.

Notes that entire syllables can be omitted in fast speech while still remaining intelligible due to contextual understanding.

Modeling Pronunciation Variations

Acknowledges the challenge of recognizing conversational speech versus read speech, which tends to adhere more closely to dictionary norms.

Proposes exploring the source of pronunciation variations through studying the human speech production system and articulators involved in sound production.

Speech Production Research

Mentions ongoing work by USC's SPAN group focusing on modeling pronunciation variation using MRI technology synchronized with audio recordings.

Articulatory Phonology and Speech Production

Discretizing Vocal Tract Variables

The approach taken involved discretizing the space of vocal tract variables, allowing for various configurations that lead to different speech sounds. This method identifies eight vocal tract variables, each capable of taking on multiple values.

Linguistic Foundations

The framework is grounded in established linguistic theories, particularly articulatory phonology, which redefines speech representation from sequences of phones to streams of articulatory features.

Overlapping Articulatory Features

Articulatory features are described as quasi-overlapping streams that influence pronunciation. These features interact with one another, leading to variations in how words are pronounced based on their articulatory movements.

Explaining Pronunciation Variation

By representing pronunciations as streams of features rather than fixed sequences, the model can account for significant deviations in pronunciation through asynchronous feature movements (e.g., nasalization affecting subsequent sounds).

Challenges in Modeling

While the articulatory feature framework offers elegant explanations for pronunciation variation, it poses modeling challenges. The speaker reflects on difficulties encountered during their thesis work related to this representation.

Dynamic Bayesian Networks and Language Models

Dynamic Bayesian Networks (DBNs)

The discussion transitions to Dynamic Bayesian Networks (DBNs), a generalization of Hidden Markov Models used to represent constraints between various articulatory features within the speech production model.

Probabilistic Mapping of Sounds

The mapping from sound variables to produced sounds is probabilistic rather than deterministic. This allows for flexibility in how different phonetic elements correspond to specific outputs while acknowledging variability.

Role of Language Models

Language models determine the order of words after mapping phone sequences into valid words. They analyze large text corpora to understand word co-occurrences and structure language output effectively.

Integration with Pronunciation Models

The integration between pronunciation models and language models is crucial; even if a phone sequence does not match exactly, probabilities help identify likely word sequences based on contextual usage.

Complexity in Probabilistic Systems

Understanding Language Models and Their Applications

The Role of Context in Language Models

The word "context" is crucial for predicting subsequent words; for example, "the dog" is likely to be followed by "ran" rather than "pan," which has a low probability.

Language models help disambiguate similar acoustic sequences, such as distinguishing between "is the baby crying" and "is the bay bee crying," based on their likelihood in large text corpora.

Tools for Implementing Language Models

The SRILM toolkit is widely used in various communities for language modeling tasks, offering extensive built-in functionalities.

KenLM Toolkit is gaining popularity due to its efficiency in handling large volumes of text with sophisticated data structures.

OpenGrm NGram Library, developed by Google, is ideal for those who prefer working with finite state machines.

Applications of Language Models

Language models are utilized across numerous applications including:

Speech recognition

Machine translation

Handwriting recognition

Optical character recognition

Spelling correction

Summarization and dialog generation

N-Gram Models Explained

N-Gram models analyze co-occurring words (bigrams, trigrams, etc.) to predict the next word based on previous context. For instance, a four-gram model assesses the probability of “class” following “she taught a.”

As n increases (e.g., five-grams), reconstructing sentences becomes more accurate but also leads to an exponential increase in possible n-grams.

Challenges with Unseen N-Grams

A significant limitation of n-gram models is encountering unseen n-grams that do not appear in training data. This issue arises frequently even with large datasets.

When testing with unseen n-grams during sentence reordering, the model assigns a probability of zero if it encounters these unknown sequences.

Importance of Smoothing Techniques

Understanding Smoothing Techniques in N-gram Models

Importance of Smoothing Methods

Discussion on the significance of distributing remaining probability mass across unseen N-grams, highlighting various smoothing techniques.

Recommendation to read the 1998 paper by Chen and Goodman for a deep understanding of smoothing techniques in language models (LMs).

Emphasis on the relevance of N-grams today, despite advancements in language modeling.

Transition to Neural Network-Based Models

Mention that while recurrent neural network-based language models are emerging, many production systems still rely on traditional N-gram models due to speed concerns.

Explanation that ASR (Automatic Speech Recognition) systems often use a combination of N-grams and rescoring with neural networks.

The Role of Decoders in ASR Systems

Understanding the Search Problem

Overview of how decoders integrate various components to determine the most likely word sequence from phoneme sequences.

Description of a naive search graph illustrating transitions between words like "nine" and "one," emphasizing associated probabilities.

Complexity of Search Graphs

Acknowledgment that even simple examples can lead to large graphs; real systems may involve tens of thousands or millions of states.

Clarification that exact searches through these extensive graphs are impractical, necessitating approximate search techniques.

Emerging Trends: End-to-End ASR Systems

Simplifying ASR Components

Introduction to end-to-end ASR systems aiming to eliminate traditional components by directly mapping acoustic features to character sequences.

Advantages and Challenges

Highlighting the benefit of removing pronunciation models, which simplifies system development for new languages but requires substantial data for effectiveness.

References for Further Reading

Understanding Character Models in Speech Recognition

Direct Mapping of Acoustic Vectors

The system discussed operates without a dictionary or language model, directly mapping acoustic vectors to letters and characters.

This approach leads to lexical errors due to phonetic similarities, such as "shingle" being interpreted as "single."

Certain words like "Dukakis" and "Milan" may not appear in the vocabulary, showcasing an advantage of character models that predict one character at a time.

Analysis of Phone Correspondence

An analysis from Maas et al. (2014) shows how speech samples correspond to various phones, despite their system lacking explicit phone definitions.

The letter 'C' corresponds with its core sound, while 'S' and 'H' are prominent in the phone "shah," indicating complex relationships between sounds and letters.

Advances with Sequence-to-Sequence Models

A 2016 system significantly improved upon previous models by utilizing sequence-to-sequence networks with attention mechanisms.

Despite advancements, end-to-end systems still lag behind traditional pipelines in terms of understanding and performance.

Challenges in Speech Recognition Systems

These systems require extensive data for training due to the complexity of understanding speech sounds and spelling variations.

Addressing variations in age and accent is crucial for improving Automatic Speech Recognition (ASR), particularly for individuals with speech impairments.

Future Directions for ASR Development

There is a need for ASR systems that can handle noisy environments with multiple speakers effectively.

Current state-of-the-art methods struggle with meeting transcription tasks involving overlapping speech from different speakers.

Efficiency and Adaptation Challenges

Adapting existing models to new languages or dialects remains challenging; collecting large datasets is often necessary but inefficient.

The goal is to reduce reliance on extensive labeled data while leveraging unlabeled data more effectively across domains.

Resource Efficiency Goals

Speech Recognition and Language Technology Challenges

Promising Developments in Speech-to-Speech Translation

Microsoft is advancing speech-to-speech translation, showcasing its potential through an advertisement featuring Skype.

The integration of cues in speech can enhance machine translation, suggesting a need for improved models that leverage both speech recognition (SR) and production.

Addressing Pronunciation Variability

Utilizing speech production models could minimize reliance on language-specific resources by focusing on universal vocal tract configurations.

Current challenges include managing new languages without extensive data collection; existing ASR supports around 80 languages, but many are variations rather than distinct languages.

Geographic Disparities in Language Support

Europe leads in language representation within speech technologies, while Asia has significantly fewer supported languages despite its linguistic diversity.

Crowdsourcing transcription from native speakers poses challenges due to demographic mismatches on platforms like Amazon's Mechanical Turk.

Limitations of Crowdsourcing Solutions

The disparity between the language expertise required for transcription tasks and the available crowdworkers highlights issues with minority language support.

This method may not be viable for all languages, indicating a need for innovative solutions in the field of language technology.

Insights into Language Models and Phonetic Annotation

Language models can revert to simpler forms like unigram models if individual words have been encountered previously, allowing recovery of word sequences even with limited constraints.

English phonetic transcriptions often exceed 40 phones due to fine-grained variations; achieving such detailed annotation remains challenging.

Computational Resources and System Performance

End-to-end systems currently rely heavily on recurrent neural networks, facing limitations regarding context retention and effectiveness of attention mechanisms.

Understanding Speech Recognition in Indian Languages

Challenges of Predicting Characters

The discussion begins with the prediction of a single letter in speech recognition, indicating that auto vocabulary is not an issue since it predicts one character at a time.

It is noted that for morphologically rich Indian languages, end-to-end systems might perform well due to the lack of various word forms in the vocabulary. However, data availability remains a challenge.

Learning Sound Mapping and Spelling

The conversation highlights the need for systems to learn sound mapping and spelling due to irregularities in language mapping.

There’s mention of Unicode initiatives that could help predict characters more effectively by breaking down complex sounds into simpler components.

Data Requirements for Effective Models

A comparison is made regarding data requirements; while English models may require extensive machinery, Indian languages could potentially yield better results with less data if mapped correctly.

The standard pipeline for speech recognition is discussed, referencing Switchboard's 200 hours of speech achieving a 5% error rate as a benchmark.

Minimum Data Threshold for Experiments

The minimum amount of data needed to experiment with Indian languages is debated; previous studies suggest around 100 hours could be sufficient to start developing effective systems.

There's interest expressed in conducting experiments on Indian languages even with smaller datasets, acknowledging ongoing research on lesser-known languages.

Current Research Landscape

The conversation touches upon the novelty and complexity of applying these methods to different languages, emphasizing that no comprehensive evidence exists yet for their effectiveness across all complicated language structures.