Limits...
Automatic Speech Recognition from Neural Signals: A Focused Review

View Article: PubMed Central - PubMed

ABSTRACT

Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.

No MeSH data available.


Decoding process in the Brain-to-text system. Broadband gamma power is extracted for a phrase of ECoG data. The most likely word sequence is then decoded by combining the knowledge of ECoG phone models, dictionary and language model.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC5037201&req=5

Figure 2: Decoding process in the Brain-to-text system. Broadband gamma power is extracted for a phrase of ECoG data. The most likely word sequence is then decoded by combining the knowledge of ECoG phone models, dictionary and language model.

Mentions: These models for each phone can be used to estimate the likelihood of a certain phone given a piece of ECoG data. Additionally, the calculated generative models for each phone can be used to gain insights into the neural basis of speech production for different phones. Even though these ECoG phone models alone could be used to pick the most likely phone for each interval of ECoG activity, ASR software works by adding crucial information through a statistical language model (Jelinek, 1997; Stolcke, 2002) and a pronunciation dictionary. The combination of these three ingredients yields the great results known from speech interfaces. The ASR software extracts the search result by identifying the sequence of words from the dictionary that has the best score combination from language model and the ECoG phone models. Using these ideas from ASR, our Brain-to-text system is able to create a textual representation of spoken words from neural data. See Figure 2 for a graphical explanation of the decoding process.


Automatic Speech Recognition from Neural Signals: A Focused Review
Decoding process in the Brain-to-text system. Broadband gamma power is extracted for a phrase of ECoG data. The most likely word sequence is then decoded by combining the knowledge of ECoG phone models, dictionary and language model.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC5037201&req=5

Figure 2: Decoding process in the Brain-to-text system. Broadband gamma power is extracted for a phrase of ECoG data. The most likely word sequence is then decoded by combining the knowledge of ECoG phone models, dictionary and language model.
Mentions: These models for each phone can be used to estimate the likelihood of a certain phone given a piece of ECoG data. Additionally, the calculated generative models for each phone can be used to gain insights into the neural basis of speech production for different phones. Even though these ECoG phone models alone could be used to pick the most likely phone for each interval of ECoG activity, ASR software works by adding crucial information through a statistical language model (Jelinek, 1997; Stolcke, 2002) and a pronunciation dictionary. The combination of these three ingredients yields the great results known from speech interfaces. The ASR software extracts the search result by identifying the sequence of words from the dictionary that has the best score combination from language model and the ECoG phone models. Using these ideas from ASR, our Brain-to-text system is able to create a textual representation of spoken words from neural data. See Figure 2 for a graphical explanation of the decoding process.

View Article: PubMed Central - PubMed

ABSTRACT

Speech interfaces have become widely accepted and are nowadays integrated in various real-life applications and devices. They have become a part of our daily life. However, speech interfaces presume the ability to produce intelligible speech, which might be impossible due to either loud environments, bothering bystanders or incapabilities to produce speech (i.e., patients suffering from locked-in syndrome). For these reasons it would be highly desirable to not speak but to simply envision oneself to say words or sentences. Interfaces based on imagined speech would enable fast and natural communication without the need for audible speech and would give a voice to otherwise mute people. This focused review analyzes the potential of different brain imaging techniques to recognize speech from neural signals by applying Automatic Speech Recognition technology. We argue that modalities based on metabolic processes, such as functional Near Infrared Spectroscopy and functional Magnetic Resonance Imaging, are less suited for Automatic Speech Recognition from neural signals due to low temporal resolution but are very useful for the investigation of the underlying neural mechanisms involved in speech processes. In contrast, electrophysiologic activity is fast enough to capture speech processes and is therefor better suited for ASR. Our experimental results indicate the potential of these signals for speech recognition from neural data with a focus on invasively measured brain activity (electrocorticography). As a first example of Automatic Speech Recognition techniques used from neural signals, we discuss the Brain-to-text system.

No MeSH data available.