Abstract
Brain-computer interfaces (BCIs) can provide a rapid, intuitive way for people with paralysis to communicate by transforming the cortical activity associated with attempted speech into text. Despite recent advances, communication with BCIs has been restricted by requiring many weeks of training data, and by inadequate decoding accuracy. Here we report a speech BCI that decodes neural activity from 256 microelectrodes in the left precentral gyrus of a person with ALS and severe dysarthria. This system achieves daily word error rates as low as 1% (2.66% average; 9 times fewer errors than previous state-of-the-art speech BCIs) using a comprehensive 125,000-word vocabulary. On the first day of system use, following only 30 minutes of attempted speech training data, the BCI achieved 99.6% word accuracy with a 50 word vocabulary. On the second day of use, we increased the vocabulary size to 125,000 words and after an additional 1.4 hours of training data, the BCI achieved 90.2% word accuracy. At the beginning of subsequent days of use, the BCI reliably achieved 95% word accuracy, and adaptive online fine-tuning continuously improved this accuracy throughout the day. Our participant used the speech BCI in self-paced conversation for over 32 hours to communicate with friends, family, and colleagues (both in-person and over video chat). These results indicate that speech BCIs have reached a level of performance suitable to restore naturalistic communication to people living with severe dysarthria.
Introduction
Communication is a top priority for the millions of people living with dysarthria from neurological disorders such as stroke and amyotrophic lateral sclerosis (ALS)1. As communication fails, people report increased rates of isolation, depression, and decreased quality of life2,3; losing communication often determines if a person will pursue or withdraw life-sustaining care in advanced ALS. Existing augmentative and assistive communication technologies such as eye trackers suffer from low information transfer rates and become increasingly less reliable and more onerous for patients as they lose voluntary motor control4. Brain-computer interfaces (BCIs) are a promising assistive technology to meet patients’ fundamental need for fast and effortless communication by bypassing the damaged parts of the nervous system and directly decoding their intended speech from neural measurements (reviewed in 5). Efforts to develop a speech neuroprosthesis are built on a large body of prior work, consisting mostly of offline (post hoc) speech decoding studies using data from able speakers undergoing electrophysiological monitoring for clinical purposes (e.g. 6–14, but see 15). Several groups have now started closed-loop speech BCI studies specifically to restore lost speech using chronically implanted electrocorticography (ECoG)16–19 and intracortical multielectrode arrays20. Two recent studies have established the state-of-the-art for ‘brain-to-text’ speech BCIs18,20 by decoding the neural underpinnings of attempted speech into phonemes (the building blocks of words), which are then assembled into words and sentences using a language model and displayed on a computer screen. These studies achieved communication accuracies – as quantified using the word error rate (WER) metric – of 25.5%18 and 23.8%20. However, as we wrote in our previous study: “it is important to note that it does not yet constitute a complete, clinically viable system … work remains to be done to reduce the time needed to train the decoder … 24% word error rate is probably not yet sufficiently low for everyday use.”20.
Here, we report an intracortical speech neuroprosthesis to meet the need for high accuracy communication (WER below 5%), using a comprehensive vocabulary (125,000 words), with low training data requirements. Our work builds upon prior results20 with multiple innovations including: (1) doubling the number of electrodes chronically placed in the ventral precentral gyrus to 256; (2) improvements to the language model; (3) online decoder fine-tuning that enables consistently high accuracy decoding over hours of use; (4) a personalized text-to-speech module that reproduces the participant’s original voice; and (5) demonstration of self-initiated personal communication with an open vocabulary. We report that these advances resulted in very high accuracy brain-to-text communication in a person living with severe dysarthria due to ALS, beginning on the very first day of use.
Methods
Study, participant, and implanted device
We recruited a left-handed male participant in his 40’s (referred to as ‘SP2’ in this preprint rather than the actual trial participant designation, which the participant is familiar with, as per medRxiv policy) with amyotrophic lateral sclerosis (ALS) for the BrainGate2 pilot clinical trial (identifier: NCT00912041). SP2 retains limited orofacial movement with the capacity for vocalization, but is unable to produce intelligible speech (Audio 1). His eye and neck movements remain intact.
Our objective was to translate SP2’s attempted speech by decoding his neural signals using four 64-electrode Utah arrays chronically implanted in the precentral gyrus, targeted to brain areas that contributed most to speech decoding from recent studies16,18,20 using the Human Connectome Project’s multi-modal MRI-derived cortical parcellation precisely mapped to SP2’s brain21 (Fig. S1, Section S1.02), and accounting for placement constraints from his brain’s anatomy and vasculature (Fig. 2a).
Real-time acquisition and processing of neural data
A signal processing system (NeuroPort System, Blackrock Neurotech) was used to acquire signals from the 256 implanted electrodes and transmit them to a computer running custom software22 (Section S1.5) for real-time signal processing (Section S1.4), decoding (Sections S2-3), and task control.
Speech task designs
The study consisted of 18 research sessions over the course of 11 weeks (Section S1.06; Table S2) and took place in the participant’s home. SP2 engaged in two types of tasks: 1) an instructed-delay Copy Task (Videos 1-2 and Section S1.07), and 2) a self-paced Conversational Task (Video 3 and Section S1.08).
Decoding speech
We used neural activity collected during the speech tasks to train a recurrent neural network (RNN, Section S2) to predict the probability of each English phoneme being spoken. Day-specific input layers were used to correct for nonstationarities between neural data from each research session. Sequences of phoneme probabilities were converted to the most likely word sequence by a multi-stage language model (Section S3), as described in 20.
The RNN and language model ran in real time to convert neural activity during attempted speech into words that appear on a screen. Prior to each session, a new RNN was trained from scratch using all data from previous sessions (Section S2.02). Starting from session 12, we added an ‘online training’ capability23, which used new neural data to fine-tune the RNN after each sentence (Section S2.03).
Evaluation
We used two metrics to analyze the speech decoding performance: phoneme error rate (PER) and word error rate (WER), consistent with previous speech decoding studies16,18,20. We evaluated our online speech decoding performance only on predetermined “evaluation blocks” (Section S1.09). The first-ever closed-loop block (session 1) was excluded from evaluation because the participant cried with joy as the words he was trying to say correctly appeared on-screen. To calculate overall decoding performance during the Copy Task, we used all evaluation blocks from the final three sessions. For evaluating the self-paced conversational task, we used all blocks of data.
Statistical analyses
Results for each analysis are presented with 95% confidence intervals or as mean ± standard deviation. The evaluation metrics (phoneme error rate and word error rate) were chosen before the start of data collection.
Results
Online decoding performance
In the very first research session, we asked participant SP2 to read prompted sentences, which were limited to a 50-word vocabulary16, while we recorded his neural data. After collecting 213 sentences (30 minutes) of training data, we trained the RNN and switched to the BCI’s closed-loop mode, where predicted words appeared on-screen as SP2 attempted to speak. In 50 evaluation sentences, SP2’s attempted sentences were decoded with a word error rate (WER) of 0.44%. We replicated this high-accuracy result for 50-word decoding in the second research session, where all 50 of SP2’s attempted sentences were decoded completely correct (0% WER; Fig. 1b).
In this second research session, we also expanded the vocabulary of the neuroprosthesis from 50 words to over 125,000 words, which encompasses the majority of the English language. We collected an additional 260 sentences of training data (1.9 hours), which contained a much larger vocabulary of words from conversational English24. After incorporating these data into the decoder, the BCI decoded SP2’s attempted speech with a WER of 9.8% (Fig. 1b). Decoding performance continued to improve in subsequent research sessions as we collected more training data, optimized algorithm hyperparameters, added online decoder fine-tuning23, and expanded the training dataset to include personal use data. We reduced the WER to 2.5% by session 15, and 1% by session 17. Average Copy Task decoding performance in the final 3 sessions had a 2.66% WER at SP2’s self-paced speaking rate of 32.9 words per minute (Fig. S2).
Notably, the system achieved high accuracy at the start of new research sessions (2-5 days after the previous research session), maintaining an average WER of 4.8% over the first 50 sentences across six sessions (Fig. 1C). This “plug and play” utility is attributed to the increased stability due to a larger number of recording electrodes, and also to the decoder’s continuous online fine-tuning23.
Recording array implant locations and decoding contributions
This performance was enabled by four Utah arrays chronically implanted in the left precentral gyrus (Fig. 2a; Supplemental Fig. S1), targeting putative language-related area 55b, premotor areas dorsal 6v (d6v) and ventral 6v (v6v), and primary motor cortex (area 4) with action potential resolution (Fig. 2b). To identify each array’s contribution to speech decoding, we trained decoders with data from one array at a time, or by omitting one array, and evaluated offline the raw (pre-language model) PERs (Fig. 2c). Consistent with our previous findings20, the ventral 6v array provided the most accurate decoding. The dorsal 6v array’s performance was notably worse, while the performance of the 55b and M1 arrays was only slightly worse than ventral 6v. Moreover, phoneme-specific error rates showed differences across arrays but no one array was essential for decoding specific phoneme groups (Fig. 2e). Finally, decoding performance as a function of the total number of electrodes utilized revealed an expected trend: an increase in channel count correlated with higher decoding accuracy, but the gains in performance showed diminishing returns as more electrodes were added (Fig. 2d).
Retrospective decoding analyses
Throughout the study, we refined our decoding pipeline several times, which significantly enhanced performance (Fig. 1b). This raises an intriguing question: how good could performance have been on the first day of speech BCI use, had we used these more refined methods? A retrospective decoding analysis shows that for a vocabulary of 50 words, we could achieve a 0% WER with just 165 training sentences. For a 125,000-word vocabulary, a WER as low as 8.3% could have been attained after training on 323 sentences (Fig. 3a).
To assess speech decoding stability, we tested (offline) pretrained decoders on data collected on subsequent days without additional fine-tuning. Results showed that fixed decoders maintained high accuracy up to 20 days post-training. Furthermore, decoders trained on larger amounts of data were more stable beyond 20 days (Fig. 3b).
In online evaluation blocks, most words (76.8%) were always decoded accurately, including 66.8% of words that the decoder had never previously encountered (i.e., they were not in the training dataset). This suggests the decoder generalizes well (Fig. 3c, inset). In cases where words were not decoded correctly, we found that the number of occurrences of a word in the decoder training dataset was predictive of the accuracy with which it was decoded (Fig. 3c).
Conversational speech using the BCI
We developed a system for SP2 to have conversations via self-initiated speech. The BCI automatically detected when SP2 started or stopped speaking from neural activity, and decoded his attempted speech accordingly (Fig. 4d). Additionally, SP2 had the option to use an eye tracker for selecting actions (Fig. 4a) to i.) finalize and read aloud the sentence, ii.) indicate whether the sentence was decoded correctly or not, or iii.) spell out words letter-by-letter that were not correctly predicted by the decoder (e.g., because they were not in the vocabulary, such as certain proper nouns).
SP2’s first use of the BCI for naturalistic communication with his family is exemplified in Fig. 4b (Table S3 provides additional transcripts). In subsequent sessions, SP2 utilized the neuroprosthesis for personal use (e.g., Video 3), communicating a total of 1189 sentences. For the majority of these sentences (925; 77.8%) we were able to confirm SP2’s intended speech through directly asking SP2, contextual analysis, and examining the RNN-derived phoneme probability patterns. Self-initiated sentences for which we knew the ground-truth were decoded with a WER of 3.7% (Fig. 4c). For one session where we validated the ground truth of every sentence (43 sentences, 873 words) with SP2, the WER was 2.5%. Using the speech BCI, SP2 was empowered to tell the research team, “I hope that we are very close to the time when everyone who is in a position like me has the same option to have this device as I do” (Table S3).
Discussion
Beginning on the first day of device use, a brain-to-text speech neuroprosthesis with 256 recording sites in the precentral gyrus accurately decoded intended speech in a man with severe dysarthria due to ALS. He communicated using a comprehensive 125,000 word vocabulary on the second day of use (and retrospective analysis indicated this could have been achieved on Day 1). Within 16 hours of use, the BCI correctly identified 97.3% of attempted words. To contextualize this 2.7% WER performance, the state-of-the-art for English automated speech recognition (e.g., smartphone dictation) has an approximate 5% WER25 and able speakers have a 1-2% WER26 when reading aloud. To our knowledge this is also the first study to report extensive open conversation via a large-vocabulary speech BCI, including decoding words never seen during training. We believe that the high decoding accuracy demonstrated in this study indicates that speech neuroprostheses have reached a level of performance suitable for rapidly and accurately restoring communication to people living with paralysis.
This study’s participant used the brain-to-text speech BCI to converse with family, friends, healthcare professionals, and colleagues. His regular means of communication without a BCI involves either (1) having trained caregivers interpret his severely dysarthric speech, or (2) using a head-mouse with point-and-click selections on a computer screen. The (investigational) BrainGate Neural Interface System is now his preferred way to communicate with our research team, and he has requested the ability to use it on his own time to be able to more rapidly write and communicate as part of his occupation and family life. The own-voice text-to-speech at the end of each sentence is also a novel capability in a brain-to-text speech BCI; SP2 and his family reported being pleased that the system’s voice resembled his own.
A clinically viable neuroprosthesis must not only be accurate, but should also minimize calibration time. This study demonstrated a large reduction in the quantity of training data required to achieve high accuracy decoding. In our previous study20, the participant attempted to speak 260-480 sentences at the start of each day, after which up to ∼30 minutes of computation time was required until the speech neuroprosthesis was ready for use. That previous study’s reported closed-loop results were measured starting 113 days post-implant, and used more than 15 days of training data and 10,000 training sentences to achieve a WER of 23.8%, while a previous ECoG speech BCI required 17.7 hours of training data, collected over 13 days, to reach a WER of 25.5%18. This new neuroprosthesis provided over 99% accuracy on a limited set of 50 words16 after just 30 minutes of training data on the very first day of use. It also achieved over 95% accuracy on a large vocabulary after collecting 6.6 cumulative hours of training data (over 7 sessions), and offline analyses indicate that optimized methods could provide >91% accurate large-vocabulary communication on the first day of use. Rapid communication with an intracortical speech BCI builds on our previous demonstration of rapid point-and-click communication with first-time BCI users27.
Previous studies have reported that intracortical devices often require recalibration due to signal nonstationarities28–30. Here, adopting the recent recognition that recent days’ neural data can be used to calibrate an effective neural decoder for a new day31,32, we demonstrated that a speech decoder could similarly be used to provide >95% accuracy at the start of each session. Future work is needed to establish whether the online fine-tuning we employed23 can maintain performance indefinitely in the absence of ground-truth labels of intended speech.
We believe that a significant factor enabling the higher performance of this study relative to our prior intracortical speech BCI20 was doubling the number of microelectrodes in speech motor cortex. Our finding that ∼200 electrodes in these regions is sufficient for very high accuracy brain-to-text communication provides an important design parameter to guide ongoing efforts to build neural interface hardware that can reach patients at scale. Using an improved phoneme-to-sentences language model relative to our prior work20 also improved performance (Fig. S5), and SP2’s slow speaking rate (Fig. S2) may also have contributed.
In addition to recording from two arrays in the putative ventral portion of area 6v (speech motor cortex) as in 20, we also targeted one array each into two areas which, to our knowledge, have not previously been recorded from with multielectrode arrays: area 4 (primary motor cortex, which in humans is often in the sulcus21 and thus largely not accessible with Utah arrays) and area 55b. We found that the strongest phoneme encoding was from the array in ventral 6v, which is consistent with our previous participant20. The array in area 4 also showed high phoneme encoding, as did the array in area 55b, which has recently been proposed as an important node in the wider speech production network33. We note that these brain area descriptions are estimations based on precisely aligning SP2’s brain to a Human Connectome Project derived atlas using multi-modal MRI.
Limitations
As with other recent clinical trial reports in the nascent field of implanted speech BCIs16–18,20,34, this study involved a single participant. Future work with additional participants is needed to establish the across-individual distribution of performances for speech BCIs using similar methods. Whether similar results can be expected may depend on whether the signal-to-noise ratio of SP2’s speech-related neural signals is typical. Nevertheless, these data, when combined with our previous speech BCI results with two 64-electrode arrays in area 6v20, demonstrate both successful initial replication and subsequent methodological improvements of the intracortical speech BCI approach. It is also not yet known how the performance of the system may change over the long term, but previous studies decoding attempted arm and hand movements using Utah arrays35 sustained high accuracy for multiple years after implantation36–39.
The participants in both this study and 20 had dysarthria due to ALS. Further work will assess whether similar methods will work for other etiologies of dysarthria. Given that we recorded from ventral precentral gyrus, which is upstream of the neuronal injury incurred in many conditions, and that recent ECoG speech neuroprostheses were demonstrated in two individuals with brainstem stroke16–18, we predict that this approach will also work in other conditions35.
While the demonstrated brain-to-text capabilities can provide widely useful communication, they do not capture the full expressive richness of voice; the more difficult challenge of closed-loop brain-to-voice synthesis remains an active area of speech BCI research18,34,40.
Audio 1 - Demonstration of SP2’s unintelligible dysarthric speech. SP2 is attempting to say prompted sentences aloud in an instructed delay Copy Task displayed on the screen in front of him (session 10; see Video 2). He retains intact eye movement and limited orofacial movement with the capacity for vocalization, but is unable to produce intelligible speech. At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice.
Link to listen online: https://ucdavis.box.com/s/gegiqcl4jzqdnug6dwxjmnd5t7gams4h
Video 1 - Copy Task speech decoding with eye tracker control. This video shows the same speech decoding trials as in Audio 1 (session 10). Prompted sentences appear on the screen in front of SP2. When the red square turns green, SP2 attempts to say the prompted sentence aloud while the speech decoder predicts what he is saying in real time. In this video, SP2 is signaling the end of a sentence by using an eye tracker to hit an on-screen “done” button. At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice.
Link to view online: https://ucdavis.box.com/s/0ono8rwx1evmp8ee27po8bs0u45bcvjn
Video 2 - Copy Task decoding with neural click control. This video shows another example of Copy Task speech decoding from a later session (session 17). Prompted sentences appear on the screen in front of SP2. When the red square turns green, SP2 attempts to say the prompted sentence aloud while the speech decoder predicts what he is saying in real time. In this video, SP2 is signaling the end of a sentence by attempting to squeeze his right fist, the neural correlates of which are decoded (Section S6). At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice.
Link to view online: https://ucdavis.box.com/s/afeqbmljk81rt4yscr2uh71ksn563g6y
Video 3 - Self-initiated conversational speech decoding. SP2 is using the speech decoder to engage in freeform conversation with those around him. The video is muted while conversation partners are speaking for privacy reasons. The BCI reliably detects when SP2 begins attempting to speak, and shows the decoded words on-screen in real time. SP2 can signal the end of a sentence using an on-screen eye tracker button (“DONE” button in the top-right of the screen), or by not speaking for 6 seconds (as he does in this video), after which the BCI finalizes the sentence. At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice. Finally, SP2 uses the eye tracker to confirm whether the decoded sentence was correct or not. Correctly decoded sentences are used to fine-tune the neural decoder online.
Link to view online: https://ucdavis.box.com/s/79nyhal9q6x7kc4toq63jkcfezdtditq
Data Availability
Derivatives of the neural data, including RNN probabilities and language model outputs, which can reproduce the reported performance quantification measurements and figures will be made publicly available on Dryad at publication. Code that implements an offline reproduction of the central findings in this study (high-performance neural decoding of real-time attempted speech) will be made publicly available at publication. Neural data will be publicly available after completion of the trial.