Abstract
Brain-computer interfaces can enable rapid, intuitive communication for people with paralysis by transforming the cortical activity associated with attempted speech into text on a computer screen. Despite recent advances, communication with brain-computer interfaces has been restricted by extensive training data requirements and inaccurate word output. A man in his 40’s with ALS with tetraparesis and severe dysarthria (ALSFRS-R = 23) was enrolled into the BrainGate2 clinical trial. He underwent surgical implantation of four microelectrode arrays into his left precentral gyrus, which recorded neural activity from 256 intracortical electrodes. We report a speech neuroprosthesis that decoded his neural activity as he attempted to speak in both prompted and unstructured conversational settings. Decoded words were displayed on a screen, then vocalized using text-to-speech software designed to sound like his pre-ALS voice. On the first day of system use, following 30 minutes of attempted speech training data, the neuroprosthesis achieved 99.6% accuracy with a 50-word vocabulary. On the second day, the size of the possible output vocabulary increased to 125,000 words, and, after 1.4 additional hours of training data, the neuroprosthesis achieved 90.2% accuracy. With further training data, the neuroprosthesis sustained 97.5% accuracy beyond eight months after surgical implantation. The participant has used the neuroprosthesis to communicate in self-paced conversations for over 248 hours. In an individual with ALS and severe dysarthria, an intracortical speech neuroprosthesis reached a level of performance suitable to restore naturalistic communication after a brief training period.
Introduction
Communication is a priority for the millions of people living with dysarthria from neurological disorders such as stroke and amyotrophic lateral sclerosis (ALS)1. People living with diseases that impair communication report increased rates of isolation, depression, and decreased quality of life2,3; losing communication often determines if a person will pursue or withdraw life-sustaining care in advanced ALS4. While existing augmentative and assistive communication technologies such as head or eye trackers are available, they suffer from low information transfer rates and become increasingly onerous to use as patients lose voluntary muscle control5. Brain-computer interfaces are a promising communication technology that can directly decode the user’s intended speech from cortical neural signals6. Efforts to develop a speech neuroprosthesis are built on largely offline (post hoc) studies using data from able-bodied speakers undergoing electrophysiological monitoring for clinical purposes (e.g. 7–15, but see 16). Several groups have performed online (real-time) brain-computer interface studies specifically to restore lost speech using chronically implanted electrocorticography (ECoG)17–20 and intracortical multielectrode arrays21. Two recent reports have established ‘brain-to-text’ speech performance19,21 by decoding the neural signals generated by attempted speech into phonemes (the building blocks of words), and assembling these phonemes into words and/or sentences displayed on a computer screen. These studies achieved communication performance, quantified using the word error rate metric, of 25.5% with a 1,024-word vocabulary19 and 23.8% with a 125,000-word vocabulary21, which is insufficient for reliable general-purpose communication. These prior studies also required at least 13 days of recording to collect sufficient training data to obtain that level of performance (17.7 training data hours in 19, 16.8 hours in21).
Here, we report an intracortical speech neuroprosthesis capable of providing access to a comprehensive 125,000-word vocabulary, with low training data requirements. We report that our clinical trial participant, living with advanced ALS and severe dysarthria, achieved very high accuracy brain-to-text communication (word error rates consistently below 5%), with useful function beginning on the very first day of use.
Methods
Study participant
A left-handed man in his 40’s (referred to as ‘SP2’ in this preprint rather than the actual trial participant designation, which the participant is familiar with, as per medRxiv policy) with amyotrophic lateral sclerosis (ALS) was enrolled into the BrainGate2 pilot clinical trial. At the time of enrollment, he had no functional use of his upper and lower extremities and had severe dysarthria (ALSFRS-R = 23). For the duration of this study (8 months following implantation), he has maintained a modified mini-mental status exam score of 27 (the highest score attainable). At the time of this report, he still retains eye and neck movements, but has limited orofacial movement with a mixed upper- and lower-motor neuron dysarthria resulting in a monotone, low-volume, hypernasal speech with off-target articulation. He requires non-invasive respiratory support at night, and does not have a tracheostomy. When speaking with non-expert listeners, he is unintelligible (Audio 1): his oral motor tasks on the Frenchay Dysarthria Assessment-2 were an “E” rating, representing non-functioning or profound impairment. When speaking to expert listeners, he communicates at 6.8 ± 5.6 (mean ± standard deviation) correct words per minute. His typing speed using his gyroscopic headmouse (Zono 2, Quha, Nokia, Finland) is 6.3 ± 1.3 correct words per minute (Fig. S1). The severity of dysarthria has remained stable during the time-period of this report, including the immediate postoperative period. Additional participant details are in Section S1.01.
Surgical implantation
The goal of the surgery was to implant four microelectrode arrays (NeuroPort Array, Blackrock Neurotech, Salt Lake City, Utah, USA) into the precentral gyrus, an important cortical target for coordinating the motor activities related to speech17,19,21. Each single electrode has one recording site ∼50 μm in size, and is designed to record from a single or a small number of cortical neurons. A microelectrode array is 3.2 × 3.2 mm in size, has 64 recording sites in an 8 × 8 grid arrangement, and is inserted 1.5 mm into the cortex using a specialized high-speed pneumatic inserter.
We performed a left-sided 5 × 5 cm craniotomy under general anesthesia. Care was taken to avoid placing the microelectrode arrays through large vessels on the cortical surface, identified by visual inspection. Two arrays are connected to one percutaneous pedestal designed to transmit the neural recordings to external computers. We placed two percutaneous pedestals, for a total of 256 recording sites. Reference wires were placed in both the subdural and epidural spaces.
Our participant underwent surgery in July 2023, had no study-related serious adverse events, and was discharged on postoperative day 3. From initial incision to closure, the operation took 5 hours to perform. We began collecting data in August 2023, 25 days after surgery.
Recording array implant locations and decoding contributions
Prior to implanting arrays in the precentral gyrus, we identified the central sulcus on an anatomical MRI scan, and confirmed that our study participant was left-hemisphere language dominant on functional MRI, using standard clinical fMRI tasks (sentence completion, silent word generation, silent verb generation, and object naming). In addition to pre-operative morphological MRI and fMRI studies, we further refined our implantation targets using the Human Connectome Project’s multi-modal MRI-derived cortical parcellation precisely mapped to the participant’s brain22 (Fig. 1a, Fig. S2, Section S1.02). We targeted language-related area 55b23 and three areas in the precentral gyrus associated with motor control of the speech articulators: premotor areas dorsal 6v (d6v) and ventral 6v (v6v), and primary motor cortex (area 4; Fig. S2). Our choice of targeting speech motor cortex was motivated by our previous study, which found that two 6v arrays provided informative signals for speech decoding21. Compared to that previous report, the implantation of four micro-electrode arrays doubles the number of electrodes implanted into the ventral precentral gyrus and doubles the corresponding cortical area covered.
Real-time acquisition and processing of neural data
Neural activity was transmitted from the pedestals to a computer system designed to decode the activity in real-time (Fig. 1b). A signal processing system (NeuroPort System, Blackrock Neurotech) was used to acquire signals from the two pedestals (Fig. S3) and send them to a series of commercially available computers running custom software24 (Section S1.5) for real-time signal processing (Section S1.4) and decoding (Sections S2, S3). The system sits on a wheeled cart and requires the computers to be connected to standard wall outlets. Blackrock Neurotech was not involved in the data collection or reporting in this study, and had no oversight regarding the decision to publish.
Speech task designs
The study consisted of 84 data collection sessions over the course of 32 weeks (Section S1.06; Table S2) and took place in the participant’s home. Each session consisted of a series of task blocks, lasting approximately 5-30 minutes, wherein the participant used the neuroprosthesis. Between blocks, he would take breaks, eat meals, etc. One session was performed on any day. During each block, the participant used the system in two different ways: 1) an instructed-delay Copy Task (Videos 1, 2 and Section S1.07); and 2) a self-paced Conversation Mode (Videos 4, 5 and Section S1.08). The instructed-delay task consisted of words being presented on a computer screen, and the participant attempted to say the words after a visual/audio cue21. The self-paced Conversation Mode involved the participant trying to say whatever he wanted, from the 125,000 word dictionary, in an unstructured conversational setting. In both tasks, speech decoding occurred in real-time: as he spoke, the cortical activity at the four micro-electrode arrays were recorded and interpreted, and the predicted words were presented on the screen. Completed sentences were read aloud by a computer program and, in later sessions, automatically punctuated (Sections S4 and S3.03). The neuroprosthesis could also send the sentence to the participant’s personal computer by acting as a bluetooth keyboard, which allowed him to use it for activities such as writing emails (Section S1.08). Sampled phoneme and words used for decoder training accumulated over the course of the study (Figure S4).
Decoding speech
Our neuroprosthesis was designed to translate the participant’s cortical neural activity into words when he attempted to speak (Section S2, Figs. S5, S6). No microphone input was used for decoding, and we found no evidence of acoustic or vibration-related contamination in the recorded neural signals (Section S4.03, Fig. S7). Every 80 ms, the activity from the population of cortical neurons was used to predict the most likely English phoneme (the building block of sounds of a language) being attempted. Phoneme sequences were then combined into words using an openly-available language model21. Next, we applied two further open-source language models to translate the sequence of words into the most likely English sentence. Hence, the neural activity was the only input for all subsequent language-model refinements (Section S3, Fig. S8), as described in 21. Machine learning techniques enabled high accuracy decoding; data from different days were combined to continuously calibrate the decoder, enabling maintenance of high performance during neuroprosthesis use (Section S2 and Figs. S9, S10).
Evaluation
We used two measures to analyze the speech decoding performance: phoneme error rate and word error rate, consistent with previous speech decoding studies17,19,21. These measures are the ratio of phonemic or word errors to the total number of phonemes or words expected to be decoded, respectively. An error is defined as when an insertion, deletion, or substitution is needed to have the decoded sentence match the intended sentence. The phoneme error rate can be understood as the system’s ability to translate cortical neural activity into phonemes without language models (to make this explicit, we label it as ‘raw’ phoneme error rate in figures), and word error rate as an estimate of overall communication accuracy. Data was collected in continuous blocks (lasting 5-30 minutes in length) which were separated by short breaks. Blocks were either “training blocks” in which data were collected for decoder training or optimization, or predetermined “evaluation blocks” used to measure and report neuroprosthesis performance. Reported error rates are aggregated over all evaluation sentences for each session (Section S1.09). The first-ever closed-loop block (session 1) was excluded from evaluation because the participant cried with joy as the words he was trying to say correctly appeared on-screen; the research team paused the evaluation until after the participant and his family had a chance to celebrate the moment.
Statistical analyses
Results for each analysis are presented with 95% confidence intervals or as mean ± standard deviation. Confidence intervals were estimated by randomly resampling each dataset 10,000 times with replacement, and have not been adjusted for multiplicity. The evaluation metrics for decoding performance (phoneme error rate and word error rate) were chosen before the start of data collection (Section S1.09).
Results
Online decoding performance
In the first session, the participant attempted to speak prompted sentences constructed from a 50-word vocabulary17. We recorded 213 sentences (30 minutes) of Copy Task data, which were used to calibrate the speech neuroprosthesis. Next, we decoded his neural cortical activity in real-time as he tried to speak. The neuroprosthesis decoded his attempted speech with a word error rate of 0.44% (95% confidence interval [CI], 0.0% to 1.4%). We replicated this result for 50-word vocabulary decoding in the second research session, in which all of the participant’s attempted sentences were decoded correctly (0% word error rate; Fig. 2).
In this second research session, we also expanded the vocabulary of the neuroprosthesis from 50 words to over 125,000 words, which encompasses the majority of the English language25. We collected an additional 260 sentences of training data (1.4 hours). After being trained on these additional data, the neuroprosthesis decoded the participant’s attempted speech with a word error rate of 9.8% (95% CI, 4.1% to 16.0%; Fig. 2); offline analyses suggest that even better first-session performance may have been possible given the information content of the recorded neural signals (Figs. S12, S13). Performance continued to improve in subsequent research sessions as we collected more training data and adapted innovations for incorporating new data more effectively26. The neuroprosthesis achieved a word error rate of 2.5% (95% CI, 1.0% to 4.5%) by session 15, and high accuracy decoding performance was maintained through session 84, more than eight months after implant. Average Copy Task decoding performance in the final 5 evaluation sessions had a word error rate of 2.5% (95% CI, 2.0% to 3.1%) at the participant’s self-paced speaking rate of 31.6 words per minute (95% CI, 31.2% to 32.0%; Fig. S1), with individual day’s average word error rates ranging from 1.0% to 3.3% (Table S3). The neuroprosthesis’ communication rate far exceeded the participant’s standard means of communication using a head mouse or skilled interpreter (Fig. S1a).
The system achieved high accuracy at the start of sessions, maintaining an average word error rate of 4.9% (95% CI, 3.7% to 6.2%) over the first 50 sentences of seven sessions (Fig. S14). Offline analyses indicated that this rapidly-available communication capability was enabled by the neuroprosthesis being able to maintain accurate decoding for at least twenty days without any new training data, and for multiple months when continuously fine-tuning the neuroprosthesis26 (Fig. S15). In decreasing order, the most informative electrode arrays for decoding speech were in areas ventral 6v (best), 55b, 4, and dorsal 6v (worst; Fig. S16).
Decoding errors tended to occur between phonemes that are articulated similarly (Fig. S17). The speech decoder generalized to new words, and accuracy improved for a given word the more often it appeared in the training data (Fig. S18). It could also generalize to different attempted amplitudes of speech, including non-vocalized speech (Fig. S19).
Conversational speech using the brain-computer interface
We developed a system for the participant to have conversations via self-initiated speech. Relying exclusively on cortical neural activity, the speech neuroprosthesis detected when he started or stopped speaking and decoded his attempted speech accordingly (Section S1.08, Fig. S21). The system rarely falsely detected that he wanted to speak (Fig. S22). Additionally, the participant had the option to use an eye tracker for selecting actions (Fig. 3a) to i.) finalize and read aloud the sentence, ii.) indicate whether the sentence was decoded correctly or not, or iii.) spell out words letter-by-letter that were not correctly predicted by the decoder (e.g., because they were not in the vocabulary, such as certain proper nouns).
The participant’s first use of the neuroprosthesis for naturalistic communication with his family is exemplified in Fig. 3b (Video 3; Table S4 provides additional transcripts). In subsequent sessions, he utilized the neuroprosthesis for personal use (e.g., Videos 4-5), communicating a total of 1189 sentences from sessions 1-31. For the majority of these Conversation Mode sentences (925 sentences; 77.8%) we were able to confirm the participant’s intended speech through directly asking him, contextual analysis, and examining the decoded phoneme probability patterns. Self-initiated sentences for which we knew the ground-truth were decoded with a word error rate of 3.7% (95% CI, 3.3% to 4.3%; Fig. 3d), although we note that the word error rate reported only for sentences with known ground truth labels may be inflated (Section S1.09). For one session where we validated the ground truth of every sentence (43 sentences, 873 words) with the participant, the word error rate was 2.5% (95% CI, 1.3% to 4.0%). Using the neuroprosthesis, the participant told the research team, “I hope that we are very close to the time when everyone who is in a position like me has the same option to have this device as I do” (Table S4). The participant used the neuroprosthesis in this Conversation Mode during 72 (out of 84 total) sessions over 8 months (248.3 cumulative hours; 22,697 sentences). The longest continuous use of the speech neuroprosthesis in Conversation Mode was 7.7 hours. He used it to perform activities ranging from talking to the research team and his family and friends, to performing his occupation by participating in videoconferencing meetings and writing documents and emails.
Discussion
Beginning on the first day of device use, a brain-to-text speech neuroprosthesis with 256 recording sites in the left precentral gyrus accurately decoded intended speech in a man with severe dysarthria due to ALS. He communicated using a comprehensive 125,000 word vocabulary on the second day of use. Within 16 hours of use, the neuroprosthesis correctly identified 97.3% of attempted words. To contextualize this 2.7% word error rate, the state-of-the-art for English automated speech recognition (e.g., smartphone dictation) has an approximate 5% word error rate27 and able speakers have a 1-2% word error rate28 when reading a paragraph aloud. We believe that the high decoding accuracy demonstrated in this study indicates that speech neuroprostheses have reached a level of performance suitable for rapidly and accurately restoring communication to people living with paralysis.
This study’s participant used the brain-to-text speech neuroprosthesis to converse with family, friends, healthcare professionals, and colleagues. His regular means of communication without a neuroprosthesis involved either (1) having expert caregivers interpret his severely dysarthric speech, or (2) using a head-mouse with point-and-click selections on a computer screen. The investigational BrainGate Neural Interface System became his preferred way to communicate with our research team, and he used it on his own time (with a researcher’s assistance to connect and launch the system) to be able to more rapidly write and communicate as part of his occupation and family life. The participant and his family and friends also reported being pleased with the own-voice text-to-speech at the end of each sentence; they indicated that the system’s voice did resemble his own.
This study demonstrated a large reduction in the quantity of training data required to achieve high accuracy decoding. In our previous study21, the participant attempted to speak 260-480 sentences at the start of each day, after which up to ∼30 minutes of computation time was required until the speech neuroprosthesis was ready for use. That previous study’s reported closed-loop results were measured starting 113 days post-implant, and used 16.8 hours of training data, collected over 15 days, to achieve a word error rate of 23.8%. A previous ECoG speech neuroprosthesis required 17.7 hours of training data, collected over 13 days, to reach a word error rate of 25.5%19. The new neuroprosthesis demonstrated here provided over 99% accuracy on a limited set of 50 words17 after just 30 minutes of training data on the very first day of use. It also achieved over 95% accuracy on a large vocabulary after collecting 6.6 cumulative hours of training data (over 7 sessions), and offline analyses indicate that optimized methods could provide >91% accurate large-vocabulary communication on the first day of use.
Previous studies have reported that intracortical devices require frequent recalibration29–31. Motivated by recent studies showing that multiple days of neural data can be used to calibrate an effective decoder for a new day32,33, here we demonstrated that a speech decoder trained with multiple previous days of data could similarly be used to provide >95% accuracy at the start of a new session. Future work is needed to establish whether the strategy we employed26 can maintain performance indefinitely in the absence of ground-truth labels of intended speech, but it is encouraging that across the 29 session days where the participant used the neuroprosthesis solely for personal use, he only requested to recalibrate the decoder on three occasions. Each calibration took approximately 7.5 minutes, during which twenty Copy Task sentences were displayed to provide ground-truth training labels and thereby rapidly update the decoder.
It is possible that an important factor enabling the higher performance of this study relative to our prior intracortical speech neuroprosthesis21 was doubling the number of microelectrodes in speech motor cortex. Our finding that ∼200 electrodes in these regions is sufficient for very high accuracy brain-to-text communication provides an important design parameter to guide ongoing efforts to build neural interface hardware that can reach patients at scale. Using an improved phoneme-to-sentences language model (relative to 21) also improved performance (Fig. S8), and the participant’s slow speaking rate (Fig. S1) may also have contributed.
In addition to recording from two arrays in the putative ventral portion of area 6v (speech motor cortex) as in 21, we also targeted one array each into two areas which, to our knowledge, have not previously been recorded from with multielectrode arrays: area 4 (primary motor cortex, which in humans is often in the sulcus22 and thus largely not accessible with Utah arrays) and area 55b. We found that the strongest phoneme encoding was from the array in ventral 6v, which is consistent with our previous participant21. The array in area 4 also showed high phoneme encoding, as did the array in area 55b, which has recently been proposed as an important node in the wider speech production network23. We note that these brain area descriptions are estimations based on precisely aligning the participant’s brain to a Human Connectome Project derived atlas using multi-modal MRI.
Limitations
As with other recent clinical trial reports in the nascent field of implanted speech brain-computer interfaces17–19,21, this study involved a single participant. Future work with additional participants is needed to establish the across-individual distribution of performances for speech decoding. Whether similar results can be expected in future users may depend on whether the signal-to-noise ratio of this participant’s speech-related neural signal modulation is typical. Nevertheless, these data, when combined with our previous speech decoding results with two 64-electrode arrays in area 6v21, demonstrate both successful initial replication and subsequent methodological improvements of the intracortical speech neuroprosthesis approach. Furthermore, while the demonstrated brain-to-text capabilities can provide widely useful communication, they do not capture the full expressive richness of voice; the more difficult challenge of closed-loop brain-to-voice synthesis remains an active area of speech neuroprosthesis research19,34. Notable steps that remain before brain-to-text neuroprostheses are likely to be adopted at scale include removing the percutaneous connector (i.e., making the electrodes fully implanted with wireless power and telemetry), reducing the size and number of computers used for processing, and automating the software so that users and their care partners can operate it independently.
Decoding speech in patients with other neurological impairments
The participants in both this study and our previous report21 had dysarthria due to ALS. Further work is needed to assess whether similar methods will work for other etiologies of dysarthria. Given that we recorded from ventral precentral gyrus, which is upstream of the neuronal injury incurred in many conditions, and that recent ECoG speech neuroprostheses were demonstrated in two individuals with brainstem stroke17–19, we predict that this approach will also work in other conditions35.
We note that the participant retains voluntary (albeit impaired) bulbar muscle control, normal sensation, and normal hearing. We do not know whether our approach will continue to restore communication should he develop anarthria. We also do not know if our approach will work in patients who already have anarthria or have hearing impairments.
Long-term stability of neural decoding
Prior reports have described that neural recording quality can decrease over years with the type of microelectrode array used in this study (e.g., 36). While intracortical neural decoding of attempted hand movements has been shown to be sustained beyond five years37–40, our study reports data only up to 8 months after implantation. It is not known how well decoding performance will be sustained over time.
Conclusion
Overall, the rapid and highly accurate restoration of full vocabulary, speech-based communication, enabled by an intracortical neuroprosthesis and used by a person with advanced ALS, suggests that this approach may be useful in improving the communication ability and autonomy of people with severe speech and monitor impairments.
Audio 1 - Demonstration of the participant’s unintelligible dysarthric speech
The participant is attempting to say prompted sentences aloud in an instructed delay Copy Task displayed on the screen in front of him (session 10; see Video 2). He retains intact eye movement and limited orofacial movement with the capacity for vocalization, but is unable to produce intelligible speech. At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice. Link to listen online: https://ucdavis.box.com/s/f1qt5p4r9bdqflqgm519p49i10isefp5
Video 1 - Copy Task speech decoding (session 10)
This video shows the same speech decoding trials as in Audio 1 (session 10). Prompted sentences appear on the screen in front of the participant. When the red square turns green, he attempts to say the prompted sentence aloud while the speech decoder predicts what he is saying in real time. In this video, he is signaling the end of a sentence by using an eye tracker to hit an on-screen “done” button. The participant could also end trials by attempting to squeeze his right hand into a fist, the neural correlates of which were decoded (see Video 2; Section S6). At the end of each trial, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice. Link to view online: https://ucdavis.box.com/s/b7g9jmobangbjvdp28gb8d0d5esy7fuy
Video 2 - Copy Task speech decoding (session 17)
This video shows another example of Copy Task speech decoding from session 17. Prompted sentences appear on the screen in front of the participant. When the red square turns green, he attempts to say the prompted sentence aloud while the speech decoder predicts what he is saying in real time. In this video, he is signaling the end of a sentence by attempting to squeeze his right hand into a fist, the neural correlates of which are decoded (Section S6). At the end of each trial, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice. Link to view online: https://ucdavis.box.com/s/bj155qbekzc93l5b2bbjd9h7xb4fm68r
Video 3 - First self-directed use of the speech neuroprosthesis (session 2)
The participant uses the speech neuroprosthesis to say whatever he wants for the first time (125,000-word vocabulary; session 2). He chose to speak to his daughter; the transcript of what he said is also available in Fig. 3b. Because the dedicated Conversation Mode (and speech detection) had not yet been developed during session 2, the participant waits for the onscreen square to turn from red to green before attempting to speak. Link to view online: https://ucdavis.box.com/s/7svzpzq2ldxp0wbqd4ewdry4cc5jdalm
Video 4 - Conversation Mode speech decoding (session 31)
The participant is using the speech decoder in Conversation Mode to engage in freeform conversation with those around him. The video is muted while conversation partners are speaking for privacy reasons. The speech neuroprosthesis reliably detects when the participant begins attempting to speak, and shows the decoded words on-screen in real time. He can signal the end of a sentence using an on-screen eye tracker button (“DONE” button in the top-right of the screen), or by not speaking for 6 seconds (as he does in this video), after which the neuroprosthesis finalizes the sentence. At the end of each sentence, the decoded sentence is read aloud by a text-to-speech algorithm that sounds like his pre-ALS voice. Finally, the participant uses the eye tracker to confirm whether the decoded sentence was correct or not. Correctly decoded sentences are used to fine-tune the neural decoder online. Link to view online: https://ucdavis.box.com/s/ezpz1fbizsma8cqmavjhhlarce7a2v40
Video 5 - Conversation Mode speech decoding (session 80)
Another example of the participant using the speech decoder in Conversation Mode to engage in freeform conversation with those around him. The participant is speaking to a researcher about movies. He is able to select on-screen buttons using an eye tracker. The participant interface for Conversation Mode was updated in session 72 to add new features (see Section S1.09), including the ability for him to control when text-to-speech audio is played. The own-voice text-to-speech model was updated in session 55 to sound more lifelike and closer to the participant’s pre-ALS voice (see Section S5.02).
Link to view online: https://ucdavis.box.com/s/f7glntzwvj9h4p79yic5zfrq5844o2z2
Data Availability
Derivatives of the neural data, including RNN probabilities and language model outputs, which can reproduce the reported performance quantification measurements and figures will be made publicly available on Dryad at publication. Code that implements an offline reproduction of the central findings in this study (high-performance neural decoding of real-time attempted speech) will be made publicly available at publication.
List of investigators and contributions
List of investigators (authors)
Nicholas S. Card
Maitreyee Wairagkar
Carrina Iaccobacci
Xianda Hou
Tyler Singer-Clark
Francis R. Willett
Erin M. Kunz
Chaofei Fan
Maryam Vahdati Nia
Darrel R. Deo
Aparna Srinivasan
Eun Young Choi
Matthew F. Glasser
Leigh R. Hochberg
Jaimie M. Henderson
Kiarash Shahlaie
David M. Brandman*
Sergey D. Stavisky*
* These authors contributed equally
Author contributions
N.S.C. led the experiments, implemented the real-time neural decoder and language models, designed and implemented the Copy Task and the self-initiated Conversation Mode, analyzed the data, and created the figures. N.S.C. and M.W. developed and implemented the real-time neural signal processing, noise removal, and feature extraction pipelines. N.S.C., M.W., X.H., and T.S-C. coded the real-time data collection system and built the neuroprosthetic cart system. N.S.C., M.W., and C.I. collected the primary data for this study. N.S.C. and C.I. interfaced with the participant and scheduled research sessions. E.M.K. ran hyperparameter sweeps to identify optimal hyperparameters for the neural decoder. C.F. designed and trained the language models. N.S.C. and M.V.N. created the own-voice text-to-speech models. A.S. checked for acoustic contamination and contributed to the speech intensity analyses. E.Y.C. acquired the MRI data for the HCP cortical parcellation for surgical targeting. F.R.W., D.R.D., E.Y.C, and M.F.G. processed and interpreted the HCP cortical parcellation data and provided presurgical target locations for array implantation. D.M.B and K.S. led planning and performing the surgical implant placement procedure. L.R.H. is the sponsor-investigator of the multisite clinical trial. D.M.B. was responsible for all clinical trial-related activities at U.C. Davis. N.S.C., M.W., S.D.S., and D.M.B. were involved in conceptualization of the study and experimental design. S.D.S. and D.M.B. supervised all aspects of the project. J.M.H. supervised the team members at Stanford. N.S.C., S.D.S., and D.M.B wrote the manuscript. All authors reviewed and helped edit the manuscript.
Supplemental Text
S1 - Experimental procedures
S1.01 - Study participant
This study includes data from one participant (referred to as ‘SP2’ in this preprint rather than the actual trial participant designation, which the participant is familiar with, as per medRxiv policy) who gave informed consent and was enrolled in the BrainGate2 clinical trial (identifier: NCT00912041). This pilot clinical trial was approved under an investigational device exemption by the US Food and Drug Administration (Investigational Device Exemption #G090003). Permission was also granted by the Institutional Review Boards at Mass General Brigham (#2009P000505) and the University of California, Davis (protocol #1843264). SP2 gave consent to publish photographs and videos containing his likeness. All research was performed in accordance with the relevant guidelines and regulations.
BrainGate2 is an ongoing open-label, non-blinded, multi-center, sponsor-investigator-led, feasibility study investigating the safety of chronically implanted electrodes (Utah array) in people living with paralysis. The primary endpoint of the trial quantifies the safety of the implanted device at 13 months after implantation. After meeting the primary endpoint, participants may decide to remain enrolled in the clinical trial, or to have their device explanted. Inclusion criteria include patients living with spinal cord injury, brainstem stroke, and motor neuron diseases such as ALS. Further details of the interim safety analysis have recently been reported1.
SP2 is a left-handed man in his 40’s with Amyotrophic Lateral Sclerosis (ALS). Four years before enrollment in the BrainGate2 clinical trial, he developed fasciculations and cramping in the legs, and was originally diagnosed with benign fasciculation syndrome. His weakness progressed resulting in recurrent falls and dropping objects. One year after initial symptoms he met El Escorial EMG criteria and was diagnosed with ALS by a neuromuscular-disease board-certified neurologist. At initial diagnosis he had mild weakness in the upper and lower extremities, and was still ambulatory with a cane. SP2 was enrolled into the clinical trial after providing informed consent. At the time of enrollment, he had no functional use of his upper and lower extremities (with the exception of 4/5 MRC-grade strength in knee extension and flexion) and had severe dysarthria, with an ALSFRS-R = 23. Since enrollment, he has maintained a modified mini-mental status exam score of 27 (the highest scores attainable). At the time of this report, he still retains eye and neck movements, but has limited orofacial movement with a mixed upper- and lower-motor neuron dysarthria resulting in a monotone, low-volume, hypernasal speech with off-target articulation. He requires non-invasive respiratory support only at night, and does not have a tracheostomy.
When speaking with non-expert listeners, he is unintelligible (Audio 1): his oral motor tasks on the Frenchay Dysarthria Assessment-2 were an “E” rating, representing non-functioning or profound impairment, and his reading of the “Grandfather Passage” was at 49 words in one minute, corresponding to severe intelligibility impairment2,3. When speaking to expert listeners, he communicates at 6.8 ± 5.6 (mean ± standard deviation) correct words per minute. His typing speed using his gyroscopic headmouse (Quha Zono 2), which enables him to move and click a mouse on a computer screen, is 6.3 ± 1.3 correct words per minute (Fig. S.7). The severity of dysarthria has remained stable during the time-period of this report, including the immediate postoperative period.
During the summer of 2023, four 64-electrode, 1.5 mm-length silicon microelectrode arrays coated with sputtered iridium oxide (Blackrock Microsystems, Salt lake City, UT) were placed in his left precentral gyrus. For more information on array targeting, see Sections S1.02 and S1.03, below.
S1.02 - Multi-modal MRI-based speech localization
Prior to microelectrode array placement, SP2 underwent a multi-modal MRI session for array targeting based on the Human Connectome Project (HCP’s) prior protocols4,5 and as described for BrainGate2 clinical trial participant ‘T12’6. SP2 was scanned in a 3T Ultra High Performance scanner (GEHealthcare) with a Nova 32-channel coil. Scan parameters were based on HCP Lifespan protocols and modified for the GE system (Table S1). Briefly, 0.8mm isotropic T1w and T2w images were acquired together with 2 mm isotropic resting state fMRI with TR = 800 ms in 4 runs each lasting 5 minutes and 45 seconds. In addition, phase reversed single band reference fMRI and spin echo MRI images geometrically and distortion matched to the MRI were acquired for distortion correction, unaliasing, and motion correction. The HCP’s minimal preprocessing pipelines4 were used to align the data within and across modalities, correct for image distortions, reconstruct white and pial cortical surfaces, and compute T1w/T2w myelin maps and cortical thickness maps. Subsequently, multi-run spatial Independent Components Analysis (sICA) was applied to remove spatially specific fMRI artifacts related to head motion, physiology, and the MRI scanner7.
These independent components were hand checked after initial automated classification using the “FIX” tool8 before non-aggressive regression of the artifactual components out of the fMRI data. Hand component classification was used because the application involved surgical planning. At this point the T1w/T2w myelin maps and fMRI data were used to align SP2’s brain to the HCP’s atlas space using MSMAll areal-feature-based surface registration9. This multi-modal cortical surface registration compensates for individual variability in areal size, shape, and position and enabled the HCP’s multi-modal cortical parcellation4 to be overlaid directly on SP2’s pial surface. The multi-modal surface registration was hand checked by comparing the multi-modal features computed from SP2’s brain to the same features in the atlas, with special attention paid to the features that defined the borders of areas 4, 6v, and 55b, including the T1w/T2w myelin maps (Fig. S2e) and multiple spatial ICA-based functional networks, including the language network (Fig. S2c-d), head sensori-motor network, and upper extremity sensori-motor network. Again these maps were hand checked given the neurosurgical targeting application. Once precise alignment of the HCP’s atlas of multi-modal cortical areas was confirmed, targets for the arrays were proposed, including targets in ventral area 6v, area 4, dorsal area 6v, and area 55b. These visual analyses were carried out within the HCP’s Connectome Workbench software (Fig. S2b-h).
S1.03 - Array placement targeting
The surgical targets for array placement within the precentral gyrus were chosen based on gross anatomical structure, vasculature, previous speech decoding studies6,10–12, and from estimates of cortical boundaries obtained using a cortical parcellation method derived from multi-modal Human Connectome Project (HCP) data (see Section S1.02, above). For two arrays, we targeted the dorsal and ventral aspects of area 6v due to their contributions to speech decoding in 6. A third array was targeted to speech primary motor cortex (area 4). We targeted area 55b for the fourth array due to emerging evidence of its importance in speech13.
S1.04 - Neural signal processing and feature extraction
Neural signals were recorded using Neuroplex-E headstages (Blackrock Microsystems) attached to two percutaneous connectors (each connected to two of the four implanted microelectrode arrays). The headstages analog filtered raw signals between 0.3 Hz to 7.5 kHz (4th order Butterworth filter) and performed analog-to-digital conversion with sampling rate of 30 kHz (250 nV resolution). 1 ms windows of the digitized 30 kHz signal from 256 channels were sent to our custom BRAND node (see Section S1.05) written in Python for real-time digital filtering and feature extraction.
Each incoming 1 ms neural signal window was first band pass filtered between 250 to 5000 Hz using a 4th order zero-phase non-causal Butterworth filter. 1 ms neural signal windows were padded on both sides (using the previous 1 ms window on the left side and 1.2 ms of mean padding on the right side) to minimize discontinuities at the edges. Linear Regression Referencing (LRR) was applied to each array-group of 64 electrodes to reduce noise artifacts6,14.
We extracted threshold crossings (putative action potentials) and spike-band power features from every 1 ms window of filtered and denoised neural signals. Threshold crossings were identified for each electrode if the voltage of the signal in this window crossed the threshold of −4.5 times the root mean squared (RMS) value of the neural signal for that electrode. Spike-band power was obtained by squaring the samples in the window and temporally averaging it for each electrode. Spike-band power was clipped at 12,500 µV2 to avoid outliers. This real-time signal processing, de-noising and feature extraction was performed in less than 1 ms, minimizing signal latency. These neural features were then binned into 20 ms non-overlapping bins. Binned threshold crossing counts were obtained by summing threshold crossings in 20 consecutive 1 ms neural feature windows. Binned spike-band power was computed by averaging spike-band power in 20 consecutive neural feature windows. Threshold crossings and spike-band power are commonly used measurements of local spiking (neuronal action potential) activity that have been shown to be comparable to sorted single unit activity in terms of decoding performance and neural population structure15–17. For brain-to-text decoding, binned threshold crossings and spike-band power from all 256 electrodes were assembled into a single 1 x 512 feature vector at every time step. Sequences of the feature vectors were smoothed and normalized before passing them into the RNN decoder (see Section S2.01).
At the start of each session, a short “diagnostic” block with attempted speech of repeated single words was recorded (see Section S1.07), used to estimate electrode-specific RMS thresholds for obtaining threshold crossing features, and LRR filter coefficients for de-noising signals as described above. Subsequently during the session, we recomputed these RMS thresholds and LLR coefficients after every block of neural data recording. Recomputing these parameters after every block helped with minimizing nonstationarities in the neural activity throughout the day.
S1.05 - Data collection rig
In this research, we used multiple computers to enable both rigorous scientific study of high bandwidth neural data and rapidly calibrated, real-time speech decoding with a 125,000 word vocabulary. The current form factor is larger (and has more compute capability) than is strictly necessary for the algorithms described here, in order to support rapid iteration and flexible exploration of different signal processing and decoding methods. A future, final-form medical device would reduce the current setup to an embedded system with a small, portable external component. We direct readers to a previous Utah arrays-based reach-and-grasp brain-computer interface study as an example of how a similar research data collection rig was made substantially more compact and portable18.
All real-time data collection, processing, analysis, and decoding was performed by a group of four computers connected in a local area network. A computer running Windows 10 interfaced with the Neuroplex-E system to start and stop neural data recording. A second computer (running Ubuntu 22.04 LTS) was used to process and extract neural features from raw 30 kHz neural data. A third computer (running Ubuntu 22.04 LTS) was responsible for real time neural decoding, fine-tuning the RNN model online, displaying the task to the participant, and displaying the task control GUI on the research team-facing monitor. Finally, a fourth computer (running Ubuntu 22.04 LTS) was used to run the language model that converted phoneme sequences to words. We used the Backend for Realtime Asynchronous Neural Decoding (BRAND19) to run our data collection computer setup. All code was written in Python, C, or MATLAB.
S1.06 - Overview of data collection sessions
Neural data were recorded in 5-7 hour long research sessions, which took place at the participant’s home 2-4 days per week. SP2 chose the dates and times for sessions to occur. Sessions typically included 1-2 breaks for food or beverages. During the sessions, SP2 sat in his power lift chair in an upright position. A computer monitor placed in front of SP2 displayed the task. An eye tracker mounted to the bottom of the computer monitor allowed SP2 to select on-screen “buttons” by looking at them. Data was collected in 15-25 minute “blocks” consisting of an uninterrupted series of trials. Trials could be paused as necessary and continued, or terminated, as appropriate. Between blocks, SP2 was encouraged to rest as needed. Table S2 lists all data collection sessions reported in this study. Across all sessions, we collected an average of 236 minutes of neural data. Microphone data was recorded at 30 kHz during all sessions and synchronized with neural data via the Neural Signal Processor’s analog in port. Video was recorded for all sessions from two cameras set up behind and in front of the participant.
In keeping with historical precedent in our clinical trial, we began data collection 25 days after implantation1. While a delay between surgery and device initialization is standard practice in clinically approved neuromodulation procedures such as deep brain stimulation and vagal nerve stimulation, in principle data collection could have begun within hours or days after implantation. Building on our previous demonstration of rapid point-and-click communication with first-time BCI users20, we began closed-loop speech neuroprosthesis evaluation on the first day when the participant was connected to the recording system (session 1).
S1.07 - Instructed delay Copy Task
In an instructed-delay Copy Task (Videos 1-2), a prompted sentence was displayed as text on a computer monitor facing SP2. A colored square changed from red to green to indicate when he should begin speaking. SP2 triggered the end of each sentence using an on-screen eye-tracker “button”, at which time the final decoded sentence was read aloud with a text-to-speech algorithm that was customized to sound like the participant’s pre-ALS voice21 (Section S5). To support future neuroprosthesis users incapable of eye gaze control, we also demonstrated sentence finalization triggered by neural decoding of SP2’s attempted hand squeezes (Section S6). The majority of Copy Task blocks were 50 trials long, which took 15-25 minutes depending on how long the prompted sentences were. SP2 controlled the pace of each Copy Task trial via either eye tracking (Section S1.11) or gesture decoding (Section S6). In early sessions, prior to eye tracking implementation, SP2 signaled to the researcher that he was done speaking at the end of each trial by making eye contact with the researcher, who would press a button to end the trial.
At the start of each session, we did a “diagnostic block”, which was an instructed delay task with 8 single-word cues each repeated 6-8 times. The word set consisted of the words ‘bah’, ‘choice’, ‘day’, ‘kite’, ‘though’, ‘veto’, ‘were’, and a ‘DO NOTHING’ condition where SP2 was instructed not to say or do anything, consistent with 6. Data from this block was used to calculate initial thresholds and weights for linear regression referencing, which were then updated after each subsequent block.
S1.08 - Conversation Mode
In Conversation Mode (Videos 3, 4, and 5), no prompted sentences were shown on screen. Instead, SP2 could say whatever he wanted. Video 2 shows the first ever self-directed use of the speech neuroprosthesis in Session 2, where he spoke to his daughter (Fig. 3b). In this first self-directed usage, speech detection had not been implemented yet, so SP2 waited until the square turned from red to green (like in the Copy Task) and then began to say whatever he wanted to.
In subsequent sessions, we developed a dedicated Conversation Mode that detected when SP2 spoke and decoded accordingly. In Conversation Mode, SP2 would initiate a new sentence by simply attempting to speak, which the speech neuroprosthesis would reliably detect using only neural data (Fig. S22). To accomplish this detection, the RNN decoder was always running in the background to predict phoneme probabilities every 80 ms. These phoneme probabilities were analyzed in real time to detect when speech had started (as indicated by the probability of any phoneme being higher than the probability of silence) or ended (if the probability of silence was higher than the probability of any phoneme for 6 consecutive seconds; this duration was determined over the first few sessions of this task to balance accidentally timing out a sentence early, with not making SP2 wait too long for it to end when he wanted it to). SP2 could end a sentence using the eye tracker (by dwelling his gaze over an on-screen button that said “DONE”) or by waiting six seconds for the trial to time out, after which time the final sentence was read aloud by the TTS algorithm. SP2 used the eye tracker to confirm whether the final decoded sentence was correct, or if not, he could specify whether it was “mostly correct” or “incorrect”. We also included a “spelling mode”, toggled by an on-screen eye tracker button, which allowed SP2 to spell out individual words by attempting to say each individual letter (e.g., “h”, “e”, “l”, “l”, “o”). Spelling mode leveraged a separate language model that had an output space limited to the 26 English letters, and was biased towards (but not limited) predicting letter sequences that formed valid English words. Sentences and letter sequences that were confirmed to be correct were used to fine-tune the RNN in the background (Section S2.03), which helped decoding performance remain stable and accurate throughout usage of the speech neuroprosthesis.
A major upgrade to Conversation Mode was implemented in session 72 (see Video 5). This upgrade included two major additional features: (1) SP2 could now control if and when the text-to-speech audio was played after each sentence by looking at an on-screen eye tracker button. Previously, every sentence was automatically read aloud upon finalization. This functionality helped SP2 to better fit his sentences into the flow of conversation, and also enabled him to play audio for sentences multiple times, either to repeat himself to get someone’s attention, or for comedic effect. (2) Upon SP2’s marking a final decoded sentence as “mostly correct”, the speech neuroprosthesis would now offer him up to 5 additional candidate sentences to choose from, which were the next five most likely decoded sentences. In each candidate sentence, words that differed from the original final decoded sentence were highlighted in yellow to make the differences easier to see. This functionality enabled SP2 to communicate more quickly by arriving at the correct decoded sentence without having to repeat himself.
The duration of personal use blocks ranged from approximately five minutes to over 7 hours. The user interface design of the self-paced Conversation Mode was iteratively improved in response to feedback from SP2 over the course of the study (e.g., see Table S4). SP2 used the speech neuroprosthesis in Conversation Mode for over 248 cumulative hours to speak with the research team, with family and friends, or with work colleagues. Conversations took place in-person, over phone and video calls, and over text messages and emails. In session 36 we added the (optional) ability for the speech neuroprosthesis to type final decoded sentences onto SP2’s personal computer by acting as a bluetooth keyboard. This functionality enabled SP2 to independently send text messages, emails, and perform other applications on his personal computer that involved text entry.
S1.09 - Decoder evaluation
To evaluate speech decoding performance, we computed phoneme error rate and word error rate using Levenshtein distance, which counts the number insertions, deletions, or substitutions necessary to match the decoded phonemes or words to the ground truth labels. For assessing the RNN output (without language models), we calculate the “raw phoneme error rate” by comparing the most probable phoneme decoded in each time step (duplicates removed) with the ground truth phoneme sequence. Consistent with 6, reported error rates were aggregated across all evaluation sentences from each session by summing the number of errors (insertions, deletions, or substitutions) for all sentences and then dividing it by the total number of words in those sentences. This helps prevent very short sentences from overly influencing the result. Confidence intervals for error rates were computed via bootstrap resampling over individual trials and then re-calculating the aggregate error rates over the resampled distribution (10,000 resamples).
Blocks where the participant was excessively tired, per his own report, were excluded from evaluation (2 of 41 total blocks); the word error rates on these blocks were 8.3% (session 14) and 5.3% (session 15). In each session, we collected 1-5 evaluation blocks (50-250 sentences). The number of evaluation blocks collected on any given day was dependent on the variable amount of time available in each session based on the participant’s schedule.
Before every session, an RNN was trained (Section S2.02) on all previous data. In early sessions (1-11), an additional new model was also trained halfway through data collection to calibrate it to the current day. From session 12 onward, after online fine-tuning was introduced, we stopped training a new model halfway through the day and instead relied on the online fine-tuning (Section S2.03).
We note that evaluating performance of Conversation Mode blocks has some inherent limitations. First, since some of the sentences may have been aborted early or be so incorrect as to be unintelligible, an estimate of the ground truth of these sentences would be impossible to determine. Second, aside from the session where we verified the content of all attempted sentences by asking the participant to identify ground truth labels for incorrectly decoded sentences, we used a combination of directly asking the participant, and examining the predicted phoneme patterns and the top 100 candidate sentences for each utterance in light of the context of the conversation; the latter approach involves some subjectivity.
S1.10 - Sentence selection
For 50-word vocabulary decoding (sessions 1 and 2), custom-written prompted sentences contained words from a 50-word vocabulary10. For 125,000-word vocabulary decoding (sessions 2 onward), sentences were sourced from the Switchboard corpus22, as in 6. We chose to use the same vocabularies as these two prior studies to facilitate comparison of speech decoding performance. Additional training sentences were sourced from the OpenWebText2 corpus23 and the Harvard Sentences24 in an effort to expand the sampled vocabulary and thus the decoder’s ability to generalize (Fig. S18, S17). Sentences were manually screened for grammatical errors or offensive language. We collected data for 10,481 prompted sentences over 50 sessions, totaling 67.4 hours of neural recording.
S1.11 - Eye tracking
SP2’s gaze data were tracked using a Tobii Pro Spark eye tracker (Tobii AB, Stockholm, Sweden). Eye tracker calibration was performed at the beginning of each session, and repeated as necessary between data collection blocks. During data collection blocks, SP2’s on-screen gaze location was recorded at 60 Hz and used by the speech neuroprosthesis to allow him to select on-screen “buttons” by looking at them for 0.5 seconds. Gaze data was recorded independently from each eye before being averaged and smoothed over time. All eye tracker calibration, gaze data recording, and logic for on-screen button selection was done with custom written Python code that was integrated into our BRAND-based19 data collection rig.
S1.12 - Estimating electrode array placement in MNI space
We estimated the location of the four Utah arrays in the precentral gyrus using the asymmetric MNI2009b atlas. This was performed as follows. First, by corroborating intra-operative photographs with a post-operative CT scan co-registered to the pre-operative anatomical T1 MRI, we identified the center-point of each array on SP2’s precentral gyrus. Second, we aligned SP2’s brain to ACPC coordinate space, and computed the ACPC coordinate of each array. Third, we used the Advanced Normalization Tools package25, to “skull strip” the MRI (using the asymmetric MNI2009b atlas as a template), and then applied antsRegistration to perform non-linear deformation from SP2’s anatomical T1 to the MNI2009b template26,27. Finally, we applied the warping composite deformation of the ACPC coordinates of the arrays.
This approach yielded the following coordinates:
We note that this approach to estimating corresponding array location in MNI space is highly dependent on parameters chosen at each step of the ANTs registration. The “skull stripping” procedure used the default parameters provided in the ANTs registration package of version 2.4.0. Motivated by author D.M.B.’s experience in performing nonlinear deformations of template atlases for post-operative analysis of stereotactic procedures, the following parameters were chosen, implemented using the open-source nipype package (version 1.8.3):
Transforms = [’Rigid’, ‘Affine’,’SyN’]
Transform_parameters = [(0.1,), (0.15,), (0.3,3,0)]
Convergence_window_size = [10,10,10]
Shrink_factors = [[12,8,4,1], [8,4,2,1], [8,6,4,2,1]]
Convergence_threshold = [1e-6, 1e-6, 1e-6]
Sigma_units = [’vox’, ‘vox’,’vox’]
Metric = [’MI’,’MI’,’MI’]
Smoothing_sigmas = [[4,3,2,1],[4,3,2,1],[4,3,2,1,1]]
Number_of_iterations = [[1000,500,250,0],[100,500,250,0],[200]*5]
Radius_or_number_of_bins = [32, 32, 32]
Sampling_strategy = [’Regular’, ‘Regular’,’Regular’]
Sampling_percentage = [0.25, 0.25, 0.25]
S2 - RNN decoder
S2.01 - RNN architecture and feature preprocessing
A recurrent neural network was used in this study to predict sequences of phoneme probabilities from neural data. In brief, the RNN consisted of (1) linear day-specific input layers to correct for nonstationarity in neural data between days, (2) 5 layers of gated recurrent unit (GRU) architecture with 512 units per layer, and (3) a dense output layer that outputted the probabilities of 41 classes (39 phonemes, silence, and a CTC blank token). The RNN ran every 4 bins (20 ms per bin) to predict phoneme probabilities from the most recent 14 bins of neural data (280 ms). Thus, a set of probabilities of each phoneme being spoken were generated by the RNN every 80 ms.
Before neural features were input into the RNN, they were z-scored using their means and standard deviations from the speech epochs of the previous 20 trials, and then smoothed with a Gaussian kernel (s.d. = 40 ms) that was delayed by 160 ms. The smoothing step reduced noise in the data, but also added 160 ms of latency to the neuroprosthesis, which was not consequential for the brain-to-text decoding described here, but may negatively affect performance in other decoder paradigms where minimizing feedback latency is important. Connectionist temporal classification (CTC) loss was used to output a sequence of predicted labels (phonemes) from an unlabeled time sequence of neural data. For additional details about the RNN architecture, refer to the supplemental methods section of 6. All RNN parameters are listed in Table S5. We note that the hyperparameters chosen here were optimized for decoding SP2’s attempted speech (Fig. S9) and may need to be individually tuned to future users.
S2.02 - Offline RNN training
The offline RNN training protocol used here is the same as in 6. A new RNN was trained before each session, and also mid-session for the first 11 sessions (before online fine-tuning was introduced). Whenever an offline RNN was trained, 90% of all previous data were used for training, and 10% of data were randomly (uniformly from each session) held-out for validation. Prompted sentences from each trial were converted to a sequence of phonemes using the Python g2p-en package 28. The RNN was trained with neural feature sequences and target phoneme sequences for 2,000-50,000 batches (the number of batches was increased as the training data pool grew throughout data collection), and the learning rate was linearly decayed from 0.02 to 0.0 across all batches. In each batch, up to 64 trials of data from a randomly selected session were input into the corresponding day-specific input layer followed by the GRU and dense layers. Neural data was dynamically augmented on a batch-by-batch basis during training to improve decoder generalizability and stability by adding (1) white noise and (2) artificial constant offsets to the neural features29. At the end of each batch, weights for the relevant day-specific input layer, GRU layers, and dense output layer were updated using stochastic gradient descent (ADAM; β1 = 0.9, β2 = 0.999, ε = 0.1). We applied dropout and L2 weight regularization during training to improve generalization. Parameters used for online speech decoding are listed in Table S5.
S2.03 - Online RNN fine-tuning
Online RNN fine-tuning was introduced in session 12 in an effort to frequently adjust the RNN to shifts in neural signals to ensure that speech decoding performance remained consistently high throughout a session. The online fine-tuning method employed here is similar to the one introduced in 30 for a handwriting decoder, but was adapted here to work with speech decoding tasks. At the beginning of a new session, an RNN trained on all previous data was loaded, and the day-specific input layer corresponding to the most recent session was duplicated. After each trial in the new session (starting after 10 trials of data had accumulated in sessions 12-25, and 5 thereafter]), the weights of this day-specific input layer and the weights of the base GRU model were updated using the neural data and ground-truth sentences from each trial. Ground-truth sentences were defined as either the prompted sentence (in the Copy Task) or as decoded sentences that SP2 confirmed to be correct (using the eye tracker) in the self-initiated Conversation Mode. During each fine-tuning epoch, data from previous sessions was randomly sampled in a proportion of 60% new data (from the current session) to 40% old data (from previous sessions) when training the model in an effort to ensure that the model did not overfit to the current day’s data. A static learning rate of 0.04 was used to fine-tune the RNN throughout the session. For additional details about online RNN fine-tuning, refer to 30. Parameters used for online speech decoding are listed in Table S5.
S2.04 - Hyperparameter optimization
Optimal hyperparameters for RNN architecture and training were determined with hyperparameter sweeps twice throughout data collection (Fig. S21), and optimal parameters were used in subsequent online decoding sessions. RNNs were trained offline (using all previous data) with one hyperparameter varied at a time. Each RNN was validated on randomly-selected held-out validation trials to evaluate performance (raw phoneme error rate). For each parameter condition, 10 RNNs were trained and their validation performances were averaged.
We note that all data used to train the hyperparameters for the RNN were exclusively from participant SP2. An open question remains whether the hyperparameters from one participant could be used to seed a generic RNN, to be used with a future participant.
S3 - Language model
S3.01 - Architecture
The n-gram language models in this study take sequences of phoneme probabilities as an input and output the most likely sequence of words. The 50-word language model used here was a 5-gram model trained on 2,413 custom-written sentences that contained only words from the 50 word vocabulary10. This model outputs only the singular most likely sequence of words. The 125,000-word language model, trained on the OpenWebText2 corpus23, was the same 5-gram model described for post-hoc offline analyses in 6. The 125,000-word vocabulary included in this language model stems from the CMU dictionary (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) and encompasses the majority of the English language; native English speakers typically know ∼20,000 to 40,000 words31,32. We added a list of names provided by the participant (e.g., those of his family members and the research team) to the output vocabulary space of the language model, enabling him to say those names using the speech neuroprosthesis. This language model initially predicts up to 100 of the most likely sequences of words, before rescoring them in multiple stages to identify the singular most likely sequence of words. To use this language model online and in real time, we implemented custom-written Python code to integrate it into our real-time decoding system. For additional details about the language models utilized in this study, refer to the supplemental methods section of 6. Parameters used for online speech decoding are listed in Table S5.
S3.02 - Hyperparameter optimization
As we collected data, we ran offline language model hyperparameter sweeps to identify optimal speech decoding parameters (Fig. S21). In particular, the acoustic scale and alpha parameters had the biggest impact on decoding performance. The acoustic scale parameter is the weighting ratio between the RNN-derived phoneme probabilities and the n-gram derived sentence probabilities, and the alpha parameter is the weight ratio between the n-gram model rescoring and the OPT LLM rescoring (see supplemental methods in 6 for more details). After hyperparameter tuning, we used these identified optimal hyperparameters to enhance speech decoding accuracy online in subsequent speech decoding sessions.
S3.03 - Automatic sentence punctuation
Starting in session 38, the final decoded sentence was automatically punctuated by passing the predicted word sequence text through a publicly available open-source auto punctuation model (“fullstop-punctuation-multilang-large”)33. This model is capable of quickly predicting the punctuation of English, Italian, French and German texts. Punctuation was not always correct, and the participant was instructed not to consider punctuation in his assessment of correctness when using the speech neuroprosthesis in Conversation Mode. The participant reported being largely pleased with this auto-punctuation implementation, and the punctuation also helped to add appropriate pauses to audio generated by the text-to-speech model.
S4 - Offline analyses
S4.01 - Offline RNN analyses
Offline RNN decoding analyses were utilized in Figures S6, S10, S11, S13, S14, S15, S19, and S20 of this study. All analyses averaged the results from 5-10 RNN seeds per condition. Chance decoding values (reported in Fig. S16c) were calculated by training a decoder where the phonemes of the ground-truth sentence for each trial were shuffled. This allowed us to maintain the statistical distribution of uttered phonemes while obtaining a chance decoding value. For each electrode count condition of the electrode dropping analysis (Fig. S16b), random electrodes were sampled uniformly from all arrays, and data from unused electrodes was zeroed out. Unless specified, RNN architecture and hyperparameters remained consistent between all conditions.
S4.02 - Offline language model analyses
Offline language model performance analyses were performed for Figures S12, S15, and S21. In these offline analyses, RNN-predicted phoneme sequences, either from online speech decoding sessions or from offline RNN analyses (see Section S4.01), were sequentially fed into language models for each trial of data, before finalization. Offline language models were initialized with a range of parameters or vocabularies as relevant for the analysis. Language model performance was assessed as the aggregate word error rate of all predicted sentences, calculated as described in Section S1.09.
S4.03 - Acoustic contamination analysis
We assessed whether the neural data was contaminated by the audio vibrations associated with attempted speech due to a potential microphonic effect at one or more elements of the electrophysiology recording chain34. We designed a Speech Amplitude task where SP2 was instructed to say single words (“be”, “go”, “my”, “know”, “have”, and “going”) at a specified amplitude ( “SILENT”, “WHISPER”, “NORMAL”, and “LOUD”). In offline analyses, we then performed the same analysis as described in 34 to determine if the neural data was potentially acoustically contaminated.
The Speech Amplitude task was designed similarly to the Copy Task described in Section S1.07. Here, a prompt consisting of a word and the required amplitude at which the word needs to be spoken was displayed as text (for example, “LOUD: be”) on a screen facing SP2. A red square on the screen turned to green and a beep sound played over a speaker, prompting SP2 to attempt to speak. Each trial had a fixed duration of 3 seconds. The words used in this task were selected because they are some of the most frequently sampled words in the Switchboard dataset. For the “SILENT” amplitude, SP2 was instructed to attempt to speak absolutely silently, as if he was mouthing a word to someone across a room. For the “NORMAL” amplitude, SP2 was instructed to maintain the same speech amplitude level as used during the Copy Task (Section 1.07). We also included a “DO NOTHING” prompt, where SP2 was instructed to not say or do anything.
In each research block, the 25 prompts (6 words × 4 amplitudes + “DO NOTHING”) were repeated 5 times each in a random order. Each block took approximately 10 minutes. In total, we collected 5 blocks of data, resulting in 625 total trials. Two trials were excluded because SP2 was coughing during them.
We used the methods and code described in 34 to check each trial from the speech amplitude task for acoustic contamination. Neural data from 585/623 trials (93.9%) did not have evidence of acoustic contamination (p > 0.05; Fig. S7a). Neural data from the remaining 38/623 trials (6.1%) were flagged as potentially acoustically contaminated using a statistical threshold of p < 0.05. These trials with significant correlation between the microphone recording and multielectrode array voltages included 4 “DO NOTHING” trials, 6 “LOUD” trials, 9 “SILENT’’ trials, 10 “NORMAL” trials and 9 “WHISPER” trials. The flagged trials were distributed across the different speech amplitudes and, given that we have chosen alpha to be 0.05, we expect ∼5% of the trials will be flagged as potentially contaminated at random due to chance under the null hypothesis that there is no acoustic contamination. Thus, we consider the 6.1% of all trials that were flagged as potentially acoustically contaminated to be at the chance level, and conclude that there is no evidence that the neural data presented in this study was acoustically contaminated. This is supported by the finding that 9/149 trials during the “SILENT” condition showed potential acoustic contamination despite the complete absence of vocalized speech.
S4.04 - Decoding accuracy during attempted silent, whispered, or loud speech
In both offline and online analyses, we aimed to determine if the speech neuroprosthesis could generalize to other amplitudes of attempted speech, including attempted silent, whispered, or loud speech.
First, we used a decoder trained on speech attempted at a normal volume to predict (offline) what SP2 was attempting to say during the Speech Amplitude task described in Section S4.03. We found that the decoder generalized reasonably well to the other speech amplitudes (Fig. S19a), which could be decoded with much better accuracy than would be expected by chance (∼80%; Fig. S16a). It is not surprising that we observed the lowest phoneme error rate for the “normal” speech volume condition, considering that the decoder was trained solely on attempted speech using the same strategy.
We hypothesized that including more training data where SP2 speaks silently may help the decoder adapt to that condition and more accurately predict the intended words. To this end, we dedicated an entire session and a half to calibrating the speech neuroprosthesis with silent speech training data, and then evaluating how accurately it could predict silent speech online. We found that we could decode silent speech online with word error rates below 15%, and as low as below 5% (Fig. S19b). As predicted, word error rates started higher in early training blocks during each session and tended to be lower in subsequent blocks as the neuroprosthesis calibrated itself. These results show that silent speech could be decoded online with accuracy exceeding previous speech decoding accuracies6,12 but lower than SP2’s attempted normal amplitude speech performance. We predict that collecting additional silent speech data and optimizing the RNN and LM hyperparameters on that speech modality could further increase decoding accuracy, which remains an avenue for further investigation.
S5 - Own-voice text-to-speech
S5.01 - Fine-tuned VITS text-to-speech model
We trained a text-to-speech algorithm21 to sound like SP2’s pre-ALS voice. SP2 and his family provided us with home videos and other recordings of SP2 speaking. Recordings with clear samples of SP2’s voice were segmented into individual sentences, noise-reduced (RNNoise 1.435), and amplitude-normalized in preparation for training the TTS model. Signal-to-noise ratio (SNR; calculated with Waveform Amplitude Distribution Analysis (WADA) with the Coqui TTS Check-DatasetSNR notebook21) was used to quantify how noisy each audio clip was. Each audio clip was then manually transcribed, and the phoneme distribution across all audio clips was calculated (using the Python g2p-en package28) to ensure that each phoneme was adequately sampled. Comprehensive coverage of each phoneme is required to train robust text-to-speech models and accurately reproduce speech patterns.
For training an own-voice TTS, we chose the VITS model36, which we subjectively found to reproduce SP2’s pre-ALS voice most accurately and without a long delay at the end of each sentence. A pre-trained VITS model with LJSpeech corpus was fine-tuned using SP2’s processed audio samples to create a TTS that sounded like SP2’s pre-ALS voice. This TTS was used to read the final decoded sentence at the end of each trial in both the Copy Task and Conversation Mode. SP2 and his family reported that they found our recreation of his pre-ALS voice to be more representative of his pre-ALS voice than his previously purchased commercial version.
S5.02 - Fine-tuned StyleTTS2 text-to-speech model
In session 55 we deployed a significantly upgraded own-voice TTS model which was based on StyleTTS2, which leverages style diffusion and adversarial training with large speech language models to achieve human-level TTS synthesis37. Using the same training dataset described in Section S5.01, plus additional audio clips acquired from SP2 since training the original own voice TTS model, we followed the instructions available on the StyleTTS2 GitHub repo to fine-tune a pre-trained StyleTTS2 model to sound like SP2’s pre-ALS voice. The fine-tuned StyleTTS2 own-voice TTS model sounded much more naturalistic and less robotic than the previously described VITS own-voice TTS model. SP2 and several of his family members and friends independently agreed that the StyleTTS2 own-voice TTS model closely resembled SP2’s pre-ALS voice. The first time that SP2 used the new TTS with the speech neuroprosthesis, he was emotionally moved and used the speech neuroprosthesis to repeatedly tell the research team and his family: “I am so f*****g back.” (The censored-for-publication word was decoded and synthesized aloud with correct spelling and pronunciation.)
S6 - Gesture decoding for task control
People with ALS may lose precise and reliable eye gaze control as their disease progresses. Thus, eliminating reliance on eye gaze (e.g., by making the system controllable through exclusively neural signals) is necessary to ensure that the neuroprosthesis will remain usable in the long-term for users with degenerative diseases such as ALS. We provided SP2 with a “neural click” functionality as an alternative to eye tracker control for indicating that he is done speaking, which triggers sentence finalization and text-to-speech output. This neural click functionality was tested during closed-loop speech neuroprosthesis use in sessions 17 and 18 (275 total sentences).
S6.01 - Motor imagery
We chose “right-hand squeeze” as the motor imagery to perform the neural click, because it had a robust neural signal to noise ratio in previously collected SP2 movement sweep data, and the hand squeeze gesture has been used for neural click in prior intracortical brain-computer interface studies38,39. Other discrete gestures may have worked just as well for this purpose40.
S6.02 - Decoder architecture
We implemented a linear gesture decoder (independent from the RNN speech decoder) to solve the binary classification problem: “For each 10 ms bin, is the user attempting right-hand squeeze or not?”. We chose to use a linear discriminant analysis (LDA) model, because linear models are simple and fast to train, and in this case were able to reliably distinguish the neural correlates of hand squeezes from those of speech or silence in preceding offline tests.
S6.03 - Decoder training
We interspersed 16 trials of a “RIGHT HAND - CLOSE” condition into the instructed delay task (our “diagnostic block”) performed at the start of each session. After the diagnostic block, we trained the LDA classifier on all the trials (“RIGHT HAND - CLOSE” = click, single-word speech trials = non-click, “DO NOTHING” = non-click). The training data from each trial came from the epoch 0.5 - 1.5 seconds after the go cue. Each trial yielded 100 training samples, as our LDA classifier operated on individual 10 ms time bins.
Each training sample was a single time bin’s feature vector (256 threshold crossings + 256 spike band power = 512 features). These feature vectors were processed identically to those used for speech decoding (filtered, z-scored, etc.). To fit the LDA model on these training data, we used the LinearDiscriminantAnalysis class from the Python package sklearn41.
S6.04 - Decoder inference
After training the LDA classifier, we used it during closed-loop speech blocks to decode neural clicks in real-time. Because this LDA neural click decoder was independent from the RNN speech decoder, it was run in parallel using the BRAND software architecture. Though the LDA model outputted a prediction for every 10 ms time bin, a click was not immediately performed every time the LDA model predicted click. Instead, a click was only performed when all time bins in a 100 ms sliding window were predicted as click, to reduce spurious clicks. Additionally, after each click we maintained a refractory period of 1 second during which no additional clicks were performed, to avoid erroneous repeated clicking.
S7 - Trial Registration
Clinicaltrials.gov: NCT00912041
Date registered: June 2009
Date first patient enrolled in study: May 2009*
Number of patients enrolled before registration: 1*
Reason for delay: *No new patients were enrolled before registration. The “first” participant in this study was a transitional participant who was previously enrolled in another clinical trial. She was technically transitioned from that study to this study in May 2009, which was the date the Investigational Device Exemption (IDE) for this study was granted. The IDE at that time preceded the clinicaltrials.gov registration (June 2009); we’ve consistently used the May date to mark the enrollment of this transitional participant, as this is when all regulatory permissions required had been granted for her to transition to the current trial.
Supplemental Figures
Supplemental Tables
Footnotes
Author financial interests: Stavisky, Henderson, and Willett, are inventors on intellectual property owned by Stanford University that has been licensed to Blackrock Neurotech and Neuralink Corp. Wairagkar, Stavisky, and Brandman have patent applications related to speech BCI owned by the Regents of the University of California. Stavisky was an advisor to wispr.ai and received equity. Brandman is a surgical consultant to Paradromics Inc. Henderson is a consultant for Neuralink Corp, serves on the Medical Advisory Board of Enspire DBS and is a shareholder in Maplight Therapeutics. The MGH Translational Research Center has a clinical research support agreement with Neuralink, Synchron, Axoft, Precision Neuro, and Reach Neuro, for which LRH provides consultative input. Mass General Brigham (MGB) is convening the Implantable Brain-Computer Interface Collaborative Community (iBCI-CC); charitable gift agreements to MGB, including those received to date from Paradromics, Synchron, Precision Neuro, Neuralink, and Blackrock Neurotech, support the iBCI-CC, for which LRH provides effort. Glasser is a consultant to Sora Neuroscience, Manifest Technologies, and Turing Medical.
Funding:
NS Card:
○ A.P. Giannini Postdoctoral Fellowship
JM Henderson:
○ Stanford Wu Tsai Neurosciences Institute
○ HHMI
○ NIH-NIDCD (1U01DC019430)
SD Stavisky:
○ DP2 from the NIH Office of the Director and managed by NIDCD (1DP2DC021055)
○ DOD CDMRP ALS Pilot Clinical Trial Award (AL220043)
○ Pilot Award from the Simons Collaboration for the Global Brain (872146SPI)
○ Searle Scholars Program
○ Career Award at the Scientific Interface from the Burroughs Wellcome Fund
○ UC Davis Early Career Faculty Award for Creativity and Innovation
LR Hochberg:
○ NIH-NIDCD (U01DC017844)
○ VA RR&D (A2295-R)
The manuscript and supplemental appendix were revised in response to peer review. Existing figures were rearranged. New data, analyses, and figures were added. Aparna Srinivasan was added as an author because she contributed to data collection and analyses during the revision process.