Main

SARS-CoV-2 is an enveloped virus consisting of a positive-sense, single-stranded RNA genome of around 30 kb. Two overlapping ORFs, ORF1a and ORF1b, are translated from the positive-strand genomic RNA and generate continuous polypeptides, which are cleaved into a total of 16 nonstructural proteins (NSPs). The translation of ORF1b is mediated by a −1 frameshift that allows translation to continue beyond the stop codon of ORF1a. Negative-strand RNA intermediates are produced from the viral genome and serve as templates for the synthesis of positive-strand genomic RNA and subgenomic RNAs5. The subgenomic RNAs contain a common 5′ leader fused to different segments from the 3′ end of the viral genome, a 5′-cap structure and a 3′ poly(A) tail6,7. These distinct fusions occur during negative-strand synthesis at 6–7 nucleotide (nt) core sequences called transcription-regulating sequences (TRSs), which are located at the 3′ end of the leader sequence as well as preceding each viral ORF. The different subgenomic RNAs encode four conserved structural proteins—spike (S), envelope (E), membrane (M) and nucleocapsid (N)—and several accessory proteins. On the basis of sequence similarity to other betacoronaviruses, and specifically to SARS-CoV, the current annotation of SARS-CoV-2 includes predictions of six accessory proteins (3a, 6, 7a, 7b, 8 and 10, NC_045512.2), but not all have been experimentally confirmed8,9.

To capture the full coding capacity of SARS-CoV-2, we applied a range of ribosome-profiling approaches to Vero E6 cells infected with SARS-CoV-2 for 5 or 24 h (Fig. 1a). At 24 h post-infection (hpi) the vast majority of cells were infected and cells were still intact (Extended Data Fig. 1). For each time point, we prepared three ribosome-profiling libraries (Ribo-seq), each one in two biological replicates. To facilitate mapping of translation initiation sites, we prepared two Ribo-seq libraries by treating cells with lactimidomycin (LTM) or harringtonine (Harr), two drugs with distinct mechanisms that prevent elongation at 80S ribosomes at translation initiation sites. These treatments lead to excessive accumulation of ribosomes precisely at the sites of translation initiation and depletion of ribosomes over the body of the ORF (Fig. 1a). The third Ribo-seq library was prepared from cells treated with the translation elongation inhibitor cycloheximide (CHX), and provides a snapshot of actively translating ribosomes across the body of the translated ORF (Fig. 1a). In parallel, RNA-sequencing (RNA-seq) was applied to map viral transcripts. Analysis of cellular genes from the different Ribo-seq libraries revealed the expected distinct profiles in both replicates. Ribosome footprints displayed a strong peak at the translation initiation site, which, as expected, is more pronounced in the Harr and LTM libraries; the CHX library also exhibited a distribution of ribosomes across the entire coding region, and the mapped ribosome footprints were enriched in fragments that align to the translated frame (Fig. 1b, Extended Data Fig. 2a). As expected, the RNA-seq reads were uniformly distributed across coding and non-coding regions (Fig. 1b). The footprint profiles of viral coding sequences at 5 hpi fit the expected profile of translated sequences (Fig. 1c, Extended Data Fig. 2b) and the footprint densities were highly reproducible between biological replicates, at single-nucleotide resolution (Extended Data Fig. 2c). Notably, the footprint profile over the viral genome at 24 hpi, did not fit the expected profile of translating ribosomes and was generally not affected by Harr or LTM treatments (Extended Data Fig. 2b). To further examine the characteristics of the footprints, we applied a fragment length organization similarity score (FLOSS) that measures the magnitude of disagreement between the footprint distribution on a given transcript and the footprint distribution on canonical coding sequences (CDSs)10. At 5 hpi, protected fragments from SARS-CoV-2 ORFs did not differ from highly expressed cellular transcripts (Fig. 1d). However, reads at 24 hpi could be clearly distinguished from cellular CDSs (Fig. 1e). We conclude that the footprint data from 5 hpi constitutes robust and reproducible ribosome footprint information but that viral protected fragments at 24 hpi may reflect additional interactions with viral RNA that occur at late time points in infection.

Fig. 1: Ribosome profiling of SARS-CoV-2 infected cells.
figure 1

a, Vero E6 and Calu3 cells infected with SARS-CoV-2 were collected at 5, 24 (Vero E6) and 7 (Calu3) hpi for RNA-seq, and for Ribo-seq using LTM, Harr or CHX treatments. b, Metagene analysis of read densities relative to the maximal signal of each gene around the start and stop codons of cellular CDSs at 5 hpi. The read densities are shown with different colours indicating the three frames (red, 0; black, +1; grey, +2). c, Metagene analysis around the start codon, as described in b, for viral ORFs at 5 hpi. d, e, FLOSS score for cellular and SARS-CoV-2 ORFs at 5 hpi (d) and 24 hpi (e).

A global view of RNA and CHX-footprint reads mapping to the viral genome at 5 hpi demonstrates that RNA levels are constant across ORFs 1a and 1b, and steadily increase towards the 3′ end, reflecting the cumulative abundance of these sequences due to the nested transcription of subgenomic RNAs (Fig. 2a). Increased coverage is also seen at the 5′ untranslated region (UTR), reflecting the presence of the 5′ leader sequence in all subgenomic RNAs as well as the genomic RNA. Reduction in footprint density between ORF1a and ORF1b reflects the proportion of ribosomes that terminate at the ORF1a stop codon instead of frameshifting into ORF1b (Extended Data Fig. 3). By dividing the footprint density in ORF1b by the density in ORF1a we estimate a frameshift efficiency of 57% ± 12%. This value is similar to the frameshift efficiency measured by Ribo-seq of mouse hepatitis virus (MHV) (48–75%)3. Similar to observations in MHV and avian infectious bronchitis virus (IBV)3,11, we did not observe noticeable ribosome pausing before or at the frameshift site, but we identified several potential pausing sites within ORF1a and ORF1b (Extended Data Fig. 3).

Fig. 2: Expression level of canonical viral ORFs.
figure 2

a, RNA-seq (green) and Ribo-seq CHX (red) read densities (number of reads) at 5 hpi along the SARS-CoV-2 genome. SARS-CoV-2 canonical ORFs are labelled. b, Transcript abundance relative to ribosome densities of each SARS-CoV-2 canonical ORF at 5 hpi. c, Scatter plot of the abundance of reads that span canonical leader-dependent junctions (red), non-canonical leader-dependent junctions (green), non-canonical leader-independent junctions (purple) or genomic deletions (cyan) at 5 and 24 hpi.

Besides ORF1a and ORF1b, all other canonical viral ORFs are translated from subgenomic RNAs. As raw RNA-seq densities represent the cumulative sum of genomic and subgenomic RNAs, we calculated transcript abundance using two approaches: deconvolution of RNA densities, in which RNA expression of each ORF is calculated by subtracting the RNA-read density of cumulative densities upstream of the ORF region; and relative abundances of RNA reads spanning leader–body junctions of each of the canonical subgenomic RNAs. For the majority of the ORFs, there was high correlation between these two approaches (Pearson’s R = 0.897; Extended Data Fig. 4a), and in both approaches the N transcript was the most abundant transcript, in agreement with other studies9,12. We next compared footprint densities to RNA abundance. For the majority of viral ORFs, transcript abundance correlated almost perfectly with footprint densities (Fig. 2b), indicating that these viral ORFs are translated with similar efficiencies (probably owing to their almost identical 5′ UTRs); however, three ORFs were outliers. The translation efficiencies of ORF1a and ORF1b were considerably lower. This may stem from distinct features in their 5′ UTR (discussed below) or from under estimation of their true translation efficiency, as some of the full-length RNA molecules may serve as template for replication or packaging and are thus not part of the translated mRNA pool. The third outlier is ORF7b, for which we identified very few body–leader junctions; nevertheless, it exhibited relatively high translation, probably owing to ribosome leaky scanning of the ORF7a transcript, as was suggested for SARS-CoV13.

Many transcripts derived from non-canonical junctions have been identified for SARS-CoV-29,12. These junctions contain either the leader combined with 3′ fragments at unexpected sites in the middle of ORFs (leader-dependent noncanonical junction) or fusions between sequences that do not show similarity to the leader (leader-independent junction). We estimated the frequency of junction-spanning reads in our RNA libraries and obtained excellent agreement between our replicates (Extended Data Fig. 4b, c, Supplementary Table 1) and significant correlation with previous data from Vero E6 cells12 (Pearson’s R = 0.81; Extended Data Fig. 4d), illustrating that many of these junctions are reproducible between experimental systems. We also identified five abundant leader-independent junctions that, to our knowledge, were unique to our data (Supplementary Table 2). We noted that three of these junctions represent short in-frame deletions in the spike protein that overlap deletions in the furin-like cleavage site that were recently described9 (Extended Data Fig. 4e). The recurrence of the same genomic deletion supports the conclusion that this deletion is being selected for during passage in Vero E6 cells. To examine whether additional non-canonical junctions are derived from genomic deletions, we sequenced the genomic RNA of the virus we used in our infections. In addition to the deletions at the furin-like cleavage site, we identified an 8 amino acid (aa) deletion in ORF-E in 2.3% of the genomic RNA (Fig. 2c, Supplementary Table 2). When we compared the frequency of junctions between 5 h and 24 h time points, the leader-dependent junctions and the genomic deletions showed good correlation, but the leader-independent junctions were specifically increased at 24 hpi (Fig. 2c). These data show that a small number of the leader-independent junctions represent genomic deletions and a larger subset increases at late stages of infection when genome replication is dominant and therefore probably do not substantially affect viral transcripts and translated ORFs.

Examination of SARS-CoV-2 translation as reflected by the diverse Ribo-seq libraries, revealed unannotated translated ORFs. We detected in-frame internal ORFs (iORFs) within existing ORFs, resulting in N-terminally truncated products. These include relatively long truncated versions of canonical ORFs, such as the one found in ORF6 (Fig. 3a, Extended Data Fig. 5a), or very short truncated ORFs that may serve as an upstream ORF (uORF), such as truncated ORF7a, which might regulate ORF7b translation (Fig. 3b, Extended Data Fig. 5b). We also detected internal out-of-frame translations, which would yield novel polypeptides, such as ORFs within ORF3a (41 aa and 33 aa in size, respectively; Fig. 3c, Extended Data Fig. 5c) and within ORF-S (39 aa; Fig. 3d, Extended Data Fig. 5d) or short ORFs that probably serve as uORFs (Fig. 3e, Extended Data Fig. 5e). Additionally, we observed a 13-aa extended ORF-M, in addition to the canonical ORF-M, which is predicted to start at the near-cognate codon AUA (Fig. 3f, Extended Data Fig. 5f).

Fig. 3: Ribosome densities reveal novel viral coding regions.
figure 3

ai, Ribosome density profiles (number of reads) of CHX, Harr and LTM samples at 5 hpi. Densities are shown with different colours indicating the frame relative to the main ORF (red, 0; black, +1; grey, +2). Filled and open rectangles indicate the canonical and novel ORFs, respectively. ORFs starting in a near cognate codon are labelled with stripes. a, b, In-frame iORFs within ORF6 (a) and within ORF7a (b). ce, Out-of-frame internal initiations within ORF-3a, within ORF-S (d) and within ORF-M (e). f, An extended version of ORF-M. g, An uORF that overlaps ORF10 initiation and an in-frame internal initiation generating truncated ORF10. h, Two uORFs embedded in the ORF1a 5′ UTR. i, Non-canonical CUG initiation upstream of the TRS leader. Reads that were cut to fit the scale are indicated with horizontal lines.

The presence of the annotated ORF10 was recently called into question, as almost no subgenomic reads were found for its corresponding transcript12,14. Although we also did not detect subgenomic RNA designated for ORF10 translation (Supplementary Table 1), the ribosome footprint densities indicate the presence of a translation initiation signal in ORF10 (Fig. 3g, Extended Data Fig. 5g). Of note, we additionally detected two putative ORFs in this region, an upstream out-of-frame ORF that overlaps ORF10 initiation and an in-frame internal initiation that leads to a truncated ORF10 product. Further research is needed to delineate how ORFs in this region are translated and whether they have any functional roles.

Finally, we detected four distinct initiation sites in the SARS-CoV-2 5′ UTR. Three of these encode for uORFs that are located just upstream of ORF1a; the first initiating at an AUG (uORF1) and the other two initiating at near-cognate codons (uORF2 and extended uORF2; Fig. 3h, Extended Data Fig. 5h). These uORFs are in line with findings in other coronaviruses3,15. The fourth site is the most prominent peak in the ribosome footprint densities on the SARS-CoV-2 genome and is located on a CUG codon at position 59, just 10 nucleotides upstream of the TRS leader (Fig. 3i, Extended Data Fig. 5i). The reads mapped to this site have the tight length-distribution characteristic of ribosome-protected fragments (Extended Data Fig. 6a). The occupancy at the CUG is higher than the downstream translation signal (Fig. 3i), implying that this peak might reflect ribosomal pausing. Notably, potenial ribosome pauses located just upstream of the TRS leader were also identified in MHV and IBV genomes3,11. Owing to its location upstream of the TRS leader, footprints mapping to this site could potentially derive from any of the subgenomic or genomic RNAs. Therefore, to view this initiation in its context, we aligned the footprints to the genomic RNA or to the most abundant subgenomic N transcript. On the genome and on ORF-N transcript, this initiation results in translation of uORFs, which, on the genome would generate an extension of uORF1 (Extended Data Fig. 6b, c). To assess the distribution of footprints at this initiation on the different viral transcripts, viral transcripts were divided into three groups on the basis of their sequence similarity downstream of the leader–junction site (to enable unique alignment) (Extended Data Fig. 6d). Of note, substantially more footprints were mapped to the group that includes the genomic RNA and the subgenomic E and M transcripts, than would be expected from their relative RNA abundance (Extended Data Fig. 6e). When only footprints that allow unique mapping to genomic RNA or subgenomic M and E transcripts are used (sizes 31–33 base pairs (bp) to discriminate M from genome or E transcript, and sizes 32–33 bp to discriminate E from the genome) a strong enrichment of footprints that originate from the genome is observed (Extended Data Fig. 6f, g). This footprint enrichment on genomic RNA suggests that ribosome pausing might be more prominent on the genome or that ribosomes engage with genomic RNA differently than with subgenomic transcripts. The proximity of this pause to the leader TRS, which seems to be conserved in MHV and IBV3,11, together with the relative enrichment on the viral genome, raises the possibility that a ribosome at this position might affect discontinuous transcription either by sterically blocking the TRS-L site or by affecting RNA secondary structure. In addition, ribosomes initiating at the CUG have the potential to generate uORFs or ORF extensions in the different subgenomic transcripts (Supplementary Table 3).

To systematically define the SARS-CoV-2 translated ORFs we used PRICE and ORF-RATER, two computational methods that rely on a combination of translation features to predict novel translated ORFs from ribosome-profiling measurements16,17. After application of a minimal expression cut-off and manual curation on the predictions, these classifiers identified 25 ORFs, these included 10 out of the 11 canonical translation initiations and 15 novel viral ORFs. In addition, ORF-RATER identified three putative ORFs that originate from the CUG initiation and extend to the subgenomic transcripts of S, M and ORF6 (Supplementary Table 3). The majority (85%) of the classifier-identified ORFs where independently identified in each of the biological replicates (Supplementary Table 4). Visual inspection of the ribosome-profiling data suggested eight additional putative novel ORFs, some of which are presented above (Fig. 3a, b, g, Supplementary Table 4). Overall, we identified 23 putative ORFs, in addition to the 12 canonical viral ORFs that are currently annotated in the NCBI database and 3 additional potential ORFs that stem from the CUG initiation upstream of the leader.

To confirm the robustness of these annotations, we extended these experiments to human cells. We first examined the infection efficiency of several human cell lines that were used to study SARS-CoV-2 infection: Calu3, A549 and Caco-2. Infection of Calu3 was most efficient and the presence of trypsin increased infection efficiency by at least twofold (Extended Data Fig. 7a). We infected Calu3 with a different SARS-CoV-2 isolate, which was sequenced to confirm its integrity. The same set of Ribo-seq techniques were applied to cells at 7 hpi, each in two biological replicates, in parallel with RNA-seq. The different Ribo-seq libraries showed the expected distinct profiles in both replicates, confirming the overall quality of these libraries (Extended Data Fig. 7b). We examined the translation of the new viral ORFs; all 23 novel ORFs we identified as being translated in Vero E6 cells also showed evidence of translation in infected Calu3 cells and 16 were annotated by PRICE and ORF-RATER (Extended Data Fig. 8, Supplementary Table 4). ORF-RATER also identified the same three ORFs that originate from the CUG initiation upstream of the leader (Supplementary Table 3). LTM-induced ribosome accumulation at the canonical and predicted initiation sites was highly reproducible between biological replicates as well as between Calu3 and Vero E6 cells (Extended Data Fig. 9a–c). Furthermore, ribosome-protected footprints displayed a 3-nt periodicity that was in phase with the predicted start site in both Vero E6 and Calu3 cells, providing further evidence for the active translation of the predicted ORFs (Extended Data Fig. 9d). We conclude that 23 unannotated ORFs are reproducibly translated from SARS-CoV-2 independently of the host cell and the viral origin, and additional ORFs may be translated from the CUG initiation located upstream of the TRS leader.

Ribosome density also enables accurate quantification of viral protein production. We first quantified the relative expression levels of canonical viral ORFs on the basis of the non-overlapping regions. ORF-N shows the highest expression in Vero E6 and Calu3 cells, followed by the other viral ORFs, with some differences in the relative expression between the two cell types (Fig. 4a). To quantify the expression of out-of-frame iORFs, we computed the contribution of the iORF to the frame periodicity signal relative to the expected contribution of the main ORF. For in-frame iORF quantification, we subtracted the coverage of the main ORF in the non-overlapping region. We also used ORF-RATER, which uses a regression strategy to calculate relative expression of overlapping ORFs, resulting in largely similar estimates of translation levels (Extended Data Fig. 10a, b). These measurements show that many of the novel ORFs we annotated are expressed at similar levels to the canonical ORFs (Fig. 4b, Supplementary Table 5). Furthermore, the relative expression of viral proteins seems to be mostly independent of the host cell type (Fig. 4c).

Fig. 4: Translation of host and viral genes.
figure 4

a, Translation levels of canonical viral ORFs in Vero E6 cells at 5 hpi and in Calu3 cells at 7 hpi. ORFs are ordered on the basis of their genomic location. b, Viral ORF translation levels, as calculated from ribosome densities for Vero E6 cells at 5 hpi. Solid fill represents canonical ORFs, and striped fill represents novel ORFs. c, Scatter plot of viral ORF expression in Vero E6 cells at 5 hpi and Calu3 cells at 7 hpi. Points representing canonical ORFs are outlined in black. df, Relative transcript abundance versus ribosome densities for each host and viral ORF at 5 hpi (d) and 24 hpi (e) in Vero E6 cells and at 7 hpi in Calu3 cells (f). Transcript abundance was estimated by counting the reads that span the corresponding junction and footprint densities were calculated from the CHX sample.

Of the novel ORFs we identified, 14 are very short (up to 20 codons) or located in the 5′ UTR of the genomic RNA and therefore likely have a regulatory role, and three are extensions or truncations of canonical ORFs (M, 6 and 7a). We examined the properties of the six out-of-frame iORFs that are longer than 20 aa: one of these ORFs is ORF9b and its truncated version (size 97 aa and 90 aa; Extended Data Fig. 10c, d). ORF9b appears in UniProt annotations and was detected previously8 in proteomic measurements. Together with our translation measurements this indicates that the product of ORF9b is a bona fide SARS-CoV-2 protein. In addition, we detected an iORF at the 5′ of ORF-S and its truncated version (39 aa and 31 aa; Fig. 3d), and two iORFs within ORF3a (41 aa and 33 aa; Fig. 3c). Mining proteomic measurements of SARS-CoV-2 infected cells8,9 did not detect peptides that originate from these out-of-frame ORFs, probably owing to challenges in detecting trypsin-digested products from short coding regions16. Indeed, two canonical SARS-CoV-2 proteins, ORF7b (43 aa) and ORF-E (75 aa) were also not detected by mass-spectrometry8,9, and our ribosome-profiling data are the first to show that these SARS-CoV-2 proteins are indeed expressed.

S.iORF1 and 3a.iORF1 are predicted to contain a transmembrane domain (Extended Data Fig. 10e, f) and 3a.iORF2 contains a predicted signal peptide (Extended Data Fig. 10g). Analysis of the conservation of these out-of-frame iORFs in SARS-CoV and in related viruses (sarbecoviruses) revealed that 3a.iORF1 is highly conserved (Supplementary Table 6). This ORF was also identified by three independent comparative genomic studies, which demonstrate that it has a purifying selection signature, implying that it is a functional polypeptide18,19,20. In combination, these findings indicate that 3a.iORF1 encodes a functional transmembrane protein, which is conserved throughout sarbecoviruses, and should thus be named ORF3c19,20. The second iORF overlapping ORF3a (3a.iORF2) and the iORF overlapping S (S.iORF1) are not conserved in most sarbecoviruses19 (Supplementary Table 6). The expression level of 3a.iORF2 is much lower compared to those of ORF3a and ORF3c (Fig. 4b, Extended Data Fig. 9d). A protein corresponding to an extended version of this ORF was pulled down21 and shown to elicit an antibody response22, but we found that translation of the truncated version predominated (Extended Data Fig. 10h, i). The internal S-ORF (S.iORF1) is situated just downstream of ORF-S AUG, suggesting that ribosomes might initiate translation via leaky scanning. This region in the S-protein shows extremely rapid evolution20, but in the SARS-CoV-2 isolates that have been sequenced, its coding capacity is maintained23. Future work is needed to delineate whether this ORF, which is highly expressed (Fig. 4b), represents a functional transmembrane protein. Translated ORFs that do not act as functional polypeptides could still be an important part of the immunological repertoire of the virus, as MHC class I bound peptides are generated at higher efficiency from rapidly degraded polypeptides24.

Finally, although we identified two internal out-of-frame ORFs within ORF3a, we did not detect translation of the SARS-CoV ORF3b homologue, which contains a premature stop codon in SARS-CoV-2 (Extended Data Fig. 10h, i). We also did not find evidence of translation of ORF14, which appears in some SARS-CoV-2 annotations15 (Extended Data Fig. 10c, d).

Translation of viral proteins relies on the cellular translation machinery, and coronaviruses, like many other viruses, are known to cause host shut-off25. To quantitatively evaluate whether SARS-CoV-2 skews the translation machinery to preferentially translate viral transcripts, we compared the ratio of footprints to mRNAs for virus and host CDSs at 5 hpi and 24 hpi in Vero E6 cells and at 7 hpi in Calu3 cells. Because ribosome densities were masked by a contaminant signal at 24 hpi, for samples from this time point we used the footprints that were mapped to subgenomic RNA junctions (and therefore reflect bona fide transcripts) to estimate ribosome densities. In all samples, the virus translation efficiencies fall within the low range of most of the host genes (Fig. 4d–f), indicating that viral transcripts are not preferentially translated in infected cells. Instead, viral transcripts take over the mRNA pool, probably through massive transcription coupled to host-induced RNA degradation26,27.

In summary, we have delineated the translation landscape of SARS-CoV-2. Comprehensive mapping of the expressed ORFs is a prerequisite for the functional investigation of viral proteins and for deciphering virus–host interactions. An in-depth analysis of the ribosome-profiling experiments demonstrated a highly complex landscape of translation products, including translation of 23 novel viral ORFs and revealed the relative production of canonical viral proteins. The ORFs that we have identified may serve as accessory proteins or as regulatory units controlling the balanced production of different viral proteins. Studies on the functional importance and antigenic potential of these ORFs will increase our understanding of SARS-CoV-2 and coronaviruses in general.

Methods

No statistical methods were used to predetermine sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.

Cells and viruses

Vero C1008 (Vero E6) cells (ATCC CRL-1586) were cultured in T-75 flasks with DMEM supplemented with 10% fetal bovine serum (FBS), MEM non-essential amino acids, 2mM l-glutamine, 100 U ml−1 penicillin, 0.1 mg ml−1 streptomycin, 12.5 U ml−1 nystatin (Biological Industries). Calu3 cells (ATCC HTB-55) were cultured in 10-cm plates with DMEM supplemented with 10% FBS, MEM non-essential amino acids, 2 mM l-glutamine, 100 U ml−1 penicillin, 1% non-essential amino acid and 1% sodium pyruvate. Caco-2 cells (ATCC HTB-37) were cultured in 10-cm plates with DMEM supplemented with 20% BS, 1% GlutaMAX, 100 U ml−1 penicillin, 0.1 mg ml−1 streptomycin and 1% sodium pyruvate. A549 cells (ATCC CCL-185) were cultured in 10-cm plates with DMEM supplemented with 10% FBS, 100 U ml−1 penicillin, 0.1 mg ml−1 streptomycin and 2 mM l-glutamine. All cell lines were purchased from and authenticated by ATCC and tested negative for mycoplasma. Monolayers were washed once with DMEM (for Vero E6) or RPMI (for Calu3, A549 and Caco-2) without FBS and infected with SARS-CoV-2 virus, at a multiplicity of infection (MOI) of 0.2, For Calu3 infection, 20 μg ml−1 TPCK trypsin (Thermo scientific) was added unless otherwise stated. After 1 h of infection, cells were cultured in their respective medium supplemented with 2% FBS, MEM non-essential amino acids, l-glutamine and penicillin–streptomycin–nystatin at 37 °C with 5% CO2. SARS-CoV-2 (GISAID accession no. EPI_ISL_406862), was provided by Bundeswehr Institute of Microbiology, Munich, Germany. It was propagated (4 passages) and titred on Vero E6 cells and then sequenced (details below) before it was used. SARS-CoV-2 BavPat1/2020 Ref-SKU: 026V-03883 was provided by C. Drosten, Charité–Universitätsmedizin Berlin, Germany. It was propagated (5 passages), titred on Vero E6 and then sequenced before use in experiments. Infected cells were collected at the indicated times as described below. Handling and work with SARS-CoV-2 virus was conducted in a Biosafety Level 3 facility in accordance with the biosafety guidelines of the Israel Institute for Biological Research. The Institutional Biosafety Committee of Weizmann Institute approved the protocol used in these studies.

Preparation of ribosome-profiling and RNA-seq samples

For RNA-seq, cells were washed with PBS and then collected with Tri-Reagent (Sigma-Aldrich), total RNA was extracted, and poly(A) selection was performed using Dynabeads mRNA DIRECT Purification Kit (Invitrogen). mRNA samples were subjected to DNaseI treatment and 3′ dephosphorylation using FastAP Thermosensitive Alkaline Phosphatase (Thermo Scientific) and T4 PNK (NEB) followed by 3′ adaptor ligation using T4 ligase (NEB). The ligated products were used for reverse transcription with SSIII (Invitrogen) for first-strand cDNA synthesis. The cDNA products were 3′ ligated with a second adaptor using T4 ligase and amplified with 8 cycles of PCR for final library products of 200–300 bp. For Ribo-seq libraries, cells were treated with 50 μM LTM for 30 min or 2 μg ml−1 Harr for 5 min for translation initiation libraries, or left untreated for the translation-elongation libraries (CHX library). All three samples were subsequently treated with 100 μg ml−1 CHX for 1 min. Cells were then placed on ice, washed twice with PBS containing 100 μg ml−1 CHX, scraped from the T-75 flasks (Vero E6 cells) or 10-cm plates (Calu3 cells), pelleted and lysed with lysis buffer (1% Triton X-100 in 20 mM Tris pH 7.5, 150 mM NaCl, 5 mM MgCl2, 1 mM dithiothreitol supplemented with 10 U ml−1 Turbo DNase and 100 μg ml−1 CHX). After lysis, samples stood on ice for 2 h and subsequent Ribo-seq library generation was performed as previously described4. In brief, cell lysate was treated with RNaseI for 45 min at room temperature followed by SUPERase-In quenching. Sample was loaded on sucrose solution (34% sucrose, 20 mM Tris pH 7.5, 150 mM NaCl, 5 mM MgCl2, 1 mM dithiothreitol and 100 μg ml−1 CHX) and spun for 1 h at 100,000 rpm in a TLA-110 rotor (Beckman) at 4 °C. Pellet was collected using TRI reagent and the RNA was collected using chloroform phase separation. For size selection, 15 μg total RNA was separated on a 15% TBE-urea gel for 65 min, and 28–34 footprints were excised using 28 and 34 flanking RNA oligos, followed by RNA extraction and Ribo-seq protocol4.

Virus genomic sequencing

RNA from viruses (culture supernatant after removal of cell debris) was extracted using the viral RNA kit (Qiagen). The SMARTer Pico RNA V2 Kit (Clontech) was used for library preparation. Genome sequencing was conducted on the Illumina Miseq platform, in a single read mode 60 bp for BetaCoV/Germany/BavPat1/2020 EPI_ISL_406862 and in a paired-end mode 150 bp × 2 for BavPat1/2020 Ref-SKU: 026V-03883 producing 2,239,263 and 4,332,551 reads, respectively. Reads were aligned to the viral genome using STAR 2.5.3a aligner. Even coverage along the genome was assessed and the relative abundance junctions (which may reflect genomic deletion) were calculated. For EPI_ISL_406862 passage 4 (that was used for Vero E6 cells infection) the junctions that were found in more than 1% of genomes are listed in Supplementary Table 2. For BavPat1/2020 Ref-SKU: 026V-03883 passage 5 (that was used for Calu3 infection) no junctions in abundance of more than 1% of the genomes were detected. All genomic sequencing data was recorded.

Sequence alignment, metagene analysis

Sequencing reads were aligned as previously described28. In brief, linker (CTGTAGGCACCATCAAT) and poly(A) sequences were removed and the remaining reads were aligned to the Chlorocebus sabaeus genome (ENSEMBL release 99) and to the SARS-Cov-2 genome (Genebank NC_045512.2 with 3 changes to match the used strain (BetaCoV/Germany/BavPat1/2020 EPI_ISL_406862): C241T, C3037T, A23403G) for infection of Vero E6 cells, or to the hg19 and NC_045512.2 sequence with the same sequence changes for infection of Calu3. Alignment was performed using Bowtie v.1.1.229 with a maximum of two mismatches per read. Reads that were not aligned to the genome were aligned to the transcriptome of C. sabaeus (ENSEMBL) and to SARS-CoV-2 junctions that were recently annotated12. The aligned position on the genome was determined as the 5′ position of RNA-seq reads, and for Ribo-seq reads, the P site of the ribosome was calculated according to reads length using the offset from the 5′ end of the reads that was calculated from canonical cellular ORFs. The offsets used are +12 for reads that were 28–29 bp and +13 for reads that were 30–33 bp. Reads that were different in length were discarded. In all figures presenting ribosome-density data, all footprint lengths (28–33 bp) are presented.

Novel junctions were mapped using STAR 2.5.3a aligner30, with running flags as suggested12 to overcome filtering of non-canonical junctions. Reads aligned to multiple locations were discarded. Junctions with 5′ break sites mapped to genomic location 55–85 were assigned as leader-dependent junctions. Matching of leader junctions to ORFs and categorization of junctions as canonical or non-canonical, was adapted from supplementary table 3 in ref. 12 or was assigned manually for strong novel junctions that appear only in our data.

For the metagene analysis only genes with more than 50 reads were used. For each gene, normalization was done to its maximum signal and each position was normalized to the number of genes contributing to the position. In the virus 24-h samples, normalization for each gene was done to its maximum signal within the presented region.

Quantification of gene expression

The deconvolution of RNA expression was done by subtracting the reads per kilobase of transcript, per million mapped reads (RPKM) of an ORF from the RPKM of the ORF located just upstream of it in the genome. The junction counts were based on STAR alignment number of uniquely mapped reads crossing the junction. For comparing transcript and footprint expression level, RNA and footprint counts from bowtie alignments were normalized to units of RPKM to normalize for gene length and for sequencing depth. On the basis of the correlation between the deconvoluted RPKM and junction abundance of the subgenomic RNAs, the genomic RNA abundance was estimated and was used to estimate ORF1a and ORF1b RNA levels compared with footprint levels.

The estimation of the viral footprint densities from the 24 hpi samples was performed by calculating the ratio of the RPKM of ORF1a to the total number of leader canonical junctions at 5 hpi. This ratio was used as a factor to calculate a proxy for the ‘true’ viral footprint densities from the number of footprints that were mapped to leader canonical junctions at 24 hpi.

To quantify the translation levels of novel viral ORFs at 5 hpi and 7 hpi, many of which are overlapping, three types of calculations were used, based on ORF type. For ORFs that have a unique region, with no overlap to any other ORF, bowtie-aligned read density was calculated in that region. For out-of-frame iORFs, the read density of the iORF region was calculated by estimating the expected 3-bp periodicity distribution of footprints based on non-overlapping translated regions in the main ORF. Using linear regression, we calculated the relative contribution of the frames of the main and of the iORF to the reads covering the region of the iORF. The relative contribution of the iORF was then multiplied by the read density in that region to obtain the estimated translation level of the internal out-of-frame ORF. For in-frame iORFs the read density of the main overlapping ORF was calculated from a non-overlapping region and then subtracted from the read density in the overlapping iORF region to get an estimate of translation levels of the iORF. In cases where the unique region used to calculate read density contained the start codon of the ORF, the first 20% of the codons in the region were excluded from the calculation to avoid bias from initiation peaks, unless the region was very short and trimming it would harm the ability to estimate coverage (ORF 8 and extended ORF M). The exact regions that were used for calculation can be found in Supplementary Table 5. Finally, read density was normalized to the length of the region used for calculation and to the sum of length normalized reads in each sample to get transcripts per kilobase million (TPM) values. P values for the relative contribution levels of out-of-frame ORFs were calculated from both replicates using a mixed-effects linear model using the three-base periodicity distribution as the fixed effect and the replicates as random effect. In parallel, ORF-RATER was used to quantify the translation levels of the viral ORFs (using regression), giving largely similar values (Spearman’s R = 0.92 and 0.87 in Vero E6 and Calu3 cells, respectively).

Prediction of translation initiation sites

Translation initiation sites were predicted using PRICE16 and ORF-RATER17. To estimate the codons generating the sequencing reads with maximum likelihood, PRICE requires a predefined set of annotated coding sequences from the same experiment. Thus, it does not perform well on reference sequences with a small number of annotated ORFs such as SARS-CoV-2. As our experiment generated ribosome footprints from both SARS-CoV-2 and host mRNAs, which were exposed to the exact same conditions in the protocol, we used annotated CDSs from the host cells to evaluate the parameters of the experiment. For libraries of infected Vero E6 cells, sequencing reads were aligned using Bowtie to a fasta file containing chromosome 20 of C. sabaeus (1,240 annotated start codons, downloaded from ensembl: ftp://ftp.ensembl.org/pub/release99/fasta/chlorocebus_sabaeus/dna/) and the genomic sequence of SARS-CoV-2 (Refseq NC_045512.2). A gtf file with the annotations of C. sabaeus and SARS-CoV-2 genomes was constructed and provided as the annotations file when running PRICE. For technical reasons, the annotation of the first CDS of the two CDSs in the ORF1ab gene was deleted because having two CDSs encoded from a single gene was not permitted by PRICE. For libraries of infected Calu3 cells sequencing reads were mapped to a fasta file containing chromosome 1 of hg19 (2,843 annotated start codons) and the genomic sequence of SARS-CoV-2 (Refseq NC_045512.2). A gtf file with the annotations of hg19 and SARS-CoV-2 genomes was constructed and provided as the annotations file when running PRICE. For the data that were generated from infected Vero E6 cells at 5 hpi, training and ORF prediction by PRICE were done once using the CHX data from both replicates, and again using all Ribo-seq libraries from both replicates, and the resulting predictions were combined. To test reproducibility, the same predictions were performed on each replicate separately. For the data that were generated from infected Calu3 cells at 7 hpi training and ORF prediction by PRICE were done using all Ribo-seq libraries from both replicates. The predictions were further filtered to include only ORFs with at least 100 reads at the initiation site in the LTM samples of at least one replicate. ORFs were then defined by extending each initiating codon to the next in-frame stop codon.

ORF-RATER was used with the default values besides allowing all start codons with at most one mismatch to ATG. For each cell type, two runs of ORF-RATER were used: one in which ORF-RATER was trained on cellular annotations (chr 20 for the Vero E6 cells, and chr 1 for the Calu cells) and SARS-CoV-2 canonical ORFs (similar to the procedure that was used for running PRICE); and a second in which SARS-CoV-2 canonical ORFs were used for training. In both cases, ORF1b and ORF10 were omitted from the training set. BAM files from STAR alignment were used as input. The CHX data from both replicates was used in the first prune step to omit low-coverage ORFs. The calculations of the P-site offsets and the regression were performed for each type of Ribo-Seq library separately. The final score was calculated on the basis of all three types of libraries. A score of 0.5 was used as cut-off for the final predictions; these were further manually curated. Additional ORFs that were not recognized by the trained models (probably owing to differences in the features of viral genome compared with cellular genomes) but presented reproducible translation profile in the two cell lines were added manually to the final ORF list (Supplementary Table 4). ORFs were manually identified as such if they had reproducible initiation peaks in the CHX libraries that were enhanced in the LTM and Harr libraries, and exhibited increased CHX signal in the correct reading frame along the coding region.

Mapping reads to CUG initiation upstream of the TRS leader

Reads from ribosome-profiling libraries were aligned using bowtie to a single reference that contained the transcripts and the genome allowing no mismatches or gaps. Reads with P-site mapped to position 59 of the viral genome were collected and divided to four groups according to the nucleotide in position +17 of the read (position 76 of the genome). The first group contains reads that are short (28 nucleotides) and do not have any nucleotide at position +17. The other three groups, referred to as T, A and G, correspond to combinations of genomic and subgenomic RNAs based on their sequence, as shown in Extended Data Fig. 6d. Group T is attributed to the genome or to ORF E and ORF M subgenomic RNAs, group A to the subgenomic RNAs of ORF S, ORF7a, ORF8 and ORF N, and group G to the subgenomic RNA of ORF 6. Reads that mapped uniquely to the subgenomic RNA of ORF3a were excluded from calculation, and the number of reads in each group was summed. Group T, containing genomic reads, was further divided on the basis of the nucleotide at position +18, where reads with A at that position can originate from the subgenomic RNA of ORF M and reads with T at that position can originate from the genome or from the subgenomic RNA of ORF E. Final division of the genomic group was done based on position +19 where T corresponds to genomic reads and A corresponds to ORF E subgenomic reads. RNA values as calculated from junction densities (as described in ‘Quantification of gene expression’) were summed for the subgenomic and genomic RNAs in each group. The analysis was performed for each ribosome-profiling library separately.

Mining of proteomics data and transmembrane predictions

Data from ref. 8 was searched using the Byonic search engine with 10 ppm tolerance for MS1 and 20 ppm tolerance for MS2, against the concatenated database containing our 26 novel ORFs as well as the human proteome DB (SwissProt, November 2019), and the SARS-CoV-2 proteome. Modifications allowed were fixed carbamidomethylation on C, fixed TMT6 on K and peptide N terminus, variable K8 and R10 SILAC labelling, variable M oxidation and Variable NQ deamidation. Data downloaded from ref. 9 was searched with the Byonic search engine using 10 ppm tolerance for MS1 and 0.6 Da tolerance for MS2, against the concatenated database containing our 26 novel ORFs as well as the human proteome DB (SwissProt, November 2019), and the SARS-CoV-2 proteome. Modifications allowed were fixed carbamidomethylation on C, variable N-terminal protein acetylation, M oxidation and NQ deamidation. Transmembrane and signal peptide predictions were performed using Phobius31.

Immunofluorescence

Cells were plated on Ibidi slides, fixed in 3% paraformaldehyde for 20 min, permeabilized with 0.5% Triton X-100 in PBS for 2 min, and then blocked with 2% FBS in PBS for 30 min. Immunostaining was performed with rabbit anti-SARS-CoV-2 serum32 at a 1:200 dilution. Cells were washed and labelled with anti-rabbit FITC antibody and with DAPI at a 1:200 dilution. Imaging was performed on a Zeiss AxioObserver Z1 wide-field microscope using a ×40 objective and Axiocam 506 mono camera.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.