Pseudo-Likelihood Based Logistic Regression for Estimating COVID-19 Infection and Case Fatality Rates by Gender, Race, and Age in California
============================================================================================================================================

* Di Xiong
* Lu Zhang
* Gregory L. Watson
* Phillip Sundin
* Teresa Bufford
* Joseph A. Zoller
* John Shamshoian
* Marc A. Suchard
* Christina M. Ramirez

## Abstract

In emerging epidemics, early estimates of key epidemiological characteristics of the disease are critical for guiding public policy. In particular, identifying high risk population subgroups aids policymakers and health officials in combatting the epidemic. This has been challenging during the coronavirus disease 2019 (COVID-19) pandemic, because governmental agencies typically release aggregate COVID-19 data as marginal summary statistics of patient demographics. These data may identify disparities in COVID-19 outcomes between broad population subgroups, but do not provide comparisons between more granular population subgroups defined by combinations of multiple demographics.

We introduce a method that overcomes the limitations of aggregated summary statistics and yields estimates of COVID-19 infection and case fatality rates — key quantities for guiding public policy related to the control and prevention of COVID-19 — for population subgroups across combinations of demographic characteristics. Our approach uses pseudo-likelihood based logistic regression to combine aggregate COVID-19 case and fatality data with population-level demographic survey data to estimate infection and case fatality rates for population subgroups across combinations of demographic characteristics.

We illustrate our method on California COVID-19 data to estimate test-based infection and case fatality rates for population subgroups defined by gender, age, and race and ethnicity. Our analysis indicates that in California, males have higher test-based infection rates and test-based case fatality rates across age and race/ethnicity groups, with the gender gap widening with increasing age. Although elderly infected with COVID-19 are at an elevated risk of mortality, the test-based infection rates do not increase monotonically with age. LatinX and African Americans have higher test-based infection rates than other race/ethnicity groups. The subgroups with the highest 5 test-based case fatality rates are African American male, Multi-race male, Asian male, African American female, and American Indian or Alaska Native male, indicating that African Americans are an especially vulnerable California subpopulation.

Keywords
*   COVID-19
*   Infection Rate
*   Case Fatality Rate
*   California Health Interview Survey
*   Logistic Regression

## 1. Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread from its zoonotic origins in Hubei Province, China, causing a global pandemic of coronavirus disease 2019 (COVID-19) [1, 2]. As of June 12, 2020, COVID-19 has infected over 7 million people across 188 countries and regions [3]. In the early stages of an emerging epidemic such as COVID-19, estimating the infection rate (IR) and case fatality rate (CFR) of the infectious disease is of utmost importance to health officials, policy makers, and the population at large. Accurate population and subgroup estimates of CFRs provide an evidence-based rationale for policies designed to mitigate the spread of the infectious disease, help identify disparities in disease vulnerability, and inform resource allocation to communities in greatest need.

Official COVID-19 data released by governmental health agencies and other public sources are prohibited by U.S. law from containing personally identifiable information. Consequently, these data are generally summarized in an aggregate format that comprise only marginal or limited bivariate summary statistics of patient demographics, providing valuable but limited information on the heterogeneity of patient attributes. Indeed, in New York City, the epicenter of the COVID-19 outbreak in the U.S., the reported infection rates and case fatality rates for African Americans were disproportionately higher than other races, according to data released by the New York City Department of Health and Mental Hygiene [4]. Data from several other U.S. states, including New Jersey[5], California[6], and Illinois[7], exhibited similar trends. Gender and age-disaggregated national case data from a vast array of countries across the globe reveal that males and older individuals generally have substantially higher case fatality rates. Furthermore, evidence from numerous clinical studies of COVID-19 risk factors have established that gender and age are risk factors for COVID-19 infection mortality [8, 9, 10, 11]. However, by aggregating, data from governmental health agencies or other public sources do not provide granular information on the combined effect of the risk factors under consideration. In particular, how IRs and CFRs vary across population subgroups characterized by gender, age, and race jointly has not yet received substantial attention. Understanding the gender-age-race dynamics of COVID-19 infection and mortality would provide deeper insights into the disparities that exist in the effects of COVID-19 on the population.

Various methods for using information contained in aggregate data have been proposed in a wide array of applications [see, e.g. 12, 13, 14], and there is growing interest in leveraging marginal summary statistics in publicly released COVID-19 datasets to quantify the impact of various risk factors on COVID-19 mortality [15]. In this paper, we propose a method that helps overcome the limitations of having only aggregate summary statistics on COVID-19 cases and fatalities to obtain early estimates of COVID-19 IRs and CFRs for population subgroups defined by combinations of risk factors.A major difference between the prevalent approaches to analyzing marginalized data and our proposed method is that we incorporate multivariate population demographic data, which provides estimates of the joint probability distribution of risk factors for the disease. Specifically, we propose a pseudo-likelihood based multivariable logistic regression approach that combines publicly released aggregate COVID-19 case and fatality data with multivariate population-level demographic survey data.

The proposed method is composed of two main steps. First, we model COVID-19 IRs using a multivariable logistic regression model, estimating its parameter values from publicly available COVID-19 case data. Second, we estimate COVID-19 CFRs based on the recovered IRs and publicly available COVID-19 fatality data. This paper uses California as an example case study, but the approach is easily generalized to other states. We carry out the analysis using the most recent COVID-19 case and fatality data from the California Department of Public Health (CDPH) [6, 16, 17] and population-level demographic data from the California Health Interview Survey (CHIS) [18] to obtain estimates of IRs and CFRs for subgroups of the California state population characterized by the joint distribution of gender, age, and race. Not every person who may be infected with COVID-19 is tested, and in some locations only people who are symptomatic are tested. This introduces sampling bias into the COVID-19 data that prevents straightforward estimation of the true IRs and CFRs. To circumvent this issue, we estimate test-based IR (T-IRs) and test-based CFRs (T-CFRs) that depend on the availability and use of testing and may differ from the true IRs and CFRs. In particular, we expect true IRs to be greater than T-IRs and true CFRs to be less than T-CFRs due to the presence of asymptomatic and undiagnosed infections. While the test-based rates do not estimate the overall population rates, they capture the vast majority of severe or fatal COVID-19 infections, because these individuals are very likely to be tested. Consequently, the T-IRs and T-CFRs estimated by our method provide valuable insights into the disparities in COVID-19 outcomes that exist across gender, age, and race/ethnicity groups and furnish guidance for public policy related to the control and prevention of COVID-19.

## 2. Data

Our method for estimating COVID-19 T-IRs and T-CFRs relies on two data sources: daily COVID-19 data for California from the California Department of Public Health (CDPH), and the 2017–2018 wave of the California Health Interview Survey (CHIS).

CDPH data are publicly available and provide up-to-date information on the number of COVID-19 cases and fatalities in California by gender [16], age [17], and race/ethnicity [6], separately. The case and fatality data as of June 12, 2020 are presented in Table 1. CDPH divides the population into ten age groups: less than 5, 5–17, 18–34, 35–49, 50–59, 60–64, 65–69, 70–74, 75–79, and 80 and above. Age group is missing from less than 0.1% of the confirmed cases and deaths reported by CDPH. The eight race and ethnicity groups in the publicly released dataset are LatinX/ Hispanic (LatinX), White/ Caucasian (White), Asian, African American/Black (AA), Multi-Race, American Indian or Alaska Native (AIAN), Native Hawaiian and other Pacific Islander, and others. We combined the last two race and ethnicity groups due to their small size in the California population. Race and ethnicity are missing in almost 29% of confirmed cases and 1% of the deaths. The CDPH data also provide the number of COVID-19 cases and fatalities by race and age jointly [6], the only state in the U.S. doing so at the time of writing.

View this table:
[Table 1:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T1)

Table 1: 
Confirmed COVID-19 cases and fatalities by gender, age and race/ethnicity in California as of June 13, 2020 [16, 17, 6]

To supplement the CDPH COVID-19 data, we used demographic data on the California population collected by the California Health Interview Survey (CHIS). CHIS is the largest state health survey in the U.S., conducted by the UCLA Center for Health Policy Research in collaboration with the California Departments of Public Health and Health Care Services. CHIS interviews over 20,000 Californians each year, collecting information on a wide range of demographic and health variables. CHIS oversamples certain population subgroups to achieve more reliable and precise estimates for these subgroups, and estimates a sampling weight for each respondent to represent the reciprocal of the probability of selection. We use the 2017–2018 wave of CHIS in our analysis, which consists of 45,369 subjects interviewed, focusing on the following three demographic variables recorded: gender, age, and race and ethnicity groups.

## 3. Methods

### 3.1. Infection Rate Estimation Procedure

We propose estimating COVID-19 T-IRs given gender, age and race using a multivariable logistic regression model. The variables we use in our analysis are listed in Table 2. Letting age 0–17, female, and LatinX be the reference categories, we let ***𝓏*** *∈* {0, 1}*p**∈* ℝ*p* denote the gender-age-race covariate setting of the covariates ***Z*** in Table 2, where *p* is 15. The postulated IR model follows ![Formula][1]</img>  where *ℐ ∈* {0, 1} represents infection status, *γ* is the log odds of infection for the female age 0–17 Latinx group, and ***γ*** *∈* ℝ*p*are the log odds ratios of infection associated with the other demographic categories.

View this table:
[Table 2:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T2)

Table 2: 
Variables used in the infection rate estimation procedure

The CDPH data provide the gender, age, and race distributions of COVID-19 infections separately [16]. To estimate the T-IRs given gender, age, and race jointly, we employ a pseudo-likelihood approach that maximizes a likelihood function constructed from univariate logistic regression models obtained by marginalizing over the covariates. The proposed method begins by first expressing Equation (1) in terms of the probability of infection conditional on the covariates, ![Formula][2]</img> 

We introduce 𝕡***X*** (***x***) as the probability mass function of a *p****-dimensional discrete random variable ***X*** with support ![Graphic][3]</img>, that represents the proportion of the California population with gender-age-race attributes ***x***, which is simply an augmentation of the covariate setting ***z*** in (1) to include the reference levels listed in Table 2. In other words, there exists a bijection ***𝓏*** = ***𝓏*** (***x***) from the ***X***-space to the ***Z***-space. We then define the conditional probability mass function of ***X****−i* given ![Graphic][4]</img>, where *X**i* is the *i*th element of ***X***, and ***X****−i* is the subset of ***X*** that omits *X**i*. Defining ![Graphic][5]</img> to be the subset of *𝒳* with the constraint that *X**i* = 1 and taking the expectation of both sides of Equation (2) conditional on *X**i* = 1, by the Law of Iterated Expectations we have ![Formula][6]</img> 

Next, we construct the individual pseudo-log-likelihoods corresponding to each univariate logistic regression of *ℐ* on *X**i* = 1 for each *X**i* *∈* ***X***. Let *N* denote the total population size, *N**i*1 denote the number of individuals in the population with *X**i* = 1, and ![Graphic][7]</img> denote the total number of individuals with *X**i* = 1 who have been or will be infected with COVID-19. Therefore, ![Graphic][8]</img> follows a binomial distribution, ![Formula][9]</img>  for *i* = 1, …, *p****. We define the individual pseudo-log-likelihood of (*γ*, ***γ***) for *X**i* corresponding to the binomial distribution (4) as ![Graphic][10]</img>, and we define the full pseudo-log-likelihood of (*γ*, ***γ***) as the sum of the individual pseudo-log-likelihoods ![Formula][11]</img> 

We use the CHIS data to approximate ℙ***X*** (***x***), which we denote ![Graphic][12]</img>. Let *N* (*ℐ*) denote the total number of individuals in the population who have been or will be infected with COVID-19, and let *π**ℐ* = 𝕡(*ℐ* = 1) denote the overall infection rate in the population. Thus, the total population size is *N* = *N* (*ℐ*)*/π**ℐ*. From the CDPH data presented in Table 1, we have the cumulative number of reported COVID-19 infections as of June 12, 2020, which we denote ![Graphic][13]</img>. Because ![Graphic][14]</img> measures the cumulative number of COVID-19 infections up to June 12, 2020, and increases daily, ![Graphic][15]</img> is smaller than *N* (*ℐ*), perhaps substantially. Furthermore, *π**ℐ* is unknown, and for a given estimate ![Graphic][16]</img> of *π**ℐ*, we define ![Graphic][17]</img> to be ![Graphic][18]</img>. Therefore, even for accurate estimates of ![Graphic][19]</img> will be smaller, perhaps substantially, than the total number of individuals in the population. However, we assume here that the relative size of ![Graphic][20]</img> to *N* is approximately equal to the relative size of ![Graphic][21]</img> to *N*(*ℐ*). Hence, ![Graphic][22]</img> may be interpreted as an appropriately scaled version of *N* with respect to ![Graphic][23]</img> and ![Graphic][24]</img> as of June 12, 2020. Likewise, we define ![Graphic][25]</img> with ![Graphic][26]</img> having the same interpretation as ![Graphic][27]</img> but for the subset of the population with *X**i* = 1. We denote ![Graphic][28]</img> to be the cumulative number of infected individuals with *X**i* = 1 as of June 12, 2020, and present ![Graphic][29]</img> in Table 1.

Substituting ![Graphic][30]</img> for ![Graphic][31]</img> for ![Graphic][32]</img> and ![Graphic][33]</img> for {*N**i*1 : *I =* 1, …, *p**∗*} in the pseudo-log-likelihood (5), we obtain an approximate pseudo-likelihood ![Formula][34]</img> 

We maximize the approximate pseudo-likelihood (6) with respect to (*γ*, ***γ***) to obtain our estimates ![Formula][35]</img> 

Lastly, by plugging ![Graphic][36]</img> into Equation 2, we obtain the predicted test-based infection probabilities for individuals with gender-age-race covariate setting ***𝓏*** ![Formula][37]</img> 

### 3.2. Case Fatality Rate Estimation Procedure

Similar to the T-IR estimation method, we model the T-CFRs given gender, age, and race using a multivariable logistic regression model. The gender-age-race covariate we use for CFR estimation (see Table 3) is the same as the covariate we use for IR estimation, except that we combined the 0–17 and 18– 34 age groups due to low numbers of fatalities among the 0–17 age group. With a slight abuse of notation, we denote ***𝓏*** *∈* {0, 1}*q* *∈* ℝ*q* to be the covariate setting of the vector of non-reference group covariates ***Z***, where *q* = 14. The corresponding random variable ***X*** and its covariate setting ***x*** are as defined in the preceding subsection and have dimension *q****, where *q**** = 17. We give the T-CFR model as ![Formula][38]</img>  where *ℳ ∈* {0, 1} represents mortality status, *δ* is the log odds of mortality for the LatinX female age 0–34 group, and ***δ*** *∈* ℝ*q* are the log odds ratios of mortality for other covariate settings.

View this table:
[Table 3:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T3)

Table 3: 
Variables used in the case fatality rate estimation procedure

We again employ a pseudo-likelihood approach to estimate (*δ*, ***δ***) that maximizes a likelihood function constructed from univariate logistic regression models. Following similar steps as shown in the preceding subsection, we have ![Formula][39]</img> 

We use the CHIS data and the IR model (1) with coefficient estimates (7) to estimate ![Graphic][40]</img>. First, we estimate 𝕡(***𝓏***|*ℐ* = 1) using Bayes’ Rule ![Formula][41]</img>  where ![Graphic][42]</img> comes from Equation (8), and ![Graphic][43]</img> is obtained from the CHIS dataset. Then, by the definition of conditional probability, we estimate ![Graphic][44]</img> by ![Formula][45]</img>  where ![Graphic][46]</img> comes from Equation (11).

Analogous to the IR model, we denote ![Graphic][47]</img> to be the number of individuals with *X**i* = 1 who have died or will die from COVID-19. Therefore, ![Graphic][48]</img> each follows a binomial distribution ![Formula][49]</img> 

We then construct the full pseudo-log-likelihood of (*δ*, ***δ***) as the sum of the individual pseudo-log-likelihoods of (*δ*, ***δ***) for *X**i* corresponding to binomial distribution (13) ![Formula][50]</img> 

From the CDPH data presented in Table 1, we have the cumulative number of COVID-19 deaths by gender, age, and race. We denote ![Graphic][51]</img> to be the cumulative number of reported deaths of infected individuals with *X**i* = 1 as of June 12, 2020. Analogous to the infection risk model, we assume that the relative size of ![Graphic][52]</img> to ![Graphic][53]</img> is approximately equal to the relative size of ![Graphic][54]</img> to ![Graphic][55]</img>. Substituting ![Graphic][56]</img> for ![Graphic][57]</img> for ![Graphic][58]</img>, and ![Graphic][59]</img> for ![Graphic][60]</img>, in the pseudo-likelihood (14), we obtain an approximate pseudo-likelihood ![Formula][61]</img>  and maximize it with respect to (*δ*, ***δ***) to obtain our estimates ![Formula][62]</img> 

Lastly, from ![Graphic][63]</img>, we can obtain the predicted COVID-19 test-based case fatality rates for individuals with gender-age-race covariate setting ***𝓏***, ![Formula][64]</img> 

### 3.3. Monte Carlo Simulation Procedure

To quantify the uncertainty of the T-IR and T-CFR estimates in (8) and (17), respectively, we carry out a Monte Carlo procedure that repeatedly performs the T-IR and T-CFR estimation procedures described in Sections 3.1 and 3.2 sequentially, introducing sampling variation in the data in three stages. The first stage bootstraps the CHIS data with selection probabilities proportional to the sampling weights. The second stage introduces variation in ![Graphic][65]</img> immediately prior to maximizing the approximate pseudo-log-likelihood (9), by simulating values of ![Graphic][66]</img> for each *i* independently from a binomial distribution with success probability equal to ![Graphic][67]</img>, i.e., ![Formula][68]</img> 

Similarly, the third stage introduces variation in the ![Graphic][69]</img> prior to maximizing the approximate pseudo-log-likelihood (15) by simulating values of ![Graphic][70]</img> for each *i* independently from a binomial distribution with success probability equal to ![Graphic][71]</img>, ![Formula][72]</img> 

The entire Monte Carlo simulation procedure can be summarized in 5 steps:

View this table:
[Table4](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T4)

In Figure 1, We illustrate the Monte Carlo simulation procedure in a flow chart.

![Figure 1:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/07/01/2020.06.29.20141978/F1.medium.gif)

[Figure 1:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/F1)

Figure 1: 
Flow chart depicting the Monte Carlo simulation procedure

### 3.4. Summary Statistics for Infection and Case Fatality Rate Estimates

In addition to estimating the T-IR and T-CFR for specific covariate settings ***𝓏*** through Equations (8) and (17), respectively, we can provide collapsed estimates of T-IRs and T-CFRs for specific values of any subset ![Graphic][73]</img> of ***X***. Let *J**r* = {*j*1, …, *j**r*}, where *j**r* ⊂ {1, …, *p**};![Graphic][74]</img>, where ![Graphic][75]</img>; and ![Graphic][76]</img> denotes the subset of *𝒳* with the constraint that ![Graphic][77]</img>. Estimates of collapsed T-IRs given ![Graphic][78]</img> can be obtained using the marginalization formula ![Formula][79]</img> 

Likewise, collapsed estimates of T-CFRs given ![Graphic][80]</img>, can be obtained using the marginalization formula ![Formula][81]</img>  where ![Graphic][82]</img> comes from Equation (11).

## 4. Results

We present select estimates of T-IRs and T-CFRs obtained from our IR (1) and CFR (8) models fit to the California data described in Section 2 and summarized according to Section 3.4. All standard errors were computed using a bootstrap size of *ℬ* = 100. As a baseline estimate, we assume an overall California COVID-19 infection rate *π**I* of 2%, a rough estimate of the infection rate of the 1918 Influenza Pandemic [19] in the U.S., that is estimated to have had a basic reproductive number (*ℛ*) comparable to that of COVID-19 [20, 21]. However, there is still substantial uncertainty surrounding the true COVID-19 infection rate primarily due to the lack of testing and the large prevalence of asymptomatic cases. Recent studies suggest that the true overall infection rate in the U.S. is much higher than what was initially hypothesized [22, 23].

Figure 2 depicts T-IR estimates and error bars indicating two bootstrap standard errors (SEs) for different combinations of gender and age group under the assumption of an overall California infection rate of 2%. The T-IR estimates range from 0.1% to 8.3% for female and 0.1 % to 8.9% for male. The figures present 6 different race/ethnicity groups: LatinX/ Hispanic (LatinX), White/ Caucasian (White), Asian, African American / Black (AA), Multi-Race, and American Indian or Alaska Native (AIAN). LatinX has the highest T-IRs, followed by African American. Both females and males age 80 and older have extremely high T-IRs compared with other age groups across race/ethnicity groups. T-IRs were non-monotonic at younger ages, with age groups 60–64 and 70–74 having slightly lower T-IRs than the preceding age groups, 50–59 and 65–69 respectively. Males had higher T-IRs than females across all age groups with the gender gap slightly increasing with age.

![Figure 2:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/07/01/2020.06.29.20141978/F2.medium.gif)

[Figure 2:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/F2)

Figure 2: 
Estimated test-based infection rates given age and race/ethnicity, stratified by gender. (A) and (B) present the bootstrapped mean infection rates for female and male respectively. Only 6 racial and ethnicity groups are considered in the figures, including LatinX/ Hispanic (LatinX), White/ Caucasian (White), Asian, African American / Black (AA), MultiRace, and American Indian or Alaska Native (AIAN). The overall infection rate was assumed to be 2%, and the error bars denote two bootstrap standard errors.

We also considered alternate values for the overall California IR. Table 4 presents the point estimates and associated two SE intervals of the marginal T-IRs for gender and age group obtained from marginalization formula (20) assuming overall IRs of 1%, 2%, and 5%. The estimated marginal T-IRs for gender, age groups and race and ethnicity groups are consistent with the results presented in Figure 2, including males and older individuals having higher estimated T-IRs.

View this table:
[Table 4:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T5)

Table 4: 
Estimated marginal test-based infection rates for different overall infection rates

Table 5 presents the point estimates and associated two SE intervals of the bootstrap estimated marginal T-CFRs obtained from marginalization formula (21), assuming an overall infection rate of *π**ℐ* = 2%; T-CFR estimates do not vary in expectation for different values of *π**ℐ*. Males have a mean T-CFR 0.78% higher than females, and T-CFRs increase with age, ranging from less than 0.11% for the 0-34 age group to over 27.45% for the 80+ age group. Among 6 race and ethnicity groups, African American, Asian, and White are high-risk groups with mean T-CFRs as 7.42%, 6.80%, and 6.78% respectively. Other, LatinX, and Multi-race subgroups have T-CFRs below the overall 3.68% T-CFR for California.

View this table:
[Table 5:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/T6)

Table 5: 
Estimated marginal test-based case fatality rates

Figure 3 presents the estimated T-CFRs, obtained from formula (21), with error bars displaying two SEs of uncertainty for different combinations of gender and age groups, stratified by 6 race and ethnicity groups as shown in Figure 2. Males have higher estimated T-CFRs than females across all age-race levels, and the gender gap increases with age. African American and then Multi-race have higher estimated T-CFRs than other race groups in general across different age groups based on the stratified results. African American female even has a higher of T-CFRs than AIAN, LatinX and White male for each age group correspondingly.

![Figure 3:](http://medrxiv.org/https://www.medrxiv.org/content/medrxiv/early/2020/07/01/2020.06.29.20141978/F3.medium.gif)

[Figure 3:](http://medrxiv.org/content/early/2020/07/01/2020.06.29.20141978/F3)

Figure 3: 
Estimated test-based case fatality rates by age and gender, stratified by race/ethnicity. The bootstrap mean case fatality rates are presented separately for LatinX, White, Asian, African American, Multi-Race, and AIAN groups. The overall infection rate was assumed to be 2%, and the error bars denote two bootstrap standard errors.

Although Multi-race has higher T-CFRs at each age groups than White, the overall marginal T-CFRs for Multi-race is lower than White as shown in Table 5. The reversal of the inequality between the size ratios is an example of Simpson’s paradox. The Multi-race population is younger than White population in California. For example, we can compare the number of adults (18 and above) and the number of children and adolescents (0 to 17 years old) within each race and ethnicity group. An overall adult-child ratio is defined as the total number of adults over that of children and adolescents ignoring race and ethnicity groups. Multi-race population in California has 0.43 times overall adult-child ratio (1.7% of adults and 4% of children and adolescents), while it is 1.3 times overall adult-child ratio for White population (38.8% of adults and 29.2% of children and adolescents)[6]. Since the T-CFRs increase with the age, a higher adult-child ratio in the age structure for Multi-race leads to a higher marginal T-CFR. The case still holds when we have multiple age groups. A similar paradox happens for LatinX and Asian. Meanwhile, the small proportions of Multi-race and AIAN in the general population also result in large error bars.

## 5. Discussion

In this paper, we combined aggregate COVID-19 case and fatality data with population demographic data in a pseudo-likelihood based multivariable logistic regression approach for obtaining early estimates of COVID-19 T-IRs and T-CFRs for subgroups of the California population. Overall, our results revealed that males, the elderly, and LatinX are marginally at relatively higher risk of COVID-19 infection, and that males, the elderly, Africa Americans, Asians, and Whites are marginally at elevated risk of mortality after COVID-19 infection. However, due to the imbalance in the age distribution of different races in California, the subgroups with the top 5 T-CFRs are Africa American male,

Multi-race male, Asian male, African American female, and AIAN male for each age group. Overall, therefore, African Americans are the race/ethnicity group most vulnerable to COVID-19 in California. We also found that the elevated infection and mortality risk for males and the greater mortality risk for all races increase with age.

The proposed methods are subject to three general limitations. First, the analysis is based on publicly available test-based infection rates and case fatality rates. It has been well documented that the lack of testing for COVID-19 in the U.S. has hindered efforts to estimate the true COVID-19 infection rate. Further compounding this issue is the high prevalence of asymptomatic COVID-19 cases. These two issues may lead to substantial underestimates of the infection rates and/or substantial overestimates of the case fatality rates from our analyses. Second, race/ethnicity is missing in 29% of the reported cases from CDPH, which may bias our estimates. Even though CDPH releases summary statistics for age-race covariates and our model can fit the finer data, we fit the marginal statistics for each risk factor in the analysis to minimize the impact of the potential sampling bias. Moreover, the case and fatality data released by CDPH provide marginal summary statistics for a subset of risk factors, and we do not have direct information on the joint distribution of all risk factors. Although the central goal of our proposed methods is to circumvent this limitation, the absence of direct multivariate information on the risk factors of COVID-19 infection and mortality as well as the sampling bias should be taken into account when interpreting the results of our models. Third, in this paper, we do not consider regularity conditions ensuring concavity associated with the pseudo-log-likelihood functions constructed in Equations (6) and (15), nor do we examine the asymptotic properties of the parameter estimates in Equations (7) and (16). Future research investigating the mathematical theory of the proposed methods is warranted.

Another promising avenue for future work is combining this method with a COVID-19 prediction model [24] to provide detailed demographic projections of COVID-19 cases and mortalities. This would be a substantial improvement over most COVID-19 prediction models, as they tend to be quite limited in their ability to forecast the demographic characteristics of the infected.

In summary, this paper provides a pragmatic tool for producing early estimates of COVID-19 T-IRs and T-CFRs for the California population, which offer valuable information to guide health policies concerning the control and prevention of COVID-19. In addition, our methods can be generalized into a general framework for early estimation of subpopulation IRs and CFRs from aggregate case and fatality data in other locations and for future epidemics.

## Data Availability

The data underlying the results presented in the study are based on the most recent COVID-19 case and fatality data from the California Department of Public Health (CDPH) and population-level demo-graphic data from the California Health Interview Survey (CHIS). 

[https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/Race-Ethnicity.aspx](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/Race-Ethnicity.aspx) 

[https://update.covid19.ca.gov/#top](https://update.covid19.ca.gov/#top) 

[https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-19-Cases-by-Age-Group.aspx](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-19-Cases-by-Age-Group.aspx) 

[https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx](https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx) 

## Author Contributions

DX, MAS, and CMR conceptualized the study design. DX, LZ, GLW, and CMR drafted the original manuscript. DX, PS, TB, and CMR performed the data collection and literature search. DX and LZ developed the methodology, conducted the statistical analysis, and interpreted the results. DX, LZ, GLW, PS, TB, JZ, JS, and MAS performed the computer programming. All authors revised the manuscript and approved the final version.

## Funding

The authors received no specific funding for this work.

## Declaration of Competing Interest

The authors have declared that no competing interests exist.

## Acknowledgements

We thank Dr. Sudipto Banerjee and Jay J. Xu (University of California, Los Angeles) for their many helpful comments and assistance.

*   Received June 29, 2020.
*   Revision received June 29, 2020.
*   Accepted July 1, 2020.


*   © 2020, Posted by Cold Spring Harbor Laboratory

This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at [http://creativecommons.org/licenses/by-nd/4.0/](http://creativecommons.org/licenses/by-nd/4.0/)

## References

1.  [1]. A. L. Phelan,  R. Katz,  L. O. Gostin, The novel coronavirus originating in wuhan, china: challenges for global health governance, Jama 323 (8) (2020) 709–710.
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=http://www.n&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F07%2F01%2F2020.06.29.20141978.atom) 

2.  [2]. D. Cucinotta,  M. Vanelli, Who declares covid-19 a pandemic., Acta biomedica: Atenei Parmensis 91 (1) (2020) 157–160.
    
    
3.  [3].C. COVID, global cases by the center for systems science and engineering (csse) at johns hopkins university (jhu), ArcGIS. Johns Hopkins CSSE. Retrieved April 8 (19) 2020.
    
    
4.  [4]. S. Garg, Hospitalization Rates and Characteristics of Patients Hospitalized with Laboratory-Confirmed Coronavirus Disease 2019 — COVID-NET, 14 States, March 1–30, 2020, MMWR. Morbidity and Mortality Weekly Report 69.
    
    
5.  [5].[dataset]New Jersey Department of Health, COVID-19 Information Hub, [https://covid19.nj.gov/](https://covid19.nj.gov/) (2020 (accessed June 13, 2020)).
    
    
6.  [6].[dataset]California Department of Public Health, COVID-19 Race and Ethnicity Data, [https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/Race-Ethnicity.aspx](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/Race-Ethnicity.aspx) (2020 (accessed June 13, 2020)).
    
    
7.  [7].[dataset] Illinois Department of Public Health, Coronavirus Disease 2019 (COVID-19), [https://www.dph.illinois.gov/covid19/covid19-statistics](https://www.dph.illinois.gov/covid19/covid19-statistics) (2020 (accessed June 13, 2020)).
    
    
8.  [8]. Z. Zheng,  F. Peng,  B. Xu,  J. Zhao,  H. Liu,  J. Peng,  Q. Li,  C. Jiang,  Y. Zhou,  S. Liu, et al., Risk factors of critical & mortal COVID-19 cases: A systematic literature review and meta-analysis, Journal of Infection.
    
    
9.  [9]. J.-M. Jin,  P. Bai,  W. He,  F. Wu,  X.-F. Liu,  D.-M. Han,  S. Liu,  J.-K. Yang, Gender Differences in Patients With COVID-19: Focus on Severity and Mortality, Frontiers in Public Health 8 (2020) 152.
    
    
10. [10]. A. B. Docherty,  E. M. Harrison,  C. A. Green,  H. E. Hardwick,  R. Pius,  L. Norman,  K. A. Holden,  J. M. Read,  F. Dondelinger,  G. Carson,  L. Merson,  J. Lee,  D. Plotkin,  L. Sigfrid,  S. Halpin,  C. Jackson,  C. Gamble,  P. W. Horby,  J. S. Nguyen-Van-Tam,  A. Ho,  C. D. Russell,  J. Dunning,  P. J. Openshaw,  J. K. Baillie,  M. G. Semple, Features of 20 133 uk patients in hospital with covid-19 using the isaric who clinical characterisation protocol: prospective observational cohort study, BMJ 369 (m1985).
    
    
11. [11]. R.-H. Du,  L.-R. Liang,  C.-Q. Yang,  W. Wang,  T.-Z. Cao,  M. Li,  G.-Y. Guo,  J. Du,  C.-L. Zheng,  Q. Zhu, et al., Predictors of mortality for patients with covid-19 pneumonia caused by sars-cov-2: a prospective cohort study, European Respiratory Journal 55 (5).
    
    
12. [12]. D. Mavridis,  G. Salanti, A practical introduction to multivariate metaanalysis, Statistical methods in medical research 22 (2) (2013) 133–158.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/0962280211432219&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=22275379&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F07%2F01%2F2020.06.29.20141978.atom) 

13. [13]. M. C. Simmonds,  J. P. Higgins, A general framework for the use of logistic regression models in meta-analysis, Statistical methods in medical research 25 (6) (2016) 2858–2877.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1177/0962280214534409&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=24823642&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F07%2F01%2F2020.06.29.20141978.atom) 

14. [14]. B.-H. Chang,  S. Lipsitz,  C. Waternaux, Logistic regression in meta-analysis using aggregate data, Journal of Applied Statistics 27 (4) (2000) 411–424.
    
    
15. [15]. F. Caramelo,  N. Ferreira,  B. Oliveiros, Estimation of risk factors for covid-19 mortality-preliminary results, medRxiv.
    
    
16. [16].[dataset]California Department of Public Health, California Coronavirus (COVID-19) Response, [https://update.covid19.ca.gov/#top](https://update.covid19.ca.gov/#top) (2020 (accessed June 13, 2020)).
    
    
17. [17].[dataset]California Department of Public Health, Cases and Deaths Associated with COVID-19 by Age Group in California, [https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-19-Cases-by-Age-Group.aspx](https://www.cdph.ca.gov/Programs/CID/DCDC/Pages/COVID-19/COVID-19-Cases-by-Age-Group.aspx) (2020 (accessed June 13, 2020)).
    
    
18. [18].[dataset]California Health Interview Survey, CHIS 2017-2018 Public Use File. Los Angeles, CA: UCLA Center for Health Policy Research, [https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx](https://healthpolicy.ucla.edu/chis/data/Pages/GetCHISData.aspx) (2020 (accessed June 13, 2020)).
    
    
19. [19]. E. Vynnycky,  A. Trindall,  P. Mangtani, Estimates of the reproduction numbers of spanish influenza using morbidity data, International Journal of Epidemiology 36 (4) (2007) 881–889.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1093/ije/dym071&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=17517812&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F07%2F01%2F2020.06.29.20141978.atom) 
    
    [Web of Science](http://medrxiv.org/lookup/external-ref?access_num=000250050300033&link_type=ISI) 

20. [20]. Q. Liu,  Z. Liu,  J. Zhu,  Y. Zhu,  D. Li,  Z. Gao,  L. Zhou,  J. Yang,  Q. Wang, Assessing the global tendency of covid-19 outbreak, medRxiv.
    
    
21. [21]. S. Zhao,  Q. Lin,  J. Ran,  S. S. Musa,  G. Yang,  W. Wang,  Y. Lou,  D. Gao,  L. Yang,  D. He, et al., Preliminary estimation of the basic reproduction number of novel coronavirus (2019-ncov) in china, from 2019 to 2020: A data-driven analysis in the early phase of the outbreak, International journal of infectious diseases 92 (2020) 214–217.
    
    [CrossRef](http://medrxiv.org/lookup/external-ref?access_num=10.1016/j.ijid.2020.01.050&link_type=DOI) 
    
    [PubMed](http://medrxiv.org/lookup/external-ref?access_num=32007643&link_type=MED&atom=%2Fmedrxiv%2Fearly%2F2020%2F07%2F01%2F2020.06.29.20141978.atom) 

22. [22]. E. Bendavid,  B. Mulaney,  N. Sood,  S. Shah,  E. Ling,  R. Bromley-Dulfano,  C. Lai,  Z. Weissberg,  R. Saavedra,  J. Tedrow, et al., Covid-19 antibody seroprevalence in santa clara county, california, medRxiv.
    
    
23. [23]. D. Sutton,  K. Fuchs,  M. D’alton,  D. Goffman, Universal screening for sars-cov-2 in women admitted for delivery, New England Journal of Medicine.
    
    
24. [24]. G. L. Watson,  D. Xiong,  L. Zhang,  J. A. Zoller,  J. Shamshoian,  P. Sundin,  T. Bufford,  A. W. Rimoin,  M. A. Suchard,  C. M. Ramirez, Fusing a bayesian case velocity model with random forest for predicting covid-19 in the us, Available at SSRN 3594606.

 [1]: /embed/graphic-3.gif
 [2]: /embed/graphic-4.gif
 [3]: /embed/inline-graphic-1.gif
 [4]: /embed/inline-graphic-2.gif
 [5]: /embed/inline-graphic-3.gif
 [6]: /embed/graphic-5.gif
 [7]: /embed/inline-graphic-4.gif
 [8]: /embed/inline-graphic-5.gif
 [9]: /embed/graphic-6.gif
 [10]: /embed/inline-graphic-6.gif
 [11]: /embed/graphic-7.gif
 [12]: /embed/inline-graphic-7.gif
 [13]: /embed/inline-graphic-8.gif
 [14]: /embed/inline-graphic-9.gif
 [15]: /embed/inline-graphic-10.gif
 [16]: /embed/inline-graphic-11.gif
 [17]: /embed/inline-graphic-12.gif
 [18]: /embed/inline-graphic-13.gif
 [19]: /embed/inline-graphic-14.gif
 [20]: /embed/inline-graphic-15.gif
 [21]: /embed/inline-graphic-16.gif
 [22]: /embed/inline-graphic-17.gif
 [23]: /embed/inline-graphic-18.gif
 [24]: /embed/inline-graphic-19.gif
 [25]: /embed/inline-graphic-20.gif
 [26]: /embed/inline-graphic-21.gif
 [27]: /embed/inline-graphic-22.gif
 [28]: /embed/inline-graphic-23.gif
 [29]: /embed/inline-graphic-24.gif
 [30]: /embed/inline-graphic-25.gif
 [31]: /embed/inline-graphic-26.gif
 [32]: /embed/inline-graphic-27.gif
 [33]: /embed/inline-graphic-28.gif
 [34]: /embed/graphic-8.gif
 [35]: /embed/graphic-9.gif
 [36]: /embed/inline-graphic-29.gif
 [37]: /embed/graphic-10.gif
 [38]: /embed/graphic-12.gif
 [39]: /embed/graphic-13.gif
 [40]: /embed/inline-graphic-30.gif
 [41]: /embed/graphic-14.gif
 [42]: /embed/inline-graphic-31.gif
 [43]: /embed/inline-graphic-32.gif
 [44]: /embed/inline-graphic-33.gif
 [45]: /embed/graphic-15.gif
 [46]: /embed/inline-graphic-34.gif
 [47]: /embed/inline-graphic-35.gif
 [48]: /embed/inline-graphic-36.gif
 [49]: /embed/graphic-16.gif
 [50]: /embed/graphic-17.gif
 [51]: /embed/inline-graphic-37.gif
 [52]: /embed/inline-graphic-38.gif
 [53]: /embed/inline-graphic-39.gif
 [54]: /embed/inline-graphic-40.gif
 [55]: /embed/inline-graphic-41.gif
 [56]: /embed/inline-graphic-42.gif
 [57]: /embed/inline-graphic-43.gif
 [58]: /embed/inline-graphic-44.gif
 [59]: /embed/inline-graphic-45.gif
 [60]: /embed/inline-graphic-46.gif
 [61]: /embed/graphic-18.gif
 [62]: /embed/graphic-19.gif
 [63]: /embed/inline-graphic-47.gif
 [64]: /embed/graphic-20.gif
 [65]: /embed/inline-graphic-48.gif
 [66]: /embed/inline-graphic-49.gif
 [67]: /embed/inline-graphic-50.gif
 [68]: /embed/graphic-21.gif
 [69]: /embed/inline-graphic-51.gif
 [70]: /embed/inline-graphic-52.gif
 [71]: /embed/inline-graphic-53.gif
 [72]: /embed/graphic-22.gif
 [73]: /embed/inline-graphic-54.gif
 [74]: /embed/inline-graphic-55.gif
 [75]: /embed/inline-graphic-56.gif
 [76]: /embed/inline-graphic-57.gif
 [77]: /embed/inline-graphic-58.gif
 [78]: /embed/inline-graphic-59.gif
 [79]: /embed/graphic-25.gif
 [80]: /embed/inline-graphic-60.gif
 [81]: /embed/graphic-26.gif
 [82]: /embed/inline-graphic-61.gif