Demographic, spatial, and temporal dietary intake patterns among 526,774 23andMe research participants

Objective: To characterize dietary habits, their temporal and spatial patterns, and associations with body mass index (BMI) in the 23andMe study population. Design: We present a large-scale cross-sectional analysis of self-reported dietary intake data derived from the web-based NHANES 2009-2010 dietary screener. Survey-weighted estimates for each food item were characterized by age, sex, race/ethnicity, education, and BMI. Temporal patterns were plotted over a 2-year time period, and average consumption for select food items was mapped by state. Finally, dietary intake variables were tested for association with BMI. Setting: U.S. based adults 20-85 years of age participating in the 23andMe research program. Participants: Participants were 23andMe customers who consented to participate in research (n=526,774) and completed web-based surveys on demographic and dietary habits. Results: Survey-weighted estimates show very few participants met federal recommendations for fruit: 2.6%, vegetables: 5.9%, and dairy intake: 2.8%. Between 2017-2019, fruit, vegetables, and milk intake frequency declined, while total dairy remained stable and added sugars increased. Seasonal patterns in reporting were most pronounced for ice cream, chocolate, fruits, and vegetables. Dietary habits varied across the U.S., with higher intake of sugar and calorie dense foods characterizing areas with higher average BMI. In multivariate-adjusted models, BMI was directly associated with intake of processed meat, red meat, dairy, and inversely associated with consumption of fruit, vegetables, and whole grains. Conclusions: 23andMe research participants have created an opportunity for rapid, large scale, real time nutritional data collection, informing demographic, seasonal and spatial patterns with broad geographical coverage across the U.S.


3
For example, among those who consume 22g/day or more of saturated fat, weight gain was more pronounced among those with the -265 C/C genotype of APOA2, an estimated 10-20% of the population, compared to those without it. (7) However, longer term health consequences, such as varying cardio-metabolic disease risk across combinations of exposures, are not well understood.
While precision nutrition has been described as a major priority for epidemiology, (8) the vast majority of studies published to date are underpowered for more granular discovery.
In order to advance our understanding of how nutritional and other factors interact and impact health, data are needed from large populations followed over time. Since 2017, over 500,000 genotyped 23andMe research participants answered a survey about their dietary habits over the past month. Due to the growing size of the customer database, widespread geographical representation, and continuous data collection over time, data provided by 23andMe research participants now represent a large enough sample to inform population-based inferences for a variety of health behaviors, including diet.
However, because 23andMe participants are a subsample of 23andMe personal genome service customers, they are not a representative sample of the general population. Therefore, the aim of this manuscript is to describe how the sociodemographic profile and dietary habits of the 23andMe research participants compare to the US population, to characterize its dietary habits using survey weights that account for potential imbalances, and to test cross-sectional associations between BMI and consumption of several food items in this uniquely large cohort.
Our manuscript illustrates both the potential and the caveats of conducting nutritional epidemiology research in large-scale digital cohorts.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14. Customer data are de-identified; 23andMe researchers who conducted the statistical analyses in this manuscript did not have access to personally identifiable information (e.g. name, address, etc) and were trained in the responsible conduct of research. 23andMe has obtained a Certificate of Confidentiality from the National Institutes of Health, further protecting the privacy of research participants. Additional consent form information is available at www.23andme.com/about/consent.

Data collection
Once a 23andMe personal genome service customer receives their sample collection kit, they are asked to register it online prior to return. During this process, all customers are invited to participate in research, which occurs predominantly through web-based research surveys, some . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 5 of which are developed in-house by 23andMe and others are implemented using previously validated instruments. Beginning in March 2017, dietary intake frequencies were collected using a validated web-based version of the self-administered 25-item NHANES dietary screener questionnaire (2009/2010) otherwise referred to as the DSQ. (9) Accompanying surveys provided self-reported data on covariates such as age, sex, race/ethnicity, education, and BMI. The questions on race/ethnicity and education utilized the same response options as the U.S. Census.
Participant recruitment to the surveys took place through two approaches, actively through email and passively through the website and mobile application. Passive recruitment on the website occurs via the 'research stream', which is a feature on the 23andMe website and within the 23andMe app (available on both iOS and Android devices) which continuously surfaces surveys to eligible respondents on a variety of topics.
As such, the DSQ was fielded to those who had likely already completed higher priority surveys such as the Health survey (which collects data on basic demographics, health and disease status), or disease specific surveys based on conditions they may have reported. Because of this passive recruitment targeting scheme, we do not know the total population of people who were offered the survey, and therefore cannot directly estimate a survey response rate.

Dietary assessment and intake estimation
The DSQ measures intake of fruits and vegetables (cups/day), dairy (cups/ day), calcium (mg), added sugars (g), whole grains (ounce equivalents), and fiber (g). (10) Total fruit includes both whole fruit and fruit juice, vegetable intake is estimated from consumption frequency of salad, potatoes, beans, tomato sauce, salsa, pizza toppings, and other vegetables. Dairy intake is . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 6 estimated from milk, cheese, ice cream, and pizza. Whole grains are derived from cereals, whole grain bread, popcorn, and whole grain rice. Added sugars are derived from soda, fruit drinks, cookies/cakes/pie, doughnuts, ice cream, added sugar/honey in coffee or tea, candy, and cereal.
Added sugars from sugar sweetened beverages are derived from soda, fruit drinks, and sugar/honey added to coffee or tea. (10) The DSQ and scoring algorithms to derive broader food groups were validated against dietary intake using the 24-hour recall method in a representative population of non-institutionalized U.S. based NHANES study participants aged 2-69. (11) In that context, the DSQ was reported to produce stable estimates of intake for this set of dietary factors and correlate well with 24-hour recall estimates in three external study populations. (11) In the DSQ, individual food items are ascertained as frequency over the past month. Based on the data processing and scoring procedures recommended for use with the DSQ, we converted monthly estimates to daily estimates, and then multiplied the latter by item-specific portion and serving size estimates Time trends and maps . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint .

7
Mean daily intake and 95% confidence intervals were plotted by week of survey completion to explore seasonal variation in dietary intake. Although the data collected refer to the past month, we did not apply a lag period or adjust the date in any way. Broader temporal changes in reported dietary habits over time are visualized with loess curves.
Participants self-reported their current zip code, which we subsequently mapped to states to characterize the geographic distribution of average intake of food groups, select food items, and average BMI across the U.S. Regional average intake was estimated for the South (Delaware, For this analysis, the foods selected for seasonality were those with the most pronounced seasonal trends: chocolate, fruit, ice cream and salad. Those selected for mapping showed the most pronounced geographic patterns: fruit, vegetables, whole grains, red meat, processed meat, dairy, and pizza.

Development of survey weights
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 8 To address differences between our respondent population and the national population (Table 1), we developed survey weights and applied them to our sample. We used iterative proportional fitting (IPF) to calculate weights based on age (20-64, 65+), sex (M, F), race (white, non-white), education (less than college, completed college), and BMI (obese, not obese). We weighted our sample proportion to match the demographic distribution as reported by the U.S. Census (age, sex, race, education) (14) and the Centers for Disease Control and Prevention. (15) We excluded respondents (n=156, 053) from the raking procedure (implemented via the survey package (version 3.29.5) (16) in R) and subsequent analyses if they did not provide this demographic information. To address the increase in standard error that may be introduced with large sample weights, we chose binary classifications instead of utilizing all strata (e.g. for BMI), and trimmed all initial weights to be less than 5-times the mean survey weight. (17)

Dietary intake associations with BMI
To measure the cross-sectional correlation between dietary intake and BMI, we explored the association between each food item continuously using the daily frequency or estimated quantity for derived measures, but ultimately classified intake by tertiles for ease of interpretation and standardization of quantity (high vs. low intake frequency). We limited our sample to participants who self-reported height and weight values yielding estimated BMI scores between 14 and 70. We evaluated BMI both with and without log transformation.
During model development, race/ethnicity, education, age, sex, age*sex, and age squared were evaluated as potential confounders. The final linear models tested the associations between food intake frequency (highest vs. lowest consumption tertiles for all food items measured in the . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the (which was not peer-reviewed) The copyright holder for this preprint .
9 DSQ) and log-transformed BMI, adjusting for race/ethnicity, education, age (centered at 50 years), sex, and centered age squared to maximize the variance explained by the model and the uniformity of the plotted residuals. We estimated regression coefficient estimates and 95% confidence intervals for each food item to evaluate the relationship between frequency of intake and BMI.
We used R (R Foundation for Statistical Computing, version 3.2.5) for all statistical analyses and data visualization.

Respondent characteristics
A total of 526,774 U.S.-based respondents with non-missing data for sex, age, education, race/ethnicity, and BMI completed the DSQ between March 2017 and August 2019. Compared to the U.S. population, sample respondents were more likely to be female (65% vs 50%), more likely to be white (86% vs. 77%), less likely to be Hispanic (6% vs. 18%), nearly twice as likely to have completed college (59% vs. 30%), and less likely to be obese (31% vs. 36%).
Application of sampling weights yielded a more representative sample based on age, sex, education, and BMI.

Temporal characteristics of dietary intake
Due to the high-rate of gift giving of the 23andMe genetic testing kit during the holidays, survey completion between late December and early January is on average 4-5 fold greater than the average of other weeks throughout the year. Due to this high degree of response in these months, . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 10 the precision of the estimates by season are highest over the Northern Hemisphere winter period as compared to other seasons ( Figure 1).
Because survey data have been collected continuously over two years, seasonal trends in dietary frequencies are observable. Figure 2 shows the unadjusted mean daily intake patterns for fruit, For fruits and vegetables, we observed higher reported consumption in Northern Hemisphere summer than winter (Figure 2), but also noted a general decline over the reporting period ( Figure   3). We have explored various possible explanations for this observation, such as changing customer demographics (e.g. age, sex, or type of genotyping kit purchased or the influx of winter customers), but saw no clear explanatory patterns. Further, because national published estimates . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

11
are not yet available for this time period, we have no comparison on which to support or refute the observation that fruit and vegetable consumption is declining in the general population.

Spatial characteristics of dietary intake
The respondent population show high geographical coverage across the U.S. (Figure 1), with a minimum of 1,000 respondents in every state. Higher proportions of respondents were from high population density states (California, Texas, Florida). By exploring the fraction of respondents as a percent of the total adult population, we observed relative overrepresentation in California, Florida, and Pennsylvania and underrepresentation in Nebraska, South Dakota, and Vermont.
Minor regional differences are noted when comparing data in aggregate (Table 4), but intake frequencies plotted at the state level show distinct dietary patterns by region. In the Southeastern states, fruit, vegetable and whole grain intake are markedly lower than in coastal areas, and intake of processed meat, regular soda, and added sugars are comparatively higher. Higher average intake of milk, pizza, and red meat are observed in the northern Midwestern states, and while lower vegetable intake is observed, fruit and whole grain intake are similar to coastal areas ( Figure 3). Dietary frequencies which correspond with higher BMI in Figure 4 are also demonstrated to spatially correspond with patterns of higher BMI at the state level (e.g. higher frequency of processed meat and lower consumption of fruits, vegetables, and whole grains in the southern U.S. where BMI is highest).

Dietary intake estimates
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint .

12
Only 2.6% of the respondent population met national recommendations of fruit intake (1-½ -2 cups per day for adults depending on age and sex), with an average population intake of 0.83 cups/day. (20) Mean vegetable intake (excluding French fries but including other potatoes) across all respondents was 1.54 cups per day, translating to only 5.9% of the total adult population who met guidance of 2-½ -3 cups/day depending on age and sex ( In stratified estimates of average intake, college education vs. no college education conferred higher intake of fruits, vegetables, and whole grains, with substantially less sugar intake both in total added sugars and the percent consuming at least one sugar sweetened beverage per day. Differences across race/ethnicity showed Oceanian populations (i.e. Americans with origins in the Pacific islands) eat substantially less whole grains compared to other groups, and nearly half of American Indian respondents reported at least one sugar sweetened beverage per day. Dairy consumption was lower among Asians compared to other groups.
As shown in Figure 5, consumption of many of the food items included in the NHANES dietary screener show strong associations with BMI. In models adjusted for age, sex, education, and race/ethnicity, foods associated with the greatest increase in BMI included processed and red meat, fried potatoes, and pastry. Higher consumption of dairy products (cheese and ice cream) were all associated with higher BMI. Conversely, each tertile increase in the past month . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
Frequency of chocolate consumption showed no statistically significant association with BMI.

Discussion
Here, we detail demographic, temporal, and spatial characteristics of dietary factors measured by the DSQ and how they relate to BMI among 23andMe research participants. Because 23andMe research participants are twice as likely to have a college education, more likely to be female, white, and older than the general US population, dietary habits differential across these characteristics were most changed by weighting the sample to better represent the broader free living US population of adults. In the weighted sample, the majority of 23andMe participants fall well below dietary guidelines for consumption of fruits and vegetables showing remarkable similarity to the broader US population. In our stratified analyses, the most pronounced difference by education, which we used as a proxy for socio-economic status, was for soda consumption, which is less common as educational attainment increases. In our cross-sectional analyses of BMI and dietary habits, these under-consumed foods corresponded to lower BMI, whereas dairy, meats, and added sugars were associated with higher BMI. Descriptive spatial patterns of models of high vs. low intake of select dietary factors and BMI replicate results from large-scale nationally representative studies such as the Behavioral Risk Factor Surveillance System. (23,24) Although it is generally agreed that there exists a troubling degree of bias in selfreported dietary intake data, (25) observation of these patterns which are replicated across more rigorous data collection methods (26) are reassuring.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 14 Direct comparisons between weighted intake estimates among 23andMe research participants and the NHANES sample may be limited based on methodological differences in data collection.
As described, 23andMe estimates are based on a brief screener, whereas the NHANES estimates are based on two in-person 24-hour food intake interviews. For example, due to a truncated response option at 2 or more times per day, for many adults (all men and women aged 19-30), it is not possible to report eating enough whole fruit to meet . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint 15 current dietary recommendations (2 cups per day for women aged 19-30, and men aged 19-50) if the portion size estimates are applied, which are less than 1 cup per serving. This becomes a greater problem for women, because the quantity multiplier (i.e. the estimated number of cups per serving) used to estimate cups per day is smaller for women than it is for men. Because the total fruit estimate is based on the combination of whole fruit and 100% fruit juice, participants that report the maximum serving size frequency (2 or more times per day) must also drink 100% fruit juice regularly to meet the recommended minimum fruit intake requirements.
Because fruit juice is not universally considered a healthful dietary option due to its high added sugar and low fiber content, (28) those who eat fruit in quantities equal to or exceeding 2-cups per day but abstain from drinking fruit juice will be systematically misclassified by the DSQ as not meeting fruit intake recommendations. In an assessment of this screener, it was recommended to pilot test in each population prior to use, and to take caution when deriving precise estimates, but neither ceiling effects on fruit intake nor this particular problem was noted. (11) Because the 23andMe respondent population includes an over representation of college educated people who we show here consume less added sugars compared to those without a college degree, limitations inherent to this questionnaire may be more pronounced.
Generalizability of the survey-weighted 23andMe respondent sample to the U.S. population was achievable for age, sex, education, white vs. non-white, and BMI. However, limits to sample weighting still require a higher representation of non-white participants to develop a generalizable weighted sample. An alternative approach would be sub-sampling the database (29) to achieve a more balanced distribution across race/ethnicity, which we would recommend for . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14. Our exploration of nutrition patterns within the 23andMe database identified several unique advantages of this large-scale, participant-driven, digital cohort. The ability to quickly and contemporaneously collect nutritional information can inform more expeditious assessment of nutrition trends compared to traditional surveys; in contrast to the 23andMe data, NHANES data are usually made available to researchers several years after collection. Additionally, our unprecedentedly large sample size enables well-powered subgroup analyses, including future nutrigenomic studies, and rigorous ascertainment of spatial and seasonal variation in dietary . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint intake. Finally, digital ascertainment of nutritional exposures minimizes participant burden, encouraging participation in future surveys. (30) In conclusion, we have characterized the demographic, seasonal and spatial patterns of nutritional habits among 23andMe research participants in the U.S. Additionally, we report cross-sectional positive associations between BMI and the intake of red and processed meat as well as dairy, and inverse associations between BMI and the intake of fruit, vegetables, and whole grains. Our dataset offers a unique opportunity for rapid-scale real-time data collection, which can inform national trends in a much shorter time frame than current nationwide surveys.
Efforts to diversify the 23andMe research participant database will increase the generalizability to the U.S. population, but because of the significant number of participants, survey sampling and weighting methods can achieve this at present. While large-scale cohorts like the 23andMe participants offer exciting future opportunities in precision nutrition, general efforts to continuously work towards improvements in dietary habits remain critical for maximizing health, maintaining a healthy BMI, and preventing chronic diseases.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14. . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint .      consumption tertiles. Beta estimates and 95% confidence intervals were derived from linear models regressing intake tertile on log-transformed body mass index, adjusted for race/ethnicity, education, age (centered), sex, and age (centered)-squared. Figure S1. Average intake reported by week of data collection, components of total dairy over two-year data collection period (2017-2019).
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint .  . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint . CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
is the (which was not peer-reviewed) The copyright holder for this preprint . https://doi.org/10.1101/2020.04.14.20058263 doi: medRxiv preprint Figure 5: Mean difference in log body mass index (beta, 95% CI) between the highest and lowest consumption tertiles. Beta estimates and 95% confidence intervals were derived from linear models regressing intake tertile on log-transformed body mass index, adjusted for race/ethnicity, education, age (centered), sex, and age (centered)-squared.
. CC-BY 4.0 International license It is made available under a author/funder, who has granted medRxiv a license to display the preprint in perpetuity.