Estimating salt consumption in 49 low-and middle-income countries: Development, validation and application of a machine learning model

Background: Global targets to reduce salt intake have been proposed but their monitoring is challenged by the lack of population-based data on salt consumption. We developed a machine learning (ML) model to predict salt consumption based on simple predictors, and applied this model to national surveys in low- and middle-income countries (LMICs).Methods: Pooled analysis of WHO STEPS surveys. We used 19 surveys with spot urine samples for the ML model derivation and validation; we developed a supervised ML regression model based on: sex, age, weight, height, systolic and diastolic blood pressure. We applied the ML model to 49 new STEPS surveys to quantify the mean salt consumption in the population.Results: The pooled dataset in which we developed the ML model included 45,152 people. Overall, there were no substantial differences between the observed (8.1 g/day (95% CI: 8.0-8.2 g/day)) and ML-predicted (8.1 g/day (95% CI: 8.1-8.2 g/day)) mean salt intake (p= 0.065). The pooled dataset where we applied the ML model included 157,699 people; the overall predicted mean salt consumption was 8.1 g/day (95% CI: 8.1-8.2 g/day). The countries with the highest predicted mean salt intake were in Western Pacific. The lowest predicted intake was found in Africa. The country-specific predicted mean salt intake was within reasonable difference from the best available evidence.Conclusions: A ML model based on readily available predictors estimated daily salt consumption with good accuracy. This model could be used to predict mean salt consumption in the general population where urine samples are not available.


INTRODUCTION
The association between high sodium/salt intake and high blood pressure, a major risk factor of cardiovascular diseases (CVD), is well-established. 1-3 More than 1.7 million CVD deaths were attributed to a diet high in sodium in 2019, with ~90% of these deaths occurring in low-and middleincome countries (LMICs). 4,5 Consequently, salt reduction has been included in international goals: the World Health Organization (WHO) recommendation of limiting salt consumption to <5 g/day, 2 and the agreement by the WHO state members of a 30% relative reduction in mean population salt intake by 2025. 6 Because available evidence suggests that sodium/salt consumption is higher than the global targets, 7-9 we need timely and consistent data of sodium/salt consumption in the general population to track progress of salt reduction targets.
Global efforts have been made to produce comparable estimates of sodium/salt intake for all countries. 7 Similarly, researchers have summarized all the available evidence in specific world regions. 8,9 Although the global endeavor was based on the gold standard method to assess sodium/salt intake (i.e., 24-hour urine sample), their estimates were up to 2010. 7 Therefore, robust and comparable sodium/salt intake estimates for all countries lack for the last ten years. The regional endeavors summarized population-based evidence, yet they conducted study-level meta-analyses in which the original studies could have followed different laboratory methods, and they did not study all countries in the region. Therefore, comparability across studies could be limited and evidence lacks for many countries. Finding a method to estimate sodium/salt consumption in national samples leveraging on available data is needed to update and complement the existing evidence. 7-10 However, quantifying sodium/salt intake based on 24-hour urine samples is costly and burdensome, limiting its use in population-based studies or national health surveys. As an alternative, equations have been developed to estimate sodium/salt intake based on spot urine (SU) samples. 11-14 However, these equations have been used in few WHO STEPS and other national health surveys, 15 leaving several countries without data to quantify the local sodium/salt consumption because they do not have access to SU samples. 16 If we could (accurately) estimate sodium/salt intake based on variables that are routinely available in national health surveys (e.g., weight or blood pressure), mean sodium/salt intake in countries 6 kg/m 2 ) or with implausible weight (outside the range 12-300 kg) or height records (outside the range 1.00-2.50 m). Participants with SBP outside the range 70-270 mmHg were discarded, and so were participants with DBP outside the range 30-150 mmHg. We excluded records with SU creatinine <1.8 or >32.7 mmol/L for males and <1.8 or >28.3 for females. 17,18 In addition, we excluded participants with estimated salt intake (using the 4 equations) above or below three standard deviations from the equation-specific mean (Supplementary Figure 1). 19 After completing data preparation, observations were randomly assigned from the pooled dataset (100%) into three datasets for the ML analysis: training data (50%), test data (30%) and validation data (20%).

Machine learning modelling
Our research aim was a regression problem where we had a known outcome attribute (salt consumption at the subject level). Therefore, we planned a supervised ML regression analysis.
Details about the modelling process are available in the Extended Methods (Supplementary Material pp. 03-06). In brief, we designed a work pipeline with five steps. First, data analysis, where we dropped missing observations, we explored the available data to choose scaling and transformation methods to secure all variables were in the same scale or units, and we also planned transformations for categorical variables (e.g., one-hot encoding). Second, feature importance analysis, where we investigated the contribution of each predictor to the regression model through methods like Random Forest and Recursive Feature Elimination. The aim of this second step was to exclude any predictor that would not contribute to the regression model.
Notably, all predictors (see Variables section) chosen following expert knowledge were kept in the analysis (i.e., the feature importance analysis did not suggest the exclusion of any predictor).
Third, data processing, having explored the available data (first step in the work pipeline), we implemented different scaling and transformation methods (e.g., Box-Cox, Principal Component Analysis and polynomial features). Fourth, data modelling, where we implemented ten ML is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint 7 decision to choose one was postponed to the fifth (last) step in the work pipeline. Up to this point, we used the training and validations datasets. Five, forecasting of the predicted attribute in new data (i.e., data not used for model training); in this step we used the test dataset to choose the model that yielded predictions closest to the observed salt intake. Results comparing the observed and the predicted salt intake were computed in the test dataset alone. For each country we ran a paired t-test between the observed and predicted salt consumption, where a difference was deemed significant at a p <0.05. We also computed the absolute difference between the observed and predicted salt intake. We chose the RF algorithm because it showed the mean difference closest to zero in both sexes combined (observed -predicted = 0) (Supplementary table 1, Supplementary figure 2). All summary estimates (e.g., mean salt intake) were computed accounting for the complex survey design of the WHO STEPS surveys.

Application of the developed ML model
Having developed the ML model following the steps above described, we applied the model to 49 WHO STEPS national surveys which did not have urine samples but included the predictors in the ML model (see Variables section). In each of these 49 surveys we computed the mean daily salt intake accounting for the complex survey design. These surveys were pre-processed following the same procedures described in the Data preparation section.

Ethics
We did not seek approval by an Institutional Review Board. We used individual-level survey data which do not include any personal identifiers.

Role of the funding source
The funder had no role in the study design, analysis, interpretation or decision to publish. The authors are collectively responsible for the accuracy of the data. The arguments and opinions in this work are those of the authors alone, and do not represent the position of the institutions to which they belong.

Observed and predicted mean salt intake during the ML model derivation and validation
In the test dataset including 19 WHO STEPS surveys, the observed mean salt intake computed as per the INTERSALT equation was 8.1 g/day (95% CI: 8.0-8.2 g/day) across countries. The observed salt intake was higher in men (8.9 g/day; 95% CI: 8.7-9.0 g/day) than in women (7.4 g/day; 95% CI: 7.3-7.4 g/day). Across countries, the predicted mean salt intake was 8.1 g/day (95% CI: 8.1-8.2 g/day). Men had a higher predicted mean salt intake (8.9 g/day; 95% CI: 8.8-8.9 g/day) than women (7.4 g/day; 95% CI: 7.4-7.5 g/day). Overall, there were no substantial differences between the observed and predicted estimates (p=0.065). Results for each survey are presented in Figure 1 and Supplementary table 3.
In men across all countries in the test dataset including 19 WHO STEPS surveys representing 17 LMICs, the mean difference between observed and predicted mean salt intake was 0.02 g/day (p=0.277). Across all surveys, the positive mean difference farthest from zero was 1.39 g/day (Malawi, p<0.001), and the negative mean difference farthest from zero was -0.74 g/day (Lebanon, p=0.227). The mean difference closest to zero was 0.03 g/day (Armenia, p=0.787) (Supplementary table 4).
In women across all countries in the test dataset including 19 WHO STEPS surveys representing 17 LMICs, the mean difference between the observed and predicted mean salt intake was -0.02 g/day (p=0.124). The positive mean difference farthest from zero was 1.02 g/day (Malawi, p<0.001) and the negative mean difference farthest from zero was in -0.71 g/day (Brunei . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021.
None of the LMICs herein analysed, regardless of the method of sodium intake assessment (i.e., observed or predicted), showed a mean salt intake below the WHO recommended level of <5 g/day (Figure 1, Supplementary table 3 Across the 49 LMICs, the predicted mean salt intake was 8.1 g/day (95% CI: 8.1-8.2 g/day), and it was higher in men (8.9 g/day; 95% CI: 8.8-8.9 g/day) than in women (7.4 g/day; 95% CI: 7.3-7.4 g/day). None of the LMICs herein analysed, regardless of sex, showed a predicted mean salt intake below the WHO recommended level of <5 g/day (Figure 2, Supplementary table 7).
In men, the countries with the highest predicted mean salt intakes were Nauru and American Samoa (both with 11.1 g/day), Cook Islands (11.0 g/day) and Niue (10.7 g/day); remarkably, three of these countries (Nauru, Cook Islands and Niue) are in Western Pacific. In contrast, the lowest predicted mean salt intake in men were in Ethiopia (8.3 g/day), Eritrea and Timor-Leste (both with 8.4 g/day), and United Republic of Tanzania, Botswana and Uganda (all with 8.6 g/day); remarkably, all of these countries except Timor-Leste are in Africa.
In women, the countries with the highest predicted mean salt intakes were American Samoa (9.0 g/day), Nauru (8.8 g/day), and Cook Islands, Samoa and Tuvalu (all with 8.6 g/day); the latter four . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint countries are in Western Pacific. Conversely, the lowest predicted mean salt intake in women were in Ethiopia and Eritrea (both with 6.9 g/day), Timor-Leste (7.0 g/day), and Vietnam (7.1 g/day); the first two countries (out of four) are in Africa.

Main findings
This work leveraged on 19 national health surveys and readily available predictors to develop a ML model to predict salt consumption; this model was then applied to national surveys in 49 LMICs. The RF ML algorithm yielded the predictions closest to the observed salt intake: the mean difference between predicted and observed salt consumption across surveys was 0.02 g/day in men and -0.02 g/day in women. We used this novel ML model to predict salt consumption in 49 LMICs, where the mean salt consumption ranged from 8.3 g/day (Ethiopia) to 11.1 g/day (Nauru) in men; these numbers in women ranged from 6.9 g/day (Ethiopia and Eritrea) to 9.0 g/day

Public health implications
ML models have been used extensively to predict relevant clinical outcomes (e.g., mortality) and epidemiological indicators (e.g., forecasting COVID-19 cases). 20-24 Furthermore, ML algorithms have proven to be useful for understanding complex outcomes (e.g., identifying clusters of people with diabetes) based on simple predictors (e.g., BMI) in nationally-representative survey data. [25][26][27] Our work complements the current evidence on ML algorithms by demonstrating its use in a relevant field: population salt consumption. In so doing, we delivered a pragmatic tool which could be used to inform the surveillance of salt consumption in countries where national surveys do not objectively collect this information (e.g., SU samples). Moreover, this work provided preliminary . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint 11 evidence to update the global estimates of population-based sodium consumption, 7 by informing about the mean sodium consumption in 49 LMICs. Our results suggest that mean salt consumption is above the WHO recommended level in all the 49 LMICs herein analysed, and it was the highest among LMICs in Western Pacific, and the lowest among LMICs in Africa. This finding, which is consistent with a global work, 7 calls for urgent actions to reduce salt consumption in these 49 LMICs, especially those in Western Pacific.
We do not believe that our -or any other-ML model should replace a comprehensive populationbased nationally representative health survey with 24-hour or spot urine samples. However, until such surveys are available in many LMICs and periodically conducted, we could suggest using an estimation approach to shed lights about the mean salt consumption in the population. Our ML model seems to be a reasonably good alternative, and could become a pragmatic tool for surveillance systems that keep track of sodium consumption in accordance with global goals. 2,6   is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Research in context
The copyright holder for this this version posted September 2, 2021. ; 12 appears that our predictions were higher than those provided by Powles et al. 7 in countries with presumably low salt consumption (e.g., Comoros, Rwanda); conversely, in countries with presumably high salt consumption (e.g., Kyrgyzstan), our predictions revealed smaller estimates than those by Powles et al. 7 (Supplementary table 8 Samoa showed that the mean salt consumption was 10.6 g/day and 7.1 g/day, respectively. 17 The estimates from our ML model for Fiji (2011) and Samoa (2013) suggested that the mean salt consumption was 8.9 g/day and 9.6 g/day, respectively. A survey in Vanuatu in 2016 based on 24-hour urine sample informed that the mean salt intake was 5.9 g/day; 18 our estimate for the year 2011was 8.6 g/day. In 2009 in Vietnam a survey with SU samples revealed that the mean salt consumption was 9.9 g/day; 19 our prediction for the year 2015 was 7.9. These comparisons suggest that our ML-predicted estimates are plausible and close to the best available evidence.
Although these comparisons do not validate our predictions in the 49 national surveys, they suggest that our salt consumption estimates are within reasonable distance from the best available evidence. Until better data are available (e.g., national survey with spot or 24-hour urine sample), our model could provide preliminary evidence to inform the national mean salt consumption. Careful interpretation is warranted to understand the strengths and limitations of our ML-based predictions.

Strengths and limitations
We followed sound and transparent methods to develop a ML model to predict salt consumption at the individual level. We leveraged on open-access national data collected following standard and consistent protocols (WHO STEPS surveys). Most of the surveys we analysed were . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint 13 conducted after 2010, providing more recent evidence than the latest global effort to quantify salt consumption in all countries. 7 Notwithstanding, we must acknowledge some limitations. First, urine data was based on a spot sample, which is not the gold standard (24-hour urine sample) to measure daily salt consumption. Future work should verify and advance our results using on 24hour urine samples available in nationally representative samples; in the meantime, our work has led the foundations and hopefully sparked interest to use available data and novel analytical techniques to deliver estimates of salt consumption in the general population. Second, even is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint

DATA SHARING STATEMENT
This study used nationally-representative survey data that are in the public domain, which was requested through the online repository (https://extranet.who.int/ncdsmicrodata/index.php/home).
We provide the analysis code of data preparation and data analysis as supplementary materials to this paper.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint

Figure 1. Observed and predicted mean salt intake (g/day) by sex in each survey included in the ML model development
Exact estimates (along with their 95% CI) are presented in Supplementary table 3. These results were computed with the test dataset only. Results are for the Random Forest algorithm, which was the model with the best performance.

Men Women
A rm e n ia is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ;  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint 3 Expanded methods

Overview
We worked with a structured dataset which mostly had numeric attributes (variables). Given our study problem, we opted for a supervised learning model because there was a target attribute (i.e., salt consumption at the subject level); specifically, we conducted a supervised regression because the target attribute was a numeric variable. For the machine learning analyses we used Python and the Scikit-Learn library.
First, we developed a pipeline for data management and model development. This way, we followed a consistent and transparent methodology to secure an optimal model for the training set and that would adequately generalize to other (unseen) datasets. The following figure depicts the pipeline we developed: i) we studied the available data and where needed, we did a one-hot encoding; ii) we did feature importance analysis; iii) we chose and tried different scaling and transformation methods, so that all variables would be in the same scale or units; iv) we tried a set of machine learning models, including a customized neural network; v) we forecasted (predicted) the attribute of interest (salt consumption at the subject level) in an unseen dataset (i.e., not used for model training). Notably, we went backwards and forwards (see arrows in the figure) between the four first stages until we reached the best combinations and results for each model. In the following sections, we will describe each of these five stages. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint

Data analysis
This was an exploratory analysis to understand the dataset and its characteristics. We worked with a complete-case dataset; in other words, we excluded missing observations in the variables considered in the analysis. Consequently, we did not do any data imputation analysis.
We explored the distribution of all numerical variables, which were in different units and scales; this exploratory analysis informed the choices of data processing methods (e.g., Box-Cox) implemented in the third stage.

Feature importance analysis
Even though we followed expert knowledge to select a reduced, though relevant number of predictors to be included in the regression model, we conducted feature importance analyses to understand the role each predictor would play in the model. This process aimed to eliminate variables that would not carry substantial information for the model. We used Random Forest, Recursive Feature Elimination and Extra Trees. Consistently, these three methods suggested that all the chosen predictors would contribute to a better model.

Data processing
As described in the data analysis section (first stage), numeric variables were in different units and scales; therefore, these variables needed to be scaled or transformed. This scaling would also help to find a better prediction model. It is common knowledge that machine learning models would perform differently (and better) depending on data transformation methods. We did: i) Min-Max whereby numeric variables were scaled to a range between 0 and 1; ii) Standardization; iii) Normalization: iv) Polynomial features of degree 2 (quadratic polynomial); v) Principal Component Analysis with 3 components and explained variance of ≥0.95; and vi) Box-Cox.

Data modelling
There are several machine learning algorithms for a supervised regression model. Those that we used, and that are depicted in the figure, yielded much better results and were studied in detail. That is, at the beginning of our work we explored other algorithms, though these did not perform well and were not  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint 5 In addition to these nine machine learning algorithms, we also implemented a neural network (see figure   below). This neural network was optimized empirically. We used a batch size = 256; epochs = 300; and optimizer = 'adam'. The neural network was implemented in Python using the Keras library. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint 6 men (mean difference = 0.0052), and in women the GBR algorithm showed the best results (mean difference = -0.0005).
To support our decision process, we plotted the mean differences in men and women for each survey

Algorithm application
To make the predictions in the new 49 datasets without information about urine samples, we used the RF model (i.e., ML algorithm and predictors) developed following the methods above described (see is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint

Supplementary Table 5. Observed mean salt intake (g/day) by equation and sex in each survey
included in the ML model development.

Country
Year Sex Mean salt intake (g/day) is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint

Mean salt intake (g/day) lower
The copyright holder for this this version posted September 2, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted September 2, 2021. ; https://doi.org/10.1101/2021.08.31.21262944 doi: medRxiv preprint