Abstract
Housing infrastructure and quality is a major determinant of infectious disease risk and other health outcomes in regions of the world where vector borne, waterborne and neglected tropical diseases are endemic. It is important to quantify the geographical distribution of improvements to the major dwelling components to identify and target resources towards populations at risk. The aim of this study was to model the sub-national spatial variation in housing materials using covariates with quasi-global coverage and use the resulting estimates to map the predicted coverage across the world’s low- and middle-income countries (LMICs). Data relating to the materials used in dwelling construction were sourced from nationally representative household surveys conducted since 2005. Materials used for construction of flooring, walls, and roof were reclassified as improved or unimproved. Households lacking location information were georeferenced using a novel methodology, and a suite of environmental and demographic spatial covariates were extracted at those locations for use as model predictors. Integrated nested Laplace approximation (INLA) models were fitted to obtain and map predicted probabilities for each dwelling component. The dataset compiled included information from households in 283,000 clusters from 350 surveys. Low coverage of improved housing was predicted across the Sahel and southern Sahara regions of Africa, much of inland Amazonia, and areas of the Tibetan plateau. Coverage of improved roofs and walls was high in the Central Asia, East Asia and Pacific and Latin America and the Caribbean regions, while improvements in all three components, but most notably floors, was low in Sub-Saharan Africa. Human development was by far the strongest determinant of dwelling component quality, though vegetation greenness and land use were also relevant markers These findings are made available to the reader as files that can be imported into a GIS for integration into relevant analysis to derive improved estimates of preventable health burdens attributed to housing.
Introduction
The United Nations’ Sustainable Development Goals (SDGs) include ambitious commitments to fight communicable diseases (target 3.3) and provide adequate, safe and affordable housing (target 11.1) throughout its member states [1]. Although they fall under separate goals, housing quality has long been recognized as a social determinant of health and epidemiological evidence is now elucidating the mechanisms by which this relationship operates [2]. Many endemic infectious diseases of global public health concern, including several named in SDG3, are transmitted within and between households with the majority of infections occurring while the susceptible individual is at home [3], and consequently features of the built peridomestic environment and infrastructure play a role in promoting or impeding the spread of pathogens and their insect vectors [4]. This is particularly true of tropical and rural regions of Africa, Asia and Latin America where numerous vector borne and neglected tropical diseases circulate and where dwellings are often constructed using locally available, naturally occurring materials and traditional techniques such as wattle and daub, dried or burnt bricks, adobe, woven reed or bamboo and thatch [4]. These construction methods often require great skill and community mobilization to implement and are adapted over generations to suit local climate, ecology and topography, however numerous disease-causing insects and microbes are also well adapted to take advantage of the ecological niches that such buildings provide [5,6].
Infants and young children are particularly vulnerable to the health effects of housing construction material due to the high proportion of time spent in the family dwelling and behaviors common to early life such as crawling or playing on the floor [7–9]. Floors that are finished with wood, tiles or cement may protect against transmission of some diarrhea-causing enteric pathogens compared to those made of packed earth or sand either because they are easier to clean, or because they are less hospitable to pathogen survival outside the host [9].
As childhood mortality continues to decline globally, becoming concentrated in subnational hotspots it will be increasingly necessary to target interventions ever more specifically both geographically and to particular causes [10]. Several household-level determinants of health have been mapped at continental or global scale using survey data and spatial interpolation methods including water source and sanitation facility type [11], crowded living space [12], educational attainment [13], and relative wealth [14]. Tusting and colleagues have applied a similar approach to mapping houses built with finished materials across Sub-Saharan Africa for 2000 and 2015, defining such households as those having at least two out of three of the materials for the walls, roof and floor were finished, though they did not separate out these three components [15]. Building on these efforts, the aim of this study, a project of the Planetary Child Health & Enterics Observatory (Plan-EO, www.planeo.earth) [16] was to model the sub-national spatial variation in housing materials using covariates with quasi-global coverage and use the resulting estimates to map the predicted coverage across low- and middle-income countries (LMICs). The guiding hypothesis was that coverage of improved housing materials varies spatially as a function of environmental, and socio-demographic factors in a way that can be modelled using publicly available global datasets and state-of-the-art geostatistical methods.
Materials and Methods
Objective and scope
The objective of this analysis was to estimate the percent coverage of each category of materials used in dwelling component construction at all locations throughout the world’s LMICs (as defined by the Organisation for Economic Co-operation and Development [17], excluding those in Europe).
Outcome variables
The categories of housing materials used in this analysis were those proposed by Florey and Taylor, who classify materials used for construction of flooring, walls, and roofs into natural, rudimentary, and finished types, and then further into improved and unimproved [18]. Data relating to these variables were compiled from nationally representative, population-based household surveys with two-stage cluster-randomized sample designs such as the Demographic and Health Surveys (DHS) [19], the Multiple Indicator Cluster Surveys (MICS) [20] and others. These programs collect information on coverage of health and development indicators and make the resulting microdata publicly available through their websites. All Standard DHSs, Malaria and AIDS Indicator Surveys (MIS and AIS) and MICS that collected information on housing material dating back to 2005 from all LMICs were included. For countries where no such surveys were available, either similar surveys from the 2000-2004 period or country-specific surveys were sourced where available. The unit of analysis was the household, and these were classified into three, mutually exclusive categories (natural, rudimentary, and finished) based on the housing material recorded by the survey interviewer for each of the three dwelling components (floors, walls, and roof) as shown in Table 1.
Georeferencing households
For this spatial analysis it was necessary to assign coordinates to each household representing its approximate location. Cluster-randomized surveys have a hierarchical design such that households are nested within clusters, the census enumeration areas that serve as the primary sampling unit, which are in turn nested within survey strata (sub-national region and urban/rural status). The DHS Program provides coordinates of the cluster centroids for most of the surveys they carry out [21] (though these are randomly “displaced” – systematically shifted up to a certain distance to preserve confidentiality [22]). However, these are not available for all clusters and surveys and equivalent coordinates have been made available only for a handful of MICS and no country-specific surveys. For this analysis, households were georeferenced to their displaced cluster centroid coordinates where available, otherwise their clusters were randomly assigned to populated settlement locations taken from the Humanitarian OpenStreetMap database [23] that fell within the same survey stratum (sub-national region and urban/rural status) with probability proportional to the population density of the settlement (extracted from the WorldPop [24] database at settlement coordinates). OpenStreetMap settlements were reclassified such that cities and towns were categorized as urban, and villages, hamlets, and isolated dwellings as rural. This novel cluster location assignment process was automated in ArcGIS Pro ModelBuilder [25] and Stata 18 [26].
Covariates
A suite of time-static environmental and demographic spatial covariates available in raster format were compiled based on their hypothesized associations with the outcome variables. Definitions and sources of each covariate are shown in Table 2. Variable values were extracted at the georeferenced cluster locations in Python. In addition, time was calculated in continuous months since January 1st, 2005, based on the date of survey interview and log-transformed based on the assumption that changes in household material over time would be non-linear but unidirectional. Countries were grouped into the six regions used for administrative purposes by the World Bank [27], and this categorical variable was also treated as a covariate so that, for countries with no available survey data, estimates would be based partly on regional averages.
Analysis
To reduce the database size and computational demands, and to neutralize the issue of within-cluster correlation, one household with non-missing outcome value was randomly sampled per cluster and retained for analysis (this selection was done separately for each of the three outcomes). Due to the computational demands of performing geospatial analysis at the global scale, we recoded all outcomes to be binary, by collapsing two of the response categories together (“rudimentary” was grouped with “natural”) to give “improved” / “unimproved” response categories as shown in Table 1, and in a modification of the schema proposed by Florey and Taylor (those authors grouped rudimentary and finished walls and roofs into the improved category, but not floors, however we opted for a consistent categorization across components to facilitate comparison between outcome variables [18]).
Exploratory spatial data analysis
We first assessed the presence of spatial autocorrelation by generating semi-variograms of the Pearson residuals from a non-spatial logistic regression that included all explanatory variables listed in Table 2 (Supplementary Figure S1). We fit spherical spatial correlation models to each semi-variogram and estimated the nugget, range, and sill for each outcome. The semi-variograms and respective models were estimated using the gstat R package [28]. Together with the nugget:sill ratio and the estimated range, we determined that an explicitly spatial modeling approach was required to account for the non-trivial spatial correlation in the Pearson residuals.
Model fitting
Given the massive spatial scale of the database, with hundreds of thousands of points spanning most of the globe, incorporating spatial correlation into the models presented computational challenges. We used the inlabru R package to implement an integrated nested Laplace approximation (INLA) modeling approach in which all locations are projected onto a coarsened grid or “mesh” containing several thousand vertices that carry the spatial information and can be reprojected onto the observed data [29,30]. INLA models approximate Bayesian models by constructing the posterior distribution and then applying Laplace approximations, thus bypassing the need for time-consuming Markov chain Monte Carlo sampling and making global-scale computation feasible. All coordinates were transformed via the Mollweide projection and scaled into kilometers prior to analysis. The mesh used for modelling had 18,352 vertices, placed within continental boundaries. Further details on the implementation of the INLA model are provided in Supplementary File 1.
Model predictions
Predicted probabilities for each outcome were made for all locations in the domain of interest (the LMICs) at 5 km2 resolution and exported in Georeferenced Tag Image File format (GeoTIFF). The spatial covariates from Table 2 along with the time variable were used to generate predicted logistic distribution probability of the finished class of each building material from the INLA model. A value for time corresponding to the first of January 2023 was used for making predictions. Missing pixel values were filled by performing imputation using k-Nearest Neighbors method by Python Scikit-learn package [31].
Model evaluation
The predictive performance of the spatial models was assessed by calculating common metrics of recall (sensitivity), precision (positive predictive value), accuracy (the proportion correctly classified), F1-score (mean of precision and recall), and area under the receiver operating characteristic curve (ROC-AUC). For each performance metric, two multiclass averaging metrics (macro and weighted average) were calculated, including macro averaging and weighted macro averaging, given by:

Where Pri is the precision calculated from the multiple class predictions and Obsi is the number of observations of one class. n is the total number of observations of all classes. To assess the relative contribution of each covariate to the models, feature importance values for the input raster covariates were calculated by running parallel non-spatial linear regression models (since the inlabru package does not provide feature importance output) that were otherwise identically specified and scaling the output coefficients to the 0 – 1 range using the Scikit-learn Python package. These feature importance values can be interpreted as conditional associations, quantifying the responded variation of the output when only the given feature is allowed to vary while all other features are held constant.
Ethics statement
All human subject information used in this analysis was anonymized, publicly available secondary data, and therefore ethical approval was not required or sought. For data provided by the DHS Program, data access requests (including for the displaced cluster coordinates) were submitted and authorized through the Program’s website. A completed checklist of Guidelines for Accurate and Transparent Health Estimates Reporting (GATHER [32]) is included in Supplementary File S1.
Results
350 nationally representative household surveys (together containing data from more than 6 million households in 283,000 clusters) met the inclusion criteria, reported information on construction material types for one or more of the dwelling components and were included in the model training dataset. Figure 1 shows the number of surveys contributed by each LMIC, while Supplementary File S2 gives the national level distribution of each of the three housing construction variables in each survey (before within-cluster sub-sampling, and without sample weights applied). All eligible surveys included information on floor material; however, wall and roof material information were only available from 328 and 324 surveys respectively. No relevant data from household surveys could be found for several LMICs with large geographies and populations, most notably China, Iran, Venezuela, Libya, and Malaysia, as well as the smaller countries of Eritrea, North Korea, Lebanon, Equatorial Guinea, and numerous island nations such as Sri Lanka.
Number of nationally representative household surveys included in input dataset by country for included LMICs (small countries represented by circles). Base map compiled from shapefiles obtained from U.S. Department of State—Humanitarian Information Unit [50] and Natural Earth free vector map data @ naturalearthdata.com that are made available in the public domain with no restrictions.
Figure 2 shows the geographical distribution of the coverage of improved materials predicted by the INLA models for each of the three binary dwelling component variables across the domain of included LMICs. These predictions are also provided as raster TIFF files available on the Dryad data repository. There are some similarities across the variables, with low coverage predicted for all three across a wide belt of the Sahel and southern Sahara regions of Africa, much of inland Amazonia, and areas of the Tibetan plateau, as well as individual countries including the Democratic Republic of the Congo, Mozambique, Madagascar, Pakistan, and Papua New Guinea. High coverage of all three improved components coincided across much of the Middle East, Mediterranean North Africa, the coast of the Bight of Benin, the Caribbean, sub-Amazonian Brazil, southern Argentina, and South Africa. However, divergence in coverage of the three variables is evident across many locations. Across Kazakhstan, Mongolia, Azerbaijan, Cambodia and Laos, low coverage of improved floors, but high coverage of walls and roofs were predicted, while in Afghanistan, the reverse was the case. Yemen has mostly high improved floor coverage predicted, but low improved roof and mixed improved wall coverage, while on the island of Borneo, that pattern is reversed. Importantly, sub-national patterns are clearly visible, for example, with respect to improved floors, walls, and roofs in India, China, Mexico, and Brazil.
Coverage of improved material for three dwelling components - a. floors, b. walls, c. roofs – in LMICs predicted by integrated nested Laplace approximation (INLA) models fitted to household survey data. Base maps compiled from shapefiles obtained from U.S. Department of State—Humanitarian Information Unit [50] and Natural Earth free vector map data @ naturalearthdata.com that are made available in the public domain with no restrictions.
Figure 3 shows ridge plots visualizing the distribution of predicted values for the coverage of improved status for each of the three dwelling components and stratified by the six world regions. The distribution of improved roofs was highly concentrated at values very close to 100% in the Central Asia region, findings which are borne out by the input data, in which most surveys recorded a coverage of finished roofs greater than 97% (Supplementary File S2). This was true to a far lesser extent for other regions - with the exception of Sub-Saharan Africa, which had predicted values much more evenly dispersed along the range of values – and for improved walls, though the South Asia region and had a much more dispersed, bimodal distribution for the latter variable. For improved floors, predicted values were highly concentrated at the low extreme of Sub-Saharan Africa.
Distribution of values predicted for coverage of improved dwelling components by INLA models, stratified by component and world region.
Figure 4 visualizes the feature importance values for each covariate in each of the three models. More than half (eleven) of the variables did not contribute to any of the models. Feature importance was dominated by the same single variable (human development index), accounting for more than 50% of the variation in all three models. For the walls and to a lesser extent the floors models, the next most important feature was provided by the enhanced vegetation index (EVI), whereas for the roofs model, cropland and pasture areas contributed more to the model prediction, with EVI ranking fourth.
Feature importance for each covariate included in the final model for each of the dwelling components (HDI – Human Development Index; EVI – Enhanced Vegetation Index; LST – Land Surface Temperature; ET – Evapotranspiration; GDP – Gross Domestic Product).
Table 3 gives statistics that evaluate the models’ performance in classifying household construction material types for the three dwelling components. Across the whole database, floors were the dwelling component for which coverage of improved construction material was lowest at 57.9%, the equivalent coverage for walls and roofs being 67.1% and 80.3% respectively. While precision, recall and F1-score statistics were generally high for the unimproved category in all models, they varied considerably for the improved category, particularly for the roofs model, for which recall, and F1-score were just 0.4 and 0.5 respectively. However, the roofs model was the one with the highest weighted average for those three statistics (a precision of 0.84, recall of 0.85 and F1-score of 0.83, compared with 0.78, 0.79, and 0.78 respectively for the walls and 0.77 for all three statistics for the floors model). All three models demonstrated similarly strong discriminatory power and performance in distinguishing between households with improved and unimproved construction materials in the respective dwelling components, with ROC-AUC statistics of 0.85 – 0.87.
Discussion
Housing infrastructure and quality are major determinants of infectious disease risk and other health outcomes, particularly in regions of the world where vector borne, waterborne and neglected tropical diseases are endemic. Although, the nature of this relationship is complex and multifaceted and varies depending on the specific pathogen and vector species, it highlights the importance of targeting interventions to mitigate these adverse health outcomes, particularly in LMICs where the overwhelming majority of childhood mortality occurs. As attention turns to improving housing quality in low-resource settings as a strategy for controlling infectious diseases, it is important to quantify the geographical distribution of improvements to the major dwelling components to identify and target resources towards populations at risk. This study is the first attempt to meet this objective.
The importance of housing materials is clearly not restricted to vectorborne diseases. Finished floors have been associated with decreases of 0.89 in Log10 E. coli contamination in Peru [51], 78% in intestinal parasite prevalence in Mexican children [52], and 9% for diarrheal disease risk, 11% for both enteric bacteria and enteric protozoa risk [8], and 17% for Shigella spp. infection probability in meta-analyses of children under 5 years across multiple LMIC surveillance sites [53]. Traditional roof material has also been shown to be associated with childhood diarrhea [54], even after adjusting for floor material [55]. Pooled analyses of household survey data from multiple countries have found associations of living in improved housing on numerous child health outcomes, including cognitive and social-emotional development [7], and nutritional status [56], in addition to malaria infection [18,57]. Additionally, there is evidence of increased acute respiratory illness (ARI) in children in Pakistan, with unimproved flooring increasing ARI risk by 18%, and unimproved walling materials also increasing the risk of ARI in children under the age of five [58]. These findings are supported by similar findings with different studies in India, Nigeria, ad Lao PDR [59–61].
This study is subject to several limitations. Our characterization of housing was constrained by the availability of data from household surveys, which generally only ask about just three components, and don’t include questions about other relevant features of the built household environment, such as screens covering openings [62] elevation of sleeping areas or improvements to windows and ventilation [63]. Although the variables were originally in three-class ordinal categorical format, we had to combine categories and model them as dichotomous, because there is currently no way to address adjacent categories and parallel odds using the INLA modeling approach. Additionally, our spatial models assume a stationary (i.e., global) covariance structure that does not vary across the globe. This is likely an oversimplification of the latent spatial effects; however, estimating a non-stationary spatial model at the global scale falls outside the scope of the current article and presents a worthwhile future direction. Likewise, improving the precision of the mesh used by INLA may improve predictions, but with ROC-AUC values already relatively high, this is likely to yield only marginal gains.
Despite these limitations, the product developed fills an important gap in spatially characterizing determinants of the principal causes of infectious disease burden in LMICs. Many types of mosquitoes such as those that transmit malaria (Anopheles spp.), dengue (Aedes spp.), filariasis and Japanese encephalitis (Culex spp.) enter the home through eaves and other openings [64] and rest on walls and ceilings after ingesting a blood meal (the basis behind indoor residual spraying [IRS] of these surfaces as a malaria control intervention). Indeed, in Africa, 80% of malaria transmission occurs indoors [3] and houses with roofs and walls constructed of natural material provide more points of entry [64,65] and preferred resting places [66] for malaria-transmitting mosquitoes, insights which are increasingly putting housing improvements on the research agendas as potential disease control strategies [63,65]. In rural Gambia, studies have found reductions in intradomiciliary mosquito vector abundance and survival through installing plywood ceilings [67], closing eaves in thatched roofs [68,69], and replacing thatch with ventilated metal roofing [70]. In rural Uganda, living in a house constructed of traditional materials (thatched roof, mud walls, earth floor etc.) has been associated with increased clinical malaria incidence [71] and parasitemia in children [72] and pregnant women [73], and decreased effectiveness of IRS in reducing Anopheles biting rates [72]. Similar protective effects of improved housing construction material on entomological and clinical malaria outcomes have been documented separately in Burkina Faso [74], Ethiopia [75], Laos [76], Malawi [77], South Africa [78], and Tanzania [79], while pooled effects from systematic reviews have been reported on the order of a 32% reduction in mosquito-borne diseases, 47% for malaria infection and 85% for indoor vector densities [65,80]. Aside from mosquito-borne illnesses, living in households with walls made of mud or thatch carries an increased risk of leishmaniasis infection and indoor abundance of sandfly vectors [81], while in the Americas, Chagas Disease vectors (triatomine bugs) are drawn to houses with thatched palm roofs and mud walls [82]. In a Guatemalan community, for example, the odds of triatomine presence were 3.85 times higher in houses with walls that lacked plastering [83], while in rural Paraguay, an intervention to provide houses with smooth, flat and crack-free walls, reduced triatomine infestation by 96.4%, a comparable effect to that of fumigation [84].
Conclusions
In conclusion, this study applies a relatively computationally efficient and spatially explicit modeling approach to a very large dataset, representative of but standardized across diverse geographies, and collected through rigorous and standardized methodologies. The findings allow us to assess the predictive performance of the models as well as the relative contribution of particular covariate variables, and the resulting predictions are made available to the reader in a readily useable format (available from www.datadryad/org). Human development is by far the strongest determinant of dwelling component quality, though vegetation greenness and land use (cropland and pasture) are also relevant markers. Prevalence of improved roofs and walls is high in the Central Asia, East Asia and Pacific and Latin America and the Caribbean regions, while coverage of improvements in all three components, but most notably floors, is low in Sub-Saharan Africa.
Data Availability
The data used in this analysis are publicly available from the sources listed in table 2 and supplementary file S2. The statistical source code used to generate estimates can be accessed is available from the corresponding author upon reasonable request.
References
- 1.↵
- 2.↵
- 3.↵
- 4.↵
- 5.↵
- 6.↵
- 7.↵
- 8.↵
- 9.↵
- 10.↵
- 11.↵
- 12.↵
- 13.↵
- 14.↵
- 15.↵
- 16.↵
- 17.↵
- 18.↵
- 19.↵
- 20.↵
- 21.↵
- 22.↵
- 23.↵
- 24.↵
- 25.↵
- 26.↵
- 27.↵
- 28.↵
- 29.↵
- 30.↵
- 31.↵
- 32.↵
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.↵
- 51.↵
- 52.↵
- 53.↵
- 54.↵
- 55.↵
- 56.↵
- 57.↵
- 58.↵
- 59.↵
- 60.
- 61.↵
- 62.↵
- 63.↵
- 64.↵
- 65.↵
- 66.↵
- 67.↵
- 68.↵
- 69.↵
- 70.↵
- 71.↵
- 72.↵
- 73.↵
- 74.↵
- 75.↵
- 76.↵
- 77.↵
- 78.↵
- 79.↵
- 80.↵
- 81.↵
- 82.↵
- 83.↵
- 84.↵