Abstract
The routine measurement of children’s developmental health varies across educational settings and systems. The Early Years Foundation Stage Profile (EYFSP) is a routinely recorded measure of a child’s development completed at the end of their first school year, for all children attending school in England and Wales. Despite widespread use for research and educational purposes, the measurement properties are unknown. This study examined the internal consistency and structural validity of the EYFSP, investigating whether the summed item-level scores, which we refer to as the ‘total score’, can be used as a summary of children’s developmental health. It also examined predictive validity of the total score with respect to later academic attainment and behavioural, social, and emotional difficulties.
The data source was the longitudinal prospective birth cohort, Born in Bradford (BiB), and routine education data were obtained from Local Authorities. The internal consistency and structural validity of the EYFSP total score were investigated using Confirmatory Factor Analysis and a Rasch model. Predictive validity was assessed using linear mixed effects models for Key Stage 2 (Maths, Reading, Grammar/Punctuation/Spelling), and behavioural, social, and emotional difficulties (Strengths and Difficulties Questionnaire).
We found that the EYFSP items demonstrated internal consistency, however, an Item Response model suggested weak structural validity (n=10,589). Mixed effects regression found the EYFSP total score to predict later academic outcomes (n=2711), and behavioural, social, and emotional difficulties (n=984). The EYFSP total score appears to be a reasonable measure of child developmental health, due to having internal consistency and predictive validity. However, caution should be exercised when interpreting scores of children with very close to ‘average’ ability levels.
Introduction
‘Developmental health’ is a broad concept that combines a holistic understanding of physical, mental, social, and emotional wellbeing (1). Measurements of children’s early developmental health can be used to predict later educational performance and health (2–4), which are both, in turn, important predictors of adult social and health outcomes (5,6). Ensuring that children have strong developmental health in the earliest years of their lives can therefore improve their future educational attainment (7) and, consequently, help to close socioeconomic inequalities in educational outcomes (8,9). It is therefore important to routinely measure children’s developmental health using an accurate and valid measure to identify those who may need extra support (10,11).
The embedding of standardised measurement of this into educational systems varies greatly across countries. Due to the educational pressures that standardised exam settings can bring, assessments completed by children’s teachers can instead offer a valuable insight (12). Some countries have embedded teacher completed routine measures into educational practice, for instance, the Early Development Instrument (EDI) in Canada. The 103-item EDI is completed by kindergarten teachers in the second half of the school year and has been used since 1998 in all but one province. The EDI measures children’s developmental health, skills, and behaviour (13,14), and has generally demonstrated adequate psychometric properties in terms of internal consistency (15), and predictive validity for social relationships, emotional wellbeing, and educational performance at 9-10-years-old (16).
The Early Years Foundation Stage (EYFS) was introduced in England and Wales in 2008 to provide a research-based framework with information on how children learn and develop, aimed at practitioners to assist them in delivering high quality early years environments (17). Based on the EYFS framework, the EYFS ‘Profile’ (EYFSP) was introduced as a teacher assessment of children’s development and learning, completed at the end of the academic year in which the child turns five (18). It was originally introduced with 69 ‘Early Learning Goals’ (ELGs), and following a review which indicated a need to simplify and reduce the number of goals for teachers to complete (17), a new profile consisting of 17 ELGs was introduced. Whilst specific, detailed information regarding how the specific ELGs were chosen is limited, and the EYFSP was not developed as a robust measurement tool (in comparison to, for instance, the EDI), the ELGs do appear to relate to children’s early developmental health. The ELGs span seven different developmental areas; ‘Communication and language development’, ‘Physical development’, ‘Personal, social and emotional development’, ‘Literacy’, ‘Mathematics’, ‘Understanding the world’, and ‘Expressive arts and design’ (19,20).
The EYFSP is scored according to whether a child meets each ELG (original version, child scored as “Emerging”, “Expected” or “Exceeding”; and revised version, child scored only as “Emerging” or “Expected” (21)). The present study investigates the original version, as this has been used nationally and routinely for nine years, and cohort studies have utilised it in research studies, both as an outcome in evaluations of interventions or policies (22), and as a predictor in association studies (23). It is also likely to continue to be used in the future, as there are several studies listed on the ISRCTN that are using the EYFSP as an outcome, and protocols for evaluations which plan to use it as an outcome in the future (24).
The ‘Good Level of Development’ (GLD)
The EYFSP has been predominantly used in research studies and educational monitoring as a binary measure, where children either meet a ‘Good Level of Development’ (GLD), or they do not. Children are scored as having achieved a GLD if they have achieved at least the expected level for the ELGs in the core areas of “communication and language”, “physical development”, “personal, social and emotional development”, “literacy” and “mathematics” (19). The Department for Education monitors national and regional averages of children reaching a GLD, and compares the number of children achieving GLD across different groups according to characteristics such as gender and eligibility for free school meals (25).
Further, several research studies have investigated risk factors for not achieving a GLD. Children with ‘English as an Additional Language’ (EAL) status have been found to have lower proportions of GLD achievement in comparison to native English-speaking children (26), and children born later in the academic year are much less likely to achieve a GLD (27–29). Additionally, children achieving the GLD have higher odds of performing at expected levels on later academic assessments at age 7 (30), and lower odds of later being identified as having Special Educational Needs (SEN) (31).
Whilst the GLD is a useful benchmark to establish which children are meeting the core components of the EYFSP, it has important limitations. Dichotomising variables (continuous or categorical) is problematic for two key reasons. First, much information is lost, so the statistical power to detect an association using the variable is reduced substantially (32). In fact, dichotomising a variable can reduce statistical power by the same amount as would discarding a third of the data or more (33). Second, dichotomisation can lead to an underestimation of the extent of variation in outcome between groups, as individuals close to but on opposite sides of the cut point are characterised as being very different rather than very similar (32).
Applying the GLD method to the EYFSP therefore means missing out potentially valuable information on the number of goals for which a child meets or exceeds, meaning children very close to, but on opposite sides of, the GLD threshold are characterised as being very different, despite meeting or exceeding a similar number of goals. For instance, children who meet zero goals, and children who meet eleven out of twelve GLD goals, would be scored as ‘0’ on the GLD. The GLD also essentially ignores the distinction between children who are “Expected” and “Exceeding” in various goals, as a child who scores “Expected” in all the GLD goals, and a child who scores “Exceeding” in all the GLD goals would both be scored as a 1. As children vary considerably across different during early childhood (34,35), this simple GLD approach is likely a very limited assessment of children’s developmental health. In summary, much of the variation in the EYFSP items, and thus the variation in development amongst children, is ignored by the GLD measure.
The ‘Total Score’
An alternative to the GLD is to instead assign numerical scores to each category in the EYFSP (e.g. 0 for emerging, 1 for expected, and 2 for exceeding in the revised version; or 0 for emerging and 1 for expected in the newer version), and sum these scores into a ‘total score’ (resulting in a score ranging between 0-34 for the original version, and 0-17 for the revised version). This approach overcomes the above limitations that are found with using the GLD, as it better captures the variation in EYFSP responses.
Nonetheless, the EYFSP total score has been seldom used in research studies in comparison to the GLD. Previous research has considered the impact of early years workforce qualifications on children’s later EYFSP total scores (36,37). Only one study has used the total score to predict later outcomes, finding it to be a strong predictor of later Autism Spectrum Disorder diagnoses for children within the Born in Bradford cohort (38). Importantly, there are no studies exploring the measurement properties of the EYFSP total score.
Subscale scores of the EYFSP
As described earlier, there are seven individual learning areas within the EYFSP. However, associations between the seven individual areas of the EYFSP and later related outcomes have not been extensively explored. This may provide information about the construct validity (i.e. the extent to which a test measures what it is intended to measure) of the specific areas (39). For instance, do the ‘personal, social and emotional development’ areas have significant predictive associations with a validated measure of children’s social and emotional development? If so, this specific area (with a score ranging between 0-12) could be used as an outcome in isolation, meaning that intervention studies aiming to improve children’s social and emotional development could use this area with the three goals as an outcome. This rationale can be generalised to all seven areas of the EYFSP.
The preliminary evidence on whether the individual areas significantly relate to other outcomes is promising, but very limited. Children with higher language comprehension scores achieved higher scores on the EYFSP writing scale (40). In the Born in Bradford cohort, EYFSP scores relating to literacy and physical development were found to predict total difficulties on the Strengths and Difficulties Questionnaire (SDQ) (23). However, these are not the most relevant subscales for the SDQ, and it was not reported how the EYFSP subscale scores were calculated for this study.
This information could also be useful in educational settings, as it could compare children’s relative strengths and weaknesses across different domains. In understanding these strengths and weaknesses, a child could then be provided with support in a particular area.
Rationale and objectives
The EYFSP total score has huge potential to provide useful information on children’s early developmental health that could be utilised for research and educational purposes, at both a population and individual level. Despite the EYFSP being administered to over 7.5million children since being introduced (30), there is an absence of any psychometric research on it. Specifically, there is no previous research on the internal consistency or structural validity of the EYFSP ‘total score’, nor any research on its predictive validity for academic outcomes. Research is therefore needed to establish whether the EYFSP ‘total score’ is fit for purpose in both research studies and applied educational settings.
We first investigate the internal consistency of the EYFSP; that is, the degree of the interrelatedness among the items which represents the extent to which all items of a test measure the same construct (39,41). This is an essential first step prior to investigating the structural validity of the EYFSP; that is, the degree to which the total score reflects the dimensionality of the construct to be measured (39). We achieve this using Item Response Theory (IRT); a set of psychometric models for developing and refining psychological measures (42).
We also investigate the predictive validity of the EYFSP total score, to assess the degree to which it predicts future outcomes (39). Since it is assumed that measures administered at the start of school provide an understanding into children’s future attainment, predictive validity is crucial (4). Whilst the predictive validity of the EYFSP GLD has been investigated (30,31), the predictive validity of the total score for academic outcomes has not been investigated. We investigate whether the EYFSP total score is predictive of children’s later academic outcomes at age 10-11 years, and investigate whether specific EYFSP subscales (relating to communication and socioemotional wellbeing), are predictive of children’s behavioural, social, and emotional difficulties.
In summary, we had five aims that assessed two key aspects of using the EYFSP for research and educational purposes:
Internal Consistency/Structural validity of the EYFSP
1) Investigate whether the EYFSP items demonstrate internal consistency
2) Investigate whether the EYFSP items demonstrate structural validity, i.e. that the total scores from the instrument can be used as a summary measurement that represents children’s early school skills
Predictive Validity of the EYFSP
3) Investigate if the EYFSP total score predicts children’s later academic attainment (for maths, reading, and grammar/punctuation/spelling)
4) Investigate if the EYFSP total score predicts children’s later behavioural, social, and emotional difficulties
5) Investigate if the EYFSP subscales (relating only to communication and socioemotional wellbeing) predict later behavioural, social, and emotional difficulties
Materials & Methods
Design
This study comprises secondary data analyses of an observational birth cohort.
Setting
The data source is the longitudinal cohort study, Born in Bradford (BiB). The BiB cohort recruited pregnant mothers between March 2007 and December 2010 at the Bradford Royal Infirmary. All babies born to these mothers were eligible to participate and more than 80% of women invited agreed to participate (43). The cohort comprises of 12,453 mothers, 13,776 pregnancies and 3,448 fathers. At recruitment, the two largest ethnic groups in the sample were Pakistani heritage (45%) and White British (40%) (44).
Mothers completed the BiB baseline questionnaire when they were recruited and reported information on family demographics and socioeconomic indicators. Routine education data relating to personal characteristics and educational outcomes were obtained from the Local Authority every year that the child attends school. Additional data were collected on children aged 7 to 10 years in 89 Bradford schools between 2016 and 2019, including a teacher reported Strengths and Difficulties Questionnaire (SDQ) (which is the outcome for Research Questions 4-5) (34). Born in Bradford and the ‘Primary School Years’ wave received ethical approval for the data collection from the NHS Health Research Authority’s Yorkshire and the Humber—Bradford Leeds Research Ethics committee (references: 07/H1302/112, 16/YH/0062). Informed written consent was obtained for all women recruited.
Internal Consistency and Structural Validity Analyses
The analyses were preregistered at osf.io/s6num. Data were combined and cleaned using Stata/MP 18.0. Internal validity analyses were completed using the mirt and (45,46) ggmirt (47) packages in R.
Measurements
EYFSP total score
The EYFSP total score was summed from the 17 Early Learning Goals (ELGs) in the profile.
As seen in Table 1, each area of learning contains specific goals. The EYFSP handbook provides a description of each goal and what a child must achieve to meet each level (20). Practitioners are instructed to review the evidence gathered in order to make a judgement for each child and for each ELG, and then to score each ELG as either:
Emerging: not yet at the level of development expected at the end of the EYFS
Expected: best described by the level of development expected at the end of the EYFS
Exceeding: beyond the level of development expected at the end of the EYFS
The EYFSP handbook instructs that practitioners must make their final EYFSP assessments based on all their evidence, where ‘evidence’ means any “material, knowledge of the child, anecdotal incident or result of observation, or information from additional sources that supports the overall picture of a child’s development” (GOV.UK, 2019a, p.15).
The responses to each ELG and how they were coded in this study are as follows: ‘Emerging’ = 0, ‘Expected’ = 1, and ‘Exceeding’ = 2. If children were absent from school for a long period of time, this is marked on their records and these children were dropped from the analyses. The EYFSP total score was summed from the 17 ELGs (see Table 1), and therefore ranged between 0–34.
Item Response Theory (IRT) models
We first explored unidimensionality with a Confirmatory Factor Analysis (CFA) of a latent trait with all EYFSP items loading onto it and examined McDonald’s hierarchical Omega, which reflects the percentage of variance in the scale score accounted for by a single general factor. This allows us to estimate the extent of internal consistency among the EYFSP.
Next, we used Item response theory (IRT) to assess the structural validity (42). IRT can be used to assess whether creating a total score from the items is appropriate and assess the strength of relationships between items and constructs of interest. Item response models assume the latent trait variable is reflected by a unidimensional continuum (i.e., item responses are explained by one latent continuous variable, or single dimension). We fitted a polytomous ‘Rating Scale’ version of the 1-parameter logistic Rasch model, since the items have more than two possible response categories (see further details under ‘Rasch model parameters’) (48). Under the Rasch model, two test takers who both achieved, for example, 12 EYFSP items, but who achieved a different set of items would receive the same ability estimate (49). This allows us to interrogate the structural validity of the summed ‘total score’. Rasch model parameters.
Let Yij denote the response to item i for child j, with Yijtaking the values 0 (‘Emerging’), 1 (‘Expected’) or 2 (‘Exceeding’). The polytomous rating scale Rasch model posits that the probability of child j with latent ability θjobtaining responses 0, 1 or 2 for item i are given by: where bidenotes the overall difficulty of item i and d1, d2 denote the distances between adjacent response categories (common across all items). Furthermore, it is assumed that θj ∼ N(0, σ2) and that the item discrimination parameters are 1 across all items. This contrasts with conventional Rasch parameterisation which constrains the item discrimination parameters to be constant across all items (but not equal to unity) and assumes the latent ability θjto be distributed N(0, 1).
The item difficulty parameter measures the difficulty of achieving a higher scoring response, whereas the discrimination parameter is a measure of the differential capability of an item (i.e. a high discrimination value suggests an item that has a high ability to differentiate between subjects with similar, latent abilities) (50). In a Rasch model, discrimination is constrained to be equal across all items, and difficulty is estimated separately for all items (49). The polytomous rating scale version of the Rasch model also includes category threshold parameters which are constrained to be equal across items, and provide a measure of the distances between the difficulties of adjacent levels of response for each item.
Model fit
The fit of the Rasch model was assessed using Root Mean Square Error of Approximation (RMSEA), where values of <0.02 with sample sizes of 1000+ indicate that the data do not underfit the model (51). We also report the Comparative Fit Index (CFI) (values >.90 are acceptable), and Standardised Root Mean Square Residual (SRMR) (values <.08 are acceptable) (52).
Item fit
Item infit and outfit indicate how well the item responses fit the model (53). Item fit was assessed using infit/outfit statistics, with values between 0.5 and 1.5 considered to be acceptable (54) and RMSEA as described above.
Local Dependence
Local dependence is the assumption that the only influence on an individual’s item response is that of the latent trait variable being measured and that no other variable (e.g., other items on the EYFSP scale) is influencing individual item responses. We used the ‘residuals’ function in the mirt package to examine the standardised local dependency χ2 statistic (where any correlation higher than the average item residual +.2 (55) classifies as local dependency).
Item Response Theory visuals
The test information function shows a measure of the information provided by the total test score across the range of latent ability levels (denoted θ). Information is a statistical concept that refers to the ability of a test (or item) to reliably measure the latent ability θ. The test characteristic curve shows the relationship between the total summed score on the y axis, and latent ability (θ) on the x axis (56). Plots of item characteristic curves and item information functions are provided at osf.io/s6num/.
Predictive Validity Analyses
The analysis plan for the predictive validity analyses was preregistered at osf.io/s6num. We made two key changes to this upon starting the analysis: (1) the inclusion of a binary term for Special Educational Needs and Disabilities (SEND) status as a covariate in all analysis models and (2) inclusion of ‘school at time of outcome’ as a random intercept in all analysis models. All analyses for this component of the research were undertaken using Stata/MP 18.0.
Measurements Predictors
EYFSP total score
See the above section.
EYFSP Communication and Socioemotional goals (EYFSP-CS)
We tested the strength of the association between the ‘communication and language’ and ‘personal, social, and emotional’ ELG’s with children’s outcomes. This EYFSP-CS score ranged between 0-12 and was obtained by summing the responses to the six items in the two relevant areas.
Outcomes
Research Question 3 - Academic attainment
The Key Stage 2 Assessment is completed towards the end of Year 6 at school. There are continuous scaled scores for (1) maths, (2) reading, and (3) grammar/punctuation/spelling that range between 0 and 120. Any children who scored ‘0’ were excluded from the analyses, as any children with ‘0’s recorded are pupils who have achieved too few marks to be awarded a scaled score (57). Analysed scores therefore ranged between 80 and 120.
Research Question 4-5 - Strengths and Difficulties Questionnaire (SDQ)
We used the SDQ to measure children’s behavioural, social, and emotional difficulties (58). The SDQ was collected for children when they were aged 7-10 in the ‘Primary School Years’ wave. The 25 items in the SDQ comprise 5 scales of 5 items each. ‘Somewhat True’ is always scored as 1, but the scoring of ‘Not True’ and ‘Certainly True’ varies with the item. A total difficulties score is generated by summing scores from all the scales except the prosocial scale, and the resultant score ranges from 0 to 40.
Covariates
Table 2 below provides an overview of all covariates included in both models. Covariates were included in the regression models if they were thought to be confounders of the association between EYFSP and the outcome, or if they were covariates that would be expected to improve the precision of our estimates.
Analysis models
All research questions were answered using linear mixed effects models, with fixed effects of socioeconomic status, parent immigration status, child ethnicity, SEND, child age, and child language as covariates (see Table 2), and a random intercept for school at the time of outcome measurement. The four outcomes were: (1) Reading, (2) Maths, (3) Grammar, Punctuation, and Spelling, and (4) SDQ. The SDQ scores were analysed twice, once using EYFSP total score as a predictor, and once using EYFS-CS subscale as a predictor. The model for each outcome can be described as; Where δ is each outcome, β0 is the intercept, each β is a coefficient, uj is the random intercept for school j, and εijis the residual error for individual i within school j. The letters identify the levels within the model, where i is the individual and j is the school. Child ethnicityij & Socioeconomic statusij represent a set of dummy variables.
Unstandardized regression coefficients and Wald method 95% confidence intervals based on variance estimates obtained via Rubin’s rules are reported for all models (60).
Missing data methods
Multiple Imputation using Chained Equations (MICE) was used to impute missing data on parent immigration status, socioeconomic position, and SEND (see Figure 1 for numbers of missing values). Multiple imputation assumes that data are missing at random (MAR—the probability of data being missing does not depend on the unobserved data, conditional on the observed data) (60,61). Every variable that was in the analysis model was included in the imputation model. We used Statas ‘mi impute chained’ command to generate 25 imputed datasets for each research question. The results section presents the pooled results from the multiply imputed datasets (results from analyses based on complete cases were similar).
Robustness checks
Model fit was assessed between models run with (1) EYFSP modelled as a continuous variable via a single linear term and (2) EYFSP as an unordered categorical variable modelled via a series of dummy variables. Model fit assessed via AIC and BIC was marginally better with EYFSP as a continuous variable, and the continuous modelling provides a more parsimonious estimate, so this model was selected. A scatter plot of fitted and residual values was considered to show no evidence of heteroskedasticity.
Effect sizes
Half of a standard deviation has been previously found to correspond to a minimum clinically important difference (62,63). We therefore calculated half of a standard deviation in the outcomes, and compared these to our effect estimates. The outcomes, standard deviations, and effect sizes of interest are given below:
Figure 1 shows the total number of recruited BiB children (n=13,858), and the numbers within each measurement and analyses set.
Descriptive information
Table 4 describes the sample for all children who had EYFSP data (n=10,589). The mean EYFSP total score in this sample was 15.30 (SD=8.07), and it ranged between 0-34. The mean EYFSP score within children who achieved a GLD (n=6,272, 59%) was 20.38 (SD=4.96), and scores ranged between 12-34. The mean EYFSP score within children who did not achieve a GLD (n=4,317, 41%) was 7.92 (SD=5.67), and scores ranged between 0-27.
Figure 2 further demonstrates that there is considerable overlap in total scores between children who do and do not achieve a GLD. It also demonstrates that there is substantial variability in scores within children who do and do not achieve a GLD.
Item Response Theory analysis
Full analyses with the code, results, and additional sensitivity analyses are provided at https://osf.io/s6num/.
Internal consistency
The CFA indicated high factor loadings (all >.8) onto one construct, and a parallel analysis indicated that a one factor model was a reasonable representation of the data (64). We assessed internal consistency using McDonald’s hierarchical omega, finding a point estimate of 0.89.
Structural validity: Rasch model parameters, model fit, and item fit
The model fit values (RMSEA=0.138, SRMSR=0.162, CFI=0.938) indicated poor fit to the overall Rasch model. The maximum likelihood estimates of the category threshold parameters were -3.585 and 3.473 and the maximum likelihood estimate of the variance of the latent ability was 9.532 (discrimination parameter constrained to be equal to 1). We next assessed the item parameters and the item fit values for the overall Rasch model.
Table 5 shows that the easiest item is ‘Moving and handling’ (goal 4), and hardest item is ‘Writing’ (goal 10). The item fit values show that Item 9 has the highest RMSEA value (although other items also have problems with misfit). The item infit/outfit values are provided at osf.io/s6num/ and generally indicated values within the acceptable range.
Local dependence
The local dependency matrix is presented in osf.io/s6num/ (Section 2.5). The matrix identifies a local dependence issue between Items 2 & 3 (communication items) (residual = .44); and Item 9 & Item 10 (the literacy items) (residual = .48).
Test information function and test characteristic curve
Figure 3 demonstrates that most information is provided at the lower/higher ends of ability (ie. those children with latent abilities at least one standard deviation above/below the mean latent ability). It also shows that less information is provided for children with close to average abilities - shown by the dip in the curve around θ = 0. Figure 4 presents the scale characteristic curve, showing the relationship between the total summed score on the y axis, and the overall latent ability (θ) on the x axis. The test shows good discrimination for children with latent abilities that are slightly-to-moderately higher and lower than average (i.e. θ ∈ [- 5, -1] ∪ [1, 5]), and slightly less powerful discrimination for children with close to average abilities (shown by the flattening in the curve around θ = 0), and for children with very high or low abilities (shown by the flattening of the curve at the more extreme values of θ.
Predictive Validity Analysis
Academic attainment outcome
The mean scores and standard deviations for the Key Stage 2 outcomes were Maths = 105.08 (7.05), Reading = 103.67 (8.16) and Grammar/Punctuation/Spelling (GPS) = 106.64 (8.09). For full regression results and the key analyses code, please see Technical Appendix File 1 (Attachment C). All models indicated that EYFSP total score was associated with a higher Key Stage 2 outcome. Key results are described and displayed below.
Maths
The model explained a significant amount of the variance (unadjusted R2 = .33; F(11,23939.7) = 124.65, p<.001). The EYFSP total score was associated with higher Maths scores (B=0.356 [0.322 to 0.390], p<.001).
Reading
The model explained a significant amount of the variance (unadjusted R2 = .31; F(11,22414.3)) = 108.91, p<.001). The EYFSP total score was associated with higher Reading scores (B=0.424 [0.384 to 0.464], p<.001).
GPS
The model explained a significant amount of the variance (unadjusted R2 = .37; F(11,18477.5) = 146.05, p<.001). The EYFSP total score was associated with higher GPS scores (B=0.427 [0.390 to 0.464], p<.001).
Figure 5 displays the association between an increase in EYFSP goals (ranging between 1-10), and the estimated change in outcome in the different academic outcomes. For instance, an increment of 1 EYFSP total score point results in a change of between 0.36 to 0.42 in the outcomes, and an increment of 10 results in a change of between 3.56 to 4.24. To reach a minimum clinically important difference (shaded area in Figure 3), the change in EYFSP total score required is approximately 8 for Reading and Grammar/Punctuation/Spelling, and 10 for Maths.
Behavioural, social, and emotional difficulties outcome
The mean EYFSP-CS score was 5.78 (SD=3.16). The mean SDQ score was 7.31 (SD=6.26). For full regression results, please see Technical Appendix File 1, Attachment D. Key results are described and displayed below. Note that a higher score on SDQ indicates more socioemotional difficulties.
EYFSP total score predictor (n=984). When we included the EYFSP total score as the predictor, the model explained a significant amount of the variance (unadjusted R2 = .25, p<.001; F(11,66324.9) = 27.04). The EYFSP total score was associated with a decrease in the SDQ total difficulties (B=-0.20 [-0.26 to -0.15], p<.001) EYFSP-CS predictor (n=984). When we included only the EYFSP-CS predictor, the model explained a significant amount of the variance (unadjusted R2 = .25, p<.001; F(11,67390.5) = 26.72). The EYFSP-CS sores were associated with a stronger decrease in the SDQ total difficulties (B=-0.48 [-0.61 to -0.37], p<.001).
Figure 4 displays the association between an increase in EYFSP goals (ranging between 1-10), and the estimated change in outcome for the socioemotional wellbeing measure (SDQ). For instance, a change in 1 in the EYFSP total score results in a change of -.20 in SDQ, and a change of 1 in EYFSP-CS results in a change of -.48 in SDQ. A change in 6 in the EYFSP total score results in a change of -1.22 in SDQ, and a change in 6 in EYFSP-CS results in a change of -2.90 in SDQ (with the confidence interval crossing over the clinically important difference).
Discussion
Embedding routine measurement of children’s developmental health into educational systems is crucial to provide support to those who need it (10,11), and potentially close inequalities in educational outcomes (9). In England and Wales, the EYFSP with 17 goals has been routinely completed by teachers for all children attending school for nearly ten years. Due to the potential use of the EYFSP ‘total score’ for both research studies and applied educational settings, we investigated whether it is fit for purpose as an overall summary of child developmental health.
Internal consistency and structural validity of the EYFSP total score
The first aim was to investigate the internal consistency the EYFSP items. The EYFSP items demonstrated high internal consistency, with results indicating that the items primarily measure one unidimensional construct. We tentatively suggest that the measured construct is children’s ‘developmental health’. The construct of developmental health encompasses a holistic understanding of children’s physical, mental, social, and emotional wellbeing (1), and this reflects the EYFSP’s original purpose to operate as a research-based framework of children’s learning and development (17,65).
The second aim was to investigate if the EYFSP demonstrated structural validity. The IRT analyses indicated a poor fit to the polytomous Rasch model. However, the test information and scale characteristic curves show the total score provides substantial information across a wide range of underlying ability, although does indicate some loss of precision at very close to average abilities. This indicates that whilst the test provides information across a wide range of ability levels, it provides relatively less information for children with ‘average’ latent abilities (e.g. the 40% of children between roughly 35th and 75th percentiles). This means that two children with equal scores of, for example, 16, may have different ability levels in reality (e.g. one could have slightly below average ability and one slightly above), but that the EYFSP total score is not able to precisely discriminate between them. It should also be noted that an IRT Rasch model is extremely restrictive, as it requires all items to be equally discriminating, and this is very rarely the case in measurements of person ability (49).
The less precise estimates of ability evident for children with ‘average’ ability may relate to the varying administration of the measure in educational settings (36,37). The administration is not standardised or moderated, and therefore susceptible to considerable variation. Additionally, the procedures and requirements of the EYFSP may not lend themselves to identification of more nuanced differences in ability for children with generally average levels of development. The high number of children meeting expected levels of development in all 17 goals is potentially indicative of this issue. More guidance for teachers on how to identify differences in children’s abilities, as well as more robust procedures for moderating scores, could potentially address the apparent issues with reduced precision for children with close to average abilities, and increase the information provided by the measure.
Predictive validity of the EYFSP total score
Our third aim was to investigate the predictive validity of the EYFSP total score for academic outcomes. The EYFSP total score strongly and consistently predicts academic outcomes at ages 10-11 in Maths, Reading, and Grammar, Punctuation and Spelling assessments.
It has been previously found that the EYFSP GLD is predictive of children’s academic outcomes at ages 6-7 during Key Stage 1 (Atkinson et al., 2022), and the present study extends this finding to the EYFSP total score, and to Key Stage 2 assessments at ages 10-11 years. To reach an important change in academic outcomes (considered to be half the standard deviations of the observed Key Stage 2 scores), a change in EYFSP total score of 8-10 points was required (dependent on the outcome). This information will be useful for researchers to note if they wish to use the EYFSP total score as an outcome for intervention studies. For instance, as a change in EYFSP total score of 8 was required to reach an important difference for the Reading outcome, this could be used as a benchmark for future educational interventions which aim to improve children’s reading abilities. Though it is important to note that the estimates reported in Figure 3 could easily be used to identify differences in the EYFSP total score that translate to smaller differences in these outcomes, which may serve as more realistic target differences for future intervention studies.
Our fourth and fifth aims were to explore the predictive validity of the EYFSP total score and the EYFSP-CS subscales for children’s behavioural, social, and emotional difficulties. The relevant EYFSP-CS subscales had a much stronger association with behavioural, social, and emotional difficulties than the EYFSP total score. A change of 6 points for the EYFSP-CS score was associated with important differences in behavioural, social, and emotional difficulties, whereas no changes in the EYFSP total score were associated with important differences. Again, an improvement of 6 (for the EYFSP-CS score) could be used as a benchmark for future interventions which aim to improve children’s behavioural, social, and emotional abilities (or translated for a more realistic target difference). Researchers can more confidently use the communication and social subscales to measure behavioural, social, and emotional difficulties.
Limitations
We expect that the results from this study will generalise to the revised version of the EYFSP, particularly the findings regarding internal consistency and predictive validity.
However, as data from the revised EYFSP becomes widely available, future research will need to test if these findings generalise to the revised EYFSP, particularly as the structural validity may be affected by the removal of one of the response categories.
This study included only children in the Born in Bradford cohort, and therefore may only be relevant for comparable populations with high levels of deprivation and a diverse ethnic population. However, the ethnic diversity of this sample improves the generalisability of the findings to diverse populations.
Implications and future directions
This study supports future use of the EYFSP total score over the EYFSP GLD score for research and educational purposes. Although both the GLD and now the total score have been shown to predict future outcomes, there is substantial variation in total scores within children who do and do not achieve a GLD (see Figure 1). The GLD therefore does not capture the variation in children’s developmental health that the EYFSP total score does, and it has no other evidence regarding its measurement properties. We therefore recommend that researchers use the EYFSP total score instead of the GLD, if it suits their study purpose. For teachers, the GLD is a useful metric for identification of children who may later be diagnosed with special educational needs (31), however, teachers may wish to also examine a child’s EYFSP total score to gain a more nuanced understanding of a pupils’ development. Though, it is important to note that the EYFSP total score should be used with some caution when making inferences about ‘average’ ability children (those with total scores between approximately 15 and 18).
There is still much to be learnt about the measurement properties of the EYFSP. It would be beneficial to directly compare the measurement properties of the EYFSP total score to the GLD in a future study, explicitly examining whether more valid and accurate conclusions can be made about child developmental health using the total score than can be done using the GLD. There are also other measurement properties which could be tested, including the content validity (the degree to which the EYFSP reflects children’s developmental health as a construct) and criterion validity (the degree to which the EYFSP items and summaries derived thereof are an adequate reflection of a gold standard). This would require collection of an additional measure of child development at the same time of the EYFSP, perhaps the EDI, since this is implemented in a comparable way at population level, and has undergone substantial development since its inception (13,16). In terms of predictive validity, more research is needed to explore whether each specific goal area is associated with other measurements (e.g. literacy to reading assessments, physical activity to motor skill measurements).
Conclusions
While the EYFSP has been utilized as a measure of children’s early developmental health, this was not its intended purpose. Despite this, this study has revealed that the EYFSP total score is an internally consistent measure with predictive validity. The EYFSP total score also provides information across a range of abilities, however, we caution against using it for measurement of children with very close to ‘average’ ability levels. Given that the EYFSP was not intended to be a robust measurement tool, the EYFSP total score appears to be a reasonable measure of child developmental health for routine use in England and Wales.
Data Availability
These data cannot be shared publicly as they are available through a system of managed open access (see https://borninbradford.nhs.uk/research/how-to-access-data/). Full analysis code for the IRT analyses are available at https://osf.io/s6num/.
Acknowledgements
Born in Bradford is only possible because of the enthusiasm and commitment of the children and parents in BiB. We are grateful to all the participants, health professionals, schools and researchers who have made Born in Bradford happen.