Abstract
Purpose Dynamic assessments (DAs) of word reading skills (e.g., phonological awareness, decoding) demonstrate predictive validity with word reading outcomes but are characterized by substantial heterogeneity in terms of format, administration method, word, and symbol type used, factors which may affect their validity. This systematic review and meta-analysis examined whether the validity of DAs of word reading skills is affected by these characteristics.
Method Five electronic databases (Medline, Embase, PsycINFO, ERIC and CINAHL), 3 preprint repositories (MedRxiv, PsyArxiv and EdArxiv) and the gray literature were searched between March 2022 and March 2023, to identify studies with participants aged 4-10 that reported a Pearson’s correlation coefficient between a DA of word reading and a word reading measure. A random effects meta-analysis and 4 subgroup analyses based on DA format, administration method, word and symbol type were conducted.
Results Thirty-two studies from 30 articles were identified. The overall effect size between DAs of word reading skills and word reading is large. There are no significant differences in mean effect sizes based on format (graduated prompt vs. train-test) or administration method (computer vs. in-person). However, DAs that use nonwords and those that use familiar letters or characters demonstrate significantly stronger correlations with word reading measures, than those that use real words and those that use novel symbols.
Conclusions Outcomes provide preliminary evidence to suggest that DAs of word reading skills that use nonwords and familiar letters in their test items are more strongly associated with later word reading ability than those that use real words or novel symbols. There were no significant differences between DAs administered in-person versus via computer. Results inform development of novel DAs of word reading, and clinical practice when it comes to selecting assessment tools.
Introduction
Literacy Assessment
Literacy, the ability to read and write, is a complex construct which requires integration of multiple skills but can simply be described as the product of the ability to decode words (or read words) and comprehend language (Hoover & Gough, 1990). In this review, we derive our definition of the construct of word reading from the subskills that comprise word recognition ability in the evidence-based model Scarborough’s reading rope (2001). These subskills – phonological awareness, knowledge of the alphabetic principle (or sound-symbol knowledge) and word recognition (or decoding) ability, have been consistently found to be among the strongest and most accurate predictors of later reading ability for young children beginning to learn to read (e.g., Catts et al., 2005; Hogan et al., 2005; Scarborough, 1998;). In recent years, early assessment, and identification of difficulties with word reading skills has received increasing attention, in large part due to widespread global literacy challenges exacerbated by the COVID-19 pandemic (e.g., Annie E. Casey Foundation, 2014; OHRC (Ontario Human Rights Commission), 2022; The Conference Board of Canada, 2014; UNESCO (United Nations Educational, Scientific and Cultural Organization), 2013; UNESCO, 2021).
In the fields of speech-language pathology (SLP), psychology and education, many of the widely used traditional word reading tools employ a static assessment (SA) paradigm (e.g., Phonological Awareness Test-2 (PAT-2:NU), Robertson & Salter, 2017; Woodcock Reading Mastery Test-III(WRMT- III), Woodcock, 2011). In SA, an examiner measures an individual’s learning product via a correct or incorrect binary grading system, and passively evaluates performance without provision of prompting or corrective feedback (Grigorenko & Sternberg, 1998). In this type of testing, children from, diverse linguistic background or those with limited literacy experiences are all prone to perform poorly (Bedore & Peña, 2008; Ginsborg, 2006; Sewell, 1987). When many children underperform on a test, it results in floor effects, which weaken the validity of a measure and render it difficult to discern those who are truly at-risk from who have simply has not had enough linguistic or educational experience to perform test tasks. This can result in failure to identify word reading difficulties early and provide intervention to prevent the long-term negative effects associated with word reading difficulty (Catts et al., 2009).
Given the limitations associated with SAs, interest in alternative approaches to early word reading assessment have been increasing. Dynamic assessment (DA) is a potential solution. While SAs purport to quantify what a child is capable of at the time of testing, DAs endeavor to examine a child’s current performance and their ability to learn a skill with support (Grigorenko & Sternberg, 1998). DA is characterized by provision of prompts, feedback, and interaction between the child and examiner within the assessment (Caffrey, 2006). This approach has been shown to reduce bias in testing and misidentification of difficulty because the impact of previous linguistic or educational experiences on test outcomes are minimized (Bedore & Peña, 2008; Petersen & Gillam, 2013). In the domain of word reading skills, DAs have been shown to predict unique variance beyond SAs in reading outcomes (Dixon et al., 2022a), to contribute to the accurate identification of reading difficulties (Dixon et al., 2022b), and to demonstrate strong concurrent validity with equivalent SAs and predictive validity with word reading outcome measures across typically developing, at-risk, bilingual, and monolingual children (Wood et al., 2023). However, these DAs are characterized by substantial heterogeneity in terms of their format, administration method, word, and symbol type. The impact of these characteristics on the strength of a DA measure’s relationship with word reading measures has yet to be considered. The DA characteristics of interest are described below.
Dynamic Assessment Characteristics
Format
Broadly, there are two approaches to DA, interventionist, and interactionist (Lantolf & Poehner, 2004). Interactionist or contingent DA is typically unscripted and endeavors to modify cognitive or skill ability. In this approach the examiner responds to the individual examinee and their capacities. Interventionist or non-contingent DA, however, more closely parallels traditional SA testing. The examiner provides pre-defined and increasingly explicit levels of support in response to student need. Its scripted nature requires less clinical skill and time to administer, and its standardization permits researchers to evaluate its validity (Poehner, 2008). In the field of word reading assessment, most studies have focused on developing and validating DAs that can be characterized as interventionist. The studies included in this review focus on two types of interventionist DA, which are utilized with similar frequency in DAs of word reading skills (Dixon et al., 2022b; Sternberg & Grigorenko, 2002).
The first is an approach pioneered by Milton Budoff (1987), that follows a (pre-test), train, re-test structure, referred to in this paper as the train/test (TT) format. This TT design consists of a static pre-test, followed by a dynamic teaching/training phase, and a final static post-test, though not all assessments incorporate the initial static pre-test. During the training phase, children receive feedback and instruction (e.g., encouragement to try again, a hint, or the correct solution, etc.). If a pre-test is conducted, outcomes of the post-test are compared to the child’s initial score to assess the difference in their performance following the teaching session. If no pre-test is conducted, the post-test serves as a measure of how a child performs a skill after receiving explicit dynamic instruction in the task.
In contrast, the second format combines teaching and testing phases of the assessment within each item (Campione & Brown, 1987). In this approach, referred to as the graduated prompts (GP) format.
Children are provided with feedback about whether they were correct or incorrect following their response. If incorrect, a series of increasingly explicit prompts are provided, until the child answers correctly, or all prompts are exhausted. Scoring is directly influenced by the individual’s performance; the greater number of prompts required, the lower the score on an item (Brown & Ferrara, 1985; Campione et al., 1984). A meta-analysis found that interventionist DAs demonstrate stronger predictive validity than interactionist DAs (Caffrey et al., 2008), and a recent systematic review documented that within interventionist DAs, the GP and TT formats are used with similar frequency in assessment of word reading (Dixon et al., 2022b), but to date there has been no consideration of whether these different formats result in increased strength of relationship between the DA and word reading outcomes.
Administration Method
Assessments, dynamic or otherwise, can be conducted in-person or via computer. Development of virtual or computer-based assessments of early literacy has become increasingly important, in the wake of the COVID-19 pandemic and the subsequent shift to distance/remote learning (Campbell & Goldstein, 2022; Tohidast et al., 2020). There is evidence to suggest that there are no significant differences between administering an SA online vs. in-person (e.g., Alfano et al., 2022; Nelson & Plante, 2022), but this factor has not been considered in the context of DAs, which can be administered in-person (e.g., Spector, 1992), virtually by an examiner through a computer (e.g., Barker & Saunders, 2020), or in a computerized fashion where no examiner is required to carry out the assessment (e.g., Aravena et al., 2018). As previously noted, DA is characterized by increased interaction between examiner and examinee, and as a result may be impacted to a greater extent by computer administration. Post-pandemic, many clinicians and researchers continue to operate virtually or in a hybrid format and therefore the factor of administration method should be considered.
Word Type
Assessments of the word reading skills of phonological awareness and decoding can also be differentiated by the type of words used in their items (i.e., real or nonwords). Nonwords do not exist in the language of testing but that abide by its phonotactic and orthotactic constraints (e.g., “meeb” in English). Commercially available SAs that are in widespread clinical and research use, such as the Comprehensive Test of Phonological Processing –2 (CTOPP-2, Wagner et al., 2013) which evaluates phonological awareness, or the Woodcock Reading Mastery Test – Third Edition (WRMT-III), (Woodcock, 2011), which evaluates reading and decoding, include subtests with both words and nonwords. Similarly, commercially developed DAs, like the CUBED dynamic test of decoding include a word reading and nonword decoding measure( Petersen et al., 2016). However, many DAs have fewer subtests and are at present, primarily used in research. These tests tend to employ either words or nonwords, not both. For example, Gellert and Elbro’s phoneme identification task uses real words (2017b), but their decoding measure uses nonwords (2017a).
In young children, reading words vs. non-words are purported to tap into two different reading processes (Shapiro et al., 2013). There is evidence indicating that children may initially recognize some familiar, high frequency words by sight without activating their decoding skills (Ehri & Wilce, 1985). For example, children can often recognize their names or high frequency words like “the” in print without using knowledge of sound-symbol correspondences, phoneme blending and decoding skills. However, when it comes to reading nonwords, decoding skills are necessary to make sense of the written text because these words are categorically unfamiliar (Hoover & Tunmer, 1993). In this way, nonword decoding can be informative of word reading skills (e.g., sound-symbol knowledge, decoding etc.) and not just a measure of word familiarity and recognition. Nonword repetition tasks have been shown to reduce bias against culturally and linguistically diverse children in the domain of oral language assessment (Ortiz, 2021). Nonword reading tasks may similarly reduce bias against those with different or limited literacy experiences. Children enter kindergarten with a wide range of literacy abilities that can be attributed to factors like linguistic diversity, but also their home literacy environment, access to books and libraries, or exposure to literacy instruction in preschool or daycare (Ackerman & Barnett, 2005).
Importantly, nonword reading tasks have not been found to disadvantage strong readers with advanced lexical knowledge (Castles, et al., 2018). They can account for significant unique variance in word reading ability beyond word reading (e.g., Hogan et al., 2005). To date, no studies have considered the role of word type in validity of DA. It is possible that nonword tasks may be better suited to predict later reading ability in a DA paradigm, because DAs are designed to measure ability to learn, rather than acquired knowledge and this ability to learn can be more easily captured in a task with nonwords.
Symbol Type
Word reading assessments of sound-symbol knowledge and decoding can differ in the type of symbol used in their items (i.e., familiar letters and characters or novel symbols). Typically, SAs, which are developed and normed for a specific population, use the letters or characters of the language for which they were created. For instance, in the PAT-2 (Robertson & Salter, 2017) the phoneme-grapheme subtest (a measure of sound-symbol knowledge) evaluates a child’s acquired knowledge of the relationship between English letters and sounds, and the phoneme decoding subtest (a decoding measure) evaluates their ability to read nonwords comprised of English graphemes. However, in recent years there has been increased interest in using novel symbols in place of familiar letters or characters in DAs. Using unfamiliar symbols allows researchers and clinicians to determine how well a child can learn new symbol-sound relationships (e.g., that the symbol ◊= sound /m/) (Gellert & Elbro, 2017a), and apply this knowledge to decode symbol-based words (e.g., that the symbols ◊ ◘ = the nonword /ma/) (Gellert & Elbro, 2017a), all while minimizing the influence of previous linguistic and literacy exposure.
Measures that use novel symbols have been shown to differentiate between typical readers and those with dyslexia in adult populations (Elbro et al., 2012), and in children (Aravena et al., 2013, 2018). Additionally, DAs that use novel symbols have documented that these measures can explain unique variance in later reading ability beyond traditional measures for preliterate children (Horbach et al., 2015). Outcomes from two recent studies that examined the capacity of a nonword decoding measure administered in kindergarten to predict reading difficulty in grade 1 suggest that the measure that used novel symbols (Gellert & Elbro, 2017a) had a superior diagnostic accuracy to a task that used familiar letters (Petersen et al., 2016). However, the use of novel symbols is a recent development in the field of word reading assessment and there has not yet been a systematic quantitative examination of whether DAs that use novel symbols are comparably valid to those that use familiar letters or characters. It is possible that in DA, evaluating ability to learn SSK or decoding skills may be more easily achieved with novel symbols than with familiar letters.
Previous Research in Dynamic Assessment of Word Reading Skills
Reviews that evaluated the use of DAs reported promising findings on its utility and validity. Caffrey et al. (2008) found that DAs demonstrated greater predictive validity than SA across several domains (e.g., DAs of cognitive ability, literacy, and mathematics). A more recent review focussing exclusively on the domain of word reading skills, reported that DAs of phonological awareness and decoding demonstrate concurrent validity with SAs and predictive validity with word reading outcomes (Wood et al., 2023). In terms of format of DAs of word reading skills, no quantitative comparison has been conducted evaluating differences in associations with word reading outcomes, although findings suggest that there are no differences in classification accuracy of reading disorder for DAs that use a GP vs. TT format (Dixon et al., 2022b)). Regarding administration method, SAs of oral language skills are not affected by computer vs. in-person delivery (Alfano et al., 2022). A recent review reported that computerised DAs were used less frequently than in-person measures in research (Dixon et al., 2022b). However, the effect of administration method and its implications on the validity of DAs has not yet been considered. To our knowledge no prior reviews have qualitatively or quantitatively examined whether factors of word (real word vs. nonword) and symbol type (novel vs. familiar) are associated with stronger correlational relationships with later word reading outcomes. In summary, DAs of word reading skills vary based on several characteristics, which could have implications for the validity of these measures.
These factors should be considered to inform clinical decision-making and development of novel DA measures
The Current Study
The current systematic review and correlational meta-analysis investigates whether the format, administration method, word, and symbol type affect DA’s effectiveness. Like Caffrey et al., (2008) we will use Pearson’s correlation coefficients as our effect size measure, given that these are the most observed type of effect size reported in studies investigating DA measures. We focus exclusively on DAs of word reading skills: phonological awareness, sound-symbol knowledge, and decoding – as these skills demonstrate concurrent validity with SA counterparts and predictive validity with later word reading outcomes (Wood et al., 2023). In our analyses, we stratify DAs by their format (graduated prompts vs. train/test), their administration method (in-person vs. computer vs. computerized); (also see Dixon et al., 2022a; 2022b), and include stratifications by word type (real word vs. nonword) and symbol type (familiar vs. novel). Unlike all previous reviews, we will conduct a comprehensive search of the grey literature and will include studies published in languages other than English. The outcomes of this review will inform which administration method, word and symbol types used in DAs of word reading skills are associated with the strongest correlations with word reading measures and will have both clinical and research implications. For clinicians, it is critical to understand which characteristics of DAs are associated with stronger correlational relationships with word reading outcomes, to make informed choices about which tools to use in their practice. For researchers, a quantitative examination how these factors affect validity of DAs of word reading skills can inform development of high quality, novel tools, or revisions of existing measures.
Method
The review objectives and meta-analytic approach were planned a priori and detailed in a registered protocol on the Open Science Framework. This protocol is available online at https://osf.io/bcghx/ (Wood & Molnar, 2022).
Research Questions
Do DAs of word reading skills (phonological awareness (PA) sound symbol knowledge (SSK) and decoding) demonstrate similarly strong correlations with word reading measures when stratified by:
A) Format (train-test (TT) vs. graduated prompts (GP))
B) Administration method (computer vs. in-person)
C) Word type (real word vs. nonword)
D) Symbol type (novel vs. familiar letters or characters)
Eligibility Criteria
Study inclusion criteria were determined a priori and outlined in the protocol on Open Science Framework (Wood & Molnar, 2022). All studies included in this review were:
(i) Primary research articles found in peer-reviewed journals, and unpublished grey literature found in preprint repositories and on Google Scholar. Systematic reviews, books or book chapters, case studies, commentaries, and editorials were excluded.
(ii) Studies that assessed children with a mean age between 4;0 and 10;0. Articles that included adults or children with developmental challenges, such as hearing impairment, developmental language disorder, or autism spectrum disorder were excluded.
(iii) Articles that reported a correlation coefficient between a DA of one of three word reading skills, and a static word reading measure, concurrently or longitudinally. This allowed for a comparison of the relationships between DAs of different format and administration methods with word reading outcome measures.
(iv) No limitation was placed on setting or geographical location, but only articles written in English, French, Spanish, or a different language with full text translations were included.
Search Strategy and Information Sources
The initial search was carried out in 5 databases, MEDLINE, Embase, CINAHL (Cumulative Index to Nursing and Allied Health Literature), PsycINFO and ERIC (Education Resources Information Centre), using the terms “dynamic assessment” and “literacy” as well as their related keywords in titles and abstracts. Filters were not used in the search process. A complete list of search terms used in each database can be found in Table 1 and 2 of the supplemental files and online at https://osf.io/bcghx/.
Participant Characteristics
Equivalent terms “dynamic assessment” and “literacy” were searched in MedArxiv, EdArxiv and PsyArxiv preprint repositories. The first author and a research assistant started forward searching of included articles on Google Scholar upon the completion of the database and preprint repository search. Sources that cited each included article were located using the “cited by” function. To check whether any relevant articles were potentially missed during the database, preprint, and Google Scholar search, the first author and a research assistant reviewed the reference lists of the included articles and compared them with the list of included articles. Lastly, appeals for unpublished work were made on lab and researcher social media platforms and sent out twice to lists and labs across Canada, the United States and Europe that reported conducting research in field of literacy.
Data Collection
Data collection and extraction was managed in Covidence, a web-based software that facilitates completion of reviews (Covidence, 2023). A team of ten research assistants (RAs), trained by the first author, assisted in article screening and extraction. At the title/abstract stage, two independent team members voted to include or exclude based on relevance. In the full text stage, two reviewers voted to include or exclude articles based on whether they met pre-defined characteristics. The same team of ten RAs extracted the data from included studies using a custom template in Covidence. This template is available at https://osf.io/bcghx/. In all stages, the first author resolved conflicts.
Data Items
The following information were extracted from the included papers
General Information
The study title, journal name, date of publication, DOI, author name(s), institutional affiliation(s), and the country in which the study took place were extracted. Study funding and any potential conflicts of interest were also noted.
Participants
The number of participants at the end of the study included in analyses (taking attrition into account for longitudinal studies), the percentage of males, the language(s) spoken by participants, as well as the mean age and grade level of the children at the outset of the study were noted.
Measures
Dynamic Assessment(s) In this review, DA is defined as an assessment that provides teaching, training, feedback on performance, or prompting during testing. The research team reported the word reading skills evaluated (either phonological awareness, sound-symbol knowledge, or decoding, or multiple), as well as the type of task used to assess the skill (e.g., phonological awareness can be assessed by syllable or phoneme blending). If multiple tasks were used to evaluate a skill, coders would list all tasks utilized. Reviewers also noted the format of the DA (i.e., graduated prompts (GP) or train/test (TT), which administration method was employed (i.e., in person or computer) and whether real or nonwords, and novel or familiar symbols were used.
Word Reading Measures (WRM)
For the purposes of this review, WRMs are assessments that measure ability to read single words using a standardized correct/incorrect grading system and without provision of feedback, prompting or teaching. WRMs were conducted concurrently with the DA or longitudinally at a later timepoint. Coders noted the name of the WRM, and the subtest used (e.g., the Woodcock Reading Mastery Tests –III, Word Identification subtest), which word reading skill(s) was evaluated and whether this task used words or nonwords (e.g., single word reading accuracy).
Effect Sizes
Pearson’s correlation coefficients representing the relationship between DAs and WRMs were extracted. Coders initially extracted all correlation coefficients between a DA and an WRM listed in a study (e.g., a DA that used multiple PA tasks to assess PA skills and multiple measures to evaluate word reading ability). Following review of extracted data points, the first, second and last author developed a set of decision rules for selecting a single effect size from each study as to not violate the assumption of independence in the meta-analysis. This decision-making process was based on which measure was most frequently observed among the included studies; (i.e., every included study utilized word reading accuracy as an WRM, while word reading fluency was scarcely used). In cases where tasks were observed with equal frequency, the choice was informed by theory. For example, research suggests that phoneme level tasks demonstrate stronger predictive validity than syllable or onset-rime level tasks ones, and so a phoneme deletion task would be preferred over syllable deletion task. Effect sizes representing the relationship between DAs and WRMs are presented in Table 3 in the supplemental material. The excel table with extracted data and the R script can be found at https://osf.io/bcghx//.
Result of subgroup analyses
Quality Appraisal Assessment
Each included study was evaluated independently by two RAs using an adapted and amalgamated version of two quality assessment tools for (i) cross sectional design and (ii) diagnostic accuracy studies from the Johanna Briggs Institute (Moola et al., 2020). Studies were assessed on the five following areas: (i) participant selection, (ii) index assessments (DAs) (iii) reference assessments (WRMs), (iv) flow and timing of the study, (v) statistical analysis.
First, coders rated whether the age, sex, and demographic characteristics of the participants were adequately described. The rating for the DA domain was informed by whether the tool was explained with adequate detail regarding the skills assessed, the format, the type of prompting and scoring used, and the method of administration. Coders also noted whether the word reading skill(s) employed were developmentally appropriate for the sample population. The developmental appropriateness of assessment tools for evaluating word reading skills were also evaluated when rating the standards of reference assessments (WRMs). Additionally, coders rated whether the studies indicated the psychometric properties of the reference measures. To evaluate flow and timing, coders evaluated whether the analyses included all participants and if not, whether the author(s) provided adequate reasoning for attrition. Lastly, coders considered whether appropriate statistical analyses were conducted.
Overall, the quality appraisal consisted of 8 items to be rated over 5 domains. Items regarding participants, flow and timing, and statistical analyses were assigned one point, while items concerned with the index test (DA) and the reference tests (WRMs) were worth two points due to their greater significance in achieving the review objectives. The first author reviewed all ratings and resolved any conflicts, and the quality of each study was ranked on the following spectrum: low quality (0-33%), medium quality (34-66%) or high quality (67-100%). Only medium and high-quality studies were included in the analyses. No studies were excluded based on their score. Refer to Table 4 in the Supplemental Material for quality appraisal questions and ratings.
Analyses
A random effects meta-analysis was conducted to account for between-study variance. A single coefficient from each study was selected to ensure compliance with the assumption of independence.
Coefficients were transformed into Z scores using Fisher Z transformation with the ‘metacor’ package in R studio (Laliberté, 2019; R Core Team, 2021). A weighted average of these scores was calculated, then transformed back to Pearson’s correlation coefficients for interpretation. Heterogeneity statistics of Q, I2 and Tau2 were calculated and reported. The Sidik Jonkman estimator was used to calculate Tau2. A Baujat plot (Figure 1 in supplemental material) was generated to determine which studies contributed most to heterogeneity (Baujat et al., 2002). Significant between study variance was anticipated given differences in study design, participant factors, DA and WRM characteristics. To examine this heterogeneity, subgroup analyses by DA format (graduated prompts vs. train/test), administration method (computer vs in-person) word (real vs. nonword) and symbol type (novel vs. familiar letters/characters) were planned a priori. A funnel plot (Figure 2 in supplemental material) was generated, and Egger’s regression test was conducted to examine risk of publication bias (Egger et al., 1997).
Preferred Reporting Items for Systematic Review and Meta-Analyses Flowchart
Note. N= Number of participants, DA= Dynamic Assessment, WRM = Word Reading Measure
Forest plot of random effects meta-analysis examining the relationship between dynamic assessments word reading skills and word reading measures.
Note. Study names, sample size =N, effect sizes =COR, and 95% confidence intervals =CI (95%) are reported. The grey box associated with each study represents the weight allocated to each effect size, while the horizontal line that extends from either side of the box is a measure of the confidence interval (95%). The solid vertical line is the line of no effect while the dashed vertical line represents the significant overall mean effect size. The blue diamonds are an indication of the overall confidence interval, and the black bar represents the prediction interval. Figure drawn in R using ‘metacor’ package (R Core Team, 2021; Laliberté, 2019).
Results
Study Selection
The database search yielded 4824 articles of which 21 were included. Three preprint repositories were searched of which 1 was included. The 22 included articles were then subjected to forward searching using the “cited by” function in Google Scholar which led to identification of an additional 7 studies. The reference lists of these 29 articles were reviewed to determine if there were relevant articles that had been missed. One additional study was identified. Callouts were made for unpublished studies or data to mailing lists, via post to social media and by directly contacting labs conducting literacy related research across Canada, the United State and Europe, but no relevant articles were identified via this process. In summary, 30 articles including a total of 32 studies met the criteria for inclusion. The study identification process, including reasons for study exclusion such as incorrect population type (e.g.,den Ouden, 2019) is outlined in the PRISMA diagram below (Page et al., 2021).
Study Characteristics
Participant Characteristics
A total of 6225 participants were included across 32 studies from 30 articles. The overall mean age of participants was 5 years 8 months, and the overall average % of males was 49.27%. Table 1 below provides additional details regarding the mean age and % of males across subgroups stratified by DA characteristics.
Study Location and Language
Most studies were conducted in the United States (n=17), followed by The Netherlands (n=2), Germany (n=2), Denmark (n=2), England (n=2), China (n=2), and Hong Kong, Finland, Belgium, Spain, and Singapore (n=1 each). The most common language profile across studies was monolingual English speakers (n=15), followed by bilingual English/Other speakers (n=6), and monolingual Danish (n=4), German (n=2), Mandarin (n=2), Finnish (n=1), Dutch (n=1) and Spanish (n=1) speakers. Refer to Table 2 for additional information about characteristics of participants in included studies.
Country, Number of Participants, Mean Age, Grade, % Males, Language Status, Reading Status, Study Design, Type and Characteristics of DAs, SAs and WR Outcome Measures of Included Studies.
Dynamic Assessments
Of the 32 included studies, 11 examined a DA of PA, 5 a DA of SSK and 16 a DA of decoding. Eleven studies were administered either via computer, either by a person or in a computerized program, while 21 were conducted in-person. Seventeen studies employed a GP format, 14 used a TT approach and 1 used a game-based method that was neither GP nor TT. Most studies used explicit verbal feedback (n=29) while only 3 used implicit feedback in the context of a game. Given their nature, DAs of SSK used neither words nor nonwords, only symbols and sounds or syllables. However, of the 27 studies that examined a DA of PA or decoding, 15 used real words and the remaining 12 used nonwords. Studies used either novel symbols or familiar letters and characters in their DAs of SSK and decoding. Given their auditory nature, PA tasks used neither symbols nor letters. Of the 21 DAs of SSK and decoding tasks, 12 used novel symbols, while 9 used familiar letters or characters. For DA details, refer to Table 2.
Word Reading Measures
WRMs used in included studies were characterized as either norm-referenced or researcher developed. Most studies (n=27) used norm-referenced measures, while fewer (n=5) used a researcher developed tool. The norm-referenced measures used included versions of the Woodcock Reading Mastery Test (either WRMT-R, WRMT-RNU, WRMT-III) (n=9), the Test Of Word Reading Efficiency (n=3), the Woodcock Johnson-III (n=2), the WRAT(n=2), the Woodcock Muñoz Language Survey-Revised (n=2), the Salzburger Lese und Rechtschreib Test-III (n=2), the One Minute Test (n=2), and the Test de Análisis de la Lecto-Escritura, and the British Ability Scales-2, Single Word Reading Test, Lukilasse, 3DM, San Diego Quick Assessment, (n=1 each). Most WRMs evaluated word reading accuracy (n=25), though several assessed both accuracy and speed (n=7). Finally, the majority of WRMs used real word reading tasks (n=28), apart from two studies that used a combination of word and nonword reading subtests, and two that used exclusively nonword reading tasks. Refer to Table 2 details regarding WRMs.
Research Question: Do DAs of word reading skills demonstrate consistent relationships with word reading measures when stratified by administration method, format, word, and symbol type?
Thirty-two studies from 30 articles studies reported correlations between a DAs of word reading skills (phonological awareness, sound-symbol knowledge, or decoding) WRMs (See Table 3 in supplemental material). The effect sizes from these 32 studies were included in the correlational random effects meta-analysis examining the relationship between DAs word reading skills and WRMs, and results are displayed in Figure 2. The overall mean effect size is large (r=0.54, 95%CI = [0.47-0.60] suggesting that DAs of word reading skills are strongly correlated with WRMs. The prediction interval ranged from g=0.10-0.90 suggesting that future relevant studies would be likely to find a positive correlation. As expected, significant heterogeneity was detected (Q=230.22, p<0.01) indicating that a substantial amount of heterogeneity can be attributed to true between study variance rather than sampling error. The effect sizes are presented in subgroups according to DA type. Findings of a previous review examining validity of DAs by word reading skill type (Wood et al., 2023) are replicated here, with DAs of PA and decoding demonstrating narrower positive prediction intervals and significantly stronger correlations with WRMs than DAs of SSK (Q=11.01, df=2, p<0.01). As in previous analyses, even after subgroup analysis by DA type, significant residual heterogeneity was detected (Q=131.24, p<0.01). Additional subgroup analyses by DA format, administration method word and symbol type, were planned a priori and were conducted to further examine this heterogeneity.
Subgroup Analyses
Mixed effects models were used to examine whether there were significant differences in mean effect sizes for DAs based on their A) administration method, B) format C) word and D) symbol type. Results of these subgroups analyses are reported in Table 3 below.
There is no significant difference in the strength of correlation between DAs of word reading skills and WRMs based on the DA format (p=0.14) or administration method (p=0.21). However, mean effect sizes for DAs that used a TT format and those that were conducted via computer were moderate (r=.46, r=0.47 respectively) while effect sizes for DAs that used a GP format and that were conducted in-person were large (r=.59, r=.57 respectively). Furthermore, the prediction intervals for DAs that used a GP format and that were conducted in-person, did not cross 0, suggesting that future relevant studies may be more likely to document positive correlations for DAs with these characteristics. Significant differences were found in terms of strength of correlation between DAs and WRMs based on the type of word (p=<0.01) and symbol (p=0.01) used. Results suggest that DAs that use nonwords (r=.57) and those that use familiar symbols (r=.62) are more strongly correlated with WRMs than those that use real words (r=.50) or those that use novel symbols (r=.42).
Risk of Publication Bias
A funnel plot was generated to subjectively examine risk of publication bias in the meta-analysis of the relationship between DAs word reading skills and WRMs. This plot is presented in Figure 2 in the supplemental material. Visual inspection of the funnel plot suggests potential asymmetry. Several studies with small sample sizes and positive findings were identified and included, compared to studies with small sample sizes and negative findings (e.g., Horbach et al., 2018; Loreti, 2013; Wyman Chin, 2018). This suggests that there is a possibility that studies with negative outcomes were not completed, published, or submitted into the grey literature (Lee & Hotopf, 2012). To objectively investigate risk of publication bias, Egger’s test was calculated. Despite the apparent visual asymmetry, this test was not found to be significant for presence of plot asymmetry (Intercept = 1.4922, 95%CI= [-0.14-3.12], p=.08). It is plausible that the word reading assessment constructs are reliably correlated with word reading outcome measures in young children, and therefore negative correlations between the two would not be anticipated. Ultimately, there is minimal risk of publication bias in this analysis.
Discussion
This review examined whether characteristics of dynamic assessments (DAs) of word reading skills (phonological awareness, sound-symbol knowledge, and decoding) affected the strength of the correlational relationship between the DA and performance on a word reading measure (WRM). Thirty- two studies from 30 articles met inclusion criteria of evaluating children between the ages of 4 and 10 and reporting a correlation coefficient between a DA of a word reading skill and a WRM. Results of the overall meta-analysis were consistent with previous findings and suggested that DAs of word reading are strongly correlated with WRMs (Wood et al., 2023).
The subgroup analyses evaluating DA format found no significant differences between the graduated prompts (GP) or train-test (TT) approaches. However, mean effect sizes were larger, and prediction intervals were narrower for DAs that employed a GP approach versus those that used the TT format. These results are consistent with findings from a previous review (Caffrey et al., 2008) which found that DAs that used contingent feedback demonstrated stronger predictive validity than those that used non-contingent feedback. GP DAs are typically very scripted and use highly contingent feedback, employing a series of pre-defined, increasingly explicit prompts following an examinee’s response (e.g., Spector’s (1992) use of 6 pre-defined prompts). Many of the TT DAs also used non-contingent feedback (e.g., Horbach et al., used non-contingent scripted verbal feedback in the learning phase of their DA).
However, the feedback in the training and teaching phase of TT DAs was characterized by greater variability. For example, in Petersen & Gillam’s (2015) study examining a DA of nonword decoding, examiners used noncontingent feedback to teach children how to read the nonwords if they were unsuccessful in the initial pre-test phase. This increased variability in may have contributed to the weaker relationship between TT DAs and WRMs.
A secondary subgroup analysis examining the role of administration method in DAs of word reading skills found that there were no significant differences between those administered in-person, vs. those conducted via computer. These findings are consistent with a previous review which reported no significant differences between in-person and computer administration methods for static assessments (SAs) across domains (Alfano et al., 2022). However, the mean effect size for DAs conducted in-person was larger and had a narrower prediction interval that did not cross zero. The weaker mean effect size and wider prediction interval associated with computer administration may be a result of two factors.
First, all WRMs were conducted in-person which may have impacted the strength of the relationship between the two measures. It is promising that despite this difference in administration method, computer DAs still demonstrated strong mean effect sizes with in-person WRMs. Second, as posited earlier, it is possible that because DAs are characterized by increased interaction between examiner and examinee relative to SAs, administering them via computer may result in a reduced ability to engage in meaningful interaction or provision of accurate feedback. Similar challenges (e.g., technical issues disruption assessment, the need for caregiver support in evaluation and difficulties associated with providing feedback and maintaining child engagement) have been documented in the literature examining computer and virtual use of SAs (e.g., Hodge et al., 2019; Wood et al., 2021). However, results from these studies and this review would indicate that much like SAs, computer-based or virtual DAs are a valid alternative to in-person administration.
Finally, two additional subgroup analyses examined the impact of word and symbol type in DAs of word reading skills. Results indicate that DAs that use nonwords and those that use familiar letters or characters demonstrate significantly greater mean effect sizes than those that use real words and those that employ novel symbols. These results differ from previous findings, which suggested that nonwords were too distant from real word reading to be valid and were impractical for beginning readers who lack the necessary skills to decode (Wagner et al., 1997). However, as previously stated, it is possible that use of nonwords in DAs permits evaluation of a child’s ability to learn, since all children are unfamiliar with them, and cannot use previous knowledge or experiences to guess or recognize words in testing (Hoover & Tunmer, 1993). The trend of unfamiliarity of test items leading to increased capacity to evaluate ability to learn was not reflected in the subgroup analysis by symbol type given that DAs that used familiar letters or characters were associated with stronger mean effect sizes than those that used novel symbols. We hypothesize that this may be a result of the types of symbols used in these DAs. For instance, some used real letters and characters from a different language in their test items (e.g., Aravena et al., 2013 used Hebrew characters in evaluating Dutch children). Others however, used symbols or that did not resemble any letter or character in an existing script (e.g., Horbach et al., used dots and dashes to represent the syllable-sound correspondences in their DA measure). True letters and characters, whether familiar or unfamiliar, exhibit features and characteristics that allow them to be differentiated from scribbles or symbols (Dehaene, 2009; Heimann, et al., 2013). It is possible that unfamiliar symbols that minimally resemble real characters or letters may be better suited to predict word reading ability.
Limitations
Correlation coefficients were selected as the measure of effect size because they were the most reported statistical analysis across studies. While this allowed for inclusion of a higher number of studies, it also means that only correlational inferences can be made about the results reported in this review. It is possible that relevant studies may not have been identified because they were published in a language that our review team was not able to read (e.g., many studies in Korean and Hebrew were excluded in the title and abstract screening phase), or because they used key terms not captured by our search strategy. While it has been suggested that some ‘paired associate learning’ (PAL) should be considered as DA measures (Dixon et al., 2022), our team elected not to include PA tasks, because most of these tasks are not dynamic in nature. Additionally, despite examination of four moderators via subgroup analysis, residual heterogeneity within groups was still significant. This suggests that other factors that were not examined in this review, may be contributing to the overall strength of relationship between DAs of word reading skills and word reading measures.
Clinical Implications
The results of this systematic review and meta-analysis have implications for clinicians like speech-language pathologists and psychologists and educators who routinely evaluate word reading skills. Outcomes support the use of both graduated prompts and train/test type formats of DA, either conducted in-person or via computer. This is particularly relevant post-pandemic, as many professionals continue to evaluate children in a virtual context. Findings also suggest that clinicians may wish to favour DAs of word reading that use nonwords and familiar letters, over those that use real words or novel symbols, as these characteristics were associated with significantly stronger correlational relationships with word reading outcomes.
Future Research
Results of this study can inform development of novel DAs of word reading skills, or revisions of existing tools. It will be important for researchers to directly compare DAs with differing characteristics, using research designs and statistical analyses that permit a better understanding of the causal role these factors play. This can be achieved through longitudinal studies comparing the relative predictive validity of DAs that differ in their format, administration method, word and symbol type, or other relevant factors via regression or structural equation modelling. Finally, studies should explicitly examine whether specific characteristics of DAs of word reading skills have a greater capacity to limit floor effects associated with traditional static measures or result in improved diagnostic accuracy.
Ideally, these studies should include populations for whom DA is purported to be most useful, particularly bilingual children and those with limited previous literacy experiences.
Author Notes
The authors do not declare any conflicts of interest at the time of publication
Data Availability
All data produced are available online at https://osf.io/bcghx/
Data Availability Statement
Additional supplemental material (full list of search terms, excel files with extracted correlation coefficients and R codes for correlational meta-analyses and subgroup analyses) are available on the Open Science Framework at https://osf.io/bcghx.
Search terms for concept 1 – Dynamic assessment
Search terms for concept 2 – Literacy
Effect Sizes Representing the Relationship between Dynamic Assessments of Word Reading Skills (phonological Awareness, sound-symbol knowledge, and decoding) and Single Word Reading Measures
Quality Appraisal of Included Studies
Baujat plot for studies included in the meta-analysis of the relationship between dynamic assessments of phonological awareness and word reading outcome measures
Note. In the Baujat plot, individual contribution to overall heterogeneity is represented on the horizontal axis, and the influence on overall result on the vertical axis. Studies with greatest influence are found in the top right quadrant of the figure. Drawn in R using the ‘metacor’ package (R Core Team, 2021; Laliberté, 2019).
Funnel plot of studies included in the meta-analyses of the relationships between dynamic assessments of word reading skills (phonological awareness, sound-symbol-knowledge, and decoding) and word reading measures
Note. In the funnel plots, individual Fisher z transformed effect sizes are presented on the horizontal axis, and the standard error on vertical axis. Studies with smaller standard errors (larger studies) are found closer to the top of the plot. Drawn in R using the ‘metacor’ package (R Core Team, 2021; Laliberté, 2019).
Acknowledgements
This systematic review and meta-analysis is funded by a Canada Graduate Scholarship-Master’s grant from the Social Sciences and Humanities Research Council of Canada, at the Rehabilitation Sciences Institute at the University of Toronto and an Ontario Graduate Scholarship from the Ministry of Colleges and Universities, awarded to EW, by a University of Toronto Excellence Award, awarded to KB and by a Natural Sciences and Engineering Research Council of Canada grant awarded to MM (RGPIN-2019-06523).