What Is the Relationship among NAEP Scores, Educational Policy, and Classroom Practice?

Annually, the media, public, and political leaders over-react and misrepresent the release of SAT and ACT scores from across the US. Most notably, despite years of warnings from the College Board against the practice, many persist in ranking states by average state scores, ignoring that vastly different populations are being incorrectly compared.

These media, public, and political reactions to SAT and ACT scores are premature and superficial, but the one recurring conclusion that would be fair to emphasize is that, as with all standardized test data, the most persistent correlation to these scores includes the socio-economic status of the students’ families as well as the educational attainment of their parents.

Over many decades of test scores, in fact, educational policy and classroom practices have changed many times, and the consistency of those policies and practices have been significantly lacking and almost entirely unexamined.

For example, when test scores fell in California in the late 1980s and early 1990s, the media, public, and political leaders all blamed the state’s shift to whole language as the official reading policy.

This was a compelling narrative that, as I noted above, proved to be premature and superficial—relying on the most basic assumptions of correlation. A more careful analysis exposed two powerful facts: California test scores were far more likely to have dropped because of drastic cuts to educational funding and a significant influx of English language learners and (here is a very important point) even as whole language was the official reading policy of the state, few teachers were implementing whole language in their classrooms.

This last point cannot be emphasized enough: throughout the history of public education, because teaching is mostly a disempowered profession (primarily staffed by women), one recurring phenomenon is that teachers often shut their doors and teach—claiming their professional autonomy by resisting official policy.

November 2019 has brought us a similar and expected round of making outlandish and unsupported claims about NAEP data. With the trend downward in reading scores since 2017, this round is characterized by the sky-is-falling political histrionics and hollow fist pounding that NAEP scores have proven policies a success or a failure (depending on the agenda).

If we slip back in time just a couple decades, when the George W. Bush administration heralded the “Texas miracle” as a template for No Child Left Behind, we witnessed a transition from state-based educational accountability to federal accountability. But this moment in political history also raised the stakes on scientifically based educational policy and practice.

Specifically, the National Reading Panel was charged with identifying the highest quality research in effective reading programs and practices. (As a note, while the NRP touted its findings as scientific, many, including a member of the panel itself [1], have discredited the quality of the findings as well as accurately cautioning against political misuse of the findings to drive policy).

Here is where our trip back in history may sound familiar during this current season of NAEP hand wringing. While Secretary of Education (2005-2009), Margaret Spellings announced that a jump of 7 points in NAEP reading scores from 1999-2005 was proof No Child Left Behind was working. The problem, however, was in the details:

[W]hen then-Secretary Spellings announced that test scores were proving NCLB a success, Gerald Bracey and Stephen Krashen exposed one of two possible problems with the data. Spellings either did not understand basic statistics or was misleading for political gain. Krashen detailed the deception or ineptitude by showing that the gain Spellings noted did occur from 1999 to 2005, a change of seven points. But he also revealed that the scores rose as follows: 1999 = 212; 2000 = 213; 2002 = 219; 2003 = 218 ; 2005 = 219. The jump Spellings used to promote NCLB and Reading First occurred from 2000 to 2002, before the implementation of Reading First. Krashen notes even more problems with claiming success for NCLB and Reading First, including:

“Bracey (2006) also notes that it is very unlikely that many Reading First children were included in the NAEP assessments in 2004 (and even 2005). NAEP is given to nine year olds, but RF is directed at grade three and lower. Many RF programs did not begin until late in 2003; in fact, Bracey notes that the application package for RF was not available until April, 2002.”

Jump to 2019 NAEP data release to hear Secretary of Education Betsy DeVos shout that the sky is falling and public education needs more school choice—without a shred of scientific evidence making causal relationships of any kind among test data, educational policy, and classroom practice.

But an even better example has been unmasked by Gary Rubinstein who discredits Louisiana’s Chief of Change John White (praised by former SOE Arne Duncan) proclaiming his educational policy changes caused the state’s NAEP gain in math:

So while, yes, Louisiana’s 8th grade math NAEP in 2017 was 267 and their 8th grade math NAEP in 2019 was 272 which was a 5 point gain in that two year period and while that was the highest gain over that two year period for any state, if you go back instead to their scores from 2007, way before their reform effort happened, you will find that in the 12 year period from 2007 to 2019, Louisiana did not lead the nation in 8th grade NAEP gains. In fact, Louisiana went DOWN from a scale score of 272.39 in 2007 to a scale score of 271.64 in 2019 on that test. Compared to the rest of the country in that 12 year period. This means that in that 12 year period, they are 33rd in ‘growth’ (is it even fair to call negative growth ‘growth’?). The issue was that from 2007 to 2015, Louisiana ranked second to last on ‘growth’ in 8th grade math. Failing to mention that relevant detail when bragging about your growth from 2017 to 2019 is very sneaky.

The media and public join right in with this political playbook that has persisted since the early 1980s: Claim that public education is failing, blame an ever-changing cause for that failure (low standards, public schools as monopolies, teacher quality, etc.), promote reform and change that includes “scientific evidence” and “research,” and then make unscientific claims of success (or yet more failure) based on simplistic correlation and while offering no credible or complex research to support those claims.

Here is the problem, then: What is the relationship among NAEP scores, educational policy, and classroom practice?

There are only a couple fair responses.

First, 2019 NAEP data replicate a historical fact of standardized testing in the US—the strongest and most persistent correlations to that data are with the socio-economic status of the students, their families, and the states. When students or average state data do not conform to that norm, these are outliers that may or may not provide evidence for replication or scaling up. However, you must consider the next point as well.

Second, as Rubinstein shows, the best way to draw causal relationship among NAEP data, educational policy, and classroom practices is to use longitudinal data; I would recommend at least 20 years (reaching back to NCLB), but thirty years would add in a larger section of the accountability era that began in the 1980s but was in wide application across almost all states by the 1990s.

The longitudinal data would next have to be aligned with the current educational policy in math and reading for each state correlated with each round of NAEP testing.

As Bracey and Krashen cautioned, that correlation would have to accurately align when the policy is implemented with enough time to claim that the change impacted the sample of students taking NAEP.

But that isn’t all, even as complex and overwhelming as this process demands.

We must address the lesson from the so-called whole language collapse in California by documenting whether or not classroom practice implemented state policy with some measurable level of fidelity.

This process is a herculean task, and no one has had the time to examine 2019 NAEP data in any credible way to make valid causal claims about the scores and the impact of educational policy and classroom practice.

What seems fair, however, to acknowledge is that there is no decade over the past 100 years when the media, public, and political leaders deemed test scores successful, regardless of the myriad of changes to policies and practices.

Over the history of public education, also, before and after the accountability era began, student achievement in the US has been mostly a reflection of socio-economic factors, and less about student effort, teacher quality, or any educational policies or practices.

If NAEP data mean anything, and I am prone to say they are much ado about nothing, we simply do not know what that is because we have chosen political rhetoric over the scientific process and research that could give us the answers.

[1] See:

Babes in the Woods: The Wanderings of the National Reading Panel, Joanne Yatvin

Did Reading First Work?, Stephen Krashen

My Experiences in Teaching Reading and Being a Member of the National Reading Panel, Joanne Yatvin

I Told You So! The Misinterpretation and Misuse of The National Reading Panel Report, Joanne Yatvin

The Enduring Influence of the National Reading Panel (and the “D” Word)