Daniel Willingham--Science & Education
Hypothesis non fingo
  • Home
  • About
  • Books
  • Articles
  • Op-eds
  • Videos
  • Learning Styles FAQ
  • Daniel Willingham: Science and Education Blog

How to abuse standarized tests

4/9/2012

 
The insidious thing about tests is that they seem so straightforward. I write a bunch of questions. My students try to answer them. And so I find out who knows more and who knows less.

But if you have even a minimal knowledge of the field of psychometrics, you know that things are not so simple.

And if you lack that minimal knowledge, Howard Wainer would like a word with you.

Picture
Wainer is a psychometrician who spent many years at the Educational Testing Service and now works at the National Board of Medical Examiners. He describes himself as the kind of guy who shouts back at the television when he sees something to do with standardized testing that he regards as foolish. These one-way shouting matches occur with some regularity, and Wainer decided to record his thoughts more formally.

The result is an accessible book, Uneducated Guesses, explaining the source of his ire on 10 current topics in testing. They make for an interesting read for anyone with even minimal interest in the topic.

For example, consider the making of a standardized test like the SAT or ACT optional for college applicants, a practice that seems egalitarian and surely harmless. Officials at Bowdoin College have made the SAT optional since 1969. Wainer points out the drawback--useful information about the likelihood that students will succeed at Bowdoin is omitted.

Here's the analysis. Students who didn't submit SAT scores with their application nevertheless took the test. They just didn't submit their scores. Wainer finds that, not surprisingly, students who chose not to submit their scores did worse than those who did, by about 120 points.

Picture
Figure taken from Wainer's blog.

Wainer also finds that those who didn't submit their scores had worse GPAs in their freshman year, and by about the amount that one would predict, based on the lower scores.

So although one might reject the use of a standardized admissions tests out of some conviction, if the job of admissions officers at Bowdoin is to predict how students will fare there, they are leaving useful information on the table.

The practice does bring a different sort of advantage to Bowdoin, however. The apparent average SAT score of their students increases, and average SAT score is one factor in the quality rankings offered by US News and World Report.

In another fascinating chapter, Wainer offers a for-dummies guide to equating tests. In a nutshell, the problem is that one sometimes wants to compare scores on tests that use different items—for example, different versions of the SAT. As Wainer points out, if the tests have some identical items, you can use performance on those items as “anchors” for the comparison. Even so, the solution is not straightforward, and Wainer deftly takes the reader through some of the issues.

But what if there is very little overlap on the tests?

Wainer offers this analogy. In 1998, the Princeton High School football team was undefeated. In the same year, the Philadelphia Eagles won just three games. If we imagine each as a test-taker, the high school team got a perfect score, whereas the Eagles got just three items right. But the “tests” each faced contained very different questions and so they are not  comparable. If the two teams competed, there's not much doubt as to who would win.

The problem seems obvious when spelled out, yet one often hears calls for uses of tests that would entail such comparisons—for example, comparing how much kids learn in college, given that some major in music, some in civil engineering, and some in French.

And yes, the problem is the same when one contemplates comparing student learning in a high school science class and a high school English class as a way of evaluating their teachers. Wainer devotes a chapter to value-added measures. I won't go through his argument, but will merely telegraph it: he's not a fan.

In all, Uneducated Guesses is a fun read for policy wonks. The issues Wainer takes on are technical and controversial—they represent the intersection of an abstruse field of study and public policy. For that reason, the book can't be read as a definitive guide. But as a thoughtful starting point, the book is rare in its clarity and wisdom.

matthew levey link
4/10/2012 12:16:28 pm

Dan,

Thanks for this review. Looks fascinating.

At the risk of plugging another book before reading this one, I've picked up Daniel Kahnemann's Thinking Fast and Slow.

I'm fascinated by Kahneman's accounts of how willing we humans are to ignore evidence that our beliefs are not consistant with reality. Which explains how Wainer has to spend time yelling at the TV and writing books about policy makers abuse tests.

Yet im my fair city, I'm surrounded by intelligent, seemingly well-intended ed reformers who explain to me that if there are 5 schools achieving extraordinary results (based on standardized test scores, of course) all schools should be able to do this, if only they would study the causes of this success.

Were I to point to the possibility that there is no causality, just preparing the kids for one kind of test and a bit of luck, and that the most likely outcome for these schools is regression to the mean, I'd be as welcome as a skunk at a garden party. Or worse, be accused of being an apologist for the system!

All this in an environment where we are told the most important thing is that we teach children to think critically. Except, apparently, when evaluating data-based claims made by their local elected officials.

LarryW
4/11/2012 12:54:25 am

I'm often up for a lay book on psychometric application. There're several that do a good job of explaining in practical terms the pros and cons and misapplications. This book by Wainer is not one of them, and continues the charade. Typical of such charades, he often tells the truth and lies at the same time. He shows the flaws of his arguments, but like the small print in consumer contracts which nobody reads, he hopes and knows you won't catch him on it.

I've only read the first chapter which discusses SAT tests, ETS's big money maker. He concludes, of course, that SAT scores are and must be required and determinative for acceptance into college. He proves that, only if you have no knowledge and miss the tricks he is performing.

The topic of the first chapter is the comparison of the SAT scores of kids who submit the scores to Bowdoin College vs the SAT scores of those who don't. The SAT is optional at Bowdoin. Not surprising, ON AVERAGE, those not submitting score significantly lower. Now to his proof. Those not submitting have a 0.2 lower first year GPA than those submitting, ON AVERAGE.

He concludes that if the goal of college admissions is to choose kids who do better in their college courses, then you must use the SAT scores to make admission decisions.

I'll bet you missed the truth and the disconnect between it and his conclusion. First, the SAT scores on average differed significantly, but the GPA differences were only 0.2 points. Second, this comparison and prediction was for first year college; if the SAT has any predictive power, one would expect the correlation to be strongest first year. That is the key -- after first year, the predictive power of the SAT of college success is insignificant.

A. Ericsson's and others work on "deliberate practice" shows the insignificance of tests like the SAT in predicting success. In these studies, the correlation between such tests and success is about 0.2 which translates to saying that 4% of the variation is due to whatever is being measured by these tests, 96% of ones success is based on other factors.

Another almost laughable con Wainer uses in his first chapter is almost a direct steal of the cons illustrated in the seminal book "How to Lie with Statistics". To illustrate the differences between SAT submitters and non-submitters, he plots frequency distributions of the SAT and separately the GPA scores. This is legitimate. But to further his big lie, the vertical axis of his plots use the actual frequencies of submitters and non-submitters, not the percentages, which would have been the honest and accurate way to display the results. Because the vast majority of kids submitted SAT scores, their frequency distribution plots towers over the vastly smaller frequency distribution plot of the non-submitters.

So instead of seeing two equal area distributions side by side with the non-submitter distribution a little off to the left, showing lower SAT scores or lower first year GPA scores, with the distributions overlapping significantly, one sees the non-submitter distribution, a tiny insignificant plot, occupying almost no area, to the lower left, and this overwhelmingly large area plot representing the submitters. A parent vs child image.

My bet is you'll have the unwashed masses of conservatives drooling over this obvious con as though it were the Second Coming, and liberals, who couldn't make a substantive argument if their lives depended on it, decrying the results and perhaps making some effort at protest.

But the truth is this book is an obvious con. It's a book for dummies alright. I suggest a new title: Psychometrics for Those who want to Remain Dummies"

Daniel Willingham link
4/11/2012 02:30:22 am

@LarryW
To your points:
1) The difference in first-year GPA between submitters and non-submitters is 0.2 which you seem to imply is insignificantly small. Wainer's claim is not that it's large, but that it's about the difference one would predict, based on the difference in SAT scores.
2) Yes, the standard outcome measure for predicting college success is first year GPA, and yes, the predictive value of *all* predictors, including the SAT, drops drastically after the first year. College life is complicated, and it's hard to predict in the fall of a high school student's senior year how hard he or she will be studying, what his or her motivation will be like, social life, etc., two years hence. Admissions officers have to use the best tools they can.
3) There are a great many studies on the predictive validity of the SAT for first-year GPA, and the raw correlation ranges from .35 to .45. see, for example, Does socioeconomic status explain the relationship between admissions tests and post-secondary academic performance? By Sackett, Paul R.; Kuncel, Nathan R.; Arneson, Justin J.; Cooper, Sara R.; Waters, Shonna D.
Psychological Bulletin, Vol 135(1), Jan 2009, 1-22. Other studies try to correct for the restriction of range problem--that is, that this raw correlation probably underestimates the correlation because students will very low scores do not attend selective schools and students with very high scores don't attend non-selective schools. I don't really understand these corrections and so can't comment on them, but the restriction of range problem does seem to me to indicate that the raw correlation is probably a conservative estimate
4) I don't understand why you think that presenting the distributions as percentages would present a more honest picture. The figure as presented carries the information that many more people submitted than did not submit. It seems to me that the graph you suggest would actually be misleading, because it would suggest that the numbers of submitters and non-submitters were equivalent.

Cal
4/11/2012 07:50:52 am

I always find it amusing how everyone focuses on first year GPA as the predictor. That's what the SAT was designed for, back in the day when everyone took the same first year college courses.

What is often ignored is that the SAT, as well as the ACT, is probably the single best predictor of *college readiness*, which is why almost all state schools use an SAT/ACT score as a proxy for a placement test, or a "get out of remediation free" card. Grades do not and can not do this.

EB
4/11/2012 03:37:41 am

"Restriction of range" is a term I was not familiar with (not being a statistician or anywhere close) but it does a good job of pointing out what commonb sense will also tell you: the predictive value of SAT or other standardized admissions tests is very robust when you compare scores that are widely separated. Very robust. This is why even when selective colleges are asked to admit the children of major donors, they resist the temptation if the child's SAT score is too much lower than the median for admited students.

Barry Garelick
4/17/2012 12:24:39 pm

Define "policy wonk". I worked in Washington DC for 13 years in a technical area. Those with degrees in Public Policy and those with degrees in English were indistinguishable, and both held the same disdain for people who worked on "technical" issues. To those outside of DC, I was considered a policy wonk. Therefore, your definition would be enlightening.

Dan Willingham link
4/17/2012 10:45:03 pm

Hey Barry. I wasn't using the term all that seriously--maybe it carries connotations I don't know about, and should use it more carefully. I just meant someone who is interested in the technical details of policy.

LarryW
4/17/2012 01:41:27 pm

It makes little sense to determine college eligibility based on SAT/ACT since it only is predictive of freshman year GPA (and small difference at that even given a high correlation). However, the criteria of admissions at any university is really can this potential student succeed getting a quality education at this institution. An education is not a sprint but a marathon. It neither makes sense nor is it justified to decide on college access based on such flimsy evidence such as SAT

thc test link
6/26/2012 10:08:25 pm

I admit, I have not been on this webpage in a long time..All the contents you mentioned in post is too good and can be very useful. Thanks for sharing the such information with us.


Comments are closed.

    Enter your email address:

    Delivered by FeedBurner

    RSS Feed


    Purpose

    The goal of this blog is to provide pointers to scientific findings that are applicable to education that I think ought to receive more attention.

    Archives

    July 2020
    May 2020
    March 2020
    February 2020
    December 2019
    October 2019
    April 2019
    March 2019
    January 2019
    October 2018
    September 2018
    August 2018
    June 2018
    March 2018
    February 2018
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    April 2017
    March 2017
    February 2017
    November 2016
    September 2016
    August 2016
    July 2016
    June 2016
    May 2016
    April 2016
    December 2015
    July 2015
    April 2015
    March 2015
    January 2015
    September 2014
    August 2014
    July 2014
    June 2014
    May 2014
    April 2014
    March 2014
    February 2014
    January 2014
    December 2013
    November 2013
    October 2013
    September 2013
    August 2013
    July 2013
    June 2013
    May 2013
    April 2013
    March 2013
    February 2013
    January 2013
    December 2012
    November 2012
    October 2012
    September 2012
    August 2012
    July 2012
    June 2012
    May 2012
    April 2012
    March 2012
    February 2012

    Categories

    All
    21st Century Skills
    Academic Achievement
    Academic Achievement
    Achievement Gap
    Adhd
    Aera
    Animal Subjects
    Attention
    Book Review
    Charter Schools
    Child Development
    Classroom Time
    College
    Consciousness
    Curriculum
    Data Trustworthiness
    Education Schools
    Emotion
    Equality
    Exercise
    Expertise
    Forfun
    Gaming
    Gender
    Grades
    Higher Ed
    Homework
    Instructional Materials
    Intelligence
    International Comparisons
    Interventions
    Low Achievement
    Math
    Memory
    Meta Analysis
    Meta-analysis
    Metacognition
    Morality
    Motor Skill
    Multitasking
    Music
    Neuroscience
    Obituaries
    Parents
    Perception
    Phonological Awareness
    Plagiarism
    Politics
    Poverty
    Preschool
    Principals
    Prior Knowledge
    Problem-solving
    Reading
    Research
    Science
    Self-concept
    Self Control
    Self-control
    Sleep
    Socioeconomic Status
    Spatial Skills
    Standardized Tests
    Stereotypes
    Stress
    Teacher Evaluation
    Teaching
    Technology
    Value-added
    Vocabulary
    Working Memory