On a scale of 1 to 10, how much do you think Pearson publishing cares about the efficacy of their products?
Now now, I asked for a numerical rating, not invective or expletives.
My own rating might be a three or a four. I'm guessing that the folks at Pearson care about effectiveness to some
extent because it affects how much things sell.
But the bottom line is that what matters is the bottom line. The success and failure of particular marketing strategies are followed closely, I'm guessing, as are sales of particular products. Learning outcomes from the product? Well, the customer can track them if they are interested.
So what are we to make of it when Pearson says
"We are putting the pursuit of efficacy and learning outcomes at the centre of our new global education strategy."
Educators have every right to be cynical. It's not just that Pearson has shown little inclination in this direction in the past, but also that it's a publicly traded company that shareholders ought to expect will put profits first.
Ironically, the path
Pearson plans to effect this change is mostly about inputs: hiring people who care about efficacy, developing a global research network to gather evidence, that sort of thing.
But crucially they also promise to track outcomes, namely "to report audited learning outcomes, measures, and targets alongside its financial accounts, covering its whole business by 2018."
That's an enormous commitment and if they really follow through, it gives me some confidence that this is not merely a marketing ploy. Or if it is, the marketing team has concluded that to make this ploy appear not to be a ploy, they need to put some teeth in the plan.
A significant aspect of the success of this step turns on that small adjective "audited." It's not that hard to cook the learning outcome books. For this new effort to be persuasive, Pearson will need to have disinterested parties weigh in on the efficacy measures used, and their interpretation.
A person knowledgeable about testing, yet wholly disinterested? Does Pearson have Diogenes on staff?
There's another aspect of this plan that I find even more interesting, and potentially useful. Pearson has published a do-it-yourself efficacy review tool. It's a series of questions you are to consider to help you think about the effectiveness of a product you are currently using, or are contemplating using. There's an online version as well as a downloadable pdf.
The tool encourages you to consider four factors (listed here in my own phrasing):
- What am I trying to achieve?
- What evidence is there that this product will help with my goal?
- What's my plan to use this tool?
- Do I have what I need to make my plan work?
These simple, sensible questions are elaborated in the framework, but working through the details should still take less than an hour. The tool includes sample ratings to help the user think through the rating scheme.
I think this tool is great, and not just because it aligns well with a similar tool I offered in When Can You Trust the Experts?
I think it offers Pearson a way to gain credibility as the company that cares about efficacy. If I were to hear that Pearson's sales force made a habit of encouraging district decision-makers to apply this efficacy framework to the educational products of Pearson (and others) that would be a huge step forward.
I would be even more impressed if Pearson warned users about the difficulty of overcoming the confirmation bias, and making these judgments objectively.
Still, this is a start. There might be some satisfaction in greeting this move with cynicism, but I think it's better to start with skepticism--skepticism that will prompt action and help to encourage educators to think effectively about efficacy.
The cover story
of latest New Republic
wonders whether American educators have fallen in blind love with self-control. Author Elizabeth Weil thinks we have. Titled “American Schools Are Failing Nonconformist Kids: In Defense of the Wild Child” the article suggests that educators harping on self-regulation are really trying to turn kids into submissive little robots. And they do so because little robots are easier to control in the classroom.
But lazy teachers are not the only cause. Education policy makers are also to blame, according to Weil. She writes that “valorizing self-regulation shifts the focus away from an impersonal, overtaxed, and underfunded school system and places the burden for overcoming those shortcomings on its students.”
And the consequence of educators’ selfishness? Weil tells stories that amount to Self-Regulation Gone Wild. A boy has trouble sitting cross-legged in class—the teacher opines he should be tested because something must be wrong with him. During story time the author’s daughter doesn’t like to sit still and to raise her hand when she wants to speak. The teacher suggests occupational therapy.
I can see why Weil and her husband were angry when their daughter’s teacher suggested occupational therapy simply because the child’s behavior was an inconvenience to him
. But I don’t take that to mean that there is necessarily a widespread problem in the psyche of American teachers. I take that to mean that their daughter’s teacher was acting like a selfish bastard.
The problem with stories, of course, is that there are stories to support nearly anything. For every story a parent could tell about a teacher diagnosing typical behavior as a problem, a teacher could tell a story about a child who really could
do with some therapeutic help, and whose parents were oblivious to that fact.
What about evidence beyond stories?
Weil cites a study by Duncan et al (2007) that analyzed six large data sets and found social-emotional skills were poor predictors of later success.
She also points out that creativity among American school kids dropped between 1984 and 2008 (as measured by the Torrance Test of Creative Thinking) and she notes “Not coincidentally, that decrease happened as schools were becoming obsessed with self-regulation.”
There is a problem here. Weil uses different terms interchangeably: self-regulation, grit, social-emotional skills. They are not same thing. Self-regulation (most simply put) is the ability to hold back an impulse when you think that that the impulse will not serve other interests. (The marshmallow study would fit here.) Grit refers to dedication to a long-term
goal, one that might take years to achieve, like winning a spelling bee or learning to play the piano proficiently. Hence, you can have lots of self-regulation but not be very gritty. Social emotional skills might have self-regulation as a component, but it refers to a broader complex of skills in interacting with others.
These are not niggling academic distinctions. Weil is right that some research indicates a link between socioemotional skills and desirable outcomes, some doesn’t. But there is quite a lot of research showing associations between self-control and positive outcomes for kids including academic outcomes, getting along with peers, parents, and teachers, and the avoidance of bad teen outcomes (early unwanted pregnancy, problems with drugs and alcohol, et al.). I reviewed those studies here
. There is another literature showing associations of grit with positive outcomes (e.g., Duckworth et al, 2007
Of course, those positive outcomes may carry a cost. We may be getting better test scores (and fewer drug and alcohol problems) but losing kids’ personalities. Weil calls on the reader’s schema of a “wild child,” that is, an irrepressible imp who may sometimes be exasperating, but whose very lack of self-regulation is the source of her creativity and personality.
But irrepressibility and exuberance is not perfectly inversely correlated with self-regulation. The purpose of self-regulation is not to lose your exuberance. It’s to recognize that sometimes it’s not in your own best interests to be exuberant. It’s adorable when your six year old is at a family picnic and impulsively practices her pas de chat because she cannot resist the Call of the Dance. It’s less adorable when it happens in class when everyone else is trying to listen to a story.
So there’s a case to be made that American society is going too far in emphasizing self-regulation. But the way to make it is not to suggest that the natural consequence of this emphasis is the crushing of children’s spirits because self-regulation is the same thing as no exuberance. The way to make the case is to show us that we’re overdoing self-regulation. Kids feel burdened, anxious, worried about their behavior.
Weil doesn’t have data that would bear on this point. I don’t either. But my perspective definitely differs from hers. When I visit classrooms or wander the aisles of Target, I do not feel that American kids are over-burdened by self-regulation.
As for the decline in creativity from 1984 and 2008 being linked to an increased focus on self-regulation…I have to disagree with Weil’s suggestion that it’s not a coincidence (setting aside the adequacy of the creativity measure). I think it might very well be a coincidence. Note that scores on the mathematics portion of the long-term NAEP increased during the same period. Why not suggest that kids improvement in a rigid, formulaic understanding of math inhibited their creativity?
Can we talk about important education issues without hyperbole?
Duckworth, A. L., Peterson, C., Matthews, M. D., & Kelly, D. R. (2007). Grit: perseverance and passion for long-term goals. Journal of personality and social psychology, 92(6), 1087.
Duncan, G. J., Dowsett, C. J., Claessens, A., Magnuson, K., Huston, A. C., Klebanov, P., ... & Japel, C. (2007). School readiness and later achievement. Developmental psychology, 43(6), 1428.
I read a lot of blogs. I only comment when I think I have something to add (which is rare, even on my own blog) but I read a lot of them.
Today, I offer a plea and a suggestion for making education blogs less boring, specifically on the subject of standardized testing.
I begin with two Propositions about human behavior
- Proposition 1: If you provide incentives for X, people are more likely to do what they think will help them get X. They may even attempt to get X through means that are counterproductive.
- Proposition 2: If we use procedure Z to change Y in order to make it more like Y’, we need to measure Y in order to know whether procedure Z is working. We have to be able to differentiate Y and Y’.
A lot of blog posts on the subject of testing are boring because authors pretend that one of these propositions is false or irrelevant.
On Proposition 1: Standardized tests typically gain validity by showing that scores are associated with some outcome you care about. You seldom care about the items on the test specifically. You care about what they signify. Sometimes tests have face validity
, meaning test items look
like they test what they are meant to test—a purported history test asks questions about history, for example. Often they don’t, but the test is still valid. A well-constructed vocabulary test can give you a pretty good idea of someone’s IQ, for example.
Just as body temperature is a reliable, partial indicator of certain types of disease, a test score is a reliable, partial indicator of certain types of school outcomes. But in most circumstances your primary goal is not a normal body temperature; it’s that the body is healthy, in which case body temperature will be normal as a natural consequence of the healthy state.
Bloggers ignoring basic propositions about human behavior? What's up with that?
If you attach stakes to the outcome, you can’t be surprised if some people treat the test as something different than that. They focus on getting body temperature to 98.6, whatever the health of the patient. That’s Proposition 1 at work. If a school board lets an administrator know that test scores had better go up or she can start looking for another job. . . well, what would you do in those circumstances? So you get test-prep frenzy. These are social consequences of tests, as typically used.
On Proposition 2: Some form of assessment is necessary. Without it, you have no idea how things are going. You won’t find many defenders of No Child Left Behind, but one thing we should remember is that the required testing did expose a number of schools—mostly ones serving disadvantaged children—where students were performing very poorly. And assessments have to be meaningful, i.e., reliable and valid. Portfolio assessments, for example, sound nice, but there are terrible problems with reliability and validity. It’s very difficult to get them to do what they are meant to do.
So here’s my plea. Admit that both Proposition 1 and Proposition 2 are true, and apply to testing children in schools.
People who are angry about the unintended social consequences of standardized testing have a legitimate point. They are not all apologists for lazy teachers or advocates of the status quo. Calling for high-stakes testing while taking no account of these social consequences, offering no solution to the problem . . . that's boring.
People who insist on standardized assessments have a legitimate point. They are not all corporate stooges and teacher-haters. Deriding “bubble sheet” testing while offering no viable alternative method of assessment . . . that's boring.
Naturally, the real goal is not to entertain me with more interesting blog posts. The goal is to move the conversation forward. The landscape will likely change consequentially in the next two years. This is the time to have substantive conversations.
What aspects of background, personality, or achievement predict success in college--at least, "success" as measured by GPA? A recent meta-analysis (Richardson, Abraham, & Bond, 2012
) gathered articles published between 1997 and 2010, the products of 241 data sets. These articles had investigated these categories of predictors:
- three demographic factors (age, sex, socio-economic status)
- five traditional measures of cognitive ability or prior academic achievement (intelligence measures, high school GPA, SAT or ACT, A level points)
- No fewer than forty-two non-intellectual measures of personality, motivation, or the like, summarized into the categories shown in the figure below (click for larger image).
Make this fun. Try to predict which of the factors correlate with college GPA.
Let's start with simple correlations.
41 out of the 50 variables examined showed statistically significant correlations. But statistical significance is a product of the magnitude of the effect AND the size of the sample--and the samples are so big that relatively puny effects end up being statistically significant. So in what follows I'll mention correlations of .20 or greater.
Among the demographic factors, none of the three were strong predictors. It seems odd that socio-economic status would not be important, but bear in mind that we are talking about college students, so this is a pretty select group, and SES likely played a significant role in that selection. Most low-income kids didn't make it, and those who did likely have a lot of other strengths.
The best class of predictors (by far) are the traditional correlates, all of which correlate at least r = .20 (intelligence measures) up to r = .40 (high school GPA; ACT scores were also correlated r = .40).
Personality traits were mostly a bust, with the exception of consientiousness (r = .19), need for cognition (r = .19), and tendency to procrastinate (r = -.22). (Procrastination has a pretty tight inverse relationship to conscientiousness, so it strikes me as a little odd to include it.)
Motivation measures were also mostly a bust but there were strong correlations with academic self-efficacy (r = .31) and performance self-efficacy (r = .59). You should note, however, that the former is pretty much like asking students "are you good at school?" and the latter is like asking "what kind of grades do you usually get?" Somewhat more interesting is "grade goal" (r = .35) which measures whether the student is in the habit of setting a specific goal for test scores and course grades, based on prior feedback.
Self-regulatory learning strategies likewise showed only a few factors that provided reliable predictors, including time/study management (r = .22) and effort regulation (r = .32), a measure of persistence in the face of academic challenges.
Not much happened in the Approach to learning category nor in psychosocial contextual influences.
We would, of course, expect that many of these variables would themselves be correlated, and that's the case, as shown in this matrix.
So the really interesting analyses are regressions that try to sort out which matter more.
The researchers first conducted five hierarchical linear regressions, in each case beginning with SAT/ACT, then adding high school GPA, and then investigating whether each of the five non-intellective predictors would add some predictive power. The variables were conscientiousness, effort regulation, test anxiety, academic self efficacy, and grade goal, and each did, indeed, add power in predicting college GPA after "the usual suspects" (SAT or ACT, and high school GPA) were included.
But what happens when you include all the non-intellective factors in the model?
The order in which they are entered matters, of course, and the researchers offer a reasonable rationale for their choice; they start with the most global characteristic (conscientiousness) and work towards the more proximal contributors to grades (effort regulation, then test anxiety, then academic self-efficacy, then grade goal).
As they ran the model, SAT and high school GPA continued to be important predictors. So were effort regulation and grade goal.
You can usually quibble about the order in which variables were entered and the rationale for that ordering, and that's the case here. As they put the data together, the most important predictors of college grade point average are: your grades in high school, your score on the SAT or ACT, the extent to which you plan for and target specific grades, and your ability to persist in challenging academic situations.
There is not much support here for the idea that demographic or psychosocial contextual variables matter much. Broad personality traits, most motivation factors, and learning strategies matter less than I would have guessed.
No single analysis of this sort will be definitive. But aside from that caveat, it's important to note that most admissions officers would not want to use this study as a one-to-one guide for admissions decisions. Colleges are motivated to admit students who can do the work, certainly. But beyond that they have goals for the student body on other dimensions: diversity of skill in non-academic pursuits, or creativity, for example.
When I was a graduate student at Harvard, an admissions officer mentioned in passing that, if Harvard wanted to, the college could fill the freshman class with students who had perfect scores on the SAT. Every single freshman-- 800, 800. But that, he said, was not the sort of freshman class Harvard wanted.
I nodded as though I knew exactly what he meant. I wish I had pressed him for more information.
Richardson, M., Abraham, C., Bond, R. (2012). Psychological correlates of university students' academic performance: A systematic review and meta-analysis. Psychological Bulletin, 138, 353-387.
The PIRLS results are better than you may realize.
Last week, the results of the 2011 Progress in International Reading Literacy Study (PIRLS) were published. This test compared reading ability in 4th grade children.
U.S. fourth-graders ranked 6th among 45 participating countries. Even better, US kids scored significantly better than the last time the test was administered in 2006.
There's a small but decisive factor that is often forgotten in these discussions: differences in orthography across languages.
Lots of factors go into learning to read. The most obvious is learning to decode--learning the relationship between letters and (in most languages) sounds. Decode is an apt term. The correspondence of letters and sound is a code that must be cracked.
In some languages the correspondence is relatively straightforward, meaning that a given letter or combination of letters reliably corresponds to a given sound. Such languages are said to have a shallow orthography. Examples include Finnish, Italian, and Spanish.
In other languages, the correspondence is less consistent. English is one such language. Consider the letter sequence "ough." How should that be pronounced? It depends on whether it's part of the word "cough," "through," "although," or "plough." In these languages, there are more multi-letter sound units, more context-depenent rules and more out and out quirks.
Another factor is syllabic structure. Syllables in languages with simple structures typically (or exclusively) have the form CV (i.e., a consonant, then a vowel as in "ba") or VC (as in "ab.") Slightly more complex forms include CVC ("bat") and CCV ("pla"). As the number of permissible combinations of vowels and consonants that may form a single syllable increases, so does the complexity. In English, it's not uncommon to see forms like CCCVCC (.e.g., "splint.")
Here's a figure (Seymour et al., 2003) showing the relative orthographic depth of 13 languages, as well as the complexity of their syllabic structure.
From Seymour et al (2003)
Orthographic depth correlates with incidence of dyslexia (e.g., Wolf et al, 1994) and with word and nonword reading in typically developing children (Seymour et al. 2003). Syllabic complexity correlates with word decoding (Seymour et al, 2003).
This highlights two points, in my mind.
First, when people trumpet the fact that Finland doesn't begin reading instruction until age 7 we should bear in mind that the task confronting Finnish children is easier than that confronting English-speaking children. The late start might be just fine for Finnish children; it's not obvious it would work well for English-speakers.
Of course, a shallow orthography doesn't guarantee excellent reading performance, at least as measured by the PIRLS. Children in Greece, Italy, and Spain had mediocre scores, on average. Good instruction is obviously still important.
But good instruction is more difficult in languages with deep orthography, and that's the second point. The conclusion from the PIRLS should not just be "Early elementary teachers in the US are doing a good job with reading." It should be "Early elementary teachers in the US are doing a good job with reading despite teaching reading in a language that is difficult to learn."
Seymour, P. H. K., Aro, M., & Erskine, J. M. (2003). Foundation literacy acquisition in European orthographies. British Journal of Psychology, 94, 143-174.
Wolf, M., Pfeil, C., Lotz, R., & Biddle, K. (1994). Towarsd a more universal understanding of the developmental dyslexias: The contribution of orthographic factors. In Berninger, V. W. (Ed), The varieties of orthographic knowledge, 1: Theoretical and developmental issues.Neuropsychology and cognition, Vol. 8., (pp. 137-171). New York, NY, US: Kluwer
Michael Gove, Secretary of State for Education in Great Britain, delivered a speech on education policy last week called "In Praise of Tests" (text here
), in which he argued for "regular, demanding, rigourous examinations." The reasons offered included arguments invoking scientific evidence, and cited my work as examples of such evidence. That invites the question
"Does Willingham think that the scientific evidence supports testing, as Gove suggested?"This question really has two parts. Did Gove get the science right? And did he apply it in a way that is likely to work as he expects?The answer to the first question is straightforward: yes, he got the science right.
The answer to the second question is that I agree that testing is necessary, but have a different take on the scientific backing for this claim than Gove offered. First, the science. Gove made three scientific claims. First, that people enjoy mental activity that is successful
--it's fun to solve challenging problems. Much of the first chapter of Why Don't Students Like School
is devoted to this idea, but it's a commonplace observation; that's why people enjoy problem-solving hobbies like crossword puzzles or reading mystery novels. Second, Gove claimed that background knowledge is critical for higher thought, a topic I've written about in several places (e.g., here).
The only quibble I have with Gove on this topic is when he says "Memorisation is a necessary precondition of understanding." I'd have preferred "knowledge," to "memorisation" because the latter makes it sound as though one must sit down and willfully commit information to memory. This is a poor way to learn new information--it's much more desirable that the to-be-learned material is embedded in some interesting activity, so that the student will be likely to remember it as a matter of course.It's plain that Gove agrees with me on this point, because he emphasized that exam preparation should not mean a dull drilling of facts, but rather should happen through "
entertaining narratives in history, striking practical work in science and unveiling hidden patterns in maths." I think the word "memorisation" may be what led the Guardian to use a headline
suggesting Gove was advocating rote learning. Third, Gove argued that people (teachers and others) are biased in their evaluations of students
, based on the student's race, ethnicity, gender, or other features that have nothing to do with the students actual performance. A number of studies from the last forty years show that this danger is real.
So on the science, I think Gove is on firm ground. What of the policy he's advocating?I lack expertise in policy matters, and I've argued on this blog that the world of education might be less chaotic if each of us stuck a little closer to the home territory of what we know. Worse yet, I know little about the British education system nor about Gove's larger policy plans. With those caveats in place, I'll tread on Gove's territory and offer these thoughts on policy.It's true that successful thought brings pleasure. The sort of effort I (and others) meant was the solving of a cognitive problem.
Gove offers the example of a singer finishing an aria or a craftsman finishing an artefact. These works of creative productivity likely would bring the sort of pleasure I discussed. It's less certain that the passing of examination would be "successful thought" in this sense.
Why? Because exams seldom call for the creative
deployment of knowledge. Instead, they call for the straightforward recall of knowledge. That's because it's very difficult to write exams that call for creative responses, yet are psychometrically reliable and valid. There is a second manner in which achievement can bring pleasure; I haven't written about it, but I think it's the one Gove may have in mind. It's the pleasure of overcoming a formidable obstacle that you were not sure you could surmount. I agree that passing a difficult test could be a profound
experience. Some children really don't see themselves as students. They have self-confidence, but it comes from knowing that they are effective in other activities. Passing a challenging exam might prompt child who never really thought of himself as "a student" to recognize that he's every bit as able as other children, and that might redirect the remainder of his school experience, even his life.
But there are some obvious difficulties in reaching this goal. How do we motivate the student to work hard enough to actually pass the difficult test? The challenge of the exam is unlikely to do it--the child is much more likely to conclude that he can't possibly pass, so there is no point in trying.
The clear solution is to engage creative teachers who have the skill to work with students who begin school poorly prepared and who may come from homes where education is not a priority. But motivation was the problem we began with, the one we hoped to address. It seems to me that the motivational boost we get from kids passing a tough exam might be a good outcome of successfully motivating kids. It's not clear to me that it will motivate them.
My second concern in Gove's vision of testing is how teachers will believe they should best prepare kids for a difficult exam that demands a lot of factual recall.Gove is exactly right when he argues that teachers ought not to construe this as a call for rote learning of lists of facts, but rather should ensure that rich factual content is embedded in rich learning activities. My concern is that some British teachers--in particular, the ones whose performance Gove hopes to boost--won't listen to him. I say that because of the experience in the US with the No Child Left Behind Act. In the face of mandatory testing for students, some teachers kept doing what they had been doing, which is exactly what Gove
suggests; rich content interwoven with a demand for critical thinking, delivered in a way that motivates kids. These teachers were unfazed by the test, certain that their students would pass.Other teachers changed lesson plans to emphasize factual knowledge, and focused activities on test prep. I've never met a teacher who was happy about this change. Teachers emphasize facts at the expense of all else and engage in disheartening test prep because they think it's necessary. Teachers believed it was necessary because (1) they were uncertain that their old lesson plans would leave kids with the factual knowledge base to pass the test; or (2) they thought that their students entered the class so far behind that extreme measures were necessary to get them to the point of passing; or (3) they thought that the test was narrow or poorly designed and would not capture the learning that their old set of lesson plans brought to kids; or (4) some combination of these factors. So pointing out that exam prep and memorization of facts
is bad practice will probably not be enough.
Despite these difficulties, I think some plan of testing is necessary. Gove puts it this way: "Exams help those who need support to better know what support they need." A cognitive psychologist would say "learning is not possible without feedback." That learning might be an individual student mastering a subject, OR a teacher evaluating whether his students learned more from a new set of lesson plans he devised compared to last year, OR whether students at a school are learning more with block scheduling compared to their old schedule. In each case, you want to be confident that the feedback is valid, reliable, and unbiased. And if social psychology has taught us anything in the last fifty years, it's that people will believe their informal judgments are valid, reliable, and unbiased, whether they are or not.There's more to the speech and I encourage you to read all of it. Here I've commented only on some of the centerpiece scientific claims in it. Again, I emphasize that I don't know British education and I don't know Gove's plans in their entirety, so what I've written here may be inaccurate because it lacks broader context. I can confidently say this: hard as it is, good science is easier than good policy.
The insidious thing about tests is that they seem so straightforward. I write a bunch of questions. My students try to answer them. And so I find out who knows more and who knows less.
But if you have even a minimal knowledge of the field of psychometrics
, you know that things are not so simple.
And if you lack that minimal knowledge, Howard Wainer would like a word with you.
Wainer is a psychometrician who spent many years at the Educational Testing Service and now works at the National Board of Medical Examiners. He describes himself as the kind of guy who shouts back at the television when he sees something to do with standardized testing that he regards as foolish. These one-way shouting matches occur with some regularity, and Wainer decided to record his thoughts more formally.
The result is an accessible book, Uneducated Guesses,
explaining the source of his ire on 10 current topics in testing. They make for an interesting read for anyone with even minimal interest in the topic.
For example, consider the making of a standardized test like the SAT or ACT optional for college applicants, a practice that seems egalitarian and surely harmless. Officials at Bowdoin College have made the SAT optional since 1969. Wainer points out the drawback--useful information about the likelihood that students will succeed at Bowdoin is omitted.Here's the analysis.
Students who didn't submit SAT scores with their application nevertheless took the test. They just didn't submit their scores. Wainer finds that, not surprisingly, students who chose not to submit their scores did worse than those who did, by about 120 points.
Figure taken from Wainer's blog
Wainer also finds that those who didn't submit their scores had worse GPAs in their freshman year, and by about the amount that one would predict, based on the lower scores.
So although one might reject the use of a standardized admissions tests out of some conviction, if the job of admissions officers at Bowdoin is to predict how students will fare there, they are leaving useful information on the table.
The practice does bring a different sort of advantage to Bowdoin, however. The apparent average SAT score of their students increases, and average SAT score is one factor in the quality rankings offered by US News and World Report.
In another fascinating chapter, Wainer offers a for-dummies guide to equating tests. In a nutshell, the problem is that one sometimes wants to compare scores on tests that use different items—for example, different versions of the SAT. As Wainer points out, if the tests have some identical items, you can use performance on those items as “anchors” for the comparison. Even so, the solution is not straightforward, and Wainer deftly takes the reader through some of the issues.
But what if there is very little overlap on the tests?
Wainer offers this analogy. In 1998, the Princeton High School football team was undefeated. In the same year, the Philadelphia Eagles won just three games. If we imagine each as a test-taker, the high school team got a perfect score, whereas the Eagles got just three items right. But the “tests” each faced contained very different questions and so they are not comparable. If the two teams competed, there's not much doubt as to who would win.
The problem seems obvious when spelled out, yet one often hears calls for uses of tests that would entail such comparisons—for example, comparing how much kids learn in college, given that some major in music, some in civil engineering, and some in French.
And yes, the problem is the same when one contemplates comparing student learning in a high school science class and a high school English class as a way of evaluating their teachers. Wainer devotes a chapter to value-added measures. I won't go through his argument, but will merely telegraph it: he's not a fan.
In all, Uneducated Guesses is a fun read for policy wonks. The issues Wainer takes on are technical and controversial—they represent the intersection of an abstruse field of study and public policy. For that reason, the book can't be read as a definitive guide. But as a thoughtful starting point, the book is rare in its clarity and wisdom.
I want to highlight two incredibly valuable papers, although they are increasingly dated. One paper reports on an enormous project in which observers went into a large sample of US first grade classrooms (827 of them in 295 districts) and simply recorded what was happening.
The other paper reported on a comparable project for third grade classrooms (780 students in 250 districts)Both papers are a treasure trove of information, but I want to highlight one striking datum: the percentage of time spent on science.
In first grade classrooms it was 4%. In third grade classrooms it was
5%.There are a few oddities that might make you wonder about these figures. In the 1st grade paper, the observations typically took place in the morning, so perhaps teachers tend to focus on ELA in the morning and save science for the afternoon. But the third grade project sampled throughout the day.And although there's always some chance that
there's something odd about the method, the estimates accord with estimates using other measures, such as teachers' estimates. (See data from an NSF project here.
) And before you blame NCLB for crowding science out of the classroom, note that the data for these studies were collected before NCLB. (1st grade, mostly '97-98; 3rd grade, mostly '00-'01). I don't think there's much reason to suspect that the time spent on science instruction has increased, and smaller scale studies indicate it hasn't.The fact that so little time is spent on science is, to me, shocking.It's even more surprising when paired with the observation that US kids fare pretty well in international comparisons of science achievement. In 2003, when more or less the same cohort of kids took the TIMMS US kids ranked 6th in science. (They ranked 5th in 2008.)How are US kids doing fairly well in science in the absence of science instruction?Possibly US schools are terribly efficient in science instruction and get a lot done in minimum time. Possibly other countries are doing even less. Possibly US culture offers good support for informal opportunities to learn science. It remains a puzzle.
There is a lot of talk about STEM instruction these days. In most districts, science doesn't get serious until middle school. US schools could be doing a whole lot more with more time devoted to science instruction.
I'll have more to say about time in elementary classrooms next week.NICHD Early Child Care Research Network
(2002). The relation of global first-grade classroom environment to structural classroom features and teacher and student behaviors. The Elementary School Journal, 102
NICHD Early Child Care Research Network (2005). A day in third grade: A large-scale study of classroom quality and teacher and student behavior.The Elementary School Journal, 105
Am I stupid if I can't turn on my stove? The picture below (or one very similar) appears in most textbooks on human factors psychology.
The arrangement of controls is spatially incompatible with the arrangement of stove elements, so if I want to turn on the back left element, I may very well turn on the front left one. What's notable is that this stove likely came with an instruction book, describing which knob goes with which burner. But something about that feels wrong. It feels like the designer of the stove should have known how my mind works, and taken that into account, rather than shrugging and saying "well, it's in the manual. It's not my fault if you don't read the manual."The stove reminds me of value-added measures of teacher effectiveness.Even the staunchest boosters of value-added measures agree that they should not be the whole story, that there should be multiple measures of teacher effectiveness. But I'm afraid that asking people to remember that fact is a little like asking people to remember which knob goes with which burner on their stove. It's not that people can't do it, but you are swimming upstream of the mind's biases.To be clear, I don't think that there are data to prove this contention, but let me describe why I'm guessing it's true.We're talking about a case of missing information: you tell people: "Teacher Smith's
value-added score is X. By the way, value-added scores are incomplete as a measure of teacher effectiveness" How do people
interpret information that they know to be incomplete? It varies with the situation. Sometimes they assume the missing information is positive. ("I haven't heard that the roads are closed, so I guess all's well.") Sometimes they assume missing information is negative ("He left 'prior experience' blank, so I guess he doesn't have any.") And sometimes missing information is forgotten or discounted. My guess--and I emphasize that it's a guess--is that will be the case here. I make this guess in part by analogy to the evaluation of college applicants. A student's high school record has lots of "soft" components, the values of which are tricky to evaluate: participation in sports and clubs, leadership positions, recommendations from teachers. . .. even a student's grade point average must be evaluated in light of
the difficulty of the courses taken and the competitiveness of the high school. But then there's the SAT. It has the gloss of being numeric, and it is easy to make comparisons across students. Make no mistake, I believe that the SAT does what it's
supposed to do--predict success in the freshman year of college. But it's often interpreted
to be much more meaningful than that. That's the problem.I'm afraid that value-added measures will have the same problem. They are produced via a fancy formula, they make it simple to make comparisons, and they are numeric, which can lead one to conclude that they are more precise than they really are.
And at this point, we don't even have
any of the other "soft" measures to round out the picture of teacher effectiveness.I don't think value-added measures are meaningless. But handing people value-added measures with the bland warning "these are incomplete" is like giving me a stove with a bad mapping plus an instruction booklet. The solution to the stove problem is str
The solution to teacher evaluation is not straightforward, and I won't attempt to resolve it in a blog posting.
My purpose here is simply to highlight the problem in publishing value-added data for individual teachers, with the caveat "these measures are incomplete." I predict that caveat will go unnoticed or be forgotten.
in yesterday’s New York Times covered some recent research on the increasing education achievement gap between rich and poor. It’s worth a read, but it misses a couple of important points.
Regarding reasons for the gap, the article dwells on one hypothesis, commonly called the investment
theory: richer families have more money to invest in their kids. (The article might have mentioned that richer families not only have more financial capital, but more human capital and social capital.) The article does not mention at all another major theory of the economics of educational achievement; stress theory. Kids (and parents) who live in poverty live under systemic stress. A great deal of research in the last ten years has shown that this stress has direct cognitive consequences for kids, and also affects how parents treat their kids. (Any parent knows that you’re not at your best when you’re stressed.) An open-access review article on this research can be found here
Another important point the article misses concerns what might be done. It ends with a gloomy quote from an expert: “No one has the slightest idea what will work. The cupboard is bare.”
I think there is more reason for optimism, because other countries are doing a better job with this problem than we are. The OECD analyzes the PISA results by reported family SES. In virtually every country, high SES kids outperform low SES kids. But in some countries, the gap is smaller, and that’s it’s not just countries that have smaller income gaps.
Economic inequality within a country is often measured with a statistic called the Gini coefficient
which varies from 0 (everyone has the same net worth) to 1 (one person has all the money, and the other has nothing). Rich children score better than poor children in countries with large Gini coefficients (like the US) and
the rich outscore the poor in countries with lower Gini coefficients (like Norway). Being poor predicts lower scores everywhere, but the disparity of wealth means more in the US than it does in other countries. What’s significant is that the relationship between income and test performance is stronger in the US than it is in most countries. (The US has the 3rd strongest relationship between income and student performance in Science and 10th highest for math, in the 2006 PISA results
Some countries, (e.g., Hong Kong), despite an enormous disparity between rich and poor, manage to even the playing field when the kids are at school. The US does a particularly poor job at this task; wealthy kids enjoy a huge advantage over poor kids. People generally argue that the US is different than Hong Kong, we’re a large, heteroogenous country, and so forth. All true, but the defeatist attitude won’t get us anywhere. We need more systematic study of how those countries solve the problem.