The Rasch Paradigm: Revolution or Normal Progression?

Much of the historical and philosophical analysis (e.g., Engelhard, Fisher) from the Rasch camp has followed the notion that Rasch’s principles and methods flow naturally and logically from the best measurement thinking (Thurstone, Binet, Guttman, Terman, et al.) of the early 20th century and beyond. From this very respectable and defensible perspective, Rasch’s contribution was a profound, but normal progression based on this earlier work and provided the tools to deal with the awkward measurement problems of the time, e.g. validity, reliability, equating. Before Rasch, the consensus was the only forms that could be equated were those that didn’t need equating.

When I reread Thomas Kuhn’s “The Structure of Scientific Revolutions” I was led to the conclusion that Rasch’s contribution rises to the level of a revolution, not just a refinement of earlier thinking or elaboration of previous work. It is truly a paradigm shift, although Kuhn didn’t particularly like the phrase (and it probably doesn’t appear in my 1969 edition of “Structure”.) I don’t particularly like it because it doesn’t adequately differentiate between “new paradigm” and “tweaked paradigm”; in more of Kuhn’s words, a new world, not just a new view of an old world.

To qualify as a Kuhnian Revolution requires several things: the new paradigm, of course, which needs to satisfactorily resolve the anomalies that have accumulated under the old paradigm, which were sufficient to provoke a crisis in the field. It must be appealing enough to attract a community of adherents. To attract adherents, it must solve enough of the existing puzzles to be satisfying and it must present some new ones to send the adherents in a new direction and give them something new to work on.

One of Kuhn’s important contributions was his description of “Normal Science,” which is what most scientists do most of the time. It can be the process of eliminating inconsistencies, either by tinkering with the theory or by disqualifying observations. It can be clarifying details or bringing more precision to the experimentation. It can be articulating implications of the theory, i.e., if that is, then this must be. We get more experiments to do and other hypotheses to proof.

Kuhn described this process as “Puzzle Solving,” with, I believe, no intent of being dismissive. These fall into the rough categories of tweaking the theory, designing better experiments, or building better instruments.

The term “paradigm” wasn’t coined by Kuhn but he certainly brought it to the fore. There has been a lot of discussion and criticism since of the varied and often casual ways he used the word but it seems to mean the accepted framework within which the community who accept the framework perform normal science. I don’t think that is as circular as it seems.

The paradigm defines the community and the community works on the puzzles that are “normal science” under the paradigm. The paradigm can be ‘local’ existing as an example or, perhaps even an exemplar of the framework. Or it can be ‘global.’ Then it is the view that defines a community of researchers and the world view that holds that community together. This requires that it be attractive enough to divert adherents from competing paradigms and that it be open-ended enough to give them issues to work on or puzzles to solve.

If it’s not attractive, it won’t have adherents. The attraction has to be more than just able to “explain” the data more precisely. Then it would just be normal science with a better ruler. To truly be a new paradigm, it needs to involve a new view of the old problems. One might say, and some have, that after, say, Aristotle and Copernicus and Galileo and Newton and Einstein and Bohr and Darwin and Freud, etc., etc., we were in a new world.

Your paradigm won’t sell or attract adherents if it doesn’t give them things to research and publish. The requirement that the paradigm be open-ended is more than marketing. If it’s not open-ended, then it has all the answers, which makes it dogma or religion, not science.

Everything is fine until it isn’t. Eventually, an anomaly will present itself that can’t be explained away by tweaking the theory, censoring the data, or building a better microscope. Or perhaps, anomalies and the tweaks required to fit them in become so cumbersome, the whole thing collapses of its own weight.  When the anomalies become too obvious to dismiss, too significant to ignore, or too cumbersome to stand, the existing paradigm cracks, ‘normal science’ doesn’t help, and we are in a ‘crisis’.

Crisis

The psychometric new world may have turned with Lord’s seminal 1950 thesis. (Like most of us at a similar stage, Lord’s crisis was he needed a topic that would get him admitted into the community of scholars.) When he looked at a plot of item percent correct against total number correct (the item’s characteristic curve), he saw a normal ogive. That fit his plotted data pretty well, except in the tails. So he tweaked the lower end to “explain” too many right answers from low scorers. The mathematics of the normal ogive are, at least, cumbersome and, in 1950, computationally intractable. So that was pretty much that, for a while.

In the 1960s, the normal ogive morphed into the logistic, perhaps the idea came from following Rasch’s (1960) lead, perhaps from somewhere else, perhaps due to Birnbaum (1968); I’m not a historian and this isn’t a history lesson. The mathematics were a lot easier and computers were catching up. The logistic was winning out but with occasional retreats to the the normal ogive because it fit a little better in the tails .

US psychometricians saw the problem as data fitting and it wasn’t easy. There were often too many parameters to estimate without some clever footwork. But we’re clever and most of those computational obstacles have been overcome to the satisfaction of most. The nagging questions remaining are more epistemological than computational.

Can we know if our item discrimination estimates are truly indicators of item "quality" and not loadings on some unknown, extraneous factor(s)?

If the lower asymptote is what happens at minus infinity where we have no data and never want to have any, why do we even care?

If the lower asymptote is the probability of a correct response from an examinee with infinitely low ability, how can it be anything but 1/k, where k is the number of response choices?

How can the lower asymptote ever be higher than 1/k? (See Slumdog Millionaire, 2008)

If the lower asymptote is population-dependent, isn't the ability estimate dependent on the population we choose to assign the person to? Mightn't individuals vary in their propensity to respond to items they don't know.

Wouldn't any population-dependent estimate be wrong on the level of the individual?

If you ask the data for “information” beyond the sufficient statistics, not only are your estimates population-dependent, they are subject to whatever extraneous factors that might separate high scores from low scores in that population. This means sacrificing validity in the name of reliability.

Rasch did not see his problem as data fitting. As an educator, he saw it directly: more able students do better on the set tasks than less able students. As an associate of Ronald Fisher (either the foremost statistician of the twentieth century who also made contributions to genetics or the foremost geneticist of the twentieth century who also made contributions to statistics), Rasch knew about logistic growth models and sufficient statistics. Anything left in the data, after reaping the information with the sufficient statistics, should be noise and should be used to control the model. The size of the residuals isn’t as interesting as the structure, or lack thereof.¹

Rasch Measurement Theory certainly has its community and the members certainly adhere and seem to find enough to do. Initially, Rasch found his results satisfying because it got him around the vexing problem of how to assess the effectiveness of remedial reading instruction when he didn’t have a common set of items or common set of examinees over time. This led him to identify a class of models that define Specific Objectivity.

Rasch’s crisis (how to salvage a poorly thought-out experiment) hardly rises to the epic level of Galileo’s crisis with Aristotle, or Copernicus’ crisis with Ptolemy, or Einstein’s crisis with Newton. A larger view would say the crisis came about because the existing paradigms did not lead us to “measurement”, as most of science would define it.

In the words of William Thomson, Lord Kelvin:

When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.

Revolution

Rasch’s solution did change the world for any adherents who were willing to accept his principles and follow his methods. They now knew how to ‘equate’ scores from disparate instruments, but beyond that, how to develop scales for measuring, define constructs to be measured, and do better science.

Rasch’s solution to his problem in the 1950s with remedial reading scores is still the exemplar and “local” definition of the paradigm. His generalization of that solution to an entire class of models and his exposition of “specific objectivity” are the “global” definition. (Rasch, 1960) 

There’s a problem with all this. I am trying to force fit Rasch’s contribution into Kuhn’s “Structure of Scientific Revolutions” paradigm when Rasch Measurement admittedly isn’t science. It’s mathematics, or statistics, or psychometrics; a tool, certainly a very useful tool, like Analysis of Variance or Large Hadron Colliders.

Measures are necessary precursors to science. Some of the weaknesses in pre-Rasch thinking about measurement are suggested in the following koans, hoping for enlightened measurement, not Zen enlightenment.

"Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality." E. L. Thorndike

"Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement." L. L. Thurstone

"You never know a line is crooked unless you have a straight one to put next to it." Socrates

"Correlations are population-dependent, and therefore scientifically rather uninteresting." Georg Rasch

"We can act as though measured differences along the latent trait are distances on a river but whoever is piloting better watch for meanders and submerged rocks."

"We may be trying to measure size; perhaps height and weight would be better. Or perhaps, we are measuring 'weight', when we should go after 'mass'"

"The larger our island of knowledge, the longer our shoreline of wonder." Ralph W. Sockman

"The most exciting thing to hear in science ... is not 'Eureka' but 'That's funny.'" Isaac Asimov

1A subtitle for my own dissertation could be “Rasch’s errors are my data.”

The Five ‘S’s to Rasch Measurement

The mathematical, statistical, and philosophical faces of Rasch measurement are separability, sufficiency, and specific objectivity. ‘Separable’ because the person parameter and the item parameter interact in a simple way; Β/Δ in the exponential metric or β-δ in the log metric. ‘Sufficient’ because ‘nuisance’ parameters can be conditioned out so that, in most cases, the number of correct responses is the sufficient statistic for the person’s ability or the item’s difficulty. Specific Objectivity is Rasch’s term for ‘fundamental measurement’; what Wright called ‘sample-free item calibration’. It is objective because it does not depend on the specific sample of items or people; it is specific because it may not apply universally and the validity in any new application must be established.

I add two more ‘S‘s to the trinity: simplicity and symmetry.

Simplicity

We have talked ad nauseum about simplicity. It in fact is one of my favorite themes. The chances that the person will answer the item correctly is Β / (Β + Δ), which is about as simple as life gets.1 Or in less-than-elegant prose:

The likelihood that the person wins is the odds of the person winning
divided by sum of the odds for person winning and the odds for the item winning.

With such a simple model, the sufficient statistics are simple counts, and the estimators can be as simple as row averages. Rasch (1960) did many of his analyses graphically; Wright and Stone (1979) give algorithms for doing the arithmetic, somewhat laboriously, without the benefit of a computer. The first Rasch software at the University of Chicago (CALFIT and BICAL) ran on a ‘mini-computer’ that wouldn’t fit in your kitchen and had one millionth the capacity of your phone.

Symmetry

The first level of symmetry with Rasch models is that person ability and item difficulty have identical status. We can flip the roles of ability and difficulty in everything I have said in this post and every preceding one, or in everything Rasch or Gerhardt Fischer has ever written, and nothing changes. It makes just as much sense to say Δ / (Δ + Β) as Β / (Β + Δ). Granted we could be talking about anti-ability and anti-difficulty, but all the relationships are just the same as before. That’s almost too easy.

Just as trivially, we have noted, or at least implied, that we can flip, as suits our purposes, between the logistic and exponential expressions of the models without changing anything. In the exponential form, we are dealing with the odds that a person passes the standard item; in the logistic form, we have the log odds. If we observe one, we observe the other and the relationships among items and students are unchanged in any fundamental way. We are not limited to those two forms. Using base e is mathematically convenient, but we can choose any other base we like; 10, or 100, or 91 are often used in converting to ‘scale scores’. Any of these transformations preserves all the relationships because they all preserve the underlying interval scale and the relative positions of objects and agents on it.

That’s the trivial part.

Symmetry was a straightforward concept in mathematics: Homo sapiens, all vertebrates, and most other fauna have bilateral symmetry; a snowflake has sixfold; a sphere an infinite number. The more degrees of symmetry, the fewer parameters that are required to describe the object. For a sphere, only one, the radius, is needed and that’s as low as it goes.

Leave it to physicists to take an intuitive idea and made it into a topic for advanced graduate seminars2:

A symmetry of a physical system is a physical or mathematical feature of the system
(observed or intrinsic)
that is preserved or remains unchanged under some transformation. 

For every invariant (i.e., symmetry) in the universe, there is a conservation law.
Equally, for every conservation law in physics, there is an invariant.
(Noether’s Theorem, 1918)3.

Right. I don’t understand enough of that to wander any deeper into space, time, or electromagnetism or to even know if this sentence makes any sense.

In Rasch’s world,4 when specific objectivity holds, the ‘difficulty’ of an item is preserved whether we are talking about high ability students or low, fifth graders or sixth, males or females, North America or British Isles, Mexico or Puerto Rico, or any other selection of students that might be thrust upon us.

Rasch is not suggesting that the proportion answering the item correctly (aka, p-value) never varies or that it doesn’t depend on the population tested. In fact, just the opposite, which is what makes p-values and the like ” rather scientifically uninteresting”. Nor do we suggest that the likelihood that a third grader will correctly add two unlike fractions is the same as the likelihood for a nineth grader. What we are saying is that there is an aspect of the item that is preserved across any partitioning of the universe; that the fraction addition problem has its own intrinsic difficulty unrelated to any student.

“Preserved across any partitioning of the universe” is a very strong statement. We’re pretty sure that kindergarten students and graduate students in Astrophysics aren’t equally appropriate for calibrating a set of math items. And frankly, we don’t much care. We start caring if we observe different difficulty estimates from fourth-grade boys or girls, or from Blacks, Whites, Asians, or Hispanics, or from different ability clusters, or in 2021 and 2022. The task is to establish not if it ever fails but when symmetry holds.

I need to distinguish a little more carefully between the “latent trait” and our quantification of locations on the scale. An item has an inherent difficulty that puts it somewhere along the latent trait. That location is a property of the item and does not depend on any group of people that have been given, or that may ever be given the item. Nor does it matter if we choose to label it in yards or meters, Fahrenheit or Celsius, Wits or GRits. This property is what it is whether we use the item for a preschooler, junior high student, astrophysicist, or Supreme Court Justice. This we assume is invariant. Even IRTists understand this.

Although the latent trait may run the gamut, few items are appropriate for use in more than one of the groups I just listed. That would be like suggesting we can use the same thermometer to assess the status of a feverish preschooler that we use for the surface of the sun, although here we are pretty sure we are talking about the same latent ‘trait’. It is equally important to choose an appropriate sample for calibrating the items. A group of preschoolers could tell us very little about the difficulty of items appropriate for assessing math proficiency of astrophysicists.

Symmetry can break in our data for a couple reasons. Perhaps there is no latent trait that extends all the way from recognizing basic shapes to constructing proofs with vector calculus. I am inclined to believe there is in this case, but that is theory and not my call. Or perhaps we did not appropriately match the objects and agents. Our estimates of locations on the trait should be invariant regardless of which objects and agents we are looking at. If there is an issue, we will want to know why: are difficulty and ability poorly matched? Is there a smart way to get the item wrong? Is there a not-smart way to get it right? Is the item defective? Is the person misbehaving? Or did the trait shift? Is there a singularity?

My physics is even weaker that my mathematics.

What most people call ‘Goodness of Fit’ and Rasch called ‘Control of the Model’, we are calling an exploration of the limits of symmetry. For me, I have a new buzz word, but the question remains, “Why do bright people sometimes miss easy items and non-bright people sometimes pass hard items?”5 This isn’t astrophysics.

Here is my “item response theory”:

The Rasch Model is a main effects model; the sufficient statistics for ability and difficulty are the row and column totals of the item response matrix. Before we say anything important about the students or items, we need to verify that there are no interactions. This means no matter how we sort and block the rows, estimates of the column parameters are invariant (enough).

That’s me regressing to my classical statistical training to say that symmetry holds for these data.


[1] It may look more familiar but less simple if we rewrite it as (Β/Δ) / (1 + Β/Δ), even better eβ-δ/(1 + eβ-δ), but it’s all the same for any observer.

[2] Both of the following statements were lifted (plagiarized?) from a Wikipedia discussion of symmetry. I deserve no credit for the phrasing, nor do I seek it.

[3] Emmy Noether was a German mathematician whose contributions, among other things, changed the science of physics by relating symmetry and conservation. The first implication of her theorem was it solved Hilbert and Einstein’s problem that General Relativity appeared to violate the conservation of energy. She was generally unpaid and dismissed, notably and empathically not by Hilbert and Einstein, because she was a woman and a Jew. In that order.

When Göttingen University declined to give her a paid position, Hilbert responded, “Are we running a University or a bathing society?” In 1933, all Jews were forced out of academia in Germany; she spent the remainder of her career teaching young women at Bryn Mawr College and researching at the Institute for Advanced Study in Princeton (See Einstein, A.)

[4] We could flip this entire conversation and talk about the ‘ability’ of a person preserved across shifts of item difficulty, type, content, yada, yada, yada, and it would be equally true. But I repeat myself again.

[5] Except for the ‘boy meets girl, . . . aspect, this question is the basic plot of “Slumdog Millionaire“, undoubtedly the greatest psychometric movie ever made. I wouldn’t however describe the protagonist as “non-bright”, which suggests there is something innate in whatever trait is operating and exposes some of the flaws in my use of the rather pejorative term. I should use something more along the lines of “poorly schooled” or “untrained”, placing effort above talent.

Lexiles: the making of a measure

PDF download: Using Lexiles Safely

A recent conversation with a former colleague (it was more of a one-way lecture) about what psychometricians don’t understand about students and education led me to resurrect an article that I wrote for the Rasch Measurement Transactions four or five years ago. It deals specifically with Lexiles© but it is really about how one defines and uses measures in education and science.

The antagonism toward Lexiles in particular and Rasch measures in general is an opportunity to highlight some distinctions between measurement and analysis and between a measure and an assessment. Often when trying to discuss the development of reading proficiency, specialists in measurement and reading seem to be talking at cross-purposes. Reverting to argument by metaphor, measurement specialists are talking about measuring weight; and reading specialists, about providing proper nutrition.

There is a great deal involved in physical development that is not captured when we measure a child’s weight and the process of measuring weight tells us nothing about whether the result is good, bad, or normal; if you should continue on as you are, schedule a doctor’s appointment, or go to the emergency room without changing your underwear. Evaluation of the result is an analysis that comes after the measurement and depends on the result being a measure. No one would suggest that, because it doesn’t define health, weight is not worth measuring or that it is too politically sensitive to talk about in front of nutritionists. A high number does not imply good nutrition nor does a low number imply poor nutrition. Nonetheless, the measurement of weight is always a part of an assessment of well-being.

A Lexile score, applied to a person, is a measure of reading ability[i], which I use to mean the capability to decode words, sentences, paragraphs, and Supreme Court decisions. Lexiles, as applied to a text, is a measure of how difficult the text is to decode. Hemingway’s “For Whom the Bell Tolls” (840 Lexile score) has been cited as an instance where Lexiles do not work. Because a 50th percentile sixth-grade reader could engage with this text, something must be wrong because the book was written for adults. This counter-example, if true, is an interesting case. I have two counter-counter-arguments: first, all measuring instruments have limitations to their use and, second, Lexiles may be describing Hemingway appropriately.

First, outside the context of Lexiles, there is always difficulty for either humans or computer algorithms in scoring exceptional, highly creative writing. (I would venture to guess that many publishers, who make their livings recognizing good writing[ii], would reject Hemingway, Joyce, or Faulkner-like manuscripts if they received them from unknown authors.) I don’t think it follows that we should avoid trying to evaluate exceptional writing. But we do need to know the limits of our instruments.

I rely, on a daily basis, on a bathroom scale. I rely on it even though I believe I shouldn’t use it on the moon, under water, or for elephants or measuring height. It does not undermine the validity of Lexiles in general to discover an extraordinary case for which it does not apply. We need to know the limits of our instrument; when does it produce valid measures and when does it not.

Second, given that we have defined the Lexile for a text as the difficulty of decoding the words and sentences, the Lexile analyzer may be doing exactly what it should with a Hemingway text. Decoding the words and sentences in Hemingway is not that hard: the vocabulary is simple, the sentences short. That’s pretty much what the Lexile score reflects.

Understanding or appreciating Hemingway is something else again. This may be getting into the distinction between reading ability, as I defined it, and reading comprehension, as the specialists define that. You must be able to read (i.e., decode) before you can comprehend. Analogously, you have to be able to do arithmetic before you can solve math word problems[iii]. The latter requires the former; the former does not guarantee the latter. Necessary but not sufficient.

The Lexile metric is a developmental scale that is not related to instructional method or materials, or to grade-level content standards. The metric reflects increasing ability to read, in the narrow sense of decode, increasingly advanced text. As students advance through the reading/language arts curriculum, they should progress up the Lexile scale. Effective, including standards-based, instruction in ELA[iv] should cause them to progress on the Lexile scale; analogously good nutrition should cause children to progress on the weight scale[v].

One could coach children to progress on the weight scale in ways counter to good nutrition[vi]. One might subvert Lexile measurements by coaching students to write with big words and long sentences. This does not invalidate either weight or reading ability as useful things to measure. There do need to be checks to ensure we are effecting what we set out to effect.

The role of standards-based assessment is to identify which constituents of reading ability and reading comprehension are present and which absent. Understanding imagery and literary devices, locating topic sentences, identifying main ideas, recognizing sarcasm or satire, comparing authors’ purposes in two passages are within its purview but are not considered in the Lexile score. Its analyzer relies on rather simple surrogates for semantic and syntactic complexity.

The role of measurement on the Lexile scale is to provide a narrowly defined measure of the student’s status on an interval scale that extends over a broad range of reading from Dick and Jane to Scalia and Sotomayor. The Lexile scale does not define reading, recognize the breadth of the ELA curriculum, or replace grade-level content standards-based assessment, but it can help us design instruction and target assessment to be appropriate to the student. We do not expect students to say anything intelligent about text they cannot decode, nor should we attempt to assess their analytic skills using such text.

Jack Stenner (aka, Dr. Lexile) uses as one of his parables, you don’t buy shoes for a child based on grade level but we don’t think twice about assigning textbooks with the formula (age – 5). It’s not one-size-fits-all in physical development. Cognitive development is probably no simpler if we were able to measure all its components. To paraphrase Ben Wright, how we measure weight has nothing to do with how skillful you are at football, but you better have some measures before you attempt the analysis.

[i] Ability may not be the best choice of a word. As used in psychometrics, ability is a generic placeholder for whatever we are trying to measure about a person. It implies nothing about where it came from, what it is good for, or how much is enough. In this case, we are using reading ability to refer to a very specific skill that must be taught, learned, and practiced.

[ii] It may be more realistic to say they make their livings recognizing marketable writing, but my cynicism may be showing.

[iii] You also have to decode the word problem but that’s not the point of this sentence. We assume, often erroneously, that the difficulty of decoding the text is not an impediment to anyone doing the math.

[iv] Effective instruction in science, social studies, or basketball strategy should cause progress on the Lexile measure as well; perhaps not so directly. Anything that adds to the student’s repertoire of words and ideas should contribute.

[v] For weight, progress often does not equal gain.

[vi] Metaphors, like measuring instruments, have their limits and I may have exceeded one. However, one might consider the extraordinary measures amateur wrestlers or professional models employ to achieve a target weight.

Useful and Almost Number-free Reports

If I had asked my customers what they wanted, they would have said a faster horse. Henry Ford

Percentile ranks on student reports are tests as contests.

Raw scores on student reports are like live chickens on dinner plates.

If I were to step on my bathroom scale and see a single number like, say for example, 165 lbs (or 75 kilo) with no further explanation, I wouldn’t need an interpretation guide or course in psychometrics to know what the number means or to decide if I like it or not. Nor would I be writing to the manufacturer of the scale asking, “What’s a pound?” If I were to take a qualifying test to be a licensed dog walker and I received a postcard in the mail that said simply, “Your score is 509 GRits ± 41,” I would be a little frustrated and a lot annoyed. And I would need to ask some basic questions like, “What does that mean?” “What’s a GRit?” “Is the ‘41’ the standard error of measurement, or does it represent a confidence interval?” “If so, what level of confidence?” “What does 509 actually say about my proficiency to walk dogs?” And, of course, “Did I pass?

If the answer to the last question is yes, then most candidates, possibly excluding the psychometricians, will quickly lose interest in the others. If the answer is no, then the tone becomes a little more strident and now includes questions like, “Who decides what passing is?” “What did I miss?” “How close was I?” and if I was close, “Isn’t there almost a 50% chance that I actually passed?[1] People who did pass never seem concerned about the other half of this question.

If a postcard with a Scale Score (even with some form of the standard error of measurement) isn’t useful or meaningful, what does a report need to be? Examinee reports vary depending on the audience and the purpose of the exam, but for a report going to a student, teacher, parent, or anyone else who might actually make use of the information for the betterment of the examinee, there would seem to be four basic components:

  1. Identification
  2. Measurement
  3. Control
  4. Interpretation

There needs to be enough identifying information to locate the examinee and to deliver the report to the right place. For dog walking candidates, the address on the front of the postcard did the trick. For education, it probably takes some combination of student name, teacher name, classroom /section/period, grade, school, and district. We should also mention the name of the test and the date taken. That is almost always more than enough to locate the right person; if you are still worried about it, add birth date or a parent’s name. Our original list should be adequate to locate the teacher and the teacher should know the student by name.

Measurement of the examinee to determine something about status or progress is the point of the exercise. This report section could be the simple “509 GRits” but it should also include some indication of our confidence in this measurement, which means the standard error of measurement in some guise. To make it concrete, in this example, the standard error of measurement is 41, with a 95% confidence interval of 509 ± 82, or 427 to 591. It is probably prudent to never use a phrase involving the word “error” when communicating with parents or school boards; they tend to interpret “error” as “mistake” and blame you. One often sees phrases like “probable range” to describe the interval between the measure plus and minus two standard errors (or some other arbitrary multiplier), which avoids saying ‘error’ and also ducks the squabble between the frequentists and the Bayesians about what confidence means. A picture may not be worth a thousand words in this case but here it is.

Dog walking scale (3)

I am inclined to leave it at that for the moment but not everyone thinks a line with scale scores and a marker for the person’s location is all that informative. I am oft over-ridden to add other (useful and relevant) information like a conclusion (e.g., pass/fail or performance level[2]) and sometimes even subtest measures to the measurement section. One could also say things like a person at 509 has 41% likelihood of testing below the Competent level next time and a 1e-6 likelihood of testing above Skilled. These are really steps toward control and interpretation not measurement so the purist in me wants to put them in the next two sections. Although it’s not worth falling on my sword and what goes in what section is less rigid than I seem to be implying.

I am willing to give some meaning to the score by showing the ruler and some milestones along it. At this point, the scale score may be less meaningful than the milestones, but with experience, the scale score can become a useful shorthand for the milestones. It doesn’t take very much experience to understand what temperatures of 0°C and 37°C imply, even for US residents. This leads me to the less simple “map” below.

Dog walking scale (2)

Diagnosis With the Model

The vertical scale is GRits[3], which is our measure of dog walking proficiency and almost meaningless by itself; we wouldn’t lose much if we left the numbers off entirely[4]. The column of text labels is the substantive description of the scale. Topics at the low end, which are relatively easy, deal with type and use of basic equipment; topics at the high end, which are more difficult, deal with complex behaviors. The GRits bring precision; the text labels bring meaning.

The red vertical line has a tic mark for each possible raw score and a red diamond to mark the location of our hypothetical person. The red horizontal lines are the person’s location and plus/minus two standard errors. You can also add some normative information like means, standard deviations, frequency distributions, or quantiles, if you are into that sort of thing.

The gray horizontal lines mark the performance levels: 500 is Competent, 700 is Skilled, and 850 is Master. Labelling the lines rather than the regions between is not standard practice in educational assessment but it avoids the inconvenience of needing to label the region below Competent and the misinterpretation of the levels as actual developmental states or stages rather than simply more or less arbitrary criteria for addressing school accountability or dispensing certificates. So far we are just displaying the result, not interpreting it.

Control of the measurement model means, either, ensuring that we are warranted in treating the result as a valid measure, in the full sense of the word as we just did, or diagnosing what the anomalies tell us about the examinee. This is again the dichotomy of “diagnosing with the model” and “diagnosing from the model.” Determining which of these paths to follow requires a bit more than simply computing ‘infit’ or ‘outfit’ and consulting the appropriate table of big numbers. This involves looking at individual items, splitting the items into clusters, and looking for things that are funny. Maps like the following can be more useful than any fit statistic and almost number-free.

Dog Walking

Diagnosis From the Model

The display has the same measurement information as before and considerable detail about items and item clusters. First, the red vertical line still refers to the total test and still has a tic mark for each possible raw score and the red diamond for the person. It now has a black diamond for each item response; items to the left of the line are incorrect; those to the right are correct with the distance from the line representing the probability against the response; the greater the distance, the more improbable the response. The dotted vertical lines (blue shading) are control lines and represent probabilities of 75%. We don’t need to be much concerned about anything in the blue. There are four or five items, two difficult items passed and two or three easy items missed, outside the control lines that might warrant investigation.

Most of the same information has been added for each of five item clusters. These are very short tests so the results may seem a little erratic but in all cases, one more item right or wrong would bring them close to the total test measure. If you are so inclined, the number correct score can be found by counting the tic marks[5] up to the red diamond. You can’t necessarily find it by counting the item plotting symbols to the right of the scales because they can represent multiple items. (And it is further confused because some items were not assigned to clusters.) Overall, this is a well-behaved person.

Because I often disparage anything short of computer-administered, fully adaptive tests (CAT), I need to point out a serious issue for me and model control: in the world of CAT, there are no surprises. If we do the CAT right, everything should be in the blue bands. This puts all the load for model control on the item clusters. In our example, we have behaved as though clusters were based on content, which is appropriate for reporting. For control, we would do more by forming clusters based on difficulty, sequence, item type, item format, and item exposure or age are natural choices but as we become more creative in developing items for computer administration, there could be others.

Interpretation of the measure means explaining what the measurement tells us about the status and progress of the examinee. Establishing ‘performance levels,’ say, Master, Skilled, or Competent dog walker, is a significant step from measurement to meaning, or from quantification to qualification. Announcing that the candidate is above, say, the ‘Competent’ performance level is a start. Diagnosis with the model would then talk about what items candidates at this level have mastery of, what items they have no clue about, and what items are at the candidate’s level. This is reporting what the candidate can do, what the candidate can’t do, and what the candidate should find challenging but possible. That suggests three obvious comments that any computer could readily generate as personalized feedback assuming a well-behaved response pattern and diagnosis with the model.

Personalizing a report takes more than generating text that restates the obvious and uses the candidate’s name in a complete sentence, like “Ron, your Dog Walking GRit is 509 ± 82, which means we think you are Competent to walk dogs”. When we have a computer generating the feedback, we should use any intelligence, artificial or otherwise, that is available. It is generally ok to start with the generic, “Ron, you total score is …” and “You did well on item clusters D and A, but were weak on cluster C,” and move on to things that are less obvious. I prefer to open with a positive, encouraging statement (clusters D and A), then mention problem areas (cluster C), and close with things to work on immediately (topics that haven’t been mastered but are close). Ideally, we would discuss the specifics of the surprising responses. This includes difficult items that were passed and easy items that were missed. This is moving into diagnosis from the model.

The more specifics, the better, even item statistics and foil analysis if anyone asks. But it would be much more valuable and much more work for both the item developers and systems analysts to provide a discussion of the type of misunderstandings or errors implied by any incorrect responses. It is work for the item developers because they would need to understand and explain why every distractor is there and what selecting it means. It is work for the system analysts because they need to keep track of and manage everything.

In today’s world, there is little reason to limit reporting to what can be squeezed onto an 8.5×11 sheet of paper or by concerns about the cost of color printing[6]. Paper copies are static, often cramped and overwhelming. Ideally, an electronic report, like an electronic test, will be interactive, dynamic, and engaging, with effective, targeted scaffolding. It should begin with the general overview and then allow the user to explore or be led through the interesting, important, and useful aspects of the responses, showing more and more detail as needed. Performance Level Descriptors and item clusters could be defined and displayed on request; item details could pop up when the plotting symbol is clicked.

This is not free; there will be resistance to giving items away because they are expensive and the item bank is sacred. Hopefully, we are moving away from once-a-year, high-stakes tests toward testing when it is helpful for the student, drawing from computer-generated and crowd-sourced item banks. And more importantly, toward immediate and informative feedback that might actually have some educational value.

 

[1] No, you didn’t pass, but if you test again with a parallel form, there is almost a 50% chance that you will.

[2] The little gray lines mark the performance levels (Competent, Skilled, and Master from left to right).

[3] Because GRits use three, rather than e, as their base, a difference of 100 GRits means 3 to 1 odds. Our hypothetical person has 3 to 1 odds of answering an item about leash tension but less than 1 to 3 odds for an item about tugging. More generally, a difference of 100k GRits means odds of 3k to 1. That’s friendlier than base e and odds like 2.71828…k to one.

[4] The report is (almost) number-free in the sense that the numbers are not needed to understand and use the results. It is not number-free in another sense because they are essential to provide the framework to create and arrange the display.

[5] This demonstration does not include tic marks for zero and perfect; in real life, you would probably have to account for them somehow. They tend to radically distort the scale without adding much, if any, information. They would extend the scale over 100 GRits in both directions and have probably ranges more than four times that in width.

[6] Someone will undoubtedly want a printer-friendly version to stick in the file cabinet because they have the file cabinet.

Viiif: Apple Pie and Disordered Thresholds Redux

A second try at disordered thresholds

It has been suggested, with some justification, that I may be a little chauvinistic depending so heavily on a baseball analogy when pondering disordered thresholds. So for my friends in Australia, Cyprus, and the Czech Republic, I’ll try one based on apple pie.

Certified pie judges for the Minnesota State Fair are trained to evaluate each entry on the criteria in Table 1 and the results for pies, at least the ones entered into competitions, are unimodal, somewhat skewed to the left.

Table 1: Minnesota State Fair Pie Judging Rubric

Aspect

Points

Appearance

20

Color

10

Texture

20

Internal appearance

15

Aroma

10

Flavor

25

Total

100

We might suggest some tweaks to this process, but right now our assignment is to determine preferences of potential customers for our pie shop. All our pies would be 100s on the State Fair rubric so it won’t help. We could collect preference data from potential customers by giving away small taste samples at the fair and asking each taster to respond to a short five-category rating scale with categories suggested by our psychometric consultant.

My feeling about this pie is:

0

1 2 3 4
I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten

I could eat this right after a major feast!

The situation is hypothetical; the data are simulated from unimodal distributions with roughly equal means. On day one, thresholds 3 and 4 were reversed; on day two, thresholds 2 and 3 for some tasters were also reversed. None of that will stop me from interpreting the results. It is not shown in this summary of the data shown below, but the answer to our marketing question is pies made with apples were the clear winners. (To appropriate a comment that Rasch made about correlation coefficients, this result is population-dependent and therefore scientifically rather uninteresting.) Any problems that the data might have with the thresholds did not prevent us from reaching this conclusion rather comfortably. The most preferred pies received the highest scores in spite of our problematic category labels. Or at least that’s the story I will include with my invoice.

The numbers we observed for the categories are shown in Table 2. Right now we are only concerned with the categories, so this table is summed over the pies and the tasters.

Table 2: Results of Pie Preference Survey for Categories

Day

I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten I could eat this right after a major feast!

One

10 250 785 83

321

Two 120 751 95 22

482

In this scenario, we have created at least two problems; first, the wording of the category descriptions may be causing some confusion. I hope those distinctions survive the cultural and language differences between the US and the UK. Second, the day two group is making an even cruder distinction among the pies; almost I like it or I don’t like it.

The category 4 was intended to capture the idea that this pie is so good that I will eat it even if I have already eaten myself to the point of pain. For some people that may not be different than this pie is among the best I’ve ever eaten, which is why relatively few chose category 3. Anything involving mothers is always problematic on a rating scale. Depending on your mother, “Almost as good as my mother’s” may be the highest possible rating; for others, it may be slightly above boiled liver. That suggests there may be a problem with the category descriptors that our psychometrician gave us, but the fit statistics would not object. And it doesn’t explain the difference between days one and two.

Day Two happened to be the day that apples were being judged in a separate arena, completely independently of the pie judging. Consequently every serious apple grower in Minnesota was at the fair. Rather than spreading across the five categories, more or less, this group tended to see pies as a dichotomy: those that were made with apples and those that weren’t. While the general population spread out reasonably well across the continuum, the apple growers were definitely bimodal in their preferences.

The day two anomaly is in the data, not the model or thresholds. The disordered thresholds that exposed the anomaly by imposing a strong model, but not reflected in the standard fit statistics, are an indication that we should think a little more about what we are doing. Almost certainly, we could improve on the wording of the category descriptions. But we might also want to separate apple orchard owners from other respondents to our survey. The same might also be true for banana growers but they don’t figure heavily in Minnesota horticulture. Once again, Rasch has shown us what is population-independent, i.e., the thresholds (and therefore scientifically interesting) and what is population-dependent, i.e., frequencies and preferences, (and therefore only interesting to marketers.)

These insights don’t tell us much about marketing pies better but I wouldn’t try to sell banana cream to apple growers and I would want to know how much of my potential market are apple growers. I am still at a loss to explain why anyone, even beef growers, would pick liver over anything involving sugar and butter.

Viib. Using R to do a little work

Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty d and the person’s ability b, the probability of person succeeding on the item is p = B / (B + D), where B = exp(b) and D = exp(d). If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the b that makes the observed score equal the expected score, i.e., solves the equation r = ∑p, where r is the person’s number correct score and p was just defined.

If you are more concrete than that, here is a little R-code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure r is in the range 1 to L-1 (where L = length of d,) the ds are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.


#Block 1: Routine to compute logit ability for number correct r given d
Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties
   b <- log (r / (length (d)-r))    # Initialize
   repeat {
         adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))
         b <- b + adjust
         if (abs(adjust) < stop) return (b)
}      }
P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability
PQ <- function (p) (p-p^2)                     # p(1-p) aka inverse of the 2nd derivative


If you would like to try it, copy the text between the lines above into an R-window and then define the ds somehow and type in, say, Able(r=1, d=ds) or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command Able(r,d) if r and d have been defined. If you don’t have R installed on your computer, following the link to LLTM in the menu on the right will take you to an R site that has a “Get R” option.

In the world of R, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and R had it first.


#Block 2: Test ability routines
Test.Able <- function (low, high, inc) {
#Create a vector of logit difficulties to play with,
d = seq(low, high, inc)

# The ability for a raw score of 1,
# overriding default the convergence criterion of 0.01 with 0.0001
print (“Ability r=1:”)
    print (Able(r=1, d=d, stop=0.0001))
#To get all the abilities from 1 to L-1
# first create a spot to receive results
b = NA
#Then compute the abilities; default convergence = 0.01
for (r in 1:(length(d)-1) )
     b[r] = Able (r, d)
#Show what we got
print (“Ability r=1 to L-1:”)
    print(round(b,3))
}
Test.Able (-2,2,0.25)


I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (sem); we have everything we need for them. For a quick average, of sorts, sem, useful for planning and test design, we have the Wright-Douglas approximation: sem = 2.5/√L, where L is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on height, width, and length, where height is the percent correct, width is the  range of difficulties, and length is the number of items. Or to extricate the sem for almost any score from the logit ability table, semr = √[(br+1 – br-1)/2]. Or if you want to do it right, semr =1 / √[∑pr(1-pr)].

Of course, I have some R-code. Let me know if it doesn’t work.


#Block 3: Standard Errors and a few shortcuts
# Wright-Douglas ‘typical’ sem
wd.sem <- function (k) (2.5/sqrt(k))
#
# Wright-Stone from Mead-Ryan
SEMbyHWL <- function (H=0.5,W=4,L=1) {
     C2 <- NA
     W <- ifelse(W>0,W,.001)
     for (k in 1:length(H))
            C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))
return (sqrt( C2 / L))
}
# SEM from logit ability table
bToSem <- function (r1, r2, b) {
     s  <- NA
     for (r in r1:r2)
           s[r] <- (sqrt((b[r+1]-b[r-1])/2))
return (s)
}
# Full blown SEM
sem <- function (b, d) {
     s <-  NA
    for (r in 1:length(b))
          s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))
 return (s)
}

To get the SEM’s from all four approaches, all you really need are the four lines below after “Now we’re ready” below. The rest is start up and reporting.


 

#Block 4: Try out Standard Error procedures
Test.SEM <- function (d) {
# First, a little setup (assuming Able is still loaded.)
L = length (d)
        W = max(d) – min(d)
        H = seq(L-1)/L
# Then compute the abilities; default convergence = 0.01
      b = NA
      for (r in 1:(L-1))
            b[r] = Able (r, d)
# Now we’re ready
       s.wd = wd.sem (length(d))
       s.HWL = SEMbyHWL (H,W,L)
       s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment
       s = sem(b,d)
# Show what we got
     print (“Height”)
        print(H)
     print (“Width”)
        print(W)
     print (“Length”)
        print(L)
    print (“Wright-Douglas typical SEM:”)
        print (round(s.wd,2))
    print (“HWL SEM r=1 to L-1:”)
        print (round(s.HWL,3))
    print (“SEM r=2 to L-2 from Ability table:”)
       print (round(c(s.from.b,NA),3))
    print (“MLE SEM r=1 to L-1:”)
      print (round(s,3))
   plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)
         points(b,s.HWL,col=”green”,type=”l”)
        points(b[-(L-1)],s.from.b,col=”blue”,type=”l”)
       abline(h=s.wd,lty=3)
}
Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with d=seq(-3,3,0.25) are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the best test design is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form SEM =  C / √L, where C is a rather messy function of H and W. The real innovation in HWL was the recognition that test length L could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either L = ∑p or 0 = ∑p true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the sem for 1 and L-1 are typically a little greater than one, in logits, we could adjust the ability estimates for 1 and L-1 by 1.2 or so; the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to 0 and L, say, 0.25 and L-0.25. Or you can get slightly less extreme values using 0.33 or 0.5, or more extreme using 0.1.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. Close is defined as 0.33 or 0.25 or 0.10 logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

Method Shift Zero Perfect
Constant 1.20 -5.58 5.58
SE shift One -5.51 5.51
Shift 0.33 -5.57 5.57
Shift 0.25 -5.86 5.86
Shift 0.10 -6.80 6.80

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.
Test.0100 <- function (shift) {
      d = seq(-3,3,0.25)
      b = NA
      for (r in 1:(length(d)-1) ) b[r] = Able (r, d)
# Adjust by something a little greater than one squared
b0 = b[1]-shift[1]
      bL = b[length(d)-1]+shift[1] 
      print(c(“Constant shift”,shift[1],round(b0, 2),round(bL, 2)))
      plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)
# Adjust by one standard error squared
s = sem(b,d)
      b0 = b[1]-s[1]^2
      bL = b[length(d)-1]+s[1]^2
      print(c(“SE shift”,round(b0, 2),round(bL, 2)))
      points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)
#Estimate ability for something “close” to zero;
for (x in shift[-1]) {
           b0 = Able(x,d)                         # if you try Able(0,d) you will get an inscrutable error.
           bL = Able(length(d)-x,d)
           print( c(“Shift”,x,round(b0, 2),round(bL, 2)))
           points (c(b0,b,bL),c(0:length(d)+1),type=”b”)
}    }

Test.0100 (c(1.2,.33,.25,.1))

The basic issue is not statistics; it’s policy for how much the powers that be want to punish or reward zero or perfect. But, if you really want to do the right thing, don’t give tests so far off target.

Viiie: Ordered Categories, Disordered Thresholds

When the experts all agree, it doesn’t necessarily follow that the converse is true. When the experts don’t agree, the average person has no business thinking about it. B. Russell

The experts don’t agree on the topic of reversed thresholds and I’ve been thinking about it anyway. But I may be even less lucid than usual.

The categories, whether rating scale or partial credit, are always ordered: 0 always implies less than 1; 1 always implies less than 2; 2 always implies less than 3 . . . The concentric circle for k on the archery target is always inside (smaller thus harder to hit) than the circle for k-1. In baseball, you can’t get to second without touching first first. The transition points, or thresholds, might or might not be ordered in the data. Perhaps the circle for k-1 is so close in diameter to k that it is almost impossible to be inside k-1 without being inside k. Category k-1 might be very rarely observed, unless you have very sharp arrows and very consistent archers. Perhaps four-base hits actually require less of the aspect than three-base.

Continue . . . Ordered categories, disordered thresholds

Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models

VIII. Beyond “THE RASCH MODEL”

All models are wrong. Some are useful. G.E.P.Box

Models must be used but must never be believed. Martin Bradbury Wilk

The Basic Ideas and polytomous items

We have thus far occupied ourselves entirely with the basic, familiar form of the Rasch model. I justify this fixation in two ways. First, it is the simplest and the form that is most used and second, it contains the kernel (bn – di) for pretty much everything else. It is the mathematical equivalent of a person throwing a dart at a balloon. Scoring is very simple; either you hit it or you don’t and they know if you did or not. The likelihood of the person hitting the target depends only on the skill of the person and the “elusiveness” of the target. If there is one The Rasch Model, this is it.

Continue reading . . . More Models

Vb. Mean Squares; Outfit and Infit

A question can only be valid if the students’ minds are doing the things we want them to show us they can do. Alastair Pollitt

Able people should pass easy items; unable people should fail difficult ones. Everything else is up for grabs.

One can liken progress along a latent trait to navigating a river; we can treat it as a straight line but the pilot had best remember sandbars and meanders.

More about what could go wrong and how to find it

However one validates the items, with a plethora of sliced and diced matrices, between group analyses based on gender, ethnicity, ses, age, instruction, etc., followed by enough editing, tweaking, revising, and discarding to ensure a perfectly functioning item bank and to placate any Technical Advisory Committee, there is no guarantee that the next kid to sit down in front of the computer won’t bring something completely unanticipated to the process. After the items have all been “validated,” we still must validate the measure for every new examinee.

The residual analysis that we are working our way toward is a natural approach to validating any item and any person. But we should know what we are looking for before we get lost in the swamps of arithmetic. First, we need to make sure that we haven’t done something stupid, like score the responses against the wrong key or post the results to the wrong record.

Checking the scoring for an examinee is no different than checking for miskeyed items but with less data; either would have both surprising misses and surprising passes in the response string. Having gotten past that mine field, we can then check for differences by item type, content, sequence to just note the easy ones. Then depending on what we discover, we proceed with doing the science either with the results of the measurement process or with the anomalies from the measurement process.

Continue . . .Model Control ala Panchapekesan

 

Previous: Model Control ala Choppin                        Next: Beyond Outfit and Infit

IVb. The Point is Measurement

To measure the person, not the test

In spite of most of what has been said up to this point, we did not undertake this project with the hope of building better thermometers. The point is to measure the person. Because of the complete symmetry of the model, everything we have done for items, we can do again for people just by reversing the subscripts. For any two people who took some of the same items, count the number N12 that person 2 answered correctly and person 1 missed; also the number N21 that person 1 passed and person 2 missed. The relative abilities of the people will parallel expressions 23 and 25:

Continue reading . . .The Point is Measurement

 

Previous: Doing the Math                                               Next: Controlling the Model

IIIf. Another Aspect, Reading Aloud

Truth emerges more readily from error than from confusion. Bacon

There is no such thing as measurement absolute; there is only measurement relative. Jeanette Winterson.

The Case of the Missing Person Parameters

Eliminating nuisance parameters and #SpecificObjectivity

It was a cold and snowy night when, while trying to make a living as a famous statistical consultant, Rasch was summoned to the isolated laboratory of a renowned reading specialist to analyze data related to the effect of extra instruction for poor readers. There may be better ways to make a statistician feel a valued and respected member of the team than to ask for an analysis of data collected years earlier but Rasch took it on (Rasch, 1977, p. 63.)

If we could measure, in the strictest sense, reading proficiency, measurements could be made before the intervention, after the intervention, and perhaps several points along the way. Then the analysis is no different, in principle, than if we were investigating the optimal blend of feed for finishing hogs or concentration of platinum for re-forming petroleum.

Continue reading . . .IIIf. Reading Aloud

Previous Specter of Math                                                    Return to Start

IIIe: A Spectrum of Math Proficiency and the Specter of Word Problems

In mathematics, one does not understand anything. You just get used to them. Johann Von Neumann

Defining mile posts along the way from counting your toes to doing calculus

The world has divided itself in two factions: those who think they don’t understand math and those who think they do. But we’re not talking about proving Fermat’s Last Theorem or correcting Stephan Hawking’s tensor algebra; we’re talking about counting, applying the four basic operators, and solving the dreaded word problems using basic algebra, geometry, and perhaps a little calculus. That just about covers the range from counting your toes to determining the spot in the outfield where a player should stand to catch a fly ball and should be good enough to get you through freshman math.

Continue reading . . . A Spectrum of Math Proficiency

Previous: Any given Sunday                                    Next: Before Science

IIIc. Hot and Cold: making and connecting scales

In educational measurement, we don’t yet know if we are measuring heat or temperature.

A man with one watch knows what time it is; a man with two watches is never quite sure. Lee Segal

Building, calibrating, and equating instruments

Meaning comes from experience and experience comes from ignorance. You learn what hot means by touching the stove; you learn what cold is by not wearing your mittens. Nothing here answers the question of where cold ends and hot begins or what’s the line between “medium” and “medium rare;” those points are subjective, arbitrary, and personal; hopefully not capricious.

Temperature is one of the first lessons we learn: what things are too hot to touch? When is the weather too cool to not wear a jacket? When is it warm enough to go barefoot? How much fever warrants staying home from school? These concepts may define meaningful temperature bands, but “because mom says so” is not very objective, and definitely not measurement. Continue reading . . . Hot and Cold

Previous: IIIb. The Aspect of Color                Next: IIId. On Any Given Sunday

III. Abstracting Some Aspects

  • Measure what is measurable, and make measurable what is not so. Galileo Galilei

Measuring rocks and the significance of being sufficient

The process of measurement begins long before any data are collected. The starting point is a notion, or even better, a theory about an aspect of a class of things we want to understand better, maybe even do some science on. Successful measurement depends on clear thinking about the aspect and clever ideas for the agents. This is much more challenging and much more rewarding than any mathematical gymnastics that might be performed to fit model to data.

All analogies are limited but some are useful. Considering aspects of things far removed from cognitive traits may help avoid some of the pitfalls encountered when working too close to home. Hardness is a property of materials that is hard to define but we all know what it is when it hits us. Color is a narrow region of a continuous spectrum that non-physicists tend to think about as discrete categories. Temperature is an intimate part of our daily lives, which we are quite adept at sensing and more recently at measuring, but the closely connected idea, heat, may actually be more real, less bound to conventions and populations. If I could scale the proficiency of professional football teams and reliably predict the outcomes of games, I wouldn’t be writing this.

Continue reading . . . Hard Headedness: the importance of being sufficient

 

Previous: II: Measurement in Science         Next: IIIb: The Aspect of Color

II. Measurement in Science

Measurement is the breaking up of a quantum of energy into equal units. George Herbert Mead

What does it mean “to measure” and #RaschMeasurement as a foreign language

If, in a discussion about buying a new table, your spouse were to say to you, “I measured the width of the room and …” you would not expect the conversation to degenerate immediately into a discussion about what is width, or what does measured mean, or who made your yardstick, or what units you used. But if, in a discussion with the school guidance counselor, you are told, “I measured the intelligence of your child and …” you could, and probably should, ask those same questions, although they probably won’t be any more warmly received in the guidance office than they were in the dining room.

Continue reading . . . Measurement in Science

Previous: I. Rasch’s Theory of Relativity               Next: IIIa: Abstracting Some Aspects

I. Rasch’s Theory of Relativity

The reasonable man strives to adapt himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. George Bernard Shaw, Man and Superman: Maxims to a Revolutionist

Population dependent and convention versus Rasch’s #SpecificObjectivity

While trying to understand what Rasch meant when he referred (as he often did) to tests “built by my methods and conforming to my principles,” I began this philosophical apology, which was intended to help bring sanity and civility to our approach to the practical problems of assessing status, monitoring growth, evaluating effectiveness, and, in short, doing science in the social sciences.

If you have been introduced to the Rasch Measurement as a special case of the more mathematically exotic item response theory (IRT) , then you and Rasch have not been properly introduced, and you are probably under the misimpressions that:

  • All the mathematical gymnastics associated with IRT are necessary.
  • Rasch is what you do if you don’t have the resources to do it right.

The point of this treatise is to disabuse you of those misunderstandings. If I had graduate students, this is what I would tell them to inoculate them against the standard introduction-to-measurement course. We will reserve the term IRT for the models that are not Rasch models, because that term seems to describe the intent of those models, with their focus on fitting the data. The older term latent trait theory fits better with the Rasch perspective, with its focus on the underlying aspect to be measured.

Our concern is the efficacy of Rasch measurement, how it works under controlled conditions, which can hardly be controversial. When the data conform to Rasch’s principles, i.e., the data are based on agents that are equally valid and reliable; not subject to interference from extraneous attributes of the objects, the models have the power to encompass and extend the best of Thurstone and Guttman. Guttman created a non-stochastic Rasch model, with very sufficient statistics; Thurstone defined “fundamental measurement”, which foreshadowed Rasch’s “specific objectivity.” This leads to measurement, as the layman understands the word, and sets the stage for the more vital tasks of making and analyzing measures.

Most of the mainstream debate surrounding Rasch measurement has focused on effectiveness, how the models function when confronted with real responses from real people to real tests, questionnaires, surveys, checklists, and other instruments, some put together with little or no thought for their suitability for measurement[1]. The conclusion that Rasch models are robust, i.e., do pretty well in this real world, should not be taken as justification to continue doing what we’ve been doing.

There are two commonly cited motivations for using Rasch’s models: the most popular being they are extraordinarily easy to apply, compared to IRT models. Useful results can come from relatively small samples and the estimation algorithms converge readily unless the data are pathologically bizarre. In this very data-driven world, “Rasch analysis” (or the verb form, to rasch) seems to mean running data through Rasch calibration software. This requires minimal intellectual commitment and by itself doesn’t accomplish what Rasch set out to accomplish. Rasch’s more compelling motivation takes more effort.

Rasch’s Motivation

While trying to solve practical problems in statistics and in educational and psychological testing, Georg Rasch came upon a special class of models, which led him to a general philosophy of measurement. Rasch defined measurement: if it’s not Rasch measurement, it’s not measurement! Georg was a very unreasonable man.

The phrase “Rasch Measurement” is redundant; I use it to avoid ambiguity. For its adherents, Rasch measurement is axiomatic: self-evident because this is the way it must be or it’s not measurement:

The calibration of the agents must be independent of the objects used and the measurement of the objects must be independent of the agents used, over a useful range.[2]

This is not an unchecked assertion, but a rational criterion by which one can evaluate data under the bright light of a theory. From the model follow consequences. If the observed data are consistent with the anticipated consequences, we have met the conditions of Thurstone’s fundamental measurement and can treat the results with the same respect we have for measures of, say, mass, heat, distance, or duration, which, like reading fluency or art appreciation, are not real things but aspects of things.

I come by my biases naturally. Professionally, I am a grandson, on my statistics side, and great grandson, on my measurement side, of Sir Ronald Fisher. My view of statistics was shaped by people who worked with Fisher. I was grounded in statistics at Iowa State University in a department founded by George Snedecor; the focus heavily on the design of experiments and the analysis of variance, which I learned from the likes of T.A. Bancroft, Oscar Kempthorne, David Jowett, Herbert David, and David Huntsberger, some of whom had known, worked, and undoubtedly argued (if you knew Fisher, Snedecor or Kempthorne) with Fisher on a regular basis.

My view of measurement was shaped by Georg Rasch, who worked with Fisher in England, taking away from that experience Fisher’s concepts of sufficient statistics and maximum likelihood estimation (MLE). The existence of sufficient statistics is the sine qua non of Rasch measurement. I learned Rasch measurement at Chicago (in the Department of Education founded by John Dewey) sitting at the feet of Benjamin Wright, Rasch’s most active and vocal North American disciple and evangelist. Rasch visited the University of Chicago while I was a student there, although I was too green to benefit much from that visitation.

There are strong parallels and much overlap between my two universes. Rasch measurement is to item response theory (IRT) as design of experiments is to general linear models (GLM). GLM is what you do if you can’t, or won’t, get the design right; IRT is what you do if you can’t, or won’t, get the instrument right. Both cases necessitate mathematical gymnastics that can substitute for clear thinking and mask poor planning. GLM and IRT rely on fitting models to “explain” data in a statistical sense, a venerable activity in some statistical traditions and very much in vogue in today’s Big Data world. But it’s not my tradition and it’s not measurement.

The point of experimental design is to produce observations that permit unambiguous inferences to answer specific, carefully stated questions: questions like, what level of catalyst is optimal for reforming crude oil or which feed ration is best for finishing hogs? We would really like the answer to be independent of the specific breed, gender, age, growing conditions, and intended use of the pig, but that isn’t going to happen. More likely, the answer will include a description of the specific domain to which it applies.

The point of Rasch measurement is to produce measures that unambiguously quantify a specific aspect of the object of interest; measures that are independent of any other attribute of either the object (e.g., a person) or our agents (e.g., items.) We would like the agents to be universally applicable, but more likely they will be valid for small neighborhoods in the universe, which must be described.

Rasch’s Principle and Method

Design of experiments and Rasch measurement rely fundamentally on sufficient statistics to make inferences. Sufficient statistics are the constant, overarching theme. They are what make analysis of variance[3], as described by Fisher and Snedecor, which implies more than just simply partitioning the sum of squares, work; they are what make Rasch measurement, which implies more than just running data through an appropriately named piece of software, measurement. Once you have harvested the information in the sufficient statistic, you know everything the data have to tell you about the factor that you are testing or the aspect that you are measuring. That is Rasch’s principle.

Anything left in the data should be noise and anything gleaned from the residual is to be used for control of the model. That is Rasch’s method.

It is unlikely that you will learn anything useful about Maximum Likelihood Estimation in these pages; the mathematics employed here are several steps down. The methods and derivations included are workable but you will need to look elsewhere for scholarly discussions and rigorous derivations of the most efficient and fashionable methods (e.g., Andrich; Fischer; Masters & Wright; Smith & Smith; Wilson). I rely less on calculus and proof and more on analogy and metaphor than is generally deemed proper in scholarly circles. While I will try to make the presentation non-mathematical, I will not try to make it simple.

There is little new here; the majority of the entries in the reference list are between 1960, when Rasch’s Probabilistic Models for Some Intelligence and Attainment Tests (Rasch, 1960) was published, and 1980, when it was republished shortly after Rasch’s death. There is as much here about rocks, darts, football, and oral reading as about multiple-choice items. I attempted, not at all successfully, to avoid mathematics, but those seeking rigorous explanations of estimation methods or fit statistics will need to look elsewhere (e.g., Smith & Smith, 2004; Fischer & Molenaar, 1995). This is not the manual for any Rasch computer package; it will not explain what WinSteps, RUMM, ConQuest, LPCM-WIN, or especially eRm.R is actually doing. For more hand-holding, try Wright and Stone (1979) or Bond and Fox (2007.)

Finally, this is not a cookbook for applying a special case of IRT models, although we do embrace the notion that Rasch models are very special indeed. Over the last forty years, I have come to understand that:

  1. Rasch Measurement is such an extraordinarily special case of IRT that the general IRT literature says almost nothing that helps us understand or achieve measurement.
  2. The mathematical complexities are a distraction and often counter-productive; our resources are better spent elsewhere.
  3. Measurement doesn’t happen by graciously accepting whatever instrument the author proudly hands you; fitting a model that explains the data isn’t progress.
  4. You need to be extraordinarily lucky to have an instrument that meets Rasch’s requirements; the harder you work on the design, the luckier you will be.[4]

I very purposefully keep saying Rasch Measurement, rather than The Rasch Model. There are a number of mathematical expressions that qualify as Rasch models and the specific expression of the appropriate form for a given situation is incidental, probably self-evident, once we understand what we want to accomplish. Our goal is measurement, as the world[5] understands the word.

Rasch’s Theory of Relativity

Saying our goal is measurement blatantly begs the question, What is measurement? For the moment, our answer will be, Measurement is the process of quantifying an aspect of an object. This then begs the question, What is an aspect? An aspect is not a thing nor the essence of the thing but simply an interesting, at least to us at the moment, property of the thing.

Rasch’s principles define measurement; Rasch’s methods are the process.

Trying to avoid the Metaphysics, as well as the calculus, it doesn’t matter for our purpose if there is some ideal, pure form for the aspect out there in hyperspace or the mind of God independent of any actual objects or, alternatively, if the aspect exists only if objects having the aspect exist. Is there an abstract idea of, for example, “reading proficiency” or are there just students who vary in their capacities to decode written text? In either event, our problem is to imagine the consequences of that capacity for students and devise tactics to make the consequences manifest. We are trying to learn something about the status of a kid in a classroom, not to deduce the nature of a Socratic form.

There are, however, basic theoretical or philosophical issues that are much more interesting, much more challenging, and that come before the arithmetic I will eventually describe. Doing the arithmetic correctly may inform the discussion and analysis but it neither starts them nor ends them.

For much of the twentieth century, mental measurement and, by inheritance, the social sciences were hamstrung by superficial thinking and logical shortcuts, illustrated by assertions like “IQ is what IQ tests measure.” The items that make up such a test are an operational definition of some aspect and, I guess, you can call it IQ if you like but there is a valid validity question looming out there. It is far better to have a sound theoretical basis[6] of the aspect and a statement of observable consequences before we commit to a bunch of test items. If we don’t know where we are going, any items’ll get us there.[7]

The purpose of this apology is to nudge, by Rasch, mental measurement toward the same level of consensus and respect that physical measurement enjoys, or that it enjoyed before Einstein. This is Rasch’s theory of relativity: the path to understand what is real and invariant and to recognize what is convention and population-dependent. That doesn’t seem so unreasonable.

Next: II. Measurement in Science

[1] Or often when confronted with data sets deliberately simulated to not match Rasch’s requirements.

[2] Paraphrasing Thurstone and Rasch.

[3] In Bancroft’s view, the analysis of variance is just the computing algorithm to do the arithmetic for extracting mean squares and components of variance from appropriately designed experiments. Rasch analysis, as differentiated from Rasch measurement and as implemented by any number of pieces of software past and present, is just the computing algorithm to do the arithmetic for producing measures from appropriately generated observations and establishing the domains for which they are applicable and appropriate, i.e., valid.

[4] Paraphrasing Thomas Jefferson and Branch Rickey.

[5] With the possible exception of that portion of the world populated by mainstream psychometricians doing educational research in the US.

[6] IQ may have been a poor choice to illustrate the point of a sound theoretical basis.

[7] Paraphrasing “If you don’t care where you are going, any road’ll get you there.” George Harrison, paraphrasing Lewis Carroll, from a conversation between Alice and the Cheshire Cat.