The Rasch Paradigm: Revolution or Normal Progression?

Much of the historical and philosophical analysis (e.g., Engelhard, Fisher) from the Rasch camp has followed the notion that Rasch’s principles and methods flow naturally and logically from the best measurement thinking (Thurstone, Binet, Guttman, Terman, et al.) of the early 20th century and beyond. From this very respectable and defensible perspective, Rasch’s contribution was a profound, but normal progression based on this earlier work and provided the tools to deal with the awkward measurement problems of the time, e.g. validity, reliability, equating. Before Rasch, the consensus was the only forms that could be equated were those that didn’t need equating.

When I reread Thomas Kuhn’s “The Structure of Scientific Revolutions” I was led to the conclusion that Rasch’s contribution rises to the level of a revolution, not just a refinement of earlier thinking or elaboration of previous work. It is truly a paradigm shift, although Kuhn didn’t particularly like the phrase (and it probably doesn’t appear in my 1969 edition of “Structure”.) I don’t particularly like it because it doesn’t adequately differentiate between “new paradigm” and “tweaked paradigm”; in more of Kuhn’s words, a new world, not just a new view of an old world.

To qualify as a Kuhnian Revolution requires several things: the new paradigm, of course, which needs to satisfactorily resolve the anomalies that have accumulated under the old paradigm, which were sufficient to provoke a crisis in the field. It must be appealing enough to attract a community of adherents. To attract adherents, it must solve enough of the existing puzzles to be satisfying and it must present some new ones to send the adherents in a new direction and give them something new to work on.

One of Kuhn’s important contributions was his description of “Normal Science,” which is what most scientists do most of the time. It can be the process of eliminating inconsistencies, either by tinkering with the theory or by disqualifying observations. It can be clarifying details or bringing more precision to the experimentation. It can be articulating implications of the theory, i.e., if that is, then this must be. We get more experiments to do and other hypotheses to proof.

Kuhn described this process as “Puzzle Solving,” with, I believe, no intent of being dismissive. These fall into the rough categories of tweaking the theory, designing better experiments, or building better instruments.

The term “paradigm” wasn’t coined by Kuhn but he certainly brought it to the fore. There has been a lot of discussion and criticism since of the varied and often casual ways he used the word but it seems to mean the accepted framework within which the community who accept the framework perform normal science. I don’t think that is as circular as it seems.

The paradigm defines the community and the community works on the puzzles that are “normal science” under the paradigm. The paradigm can be ‘local’ existing as an example or, perhaps even an exemplar of the framework. Or it can be ‘global.’ Then it is the view that defines a community of researchers and the world view that holds that community together. This requires that it be attractive enough to divert adherents from competing paradigms and that it be open-ended enough to give them issues to work on or puzzles to solve.

If it’s not attractive, it won’t have adherents. The attraction has to be more than just able to “explain” the data more precisely. Then it would just be normal science with a better ruler. To truly be a new paradigm, it needs to involve a new view of the old problems. One might say, and some have, that after, say, Aristotle and Copernicus and Galileo and Newton and Einstein and Bohr and Darwin and Freud, etc., etc., we were in a new world.

Your paradigm won’t sell or attract adherents if it doesn’t give them things to research and publish. The requirement that the paradigm be open-ended is more than marketing. If it’s not open-ended, then it has all the answers, which makes it dogma or religion, not science.

Everything is fine until it isn’t. Eventually, an anomaly will present itself that can’t be explained away by tweaking the theory, censoring the data, or building a better microscope. Or perhaps, anomalies and the tweaks required to fit them in become so cumbersome, the whole thing collapses of its own weight.  When the anomalies become too obvious to dismiss, too significant to ignore, or too cumbersome to stand, the existing paradigm cracks, ‘normal science’ doesn’t help, and we are in a ‘crisis’.

Crisis

The psychometric new world may have turned with Lord’s seminal 1950 thesis. (Like most of us at a similar stage, Lord’s crisis was he needed a topic that would get him admitted into the community of scholars.) When he looked at a plot of item percent correct against total number correct (the item’s characteristic curve), he saw a normal ogive. That fit his plotted data pretty well, except in the tails. So he tweaked the lower end to “explain” too many right answers from low scorers. The mathematics of the normal ogive are, at least, cumbersome and, in 1950, computationally intractable. So that was pretty much that, for a while.

In the 1960s, the normal ogive morphed into the logistic, perhaps the idea came from following Rasch’s (1960) lead, perhaps from somewhere else, perhaps due to Birnbaum (1968); I’m not a historian and this isn’t a history lesson. The mathematics were a lot easier and computers were catching up. The logistic was winning out but with occasional retreats to the the normal ogive because it fit a little better in the tails .

US psychometricians saw the problem as data fitting and it wasn’t easy. There were often too many parameters to estimate without some clever footwork. But we’re clever and most of those computational obstacles have been overcome to the satisfaction of most. The nagging questions remaining are more epistemological than computational.

Can we know if our item discrimination estimates are truly indicators of item "quality" and not loadings on some unknown, extraneous factor(s)?

If the lower asymptote is what happens at minus infinity where we have no data and never want to have any, why do we even care?

If the lower asymptote is the probability of a correct response from an examinee with infinitely low ability, how can it be anything but 1/k, where k is the number of response choices?

How can the lower asymptote ever be higher than 1/k? (See Slumdog Millionaire, 2008)

If the lower asymptote is population-dependent, isn't the ability estimate dependent on the population we choose to assign the person to? Mightn't individuals vary in their propensity to respond to items they don't know.

Wouldn't any population-dependent estimate be wrong on the level of the individual?

If you ask the data for “information” beyond the sufficient statistics, not only are your estimates population-dependent, they are subject to whatever extraneous factors that might separate high scores from low scores in that population. This means sacrificing validity in the name of reliability.

Rasch did not see his problem as data fitting. As an educator, he saw it directly: more able students do better on the set tasks than less able students. As an associate of Ronald Fisher (either the foremost statistician of the twentieth century who also made contributions to genetics or the foremost geneticist of the twentieth century who also made contributions to statistics), Rasch knew about logistic growth models and sufficient statistics. Anything left in the data, after reaping the information with the sufficient statistics, should be noise and should be used to control the model. The size of the residuals isn’t as interesting as the structure, or lack thereof.¹

Rasch Measurement Theory certainly has its community and the members certainly adhere and seem to find enough to do. Initially, Rasch found his results satisfying because it got him around the vexing problem of how to assess the effectiveness of remedial reading instruction when he didn’t have a common set of items or common set of examinees over time. This led him to identify a class of models that define Specific Objectivity.

Rasch’s crisis (how to salvage a poorly thought-out experiment) hardly rises to the epic level of Galileo’s crisis with Aristotle, or Copernicus’ crisis with Ptolemy, or Einstein’s crisis with Newton. A larger view would say the crisis came about because the existing paradigms did not lead us to “measurement”, as most of science would define it.

In the words of William Thomson, Lord Kelvin:

When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.

Revolution

Rasch’s solution did change the world for any adherents who were willing to accept his principles and follow his methods. They now knew how to ‘equate’ scores from disparate instruments, but beyond that, how to develop scales for measuring, define constructs to be measured, and do better science.

Rasch’s solution to his problem in the 1950s with remedial reading scores is still the exemplar and “local” definition of the paradigm. His generalization of that solution to an entire class of models and his exposition of “specific objectivity” are the “global” definition. (Rasch, 1960) 

There’s a problem with all this. I am trying to force fit Rasch’s contribution into Kuhn’s “Structure of Scientific Revolutions” paradigm when Rasch Measurement admittedly isn’t science. It’s mathematics, or statistics, or psychometrics; a tool, certainly a very useful tool, like Analysis of Variance or Large Hadron Colliders.

Measures are necessary precursors to science. Some of the weaknesses in pre-Rasch thinking about measurement are suggested in the following koans, hoping for enlightened measurement, not Zen enlightenment.

"Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality." E. L. Thorndike

"Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement." L. L. Thurstone

"You never know a line is crooked unless you have a straight one to put next to it." Socrates

"Correlations are population-dependent, and therefore scientifically rather uninteresting." Georg Rasch

"We can act as though measured differences along the latent trait are distances on a river but whoever is piloting better watch for meanders and submerged rocks."

"We may be trying to measure size; perhaps height and weight would be better. Or perhaps, we are measuring 'weight', when we should go after 'mass'"

"The larger our island of knowledge, the longer our shoreline of wonder." Ralph W. Sockman

"The most exciting thing to hear in science ... is not 'Eureka' but 'That's funny.'" Isaac Asimov

1A subtitle for my own dissertation could be “Rasch’s errors are my data.”

The Five ‘S’s to Rasch Measurement

The mathematical, statistical, and philosophical faces of Rasch measurement are separability, sufficiency, and specific objectivity. ‘Separable’ because the person parameter and the item parameter interact in a simple way; Β/Δ in the exponential metric or β-δ in the log metric. ‘Sufficient’ because ‘nuisance’ parameters can be conditioned out so that, in most cases, the number of correct responses is the sufficient statistic for the person’s ability or the item’s difficulty. Specific Objectivity is Rasch’s term for ‘fundamental measurement’; what Wright called ‘sample-free item calibration’. It is objective because it does not depend on the specific sample of items or people; it is specific because it may not apply universally and the validity in any new application must be established.

I add two more ‘S‘s to the trinity: simplicity and symmetry.

Simplicity

We have talked ad nauseum about simplicity. It in fact is one of my favorite themes. The chances that the person will answer the item correctly is Β / (Β + Δ), which is about as simple as life gets.1 Or in less-than-elegant prose:

The likelihood that the person wins is the odds of the person winning
divided by sum of the odds for person winning and the odds for the item winning.

With such a simple model, the sufficient statistics are simple counts, and the estimators can be as simple as row averages. Rasch (1960) did many of his analyses graphically; Wright and Stone (1979) give algorithms for doing the arithmetic, somewhat laboriously, without the benefit of a computer. The first Rasch software at the University of Chicago (CALFIT and BICAL) ran on a ‘mini-computer’ that wouldn’t fit in your kitchen and had one millionth the capacity of your phone.

Symmetry

The first level of symmetry with Rasch models is that person ability and item difficulty have identical status. We can flip the roles of ability and difficulty in everything I have said in this post and every preceding one, or in everything Rasch or Gerhardt Fischer has ever written, and nothing changes. It makes just as much sense to say Δ / (Δ + Β) as Β / (Β + Δ). Granted we could be talking about anti-ability and anti-difficulty, but all the relationships are just the same as before. That’s almost too easy.

Just as trivially, we have noted, or at least implied, that we can flip, as suits our purposes, between the logistic and exponential expressions of the models without changing anything. In the exponential form, we are dealing with the odds that a person passes the standard item; in the logistic form, we have the log odds. If we observe one, we observe the other and the relationships among items and students are unchanged in any fundamental way. We are not limited to those two forms. Using base e is mathematically convenient, but we can choose any other base we like; 10, or 100, or 91 are often used in converting to ‘scale scores’. Any of these transformations preserves all the relationships because they all preserve the underlying interval scale and the relative positions of objects and agents on it.

That’s the trivial part.

Symmetry was a straightforward concept in mathematics: Homo sapiens, all vertebrates, and most other fauna have bilateral symmetry; a snowflake has sixfold; a sphere an infinite number. The more degrees of symmetry, the fewer parameters that are required to describe the object. For a sphere, only one, the radius, is needed and that’s as low as it goes.

Leave it to physicists to take an intuitive idea and made it into a topic for advanced graduate seminars2:

A symmetry of a physical system is a physical or mathematical feature of the system
(observed or intrinsic)
that is preserved or remains unchanged under some transformation. 

For every invariant (i.e., symmetry) in the universe, there is a conservation law.
Equally, for every conservation law in physics, there is an invariant.
(Noether’s Theorem, 1918)3.

Right. I don’t understand enough of that to wander any deeper into space, time, or electromagnetism or to even know if this sentence makes any sense.

In Rasch’s world,4 when specific objectivity holds, the ‘difficulty’ of an item is preserved whether we are talking about high ability students or low, fifth graders or sixth, males or females, North America or British Isles, Mexico or Puerto Rico, or any other selection of students that might be thrust upon us.

Rasch is not suggesting that the proportion answering the item correctly (aka, p-value) never varies or that it doesn’t depend on the population tested. In fact, just the opposite, which is what makes p-values and the like ” rather scientifically uninteresting”. Nor do we suggest that the likelihood that a third grader will correctly add two unlike fractions is the same as the likelihood for a nineth grader. What we are saying is that there is an aspect of the item that is preserved across any partitioning of the universe; that the fraction addition problem has its own intrinsic difficulty unrelated to any student.

“Preserved across any partitioning of the universe” is a very strong statement. We’re pretty sure that kindergarten students and graduate students in Astrophysics aren’t equally appropriate for calibrating a set of math items. And frankly, we don’t much care. We start caring if we observe different difficulty estimates from fourth-grade boys or girls, or from Blacks, Whites, Asians, or Hispanics, or from different ability clusters, or in 2021 and 2022. The task is to establish not if it ever fails but when symmetry holds.

I need to distinguish a little more carefully between the “latent trait” and our quantification of locations on the scale. An item has an inherent difficulty that puts it somewhere along the latent trait. That location is a property of the item and does not depend on any group of people that have been given, or that may ever be given the item. Nor does it matter if we choose to label it in yards or meters, Fahrenheit or Celsius, Wits or GRits. This property is what it is whether we use the item for a preschooler, junior high student, astrophysicist, or Supreme Court Justice. This we assume is invariant. Even IRTists understand this.

Although the latent trait may run the gamut, few items are appropriate for use in more than one of the groups I just listed. That would be like suggesting we can use the same thermometer to assess the status of a feverish preschooler that we use for the surface of the sun, although here we are pretty sure we are talking about the same latent ‘trait’. It is equally important to choose an appropriate sample for calibrating the items. A group of preschoolers could tell us very little about the difficulty of items appropriate for assessing math proficiency of astrophysicists.

Symmetry can break in our data for a couple reasons. Perhaps there is no latent trait that extends all the way from recognizing basic shapes to constructing proofs with vector calculus. I am inclined to believe there is in this case, but that is theory and not my call. Or perhaps we did not appropriately match the objects and agents. Our estimates of locations on the trait should be invariant regardless of which objects and agents we are looking at. If there is an issue, we will want to know why: are difficulty and ability poorly matched? Is there a smart way to get the item wrong? Is there a not-smart way to get it right? Is the item defective? Is the person misbehaving? Or did the trait shift? Is there a singularity?

My physics is even weaker that my mathematics.

What most people call ‘Goodness of Fit’ and Rasch called ‘Control of the Model’, we are calling an exploration of the limits of symmetry. For me, I have a new buzz word, but the question remains, “Why do bright people sometimes miss easy items and non-bright people sometimes pass hard items?”5 This isn’t astrophysics.

Here is my “item response theory”:

The Rasch Model is a main effects model; the sufficient statistics for ability and difficulty are the row and column totals of the item response matrix. Before we say anything important about the students or items, we need to verify that there are no interactions. This means no matter how we sort and block the rows, estimates of the column parameters are invariant (enough).

That’s me regressing to my classical statistical training to say that symmetry holds for these data.


[1] It may look more familiar but less simple if we rewrite it as (Β/Δ) / (1 + Β/Δ), even better eβ-δ/(1 + eβ-δ), but it’s all the same for any observer.

[2] Both of the following statements were lifted (plagiarized?) from a Wikipedia discussion of symmetry. I deserve no credit for the phrasing, nor do I seek it.

[3] Emmy Noether was a German mathematician whose contributions, among other things, changed the science of physics by relating symmetry and conservation. The first implication of her theorem was it solved Hilbert and Einstein’s problem that General Relativity appeared to violate the conservation of energy. She was generally unpaid and dismissed, notably and empathically not by Hilbert and Einstein, because she was a woman and a Jew. In that order.

When Göttingen University declined to give her a paid position, Hilbert responded, “Are we running a University or a bathing society?” In 1933, all Jews were forced out of academia in Germany; she spent the remainder of her career teaching young women at Bryn Mawr College and researching at the Institute for Advanced Study in Princeton (See Einstein, A.)

[4] We could flip this entire conversation and talk about the ‘ability’ of a person preserved across shifts of item difficulty, type, content, yada, yada, yada, and it would be equally true. But I repeat myself again.

[5] Except for the ‘boy meets girl, . . . aspect, this question is the basic plot of “Slumdog Millionaire“, undoubtedly the greatest psychometric movie ever made. I wouldn’t however describe the protagonist as “non-bright”, which suggests there is something innate in whatever trait is operating and exposes some of the flaws in my use of the rather pejorative term. I should use something more along the lines of “poorly schooled” or “untrained”, placing effort above talent.

Latent Trait Analysis or Item Response ‘Theory’?

In the 1960s, we had Latent Trait Analysis (Birnbaum, A., in Lord & Novack, 1968); in the 1970s, Darrell Bock lobbied for a new label ‘Item Response Theory’ (IRT) as more descriptive of what was actually going on. He was right about conventional psychometric practice in the US; he was wrong for condoning it as the reasonable thing to do. I am uncertain in what sense he used the word ‘theory’. We don’t seem to be talking about “a well-confirmed explanation of nature, in a way consistent with scientific method, and fulfilling the criteria required by modern science.” Maybe a theory of measurement would be better than a theory of item responses.

The ‘item response matrix’ is a means to an end. It is nothing more or less than rows and columns of ones and zeros recording which students answered which items correctly and which incorrectly.[1] It is certainly useful in the classroom for grading students and culling items; however, it is population dependent, and therefore scientifically rather uninteresting, like correlations and factor analysis. My interest in it is to harvest whatever the matrix can tell me about some underlying aspect of students.

The focus of IRT is estimate parameters to “best explain” the ones and zeros. “Explain” is used here in the rather sterile statistical sense of ‘minimize the residual errors’, which has long been a respectable activity in statistical practice. Once IRTists are done with their model estimation, the ‘tests of fit’ are typically likelihood ratio tests, which are relevant to the question, “Have we included enough parameters and built a big enough model?” and less relevant to “Why didn’t that item work?” or “Why did these students miss those items?”

Latent Trait Analysis’ is a more descriptive term for the harvesting of relevant information when our focus is on the aspect of the student that caused us to start this entire venture. Once all the information is extracted from the data, we will have located the students on an interval scale for a well-defined trait (the grading part). Anything left over is useful and essential for ‘controlling the model’, i.e., selecting, revising, or culling items and diagnosing students. Parameter estimation is only the first step toward understanding what happened: “explaining!” in the sense of ‘illuminating the underlying mechanism’.

“Grist” is grain separated from the chaff. The ones and zeros are grist for the measurement mill but several steps from sliced bread. The item response matrix is data carefully elicited from appropriate students using appropriate items to expose, isolate, and abstract the specific trait we are after. In addition to the chaff of extraneous factors, we need to deal with the nuisance parameters and insect parts as well.

Wright’s oft misunderstood phrases “person-free item calibration” and “item-free person measurement” (Wright, B. D., 1968) notwithstanding, it is still necessary to subject students to items before we make any inferences about the underlying trait. The problem is to make inferences that are “sample-freed”. The key, sine que non, and Rasch’s contribution that makes this possible was a class of models that had separable parameters leading to sufficient statistics ending at Specific Objectivity.

Describing Rasch latent trait models as a special case of 3PL IRT is Algebra 101 but fails to recognize the power and significance of sufficient statistics and the generality of his theory of measurement. Rasch’s Specific Objectivity is the culmination of the quest for Thurstone’s holy grail of ‘fundamental measurement’.


[1] For the moment, we will stick with the dichotomous case and talk in terms of educational assessment because it’s what I know, although it doesn’t change the argument to think polytomous responses in non-educational applications.