The Rasch Paradigm: Revolution or Normal Progression?

Much of the historical and philosophical analysis (e.g., Engelhard, Fisher) from the Rasch camp has followed the notion that Rasch’s principles and methods flow naturally and logically from the best measurement thinking (Thurstone, Binet, Guttman, Terman, et al.) of the early 20th century and beyond. From this very respectable and defensible perspective, Rasch’s contribution was a profound, but normal progression based on this earlier work and provided the tools to deal with the awkward measurement problems of the time, e.g. validity, reliability, equating. Before Rasch, the consensus was the only forms that could be equated were those that didn’t need equating.

When I reread Thomas Kuhn’s “The Structure of Scientific Revolutions” I was led to the conclusion that Rasch’s contribution rises to the level of a revolution, not just a refinement of earlier thinking or elaboration of previous work. It is truly a paradigm shift, although Kuhn didn’t particularly like the phrase (and it probably doesn’t appear in my 1969 edition of “Structure”.) I don’t particularly like it because it doesn’t adequately differentiate between “new paradigm” and “tweaked paradigm”; in more of Kuhn’s words, a new world, not just a new view of an old world.

To qualify as a Kuhnian Revolution requires several things: the new paradigm, of course, which needs to satisfactorily resolve the anomalies that have accumulated under the old paradigm, which were sufficient to provoke a crisis in the field. It must be appealing enough to attract a community of adherents. To attract adherents, it must solve enough of the existing puzzles to be satisfying and it must present some new ones to send the adherents in a new direction and give them something new to work on.

One of Kuhn’s important contributions was his description of “Normal Science,” which is what most scientists do most of the time. It can be the process of eliminating inconsistencies, either by tinkering with the theory or by disqualifying observations. It can be clarifying details or bringing more precision to the experimentation. It can be articulating implications of the theory, i.e., if that is, then this must be. We get more experiments to do and other hypotheses to proof.

Kuhn described this process as “Puzzle Solving,” with, I believe, no intent of being dismissive. These fall into the rough categories of tweaking the theory, designing better experiments, or building better instruments.

The term “paradigm” wasn’t coined by Kuhn but he certainly brought it to the fore. There has been a lot of discussion and criticism since of the varied and often casual ways he used the word but it seems to mean the accepted framework within which the community who accept the framework perform normal science. I don’t think that is as circular as it seems.

The paradigm defines the community and the community works on the puzzles that are “normal science” under the paradigm. The paradigm can be ‘local’ existing as an example or, perhaps even an exemplar of the framework. Or it can be ‘global.’ Then it is the view that defines a community of researchers and the world view that holds that community together. This requires that it be attractive enough to divert adherents from competing paradigms and that it be open-ended enough to give them issues to work on or puzzles to solve.

If it’s not attractive, it won’t have adherents. The attraction has to be more than just able to “explain” the data more precisely. Then it would just be normal science with a better ruler. To truly be a new paradigm, it needs to involve a new view of the old problems. One might say, and some have, that after, say, Aristotle and Copernicus and Galileo and Newton and Einstein and Bohr and Darwin and Freud, etc., etc., we were in a new world.

Your paradigm won’t sell or attract adherents if it doesn’t give them things to research and publish. The requirement that the paradigm be open-ended is more than marketing. If it’s not open-ended, then it has all the answers, which makes it dogma or religion, not science.

Everything is fine until it isn’t. Eventually, an anomaly will present itself that can’t be explained away by tweaking the theory, censoring the data, or building a better microscope. Or perhaps, anomalies and the tweaks required to fit them in become so cumbersome, the whole thing collapses of its own weight.  When the anomalies become too obvious to dismiss, too significant to ignore, or too cumbersome to stand, the existing paradigm cracks, ‘normal science’ doesn’t help, and we are in a ‘crisis’.

Crisis

The psychometric new world may have turned with Lord’s seminal 1950 thesis. (Like most of us at a similar stage, Lord’s crisis was he needed a topic that would get him admitted into the community of scholars.) When he looked at a plot of item percent correct against total number correct (the item’s characteristic curve), he saw a normal ogive. That fit his plotted data pretty well, except in the tails. So he tweaked the lower end to “explain” too many right answers from low scorers. The mathematics of the normal ogive are, at least, cumbersome and, in 1950, computationally intractable. So that was pretty much that, for a while.

In the 1960s, the normal ogive morphed into the logistic, perhaps the idea came from following Rasch’s (1960) lead, perhaps from somewhere else, perhaps due to Birnbaum (1968); I’m not a historian and this isn’t a history lesson. The mathematics were a lot easier and computers were catching up. The logistic was winning out but with occasional retreats to the the normal ogive because it fit a little better in the tails .

US psychometricians saw the problem as data fitting and it wasn’t easy. There were often too many parameters to estimate without some clever footwork. But we’re clever and most of those computational obstacles have been overcome to the satisfaction of most. The nagging questions remaining are more epistemological than computational.

Can we know if our item discrimination estimates are truly indicators of item "quality" and not loadings on some unknown, extraneous factor(s)?

If the lower asymptote is what happens at minus infinity where we have no data and never want to have any, why do we even care?

If the lower asymptote is the probability of a correct response from an examinee with infinitely low ability, how can it be anything but 1/k, where k is the number of response choices?

How can the lower asymptote ever be higher than 1/k? (See Slumdog Millionaire, 2008)

If the lower asymptote is population-dependent, isn't the ability estimate dependent on the population we choose to assign the person to? Mightn't individuals vary in their propensity to respond to items they don't know.

Wouldn't any population-dependent estimate be wrong on the level of the individual?

If you ask the data for “information” beyond the sufficient statistics, not only are your estimates population-dependent, they are subject to whatever extraneous factors that might separate high scores from low scores in that population. This means sacrificing validity in the name of reliability.

Rasch did not see his problem as data fitting. As an educator, he saw it directly: more able students do better on the set tasks than less able students. As an associate of Ronald Fisher (either the foremost statistician of the twentieth century who also made contributions to genetics or the foremost geneticist of the twentieth century who also made contributions to statistics), Rasch knew about logistic growth models and sufficient statistics. Anything left in the data, after reaping the information with the sufficient statistics, should be noise and should be used to control the model. The size of the residuals isn’t as interesting as the structure, or lack thereof.¹

Rasch Measurement Theory certainly has its community and the members certainly adhere and seem to find enough to do. Initially, Rasch found his results satisfying because it got him around the vexing problem of how to assess the effectiveness of remedial reading instruction when he didn’t have a common set of items or common set of examinees over time. This led him to identify a class of models that define Specific Objectivity.

Rasch’s crisis (how to salvage a poorly thought-out experiment) hardly rises to the epic level of Galileo’s crisis with Aristotle, or Copernicus’ crisis with Ptolemy, or Einstein’s crisis with Newton. A larger view would say the crisis came about because the existing paradigms did not lead us to “measurement”, as most of science would define it.

In the words of William Thomson, Lord Kelvin:

When you can measure what you are speaking about and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meager and unsatisfactory kind.

Revolution

Rasch’s solution did change the world for any adherents who were willing to accept his principles and follow his methods. They now knew how to ‘equate’ scores from disparate instruments, but beyond that, how to develop scales for measuring, define constructs to be measured, and do better science.

Rasch’s solution to his problem in the 1950s with remedial reading scores is still the exemplar and “local” definition of the paradigm. His generalization of that solution to an entire class of models and his exposition of “specific objectivity” are the “global” definition. (Rasch, 1960) 

There’s a problem with all this. I am trying to force fit Rasch’s contribution into Kuhn’s “Structure of Scientific Revolutions” paradigm when Rasch Measurement admittedly isn’t science. It’s mathematics, or statistics, or psychometrics; a tool, certainly a very useful tool, like Analysis of Variance or Large Hadron Colliders.

Measures are necessary precursors to science. Some of the weaknesses in pre-Rasch thinking about measurement are suggested in the following koans, hoping for enlightened measurement, not Zen enlightenment.

"Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality." E. L. Thorndike

"Within the range of objects for which the measuring instrument is intended, its function must be independent of the object of measurement." L. L. Thurstone

"You never know a line is crooked unless you have a straight one to put next to it." Socrates

"Correlations are population-dependent, and therefore scientifically rather uninteresting." Georg Rasch

"We can act as though measured differences along the latent trait are distances on a river but whoever is piloting better watch for meanders and submerged rocks."

"We may be trying to measure size; perhaps height and weight would be better. Or perhaps, we are measuring 'weight', when we should go after 'mass'"

"The larger our island of knowledge, the longer our shoreline of wonder." Ralph W. Sockman

"The most exciting thing to hear in science ... is not 'Eureka' but 'That's funny.'" Isaac Asimov

1A subtitle for my own dissertation could be “Rasch’s errors are my data.”

The Five ‘S’s to Rasch Measurement

The mathematical, statistical, and philosophical faces of Rasch measurement are separability, sufficiency, and specific objectivity. ‘Separable’ because the person parameter and the item parameter interact in a simple way; Β/Δ in the exponential metric or β-δ in the log metric. ‘Sufficient’ because ‘nuisance’ parameters can be conditioned out so that, in most cases, the number of correct responses is the sufficient statistic for the person’s ability or the item’s difficulty. Specific Objectivity is Rasch’s term for ‘fundamental measurement’; what Wright called ‘sample-free item calibration’. It is objective because it does not depend on the specific sample of items or people; it is specific because it may not apply universally and the validity in any new application must be established.

I add two more ‘S‘s to the trinity: simplicity and symmetry.

Simplicity

We have talked ad nauseum about simplicity. It in fact is one of my favorite themes. The chances that the person will answer the item correctly is Β / (Β + Δ), which is about as simple as life gets.1 Or in less-than-elegant prose:

The likelihood that the person wins is the odds of the person winning
divided by sum of the odds for person winning and the odds for the item winning.

With such a simple model, the sufficient statistics are simple counts, and the estimators can be as simple as row averages. Rasch (1960) did many of his analyses graphically; Wright and Stone (1979) give algorithms for doing the arithmetic, somewhat laboriously, without the benefit of a computer. The first Rasch software at the University of Chicago (CALFIT and BICAL) ran on a ‘mini-computer’ that wouldn’t fit in your kitchen and had one millionth the capacity of your phone.

Symmetry

The first level of symmetry with Rasch models is that person ability and item difficulty have identical status. We can flip the roles of ability and difficulty in everything I have said in this post and every preceding one, or in everything Rasch or Gerhardt Fischer has ever written, and nothing changes. It makes just as much sense to say Δ / (Δ + Β) as Β / (Β + Δ). Granted we could be talking about anti-ability and anti-difficulty, but all the relationships are just the same as before. That’s almost too easy.

Just as trivially, we have noted, or at least implied, that we can flip, as suits our purposes, between the logistic and exponential expressions of the models without changing anything. In the exponential form, we are dealing with the odds that a person passes the standard item; in the logistic form, we have the log odds. If we observe one, we observe the other and the relationships among items and students are unchanged in any fundamental way. We are not limited to those two forms. Using base e is mathematically convenient, but we can choose any other base we like; 10, or 100, or 91 are often used in converting to ‘scale scores’. Any of these transformations preserves all the relationships because they all preserve the underlying interval scale and the relative positions of objects and agents on it.

That’s the trivial part.

Symmetry was a straightforward concept in mathematics: Homo sapiens, all vertebrates, and most other fauna have bilateral symmetry; a snowflake has sixfold; a sphere an infinite number. The more degrees of symmetry, the fewer parameters that are required to describe the object. For a sphere, only one, the radius, is needed and that’s as low as it goes.

Leave it to physicists to take an intuitive idea and made it into a topic for advanced graduate seminars2:

A symmetry of a physical system is a physical or mathematical feature of the system
(observed or intrinsic)
that is preserved or remains unchanged under some transformation. 

For every invariant (i.e., symmetry) in the universe, there is a conservation law.
Equally, for every conservation law in physics, there is an invariant.
(Noether’s Theorem, 1918)3.

Right. I don’t understand enough of that to wander any deeper into space, time, or electromagnetism or to even know if this sentence makes any sense.

In Rasch’s world,4 when specific objectivity holds, the ‘difficulty’ of an item is preserved whether we are talking about high ability students or low, fifth graders or sixth, males or females, North America or British Isles, Mexico or Puerto Rico, or any other selection of students that might be thrust upon us.

Rasch is not suggesting that the proportion answering the item correctly (aka, p-value) never varies or that it doesn’t depend on the population tested. In fact, just the opposite, which is what makes p-values and the like ” rather scientifically uninteresting”. Nor do we suggest that the likelihood that a third grader will correctly add two unlike fractions is the same as the likelihood for a nineth grader. What we are saying is that there is an aspect of the item that is preserved across any partitioning of the universe; that the fraction addition problem has its own intrinsic difficulty unrelated to any student.

“Preserved across any partitioning of the universe” is a very strong statement. We’re pretty sure that kindergarten students and graduate students in Astrophysics aren’t equally appropriate for calibrating a set of math items. And frankly, we don’t much care. We start caring if we observe different difficulty estimates from fourth-grade boys or girls, or from Blacks, Whites, Asians, or Hispanics, or from different ability clusters, or in 2021 and 2022. The task is to establish not if it ever fails but when symmetry holds.

I need to distinguish a little more carefully between the “latent trait” and our quantification of locations on the scale. An item has an inherent difficulty that puts it somewhere along the latent trait. That location is a property of the item and does not depend on any group of people that have been given, or that may ever be given the item. Nor does it matter if we choose to label it in yards or meters, Fahrenheit or Celsius, Wits or GRits. This property is what it is whether we use the item for a preschooler, junior high student, astrophysicist, or Supreme Court Justice. This we assume is invariant. Even IRTists understand this.

Although the latent trait may run the gamut, few items are appropriate for use in more than one of the groups I just listed. That would be like suggesting we can use the same thermometer to assess the status of a feverish preschooler that we use for the surface of the sun, although here we are pretty sure we are talking about the same latent ‘trait’. It is equally important to choose an appropriate sample for calibrating the items. A group of preschoolers could tell us very little about the difficulty of items appropriate for assessing math proficiency of astrophysicists.

Symmetry can break in our data for a couple reasons. Perhaps there is no latent trait that extends all the way from recognizing basic shapes to constructing proofs with vector calculus. I am inclined to believe there is in this case, but that is theory and not my call. Or perhaps we did not appropriately match the objects and agents. Our estimates of locations on the trait should be invariant regardless of which objects and agents we are looking at. If there is an issue, we will want to know why: are difficulty and ability poorly matched? Is there a smart way to get the item wrong? Is there a not-smart way to get it right? Is the item defective? Is the person misbehaving? Or did the trait shift? Is there a singularity?

My physics is even weaker that my mathematics.

What most people call ‘Goodness of Fit’ and Rasch called ‘Control of the Model’, we are calling an exploration of the limits of symmetry. For me, I have a new buzz word, but the question remains, “Why do bright people sometimes miss easy items and non-bright people sometimes pass hard items?”5 This isn’t astrophysics.

Here is my “item response theory”:

The Rasch Model is a main effects model; the sufficient statistics for ability and difficulty are the row and column totals of the item response matrix. Before we say anything important about the students or items, we need to verify that there are no interactions. This means no matter how we sort and block the rows, estimates of the column parameters are invariant (enough).

That’s me regressing to my classical statistical training to say that symmetry holds for these data.


[1] It may look more familiar but less simple if we rewrite it as (Β/Δ) / (1 + Β/Δ), even better eβ-δ/(1 + eβ-δ), but it’s all the same for any observer.

[2] Both of the following statements were lifted (plagiarized?) from a Wikipedia discussion of symmetry. I deserve no credit for the phrasing, nor do I seek it.

[3] Emmy Noether was a German mathematician whose contributions, among other things, changed the science of physics by relating symmetry and conservation. The first implication of her theorem was it solved Hilbert and Einstein’s problem that General Relativity appeared to violate the conservation of energy. She was generally unpaid and dismissed, notably and empathically not by Hilbert and Einstein, because she was a woman and a Jew. In that order.

When Göttingen University declined to give her a paid position, Hilbert responded, “Are we running a University or a bathing society?” In 1933, all Jews were forced out of academia in Germany; she spent the remainder of her career teaching young women at Bryn Mawr College and researching at the Institute for Advanced Study in Princeton (See Einstein, A.)

[4] We could flip this entire conversation and talk about the ‘ability’ of a person preserved across shifts of item difficulty, type, content, yada, yada, yada, and it would be equally true. But I repeat myself again.

[5] Except for the ‘boy meets girl, . . . aspect, this question is the basic plot of “Slumdog Millionaire“, undoubtedly the greatest psychometric movie ever made. I wouldn’t however describe the protagonist as “non-bright”, which suggests there is something innate in whatever trait is operating and exposes some of the flaws in my use of the rather pejorative term. I should use something more along the lines of “poorly schooled” or “untrained”, placing effort above talent.

Useful and Almost Number-free Reports

If I had asked my customers what they wanted, they would have said a faster horse. Henry Ford

Percentile ranks on student reports are tests as contests.

Raw scores on student reports are like live chickens on dinner plates.

If I were to step on my bathroom scale and see a single number like, say for example, 165 lbs (or 75 kilo) with no further explanation, I wouldn’t need an interpretation guide or course in psychometrics to know what the number means or to decide if I like it or not. Nor would I be writing to the manufacturer of the scale asking, “What’s a pound?” If I were to take a qualifying test to be a licensed dog walker and I received a postcard in the mail that said simply, “Your score is 509 GRits ± 41,” I would be a little frustrated and a lot annoyed. And I would need to ask some basic questions like, “What does that mean?” “What’s a GRit?” “Is the ‘41’ the standard error of measurement, or does it represent a confidence interval?” “If so, what level of confidence?” “What does 509 actually say about my proficiency to walk dogs?” And, of course, “Did I pass?

If the answer to the last question is yes, then most candidates, possibly excluding the psychometricians, will quickly lose interest in the others. If the answer is no, then the tone becomes a little more strident and now includes questions like, “Who decides what passing is?” “What did I miss?” “How close was I?” and if I was close, “Isn’t there almost a 50% chance that I actually passed?[1] People who did pass never seem concerned about the other half of this question.

If a postcard with a Scale Score (even with some form of the standard error of measurement) isn’t useful or meaningful, what does a report need to be? Examinee reports vary depending on the audience and the purpose of the exam, but for a report going to a student, teacher, parent, or anyone else who might actually make use of the information for the betterment of the examinee, there would seem to be four basic components:

  1. Identification
  2. Measurement
  3. Control
  4. Interpretation

There needs to be enough identifying information to locate the examinee and to deliver the report to the right place. For dog walking candidates, the address on the front of the postcard did the trick. For education, it probably takes some combination of student name, teacher name, classroom /section/period, grade, school, and district. We should also mention the name of the test and the date taken. That is almost always more than enough to locate the right person; if you are still worried about it, add birth date or a parent’s name. Our original list should be adequate to locate the teacher and the teacher should know the student by name.

Measurement of the examinee to determine something about status or progress is the point of the exercise. This report section could be the simple “509 GRits” but it should also include some indication of our confidence in this measurement, which means the standard error of measurement in some guise. To make it concrete, in this example, the standard error of measurement is 41, with a 95% confidence interval of 509 ± 82, or 427 to 591. It is probably prudent to never use a phrase involving the word “error” when communicating with parents or school boards; they tend to interpret “error” as “mistake” and blame you. One often sees phrases like “probable range” to describe the interval between the measure plus and minus two standard errors (or some other arbitrary multiplier), which avoids saying ‘error’ and also ducks the squabble between the frequentists and the Bayesians about what confidence means. A picture may not be worth a thousand words in this case but here it is.

Dog walking scale (3)

I am inclined to leave it at that for the moment but not everyone thinks a line with scale scores and a marker for the person’s location is all that informative. I am oft over-ridden to add other (useful and relevant) information like a conclusion (e.g., pass/fail or performance level[2]) and sometimes even subtest measures to the measurement section. One could also say things like a person at 509 has 41% likelihood of testing below the Competent level next time and a 1e-6 likelihood of testing above Skilled. These are really steps toward control and interpretation not measurement so the purist in me wants to put them in the next two sections. Although it’s not worth falling on my sword and what goes in what section is less rigid than I seem to be implying.

I am willing to give some meaning to the score by showing the ruler and some milestones along it. At this point, the scale score may be less meaningful than the milestones, but with experience, the scale score can become a useful shorthand for the milestones. It doesn’t take very much experience to understand what temperatures of 0°C and 37°C imply, even for US residents. This leads me to the less simple “map” below.

Dog walking scale (2)

Diagnosis With the Model

The vertical scale is GRits[3], which is our measure of dog walking proficiency and almost meaningless by itself; we wouldn’t lose much if we left the numbers off entirely[4]. The column of text labels is the substantive description of the scale. Topics at the low end, which are relatively easy, deal with type and use of basic equipment; topics at the high end, which are more difficult, deal with complex behaviors. The GRits bring precision; the text labels bring meaning.

The red vertical line has a tic mark for each possible raw score and a red diamond to mark the location of our hypothetical person. The red horizontal lines are the person’s location and plus/minus two standard errors. You can also add some normative information like means, standard deviations, frequency distributions, or quantiles, if you are into that sort of thing.

The gray horizontal lines mark the performance levels: 500 is Competent, 700 is Skilled, and 850 is Master. Labelling the lines rather than the regions between is not standard practice in educational assessment but it avoids the inconvenience of needing to label the region below Competent and the misinterpretation of the levels as actual developmental states or stages rather than simply more or less arbitrary criteria for addressing school accountability or dispensing certificates. So far we are just displaying the result, not interpreting it.

Control of the measurement model means, either, ensuring that we are warranted in treating the result as a valid measure, in the full sense of the word as we just did, or diagnosing what the anomalies tell us about the examinee. This is again the dichotomy of “diagnosing with the model” and “diagnosing from the model.” Determining which of these paths to follow requires a bit more than simply computing ‘infit’ or ‘outfit’ and consulting the appropriate table of big numbers. This involves looking at individual items, splitting the items into clusters, and looking for things that are funny. Maps like the following can be more useful than any fit statistic and almost number-free.

Dog Walking

Diagnosis From the Model

The display has the same measurement information as before and considerable detail about items and item clusters. First, the red vertical line still refers to the total test and still has a tic mark for each possible raw score and the red diamond for the person. It now has a black diamond for each item response; items to the left of the line are incorrect; those to the right are correct with the distance from the line representing the probability against the response; the greater the distance, the more improbable the response. The dotted vertical lines (blue shading) are control lines and represent probabilities of 75%. We don’t need to be much concerned about anything in the blue. There are four or five items, two difficult items passed and two or three easy items missed, outside the control lines that might warrant investigation.

Most of the same information has been added for each of five item clusters. These are very short tests so the results may seem a little erratic but in all cases, one more item right or wrong would bring them close to the total test measure. If you are so inclined, the number correct score can be found by counting the tic marks[5] up to the red diamond. You can’t necessarily find it by counting the item plotting symbols to the right of the scales because they can represent multiple items. (And it is further confused because some items were not assigned to clusters.) Overall, this is a well-behaved person.

Because I often disparage anything short of computer-administered, fully adaptive tests (CAT), I need to point out a serious issue for me and model control: in the world of CAT, there are no surprises. If we do the CAT right, everything should be in the blue bands. This puts all the load for model control on the item clusters. In our example, we have behaved as though clusters were based on content, which is appropriate for reporting. For control, we would do more by forming clusters based on difficulty, sequence, item type, item format, and item exposure or age are natural choices but as we become more creative in developing items for computer administration, there could be others.

Interpretation of the measure means explaining what the measurement tells us about the status and progress of the examinee. Establishing ‘performance levels,’ say, Master, Skilled, or Competent dog walker, is a significant step from measurement to meaning, or from quantification to qualification. Announcing that the candidate is above, say, the ‘Competent’ performance level is a start. Diagnosis with the model would then talk about what items candidates at this level have mastery of, what items they have no clue about, and what items are at the candidate’s level. This is reporting what the candidate can do, what the candidate can’t do, and what the candidate should find challenging but possible. That suggests three obvious comments that any computer could readily generate as personalized feedback assuming a well-behaved response pattern and diagnosis with the model.

Personalizing a report takes more than generating text that restates the obvious and uses the candidate’s name in a complete sentence, like “Ron, your Dog Walking GRit is 509 ± 82, which means we think you are Competent to walk dogs”. When we have a computer generating the feedback, we should use any intelligence, artificial or otherwise, that is available. It is generally ok to start with the generic, “Ron, you total score is …” and “You did well on item clusters D and A, but were weak on cluster C,” and move on to things that are less obvious. I prefer to open with a positive, encouraging statement (clusters D and A), then mention problem areas (cluster C), and close with things to work on immediately (topics that haven’t been mastered but are close). Ideally, we would discuss the specifics of the surprising responses. This includes difficult items that were passed and easy items that were missed. This is moving into diagnosis from the model.

The more specifics, the better, even item statistics and foil analysis if anyone asks. But it would be much more valuable and much more work for both the item developers and systems analysts to provide a discussion of the type of misunderstandings or errors implied by any incorrect responses. It is work for the item developers because they would need to understand and explain why every distractor is there and what selecting it means. It is work for the system analysts because they need to keep track of and manage everything.

In today’s world, there is little reason to limit reporting to what can be squeezed onto an 8.5×11 sheet of paper or by concerns about the cost of color printing[6]. Paper copies are static, often cramped and overwhelming. Ideally, an electronic report, like an electronic test, will be interactive, dynamic, and engaging, with effective, targeted scaffolding. It should begin with the general overview and then allow the user to explore or be led through the interesting, important, and useful aspects of the responses, showing more and more detail as needed. Performance Level Descriptors and item clusters could be defined and displayed on request; item details could pop up when the plotting symbol is clicked.

This is not free; there will be resistance to giving items away because they are expensive and the item bank is sacred. Hopefully, we are moving away from once-a-year, high-stakes tests toward testing when it is helpful for the student, drawing from computer-generated and crowd-sourced item banks. And more importantly, toward immediate and informative feedback that might actually have some educational value.

 

[1] No, you didn’t pass, but if you test again with a parallel form, there is almost a 50% chance that you will.

[2] The little gray lines mark the performance levels (Competent, Skilled, and Master from left to right).

[3] Because GRits use three, rather than e, as their base, a difference of 100 GRits means 3 to 1 odds. Our hypothetical person has 3 to 1 odds of answering an item about leash tension but less than 1 to 3 odds for an item about tugging. More generally, a difference of 100k GRits means odds of 3k to 1. That’s friendlier than base e and odds like 2.71828…k to one.

[4] The report is (almost) number-free in the sense that the numbers are not needed to understand and use the results. It is not number-free in another sense because they are essential to provide the framework to create and arrange the display.

[5] This demonstration does not include tic marks for zero and perfect; in real life, you would probably have to account for them somehow. They tend to radically distort the scale without adding much, if any, information. They would extend the scale over 100 GRits in both directions and have probably ranges more than four times that in width.

[6] Someone will undoubtedly want a printer-friendly version to stick in the file cabinet because they have the file cabinet.

V. Control of Rasch’s Models: Beyond Sufficient Statistics

 No single fit statistic is either necessary or sufficient.  David Andrich

You won’t get famous by inventing the perfect fit statistic. Benjamin Wright[1]

That’s funny or when the model reveals something we didn’t know

You say goodness of fit; Rasch said control. The important distinction in the words is that, for the measure, once you have extracted, through the sufficient statistics, all the information in the data relevant to measuring the aspect you are after, you shouldn’t care what or how much gets left in the trash. Whatever it is, it doesn’t contribute to the measurement … directly. It’s of no more than passing interest to us how well the estimated parameters reproduce the observed data, but very much our concern that we have all the relevant information and nothing but the relevant information for our task. Control, not goodness of fit, is the emphasis.

Rasch, very emphatically, did not mean that you run your data through some fashionable software package to calculate its estimates of parameters for a one-item-parameter IRT model and call it Rasch. Going beyond the sufficient statistics and parameter estimates to validate the model’s requirements is where the control is; that’s how one establishes Specific Objectivity. If it holds, then we have a pretty good idea what the residuals will look like. They are governed by the binomial variance pvi(1-pvi) and they should be just noise, with no patterns related to person ability or item difficulty, nor to gender, format, culture, type, sequence, or any of the other factors we keep harping on (but not restricted to the ones that have occurred to me this morning) as potential threats. If the residuals do look like pvi(1-pvi), then we are on reasonably solid ground for believing Specific Objectivity does obtain but even that’s not good enough.

It does not matter if there are other models out there that can “explain” a particular data set “better”, in the rather barren statistical sense of explain meaning they have smaller mean residual deviates. Rasch recognized that models can exist on three planes in increasing order of usefulness[2]:

  1. Models that explain the data,
  2. Models that predict the future, and
  3. Models that reveal something we didn’t know about the world.

Models that only try to maximize goodness of fit are stuck at the first level and are perfectly happy fitting something other than the aspect you want. This mind-set is better suited to trying to explain the stock market, weather, or Oscar winners and to generate statements like “The stock market goes up when hemlines go up.” Past performance does not ensure future performance. They try to go beyond the information in the sufficient statistics, using anything in the data that might have been correlated and, to appropriate a comment by Rasch , correlation coefficients are population dependent and therefore scientifically rather uninteresting.

Models that satisfy Rasch’s principle of Specific Objectivity have reached the second level and we can begin real science, possibly at the third level. Control of the models often points directly toward the third level, when the agents or objects didn’t interact the way we intended or anticipated[3]. “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny.’” (Isaac Asimov.)

Continue reading . . . Model Control ala Choppin

[1] I chose to believe Ben’s comment reflected his attitude toward hypothesis testing, not his assessment of my prospects, although in that sense, it was prophetic.

[2] Paraphrasing E. D. Ford.

[3] “In the best designed experiments, the rats will do as they damn well please.” (Murphy’s Law of Experimental Psychology.)

Previous: Doing the Math                                Next: Model Control ala Panchakesan