Vb. Mean Squares; Outfit and Infit

A question can only be valid if the students’ minds are doing the things we want them to show us they can do. Alastair Pollitt

Able people should pass easy items; unable people should fail difficult ones. Everything else is up for grabs.

One can liken progress along a latent trait to navigating a river; we can treat it as a straight line but the pilot had best remember sandbars and meanders.

More about what could go wrong and how to find it

However one validates the items, with a plethora of sliced and diced matrices, between group analyses based on gender, ethnicity, ses, age, instruction, etc., followed by enough editing, tweaking, revising, and discarding to ensure a perfectly functioning item bank and to placate any Technical Advisory Committee, there is no guarantee that the next kid to sit down in front of the computer won’t bring something completely unanticipated to the process. After the items have all been “validated,” we still must validate the measure for every new examinee.

The residual analysis that we are working our way toward is a natural approach to validating any item and any person. But we should know what we are looking for before we get lost in the swamps of arithmetic. First, we need to make sure that we haven’t done something stupid, like score the responses against the wrong key or post the results to the wrong record.

Checking the scoring for an examinee is no different than checking for miskeyed items but with less data; either would have both surprising misses and surprising passes in the response string. Having gotten past that mine field, we can then check for differences by item type, content, sequence to just note the easy ones. Then depending on what we discover, we proceed with doing the science either with the results of the measurement process or with the anomalies from the measurement process.

Continue . . .Model Control ala Panchapekesan


Previous: Model Control ala Choppin                        Next: Beyond Outfit and Infit

V. Control of Rasch’s Models: Beyond Sufficient Statistics

 No single fit statistic is either necessary or sufficient.  David Andrich

You won’t get famous by inventing the perfect fit statistic. Benjamin Wright[1]

That’s funny or when the model reveals something we didn’t know

You say goodness of fit; Rasch said control. The important distinction in the words is that, for the measure, once you have extracted, through the sufficient statistics, all the information in the data relevant to measuring the aspect you are after, you shouldn’t care what or how much gets left in the trash. Whatever it is, it doesn’t contribute to the measurement … directly. It’s of no more than passing interest to us how well the estimated parameters reproduce the observed data, but very much our concern that we have all the relevant information and nothing but the relevant information for our task. Control, not goodness of fit, is the emphasis.

Rasch, very emphatically, did not mean that you run your data through some fashionable software package to calculate its estimates of parameters for a one-item-parameter IRT model and call it Rasch. Going beyond the sufficient statistics and parameter estimates to validate the model’s requirements is where the control is; that’s how one establishes Specific Objectivity. If it holds, then we have a pretty good idea what the residuals will look like. They are governed by the binomial variance pvi(1-pvi) and they should be just noise, with no patterns related to person ability or item difficulty, nor to gender, format, culture, type, sequence, or any of the other factors we keep harping on (but not restricted to the ones that have occurred to me this morning) as potential threats. If the residuals do look like pvi(1-pvi), then we are on reasonably solid ground for believing Specific Objectivity does obtain but even that’s not good enough.

It does not matter if there are other models out there that can “explain” a particular data set “better”, in the rather barren statistical sense of explain meaning they have smaller mean residual deviates. Rasch recognized that models can exist on three planes in increasing order of usefulness[2]:

  1. Models that explain the data,
  2. Models that predict the future, and
  3. Models that reveal something we didn’t know about the world.

Models that only try to maximize goodness of fit are stuck at the first level and are perfectly happy fitting something other than the aspect you want. This mind-set is better suited to trying to explain the stock market, weather, or Oscar winners and to generate statements like “The stock market goes up when hemlines go up.” Past performance does not ensure future performance. They try to go beyond the information in the sufficient statistics, using anything in the data that might have been correlated and, to appropriate a comment by Rasch , correlation coefficients are population dependent and therefore scientifically rather uninteresting.

Models that satisfy Rasch’s principle of Specific Objectivity have reached the second level and we can begin real science, possibly at the third level. Control of the models often points directly toward the third level, when the agents or objects didn’t interact the way we intended or anticipated[3]. “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny.’” (Isaac Asimov.)

Continue reading . . . Model Control ala Choppin

[1] I chose to believe Ben’s comment reflected his attitude toward hypothesis testing, not his assessment of my prospects, although in that sense, it was prophetic.

[2] Paraphrasing E. D. Ford.

[3] “In the best designed experiments, the rats will do as they damn well please.” (Murphy’s Law of Experimental Psychology.)

Previous: Doing the Math                                Next: Model Control ala Panchakesan

IIIf. Another Aspect, Reading Aloud

Truth emerges more readily from error than from confusion. Bacon

There is no such thing as measurement absolute; there is only measurement relative. Jeanette Winterson.

The Case of the Missing Person Parameters

Eliminating nuisance parameters and #SpecificObjectivity

It was a cold and snowy night when, while trying to make a living as a famous statistical consultant, Rasch was summoned to the isolated laboratory of a renowned reading specialist to analyze data related to the effect of extra instruction for poor readers. There may be better ways to make a statistician feel a valued and respected member of the team than to ask for an analysis of data collected years earlier but Rasch took it on (Rasch, 1977, p. 63.)

If we could measure, in the strictest sense, reading proficiency, measurements could be made before the intervention, after the intervention, and perhaps several points along the way. Then the analysis is no different, in principle, than if we were investigating the optimal blend of feed for finishing hogs or concentration of platinum for re-forming petroleum.

Continue reading . . .IIIf. Reading Aloud

Previous Specter of Math                                                    Return to Start

IIIe: A Spectrum of Math Proficiency and the Specter of Word Problems

In mathematics, one does not understand anything. You just get used to them. Johann Von Neumann

Defining mile posts along the way from counting your toes to doing calculus

The world has divided itself in two factions: those who think they don’t understand math and those who think they do. But we’re not talking about proving Fermat’s Last Theorem or correcting Stephan Hawking’s tensor algebra; we’re talking about counting, applying the four basic operators, and solving the dreaded word problems using basic algebra, geometry, and perhaps a little calculus. That just about covers the range from counting your toes to determining the spot in the outfield where a player should stand to catch a fly ball and should be good enough to get you through freshman math.

Continue reading . . . A Spectrum of Math Proficiency

Previous: Any given Sunday                                    Next: Before Science

IIIb. The Aspect of Color

Roy G. Biv: How many bands in your rainbow?

Art is the imposing of a pattern on experience. Alfred North Whitehead

Performance bands are arbitrary but useful

Qualitative meaning and quantitative precision

We all know our basic colors before we start to school. We learn early on that there are three primary colors (red, yellow, and blue), from which all others can be created, although designers of color printers apparently missed that lesson. The ancients saw five colors (red, yellow, green, blue, violet) in the rainbow. Newton saw seven, adding orange and indigo (perhaps to align with the natural harmony of the universe found in the number of musical notes, days of the week, and known planets; or perhaps he was just buying some vowels.)  Continue reading . . . The Aspect of Color

[While my analogy comparing bands of the rainbow to performance levels may be cute, I probably don’t have the physiology right. While light is a continuous spectrum, our perception of discrete bands may be real, depending on the distribution of cones in our eyes. Was it more important to for our ancestors to discern white animals against a white background or to distinguish ripe fruit and poisonous reptiles.]

Previous: IIIa. Abstracting Some Aspects        Next: IIIc. Hot and Cold

III. Abstracting Some Aspects

  • Measure what is measurable, and make measurable what is not so. Galileo Galilei

Measuring rocks and the significance of being sufficient

The process of measurement begins long before any data are collected. The starting point is a notion, or even better, a theory about an aspect of a class of things we want to understand better, maybe even do some science on. Successful measurement depends on clear thinking about the aspect and clever ideas for the agents. This is much more challenging and much more rewarding than any mathematical gymnastics that might be performed to fit model to data.

All analogies are limited but some are useful. Considering aspects of things far removed from cognitive traits may help avoid some of the pitfalls encountered when working too close to home. Hardness is a property of materials that is hard to define but we all know what it is when it hits us. Color is a narrow region of a continuous spectrum that non-physicists tend to think about as discrete categories. Temperature is an intimate part of our daily lives, which we are quite adept at sensing and more recently at measuring, but the closely connected idea, heat, may actually be more real, less bound to conventions and populations. If I could scale the proficiency of professional football teams and reliably predict the outcomes of games, I wouldn’t be writing this.

Continue reading . . . Hard Headedness: the importance of being sufficient


Previous: II: Measurement in Science         Next: IIIb: The Aspect of Color

II. Measurement in Science

Measurement is the breaking up of a quantum of energy into equal units. George Herbert Mead

What does it mean “to measure” and #RaschMeasurement as a foreign language

If, in a discussion about buying a new table, your spouse were to say to you, “I measured the width of the room and …” you would not expect the conversation to degenerate immediately into a discussion about what is width, or what does measured mean, or who made your yardstick, or what units you used. But if, in a discussion with the school guidance counselor, you are told, “I measured the intelligence of your child and …” you could, and probably should, ask those same questions, although they probably won’t be any more warmly received in the guidance office than they were in the dining room.

Continue reading . . . Measurement in Science

Previous: I. Rasch’s Theory of Relativity               Next: IIIa: Abstracting Some Aspects

I. Rasch’s Theory of Relativity

The reasonable man strives to adapt himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. George Bernard Shaw, Man and Superman: Maxims to a Revolutionist

Population dependent and convention versus Rasch’s #SpecificObjectivity

While trying to understand what Rasch meant when he referred (as he often did) to tests “built by my methods and conforming to my principles,” I began this philosophical apology, which was intended to help bring sanity and civility to our approach to the practical problems of assessing status, monitoring growth, evaluating effectiveness, and, in short, doing science in the social sciences.

If you have been introduced to the Rasch Measurement as a special case of the more mathematically exotic item response theory (IRT) , then you and Rasch have not been properly introduced, and you are probably under the misimpressions that:

  • All the mathematical gymnastics associated with IRT are necessary.
  • Rasch is what you do if you don’t have the resources to do it right.

The point of this treatise is to disabuse you of those misunderstandings. If I had graduate students, this is what I would tell them to inoculate them against the standard introduction-to-measurement course. We will reserve the term IRT for the models that are not Rasch models, because that term seems to describe the intent of those models, with their focus on fitting the data. The older term latent trait theory fits better with the Rasch perspective, with its focus on the underlying aspect to be measured.

Our concern is the efficacy of Rasch measurement, how it works under controlled conditions, which can hardly be controversial. When the data conform to Rasch’s principles, i.e., the data are based on agents that are equally valid and reliable; not subject to interference from extraneous attributes of the objects, the models have the power to encompass and extend the best of Thurstone and Guttman. Guttman created a non-stochastic Rasch model, with very sufficient statistics; Thurstone defined “fundamental measurement”, which foreshadowed Rasch’s “specific objectivity.” This leads to measurement, as the layman understands the word, and sets the stage for the more vital tasks of making and analyzing measures.

Most of the mainstream debate surrounding Rasch measurement has focused on effectiveness, how the models function when confronted with real responses from real people to real tests, questionnaires, surveys, checklists, and other instruments, some put together with little or no thought for their suitability for measurement[1]. The conclusion that Rasch models are robust, i.e., do pretty well in this real world, should not be taken as justification to continue doing what we’ve been doing.

There are two commonly cited motivations for using Rasch’s models: the most popular being they are extraordinarily easy to apply, compared to IRT models. Useful results can come from relatively small samples and the estimation algorithms converge readily unless the data are pathologically bizarre. In this very data-driven world, “Rasch analysis” (or the verb form, to rasch) seems to mean running data through Rasch calibration software. This requires minimal intellectual commitment and by itself doesn’t accomplish what Rasch set out to accomplish. Rasch’s more compelling motivation takes more effort.

Rasch’s Motivation

While trying to solve practical problems in statistics and in educational and psychological testing, Georg Rasch came upon a special class of models, which led him to a general philosophy of measurement. Rasch defined measurement: if it’s not Rasch measurement, it’s not measurement! Georg was a very unreasonable man.

The phrase “Rasch Measurement” is redundant; I use it to avoid ambiguity. For its adherents, Rasch measurement is axiomatic: self-evident because this is the way it must be or it’s not measurement:

The calibration of the agents must be independent of the objects used and the measurement of the objects must be independent of the agents used, over a useful range.[2]

This is not an unchecked assertion, but a rational criterion by which one can evaluate data under the bright light of a theory. From the model follow consequences. If the observed data are consistent with the anticipated consequences, we have met the conditions of Thurstone’s fundamental measurement and can treat the results with the same respect we have for measures of, say, mass, heat, distance, or duration, which, like reading fluency or art appreciation, are not real things but aspects of things.

I come by my biases naturally. Professionally, I am a grandson, on my statistics side, and great grandson, on my measurement side, of Sir Ronald Fisher. My view of statistics was shaped by people who worked with Fisher. I was grounded in statistics at Iowa State University in a department founded by George Snedecor; the focus heavily on the design of experiments and the analysis of variance, which I learned from the likes of T.A. Bancroft, Oscar Kempthorne, David Jowett, Herbert David, and David Huntsberger, some of whom had known, worked, and undoubtedly argued (if you knew Fisher, Snedecor or Kempthorne) with Fisher on a regular basis.

My view of measurement was shaped by Georg Rasch, who worked with Fisher in England, taking away from that experience Fisher’s concepts of sufficient statistics and maximum likelihood estimation (MLE). The existence of sufficient statistics is the sine qua non of Rasch measurement. I learned Rasch measurement at Chicago (in the Department of Education founded by John Dewey) sitting at the feet of Benjamin Wright, Rasch’s most active and vocal North American disciple and evangelist. Rasch visited the University of Chicago while I was a student there, although I was too green to benefit much from that visitation.

There are strong parallels and much overlap between my two universes. Rasch measurement is to item response theory (IRT) as design of experiments is to general linear models (GLM). GLM is what you do if you can’t, or won’t, get the design right; IRT is what you do if you can’t, or won’t, get the instrument right. Both cases necessitate mathematical gymnastics that can substitute for clear thinking and mask poor planning. GLM and IRT rely on fitting models to “explain” data in a statistical sense, a venerable activity in some statistical traditions and very much in vogue in today’s Big Data world. But it’s not my tradition and it’s not measurement.

The point of experimental design is to produce observations that permit unambiguous inferences to answer specific, carefully stated questions: questions like, what level of catalyst is optimal for reforming crude oil or which feed ration is best for finishing hogs? We would really like the answer to be independent of the specific breed, gender, age, growing conditions, and intended use of the pig, but that isn’t going to happen. More likely, the answer will include a description of the specific domain to which it applies.

The point of Rasch measurement is to produce measures that unambiguously quantify a specific aspect of the object of interest; measures that are independent of any other attribute of either the object (e.g., a person) or our agents (e.g., items.) We would like the agents to be universally applicable, but more likely they will be valid for small neighborhoods in the universe, which must be described.

Rasch’s Principle and Method

Design of experiments and Rasch measurement rely fundamentally on sufficient statistics to make inferences. Sufficient statistics are the constant, overarching theme. They are what make analysis of variance[3], as described by Fisher and Snedecor, which implies more than just simply partitioning the sum of squares, work; they are what make Rasch measurement, which implies more than just running data through an appropriately named piece of software, measurement. Once you have harvested the information in the sufficient statistic, you know everything the data have to tell you about the factor that you are testing or the aspect that you are measuring. That is Rasch’s principle.

Anything left in the data should be noise and anything gleaned from the residual is to be used for control of the model. That is Rasch’s method.

It is unlikely that you will learn anything useful about Maximum Likelihood Estimation in these pages; the mathematics employed here are several steps down. The methods and derivations included are workable but you will need to look elsewhere for scholarly discussions and rigorous derivations of the most efficient and fashionable methods (e.g., Andrich; Fischer; Masters & Wright; Smith & Smith; Wilson). I rely less on calculus and proof and more on analogy and metaphor than is generally deemed proper in scholarly circles. While I will try to make the presentation non-mathematical, I will not try to make it simple.

There is little new here; the majority of the entries in the reference list are between 1960, when Rasch’s Probabilistic Models for Some Intelligence and Attainment Tests (Rasch, 1960) was published, and 1980, when it was republished shortly after Rasch’s death. There is as much here about rocks, darts, football, and oral reading as about multiple-choice items. I attempted, not at all successfully, to avoid mathematics, but those seeking rigorous explanations of estimation methods or fit statistics will need to look elsewhere (e.g., Smith & Smith, 2004; Fischer & Molenaar, 1995). This is not the manual for any Rasch computer package; it will not explain what WinSteps, RUMM, ConQuest, LPCM-WIN, or especially eRm.R is actually doing. For more hand-holding, try Wright and Stone (1979) or Bond and Fox (2007.)

Finally, this is not a cookbook for applying a special case of IRT models, although we do embrace the notion that Rasch models are very special indeed. Over the last forty years, I have come to understand that:

  1. Rasch Measurement is such an extraordinarily special case of IRT that the general IRT literature says almost nothing that helps us understand or achieve measurement.
  2. The mathematical complexities are a distraction and often counter-productive; our resources are better spent elsewhere.
  3. Measurement doesn’t happen by graciously accepting whatever instrument the author proudly hands you; fitting a model that explains the data isn’t progress.
  4. You need to be extraordinarily lucky to have an instrument that meets Rasch’s requirements; the harder you work on the design, the luckier you will be.[4]

I very purposefully keep saying Rasch Measurement, rather than The Rasch Model. There are a number of mathematical expressions that qualify as Rasch models and the specific expression of the appropriate form for a given situation is incidental, probably self-evident, once we understand what we want to accomplish. Our goal is measurement, as the world[5] understands the word.

Rasch’s Theory of Relativity

Saying our goal is measurement blatantly begs the question, What is measurement? For the moment, our answer will be, Measurement is the process of quantifying an aspect of an object. This then begs the question, What is an aspect? An aspect is not a thing nor the essence of the thing but simply an interesting, at least to us at the moment, property of the thing.

Rasch’s principles define measurement; Rasch’s methods are the process.

Trying to avoid the Metaphysics, as well as the calculus, it doesn’t matter for our purpose if there is some ideal, pure form for the aspect out there in hyperspace or the mind of God independent of any actual objects or, alternatively, if the aspect exists only if objects having the aspect exist. Is there an abstract idea of, for example, “reading proficiency” or are there just students who vary in their capacities to decode written text? In either event, our problem is to imagine the consequences of that capacity for students and devise tactics to make the consequences manifest. We are trying to learn something about the status of a kid in a classroom, not to deduce the nature of a Socratic form.

There are, however, basic theoretical or philosophical issues that are much more interesting, much more challenging, and that come before the arithmetic I will eventually describe. Doing the arithmetic correctly may inform the discussion and analysis but it neither starts them nor ends them.

For much of the twentieth century, mental measurement and, by inheritance, the social sciences were hamstrung by superficial thinking and logical shortcuts, illustrated by assertions like “IQ is what IQ tests measure.” The items that make up such a test are an operational definition of some aspect and, I guess, you can call it IQ if you like but there is a valid validity question looming out there. It is far better to have a sound theoretical basis[6] of the aspect and a statement of observable consequences before we commit to a bunch of test items. If we don’t know where we are going, any items’ll get us there.[7]

The purpose of this apology is to nudge, by Rasch, mental measurement toward the same level of consensus and respect that physical measurement enjoys, or that it enjoyed before Einstein. This is Rasch’s theory of relativity: the path to understand what is real and invariant and to recognize what is convention and population-dependent. That doesn’t seem so unreasonable.

Next: II. Measurement in Science

[1] Or often when confronted with data sets deliberately simulated to not match Rasch’s requirements.

[2] Paraphrasing Thurstone and Rasch.

[3] In Bancroft’s view, the analysis of variance is just the computing algorithm to do the arithmetic for extracting mean squares and components of variance from appropriately designed experiments. Rasch analysis, as differentiated from Rasch measurement and as implemented by any number of pieces of software past and present, is just the computing algorithm to do the arithmetic for producing measures from appropriately generated observations and establishing the domains for which they are applicable and appropriate, i.e., valid.

[4] Paraphrasing Thomas Jefferson and Branch Rickey.

[5] With the possible exception of that portion of the world populated by mainstream psychometricians doing educational research in the US.

[6] IQ may have been a poor choice to illustrate the point of a sound theoretical basis.

[7] Paraphrasing “If you don’t care where you are going, any road’ll get you there.” George Harrison, paraphrasing Lewis Carroll, from a conversation between Alice and the Cheshire Cat.