Measurement

A question can only be valid if the students’ minds are doing the things we want them to show us they can do. Alastair Pollitt

Able people should pass easy items; unable people should fail difficult ones. Everything else is up for grabs.

One can liken progress along a latent trait to navigating a river; we can treat it as a straight line but the pilot had best remember sandbars and meanders.

More about what could go wrong and how to find it

However one validates the items, with a plethora of sliced and diced matrices, between group analyses based on gender, ethnicity, ses, age, instruction, etc., followed by enough editing, tweaking, revising, and discarding to ensure a perfectly functioning item bank and to placate any Technical Advisory Committee, there is no guarantee that the next kid to sit down in front of the computer won’t bring something completely unanticipated to the process. After the items have all been “validated,” we still must validate the measure for every new examinee.

The residual analysis that we are working our way toward is a natural approach to validating any item and any person. But we should know what we are looking for before we get lost in the swamps of arithmetic. First, we need to make sure that we haven’t done something stupid, like score the responses against the wrong key or post the results to the wrong record.

Checking the scoring for an examinee is no different than checking for miskeyed items but with less data; either would have both surprising misses and surprising passes in the response string. Having gotten past that mine field, we can then check for differences by item type, content, sequence to just note the easy ones. Then depending on what we discover, we proceed with doing the science either with the results of the measurement process or with the anomalies from the measurement process.

Continue . . .Model Control ala Panchapekesan

Previous: Model Control ala Choppin Next: Beyond Outfit and Infit

The reasonable man strives to adapt himself to the world; the unreasonable man persists in trying to adapt the world to himself. Therefore, all progress depends on the unreasonable man. —George Bernard Shaw, Man and Superman: Maxims to a Revolutionist

Population dependent and convention versus Rasch’s #SpecificObjectivity

While trying to understand what Rasch meant when he referred (as he often did) to tests “built by my methods and conforming to my principles,” I began this philosophical apology, which was intended to help bring sanity and civility to our approach to the practical problems of assessing status, monitoring growth, evaluating effectiveness, and, in short, doing science in the social sciences.

If you have been introduced to the Rasch Measurement as a special case of the more mathematically exotic item response theory (IRT) , then you and Rasch have not been properly introduced, and you are probably under the misimpressions that:

All the mathematical gymnastics associated with IRT are necessary.
Rasch is what you do if you don’t have the resources to do it right.

The point of this treatise is to disabuse you of those misunderstandings. If I had graduate students, this is what I would tell them to inoculate them against the standard introduction-to-measurement course. We will reserve the term IRT for the models that are not Rasch models, because that term seems to describe the intent of those models, with their focus on fitting the data. The older term latent trait theory fits better with the Rasch perspective, with its focus on the underlying aspect to be measured.

Our concern is the efficacy of Rasch measurement, how it works under controlled conditions, which can hardly be controversial. When the data conform to Rasch’s principles, i.e., the data are based on agents that are equally valid and reliable; not subject to interference from extraneous attributes of the objects, the models have the power to encompass and extend the best of Thurstone and Guttman. Guttman created a non-stochastic Rasch model, with very sufficient statistics; Thurstone defined “fundamental measurement”, which foreshadowed Rasch’s “specific objectivity.” This leads to measurement, as the layman understands the word, and sets the stage for the more vital tasks of making and analyzing measures.

Most of the mainstream debate surrounding Rasch measurement has focused on effectiveness, how the models function when confronted with real responses from real people to real tests, questionnaires, surveys, checklists, and other instruments, some put together with little or no thought for their suitability for measurement[1]. The conclusion that Rasch models are robust, i.e., do pretty well in this real world, should not be taken as justification to continue doing what we’ve been doing.

There are two commonly cited motivations for using Rasch’s models: the most popular being they are extraordinarily easy to apply, compared to IRT models. Useful results can come from relatively small samples and the estimation algorithms converge readily unless the data are pathologically bizarre. In this very data-driven world, “Rasch analysis” (or the verb form, to rasch) seems to mean running data through Rasch calibration software. This requires minimal intellectual commitment and by itself doesn’t accomplish what Rasch set out to accomplish. Rasch’s more compelling motivation takes more effort.

Rasch’s Motivation

While trying to solve practical problems in statistics and in educational and psychological testing, Georg Rasch came upon a special class of models, which led him to a general philosophy of measurement. Rasch defined measurement: if it’s not Rasch measurement, it’s not measurement! Georg was a very unreasonable man.

The phrase “Rasch Measurement” is redundant; I use it to avoid ambiguity. For its adherents, Rasch measurement is axiomatic: self-evident because this is the way it must be or it’s not measurement:

The calibration of the agents must be independent of the objects used and the measurement of the objects must be independent of the agents used, over a useful range.[2]

This is not an unchecked assertion, but a rational criterion by which one can evaluate data under the bright light of a theory. From the model follow consequences. If the observed data are consistent with the anticipated consequences, we have met the conditions of Thurstone’s fundamental measurement and can treat the results with the same respect we have for measures of, say, mass, heat, distance, or duration, which, like reading fluency or art appreciation, are not real things but aspects of things.

I come by my biases naturally. Professionally, I am a grandson, on my statistics side, and great grandson, on my measurement side, of Sir Ronald Fisher. My view of statistics was shaped by people who worked with Fisher. I was grounded in statistics at Iowa State University in a department founded by George Snedecor; the focus heavily on the design of experiments and the analysis of variance, which I learned from the likes of T.A. Bancroft, Oscar Kempthorne, David Jowett, Herbert David, and David Huntsberger, some of whom had known, worked, and undoubtedly argued (if you knew Fisher, Snedecor or Kempthorne) with Fisher on a regular basis.

My view of measurement was shaped by Georg Rasch, who worked with Fisher in England, taking away from that experience Fisher’s concepts of sufficient statistics and maximum likelihood estimation (MLE). The existence of sufficient statistics is the sine qua non of Rasch measurement. I learned Rasch measurement at Chicago (in the Department of Education founded by John Dewey) sitting at the feet of Benjamin Wright, Rasch’s most active and vocal North American disciple and evangelist. Rasch visited the University of Chicago while I was a student there, although I was too green to benefit much from that visitation.

There are strong parallels and much overlap between my two universes. Rasch measurement is to item response theory (IRT) as design of experiments is to general linear models (GLM). GLM is what you do if you can’t, or won’t, get the design right; IRT is what you do if you can’t, or won’t, get the instrument right. Both cases necessitate mathematical gymnastics that can substitute for clear thinking and mask poor planning. GLM and IRT rely on fitting models to “explain” data in a statistical sense, a venerable activity in some statistical traditions and very much in vogue in today’s Big Data world. But it’s not my tradition and it’s not measurement.

The point of experimental design is to produce observations that permit unambiguous inferences to answer specific, carefully stated questions: questions like, what level of catalyst is optimal for reforming crude oil or which feed ration is best for finishing hogs? We would really like the answer to be independent of the specific breed, gender, age, growing conditions, and intended use of the pig, but that isn’t going to happen. More likely, the answer will include a description of the specific domain to which it applies.

The point of Rasch measurement is to produce measures that unambiguously quantify a specific aspect of the object of interest; measures that are independent of any other attribute of either the object (e.g., a person) or our agents (e.g., items.) We would like the agents to be universally applicable, but more likely they will be valid for small neighborhoods in the universe, which must be described.

Rasch’s Principle and Method

Design of experiments and Rasch measurement rely fundamentally on sufficient statistics to make inferences. Sufficient statistics are the constant, overarching theme. They are what make analysis of variance[3], as described by Fisher and Snedecor, which implies more than just simply partitioning the sum of squares, work; they are what make Rasch measurement, which implies more than just running data through an appropriately named piece of software, measurement. Once you have harvested the information in the sufficient statistic, you know everything the data have to tell you about the factor that you are testing or the aspect that you are measuring. That is Rasch’s principle.

Anything left in the data should be noise and anything gleaned from the residual is to be used for control of the model. That is Rasch’s method.

It is unlikely that you will learn anything useful about Maximum Likelihood Estimation in these pages; the mathematics employed here are several steps down. The methods and derivations included are workable but you will need to look elsewhere for scholarly discussions and rigorous derivations of the most efficient and fashionable methods (e.g., Andrich; Fischer; Masters & Wright; Smith & Smith; Wilson). I rely less on calculus and proof and more on analogy and metaphor than is generally deemed proper in scholarly circles. While I will try to make the presentation non-mathematical, I will not try to make it simple.

There is little new here; the majority of the entries in the reference list are between 1960, when Rasch’s Probabilistic Models for Some Intelligence and Attainment Tests (Rasch, 1960) was published, and 1980, when it was republished shortly after Rasch’s death. There is as much here about rocks, darts, football, and oral reading as about multiple-choice items. I attempted, not at all successfully, to avoid mathematics, but those seeking rigorous explanations of estimation methods or fit statistics will need to look elsewhere (e.g., Smith & Smith, 2004; Fischer & Molenaar, 1995). This is not the manual for any Rasch computer package; it will not explain what WinSteps, RUMM, ConQuest, LPCM-WIN, or especially eRm.R is actually doing. For more hand-holding, try Wright and Stone (1979) or Bond and Fox (2007.)

Finally, this is not a cookbook for applying a special case of IRT models, although we do embrace the notion that Rasch models are very special indeed. Over the last forty years, I have come to understand that:

Rasch Measurement is such an extraordinarily special case of IRT that the general IRT literature says almost nothing that helps us understand or achieve measurement.
The mathematical complexities are a distraction and often counter-productive; our resources are better spent elsewhere.
Measurement doesn’t happen by graciously accepting whatever instrument the author proudly hands you; fitting a model that explains the data isn’t progress.
You need to be extraordinarily lucky to have an instrument that meets Rasch’s requirements; the harder you work on the design, the luckier you will be.[4]

I very purposefully keep saying Rasch Measurement, rather than The Rasch Model. There are a number of mathematical expressions that qualify as Rasch models and the specific expression of the appropriate form for a given situation is incidental, probably self-evident, once we understand what we want to accomplish. Our goal is measurement, as the world[5] understands the word.

Rasch’s Theory of Relativity

Saying our goal is measurement blatantly begs the question, What is measurement? For the moment, our answer will be, Measurement is the process of quantifying an aspect of an object. This then begs the question, What is an aspect? An aspect is not a thing nor the essence of the thing but simply an interesting, at least to us at the moment, property of the thing.

Rasch’s principles define measurement; Rasch’s methods are the process.

Trying to avoid the Metaphysics, as well as the calculus, it doesn’t matter for our purpose if there is some ideal, pure form for the aspect out there in hyperspace or the mind of God independent of any actual objects or, alternatively, if the aspect exists only if objects having the aspect exist. Is there an abstract idea of, for example, “reading proficiency” or are there just students who vary in their capacities to decode written text? In either event, our problem is to imagine the consequences of that capacity for students and devise tactics to make the consequences manifest. We are trying to learn something about the status of a kid in a classroom, not to deduce the nature of a Socratic form.

There are, however, basic theoretical or philosophical issues that are much more interesting, much more challenging, and that come before the arithmetic I will eventually describe. Doing the arithmetic correctly may inform the discussion and analysis but it neither starts them nor ends them.

For much of the twentieth century, mental measurement and, by inheritance, the social sciences were hamstrung by superficial thinking and logical shortcuts, illustrated by assertions like “IQ is what IQ tests measure.” The items that make up such a test are an operational definition of some aspect and, I guess, you can call it IQ if you like but there is a valid validity question looming out there. It is far better to have a sound theoretical basis[6] of the aspect and a statement of observable consequences before we commit to a bunch of test items. If we don’t know where we are going, any items’ll get us there.[7]

The purpose of this apology is to nudge, by Rasch, mental measurement toward the same level of consensus and respect that physical measurement enjoys, or that it enjoyed before Einstein. This is Rasch’s theory of relativity: the path to understand what is real and invariant and to recognize what is convention and population-dependent. That doesn’t seem so unreasonable.

Next: II. Measurement in Science

[1] Or often when confronted with data sets deliberately simulated to not match Rasch’s requirements.

[2] Paraphrasing Thurstone and Rasch.

[3] In Bancroft’s view, the analysis of variance is just the computing algorithm to do the arithmetic for extracting mean squares and components of variance from appropriately designed experiments. Rasch analysis, as differentiated from Rasch measurement and as implemented by any number of pieces of software past and present, is just the computing algorithm to do the arithmetic for producing measures from appropriately generated observations and establishing the domains for which they are applicable and appropriate, i.e., valid.

[4] Paraphrasing Thomas Jefferson and Branch Rickey.

[5] With the possible exception of that portion of the world populated by mainstream psychometricians doing educational research in the US.

[6] IQ may have been a poor choice to illustrate the point of a sound theoretical basis.

[7] Paraphrasing “If you don’t care where you are going, any road’ll get you there.” George Harrison, paraphrasing Lewis Carroll, from a conversation between Alice and the Cheshire Cat.

The Trouble with Rasch: the Rasch Model Exposed

Measurement solutions too simple to publish.

Vb. Mean Squares; Outfit and Infit

IIIf. Another Aspect, Reading Aloud

IIIb. The Aspect of Color

III. Abstracting Some Aspects

II. Measurement in Science

I. Rasch’s Theory of Relativity