Lexiles: the making of a measure

PDF download: Using Lexiles Safely

A recent conversation with a former colleague (it was more of a lecture) about what psychometricians don’t understand about students and education led me to resurrect an article that I wrote for the Rasch Measurement Transactions four or five years ago. It deals specifically with Lexiles© but it is really about how one defines and uses measures in education and science.

The antagonism toward Lexiles in particular and Rasch measures in general is an opportunity to highlight some distinctions between measurement and analysis and between a measure and an assessment. Often when trying to discuss the development of reading proficiency, specialists in measurement and reading seem to be talking at cross-purposes. Reverting to argument by metaphor, measurement specialists are talking about measuring weight; and reading specialists, about providing proper nutrition.

There is a great deal involved in physical development that is not captured when we measure a child’s weight and the process of measuring weight tells us nothing about whether the result is good, bad, or normal; if you should continue on as you are, schedule a doctor’s appointment, or go to the emergency room without changing your underwear. Evaluation of the result is an analysis that comes after the measurement and depends on the result being a measure. No one would suggest that, because it doesn’t define health, weight is not worth measuring or that it is too politically sensitive to talk about in front of nutritionists. A high number does not imply good nutrition nor does a low number imply poor nutrition. Nonetheless, the measurement of weight is always a part of an assessment of well-being.

A Lexile score, applied to a person, is a measure of reading ability[i], which I use to mean the capability to decode words, sentences, paragraphs, and Supreme Court decisions. Lexiles, as applied to a text, is a measure of how difficult the text is to decode. Hemingway’s “For Whom the Bell Tolls” (840 Lexile score) has been cited as an instance where Lexiles do not work. Because a 50th percentile sixth-grade reader could engage with this text, something must be wrong because the book was written for adults. This counter-example, if true, is an interesting case. I have two counter-counter-arguments: first, all measuring instruments have limitations to their use and, second, Lexiles may be describing Hemingway appropriately.

First, outside the context of Lexiles, there is always difficulty for either humans or computer algorithms in scoring exceptional, highly creative writing. (I would venture to guess that many publishers, who make their livings recognizing good writing[ii], would reject Hemingway, Joyce, or Faulkner-like manuscripts if they received them from unknown authors.) I don’t think it follows that we should avoid trying to evaluate exceptional writing. But we do need to know the limits of our instruments.

I rely, on a daily basis, on a bathroom scale. I rely on it even though I believe I shouldn’t use it on the moon, under water, or for elephants or measuring height. It does not undermine the validity of Lexiles in general to discover an extraordinary case for which it does not apply. We need to know the limits of our instrument; when does it produce valid measures and when does it not.

Second, given that we have defined the Lexile for a text as the difficulty of decoding the words and sentences, the Lexile analyzer may be doing exactly what it should with a Hemingway text. Decoding the words and sentences in Hemingway is not that hard: the vocabulary is simple, the sentences short. That’s pretty much what a Lexile score reflects.

Understanding or appreciating Hemingway is something else again. This may be getting into the distinction between reading ability, as I defined it, and reading comprehension, as the specialists define that. You must be able to read (i.e., decode) before you can comprehend. Analogously, you have to be able to do arithmetic before you can solve math word problems[iii]. The latter requires the former; the former does not guarantee the latter.

The Lexile metric is a true developmental scale that is not related to instructional method or materials, or to grade-level content standards. The metric reflects increasing ability to read, in the narrow sense of decode, increasingly advanced text. As students advance through the reading/language arts curriculum, they should progress up the Lexile scale. Effective, even standards-based, instruction in ELA[iv] should cause them to progress on the Lexile scale; analogously good nutrition should cause children to progress on the weight scale[v].

One could coach children to progress on the weight scale in ways counter to good nutrition[vi]. One might subvert Lexile measurements by coaching students to write with big words and long sentences. This does not invalidate either weight or reading ability as useful things to measure. There do need to be checks to ensure we are effecting what we set out to effect.

The role of standards-based assessment is to identify which constituents of reading ability and reading comprehension are present and which absent. Understanding imagery and literary devices, locating topic sentences, identifying main ideas, recognizing sarcasm or satire, comparing authors’ purposes in two passages are within its purview but are not considered in the Lexile score. Its analyzer relies on rather simple surrogates for semantic and syntactic complexity.

The role of measurement on the Lexile scale is to provide a narrowly defined measure of the student’s status on an interval scale that extends over a broad range of reading from Dick and Jane to Scalia and Sotomayor. The Lexile scale does not define reading, recognize the breadth of the ELA curriculum, or replace grade-level content standards-based assessment, but it can help us design instruction and target assessment to be appropriate to the student. We do not expect students to say anything intelligent about text they cannot decode, nor should we attempt to assess their analytic skills using such text.

Jack Stenner (aka, Dr. Lexile) uses as one of his parables, you don’t buy shoes for a child based on grade level but we don’t think twice about assigning textbooks with the formula (age – 5). It’s not one-size-fits-all in physical development. Cognitive development is probably no simpler if we were able to measure all its components. To paraphrase Ben Wright, how we measure weight has nothing to do with how skillful you are at football, but you better have some measures before you attempt the analysis.

[i] Ability may not be the best choice of a word. As used in psychometrics, ability is a generic placeholder for whatever we are trying to measure about a person. It implies nothing about where it came from, what it is good for, or how much is enough. In this case, we are using reading ability to refer to a very specific skill that must be taught, learned, and practiced.

[ii] It may be more realistic to say they make their livings recognizing marketable writing, but my cynicism may be showing.

[iii] You also have to decode the word problem but that’s not the point of this sentence. We assume, often erroneously, that the difficulty of decoding the text is not an impediment to anyone doing the math.

[iv] Effective instruction in science, social studies, or basketball strategy should cause progress on the Lexile measure as well; perhaps not so directly. Anything that adds to the student’s repertoire of words and ideas should contribute.

[v] For weight, progress often does not equal gain.

[vi] Metaphors, like measuring instruments, have their limits and I may have exceeded one. However, one might consider the extraordinary measures amateur wrestlers or professional models employ to achieve a target weight.

Computer-Administered Tests That May Teach

PDF download: Answer until Correct

One of the political issues with computer administered tests (CAT) is what to do about examinees who want to revisit, review, and revise earlier responses. Examinees sometimes express frustration when they are not allowed to; psychometricians don’t like the option being available because each item selection is based on previous successes and failures, so changing answers after moving on has the potential of upsetting the psychometric apple cart. One of our more diabolical thinkers has suggested that a clever examinee would intentionally miss several early items, thereby getting an easier test, and returning later to fix the intentionally incorrect responses, ensuring more correct answers and presumably a higher ability estimate. While this strategy could sometimes work in the examinee’s favor (if receiving an incorrect estimate is actually in anyone’s favor), it is somewhat limited because many right answers on an easy test is not necessarily better than fewer right answers on a difficult test and because a good CAT engine should recover from a bad start given the opportunity. While we might trust in CAT, we should still row away from the rocks.

The core issue for educational measurement is test as contest versus a useful self-assessment. When the assessments are infrequent and high stakes with potentially dire consequences for students, schools, districts, administrators, and teachers, there is little incentive not to look for a rumored edge whenever possible[1]. Frequent, low-stakes tests with immediate feedback could actually be valued and helpful to both students and teachers. There is research, for example, suggesting that taking a quiz is more effective for improved understanding and retention than rereading the material.

The issue of revisiting can be avoided, even with high stakes, if we don’t let the examinee leave an item until the response is correct. First, present a multiple choice item (hopefully more creatively than putting a digitized image of a print item on a screen). If we get the right response, we say “Congratulations” or “Good work” and move on to the next item. If the response is incorrect, we give some kind of feedback, ranging from “Nope, what are you thinking?” to “Interesting but not what we’re looking for” or perhaps some discussion of why it isn’t what we’re looking for (recommended). Then we re-present the item with the selected, incorrect foil omitted.  Repeat. The last response from the examinee will always be the correct one, which might even be retained.

The examinee’s score on the item is the number of distractors remaining when we finally get to the correct response[2]. Calibration of the thresholds can be quick and dirty. It is convenient for me here to use the “rating scale” form for the logit [bv – (di + tij)]. The highest threshold, associated with giving the correct response on the first attempt, is the same as the logit difficulty of the original multiple choice item, because that is exactly the situation we are in, and tim = 0 for an item with m distractors (i.e., m+1 foils.) The logits for the other thresholds depend on the attractiveness of the distractors. (usually when written in this form, the tij sum to zero but that’s not helpful here.

To make things easy for myself, I will use a hypothetical example of a four-choice item with equally popular distractors. The difficulty of the item is captured in the di and doesn’t come into the thresholds. Assuming an item with a p-value of 0.5 and equally attractive distractors, the incorrect responses will be spread across the three, with 17% on each. After one incorrect response, we expect the typical examinee to have a [0.5 / (0.5+.017+0.17)] = 0.6 chance of success on the second try. A 0.6 chance of success corresponds to a logit difficulty ln [(1 – 0.6) / 0.6] = –0.4. Similarly for the third attempt, the probability of success is [0.5 / (0.5+.017)] = 0.75 and the logit difficulty ln [(1 – 0.75) / 0.75] = –1.1. All of which gives us the three thresholds t = {-1.1, -0.4, 0.0}.

This was easy because I assumed distractors that are equally attractive across the ability continuum; then the order in which they are eliminated doesn’t matter in the arithmetic. With other patterns, it is more laborious but no more profound. If, for example, we have an item like:

  1. Litmus turns what color in acid?
    1. red
    2. blue
    3. black
    4. white,

we could see probabilities across the foils like (0.5, 0.4, 0.07, and 0.03) for the standard examinee. There is one way to answer correctly on the first attempt and score 3; this is the original multiple choice item and the probability of this is still 0.5. There are, assuming we didn’t succeed on the first attempt, three ways to score 2 (ba, ca, and da) that we would need to evaluate. And even more paths to scores of 1 or zero, which I’m not going to list.

Nor does it matter what p-value we start with, although the arithmetic would change. For example, reverting to equally attractive distractors, if we start with p=0.75 instead of 0.5, the chance of success on the second attempt is 0.78 and on the third is 0.875. This leads to logit thresholds of ln [(1 – 0.78) / 0.78] = –1.25, and ln [(1 – 0.875) / 0.875] = –1.95. There is also a non-zero threshold for the first attempt of ln [(1 – 0.7) / 0.7] = –0.85. This is reverting to the “partial credit” form of the logit (bvdij). To compare to the earlier paragraph requires taking the -0.85 out so that (-0.85, -1.25, -1.95) becomes -0.85 + (0.0, -0.4, -1.1) as before. I should note that this not the partial credit or rating scale model although a lot of the arithmetic turns out to be pretty much the same (see Linacre, 1991). It has been called “Answer until Correct;” or the Failure model because you keep going on the item until you succeed. This contrasts with the Success model[3] where you keep going until you fail. Or maybe I have the names reversed.

Because we don’t let the examinee end on a wrong answer and we provide some feedback along the way, we are running a serious risk that the examinees could learn something during this process with feedback and second chances. This would violate an ancient tenet in assessment that the agent shalt not alter the object, although I’m not sure how the Quantum Mechanics folks feel about this.

[1] Admission, certifying, and licensing tests have other cares and concerns.

[2] We could give a maximum score of one for an immediate correct response and fractional values for the later stages, but using fractional scores would require slightly different machinery and have no effect on the measures.

[3] DBA, the quiz show model.

Probability Time Trials

It has come to my attention that I write the basic Rasch probability in half a dozen different forms; half of them are in logits (aka, log odds) and half are in the exponential metric (aka, odds.) My two favorites for exposition are, in logits, exp (b-d) / [1 + exp (b-d)] and, in exponentials, B / (B + D)., where B = eb and D = ed. The second of these I find the most intuitive: the probability in favor of the person is the person’s odds divided by the sum of the person and item odds. The first, the logit form, may be the most natural because logits are the units used for the measures and exhibit the interval scale properties and this form emphasizes the basic relationship between the person and item.

There are variations on each of these forms like, [B / D]/ [1 + B / D] and 1 / [1+ D / B], which are simple algebraic manipulations. The forms are all equivalent; the choice of which to use is simply convention, personal preference, or perhaps computing efficiency, but that has nothing to do with how we talk to each other, only how we talk to the computer. The goal of computing efficiency means to minimize the calls to the log and exponential functions, which causes me to work internally mostly in the exponentials and to do input and output in logits.

These deliberations led to a small time trial to provide a partial answer to the efficiency question in R. I first step up some basic parameters and defined a function to compute 100,000 probabilities. (When you consider a state-wide assessment, which can range from a few thousand to a few hundred thousand students per grade, that’s not a very big number. If I were more serious, I would use a timer with more precision than whole seconds.)

> b = 1.5; d = -1.5

> B = exp(b); > D = exp(d)

> timetrial = function (b, d, N=100000, Prob) { for (k in 1:N) p[k] = Prob(b,d) }

Then I ran timetrial 100,000 times for each of seven expressions for the probability; the first three and the seventh use logits; four, five, and six use exponentials.

> date ()

[1] “Tue Jan 06 11:49:00 ”

> timetrial(b,d,,(1 / (1+exp(d-b))))            # 26 seconds

> date ()

[2] “Tue Jan 06 11:49:26 ”

> timetrial(b,d,,(exp(b-d) / (1+exp(b-d)))) # 27 seconds

> date ()

[3] “Tue Jan 06 11:49:53 ”

> timetrial(b,d,,(exp(b)/(exp(b)+exp(d)))) # 27 seconds

> date ()

[4] “Tue Jan 06 11:50:20 ”

> timetrial(b,d,,(1 / (1+D/B)))                  # 26 seconds

> date ()

[5] “Tue Jan 06 11:50:46 ”

> timetrial(b,d,,((B/D) / (1+B/D)))            # 27 seconds

> date ()

[6] “Tue Jan 06 11:51:13 ”

> timetrial(b,d,,(B / (B+D)))                     # 26 seconds

> date ()

[7] “Tue Jan 06 11:51:39 ”

> timetrial(b,d,,(plogis(b-d)))                  # 27 seconds

> date ()

[8] “Tue Jan 06 11:52:06 ”

The winners were the usual suspects, the ones with the fewest calls and operations but the bottom line seems to be, at least in this limited case using an interpreted language, it makes very little difference. That I take as good news: there is little reason to bother using the exponential metric in the computing.

The seventh form of the probability, plogis, is the built-in R function for the logistic distribution. While it was no faster, it is an R function and so can handle a variety of arguments in a call like “plogis (b-d).” If b and d are both scalars, the value of the expression is a scalar. If either b or d is a vector or a matrix, the value is a vector or matrix of the same size. If both b and d are vectors then the argument (b-d) doesn’t work in general, but the argument outer(b,d,“-“) will create a matrix of probabilities with dimensions matching the lengths of b and d. This will allow computing all the probabilities for, say, a class or school on a fixed form with a single call.

The related R function, dlogis (b-d) has the value of p(1-p), which is useful in Newton’s method or when computing the standard errors. And may be useful for impressing your geek friends or further mystifying your non-geek clients.

Useful and Almost Number-free Reports

If I had asked my customers what they wanted, they would have said a faster horse. Henry Ford

Percentile ranks on student reports are tests as contests.

Raw scores on student reports are like live chickens on dinner plates.

If I were to step on my bathroom scale and see a single number like, say for example, 165 lbs (or 75 kilo) with no further explanation, I wouldn’t need an interpretation guide or course in psychometrics to know what the number means or to decide if I like it or not. Nor would I be writing to the manufacturer of the scale asking, “What’s a pound?” If I were to take a qualifying test to be a licensed dog walker and I received a postcard in the mail that said simply, “Your score is 509 GRits ± 41,” I would be a little frustrated and a lot annoyed. And I would need to ask some basic questions like, “What does that mean?” “What’s a GRit?” “Is the ‘41’ the standard error of measurement, or does it represent a confidence interval?” “If so, what level of confidence?” “What does 509 actually say about my proficiency to walk dogs?” And, of course, “Did I pass?

If the answer to the last question is yes, then most candidates, possibly excluding the psychometricians, will quickly lose interest in the others. If the answer is no, then the tone becomes a little more strident and now includes questions like, “Who decides what passing is?” “What did I miss?” “How close was I?” and if I was close, “Isn’t there almost a 50% chance that I actually passed?[1] People who did pass never seem concerned about the other half of this question.

If a postcard with a Scale Score (even with some form of the standard error of measurement) isn’t useful or meaningful, what does a report need to be? Examinee reports vary depending on the audience and the purpose of the exam, but for a report going to a student, teacher, parent, or anyone else who might actually make use of the information for the betterment of the examinee, there would seem to be four basic components:

  1. Identification
  2. Measurement
  3. Control
  4. Interpretation

There needs to be enough identifying information to locate the examinee and to deliver the report to the right place. For dog walking candidates, the address on the front of the postcard did the trick. For education, it probably takes some combination of student name, teacher name, classroom /section/period, grade, school, and district. We should also mention the name of the test and the date taken. That is almost always more than enough to locate the right person; if you are still worried about it, add birth date or a parent’s name. Our original list should be adequate to locate the teacher and the teacher should know the student by name.

Measurement of the examinee to determine something about status or progress is the point of the exercise. This report section could be the simple “509 GRits” but it should also include some indication of our confidence in this measurement, which means the standard error of measurement in some guise. To make it concrete, in this example, the standard error of measurement is 41, with a 95% confidence interval of 509 ± 82, or 427 to 591. It is probably prudent to never use a phrase involving the word “error” when communicating with parents or school boards; they tend to interpret “error” as “mistake” and blame you. One often sees phrases like “probable range” to describe the interval between the measure plus and minus two standard errors (or some other arbitrary multiplier), which avoids saying ‘error’ and also ducks the squabble between the frequentists and the Bayesians about what confidence means. A picture may not be worth a thousand words in this case but here it is.

Dog walking scale (3)

I am inclined to leave it at that for the moment but not everyone thinks a line with scale scores and a marker for the person’s location is all that informative. I am oft over-ridden to add other (useful and relevant) information like a conclusion (e.g., pass/fail or performance level[2]) and sometimes even subtest measures to the measurement section. One could also say things like a person at 509 has 41% likelihood of testing below the Competent level next time and a 1e-6 likelihood of testing above Skilled. These are really steps toward control and interpretation not measurement so the purist in me wants to put them in the next two sections. Although it’s not worth falling on my sword and what goes in what section is less rigid than I seem to be implying.

I am willing to give some meaning to the score by showing the ruler and some milestones along it. At this point, the scale score may be less meaningful than the milestones, but with experience, the scale score can become a useful shorthand for the milestones. It doesn’t take very much experience to understand what temperatures of 0°C and 37°C imply, even for US residents. This leads me to the less simple “map” below.

Dog walking scale (2)

Diagnosis With the Model

The vertical scale is GRits[3], which is our measure of dog walking proficiency and almost meaningless by itself; we wouldn’t lose much if we left the numbers off entirely[4]. The column of text labels is the substantive description of the scale. Topics at the low end, which are relatively easy, deal with type and use of basic equipment; topics at the high end, which are more difficult, deal with complex behaviors. The GRits bring precision; the text labels bring meaning.

The red vertical line has a tic mark for each possible raw score and a red diamond to mark the location of our hypothetical person. The red horizontal lines are the person’s location and plus/minus two standard errors. You can also add some normative information like means, standard deviations, frequency distributions, or quantiles, if you are into that sort of thing.

The gray horizontal lines mark the performance levels: 500 is Competent, 700 is Skilled, and 850 is Master. Labelling the lines rather than the regions between is not standard practice in educational assessment but it avoids the inconvenience of needing to label the region below Competent and the misinterpretation of the levels as actual developmental states or stages rather than simply more or less arbitrary criteria for addressing school accountability or dispensing certificates. So far we are just displaying the result, not interpreting it.

Control of the measurement model means, either, ensuring that we are warranted in treating the result as a valid measure, in the full sense of the word as we just did, or diagnosing what the anomalies tell us about the examinee. This is again the dichotomy of “diagnosing with the model” and “diagnosing from the model.” Determining which of these paths to follow requires a bit more than simply computing ‘infit’ or ‘outfit’ and consulting the appropriate table of big numbers. This involves looking at individual items, splitting the items into clusters, and looking for things that are funny. Maps like the following can be more useful than any fit statistic and almost number-free.

Dog Walking

Diagnosis From the Model

The display has the same measurement information as before and considerable detail about items and item clusters. First, the red vertical line still refers to the total test and still has a tic mark for each possible raw score and the red diamond for the person. It now has a black diamond for each item response; items to the left of the line are incorrect; those to the right are correct with the distance from the line representing the probability against the response; the greater the distance, the more improbable the response. The dotted vertical lines (blue shading) are control lines and represent probabilities of 75%. We don’t need to be much concerned about anything in the blue. There are four or five items, two difficult items passed and two or three easy items missed, outside the control lines that might warrant investigation.

Most of the same information has been added for each of five item clusters. These are very short tests so the results may seem a little erratic but in all cases, one more item right or wrong would bring them close to the total test measure. If you are so inclined, the number correct score can be found by counting the tic marks[5] up to the red diamond. You can’t necessarily find it by counting the item plotting symbols to the right of the scales because they can represent multiple items. (And it is further confused because some items were not assigned to clusters.) Overall, this is a well-behaved person.

Because I often disparage anything short of computer-administered, fully adaptive tests (CAT), I need to point out a serious issue for me and model control: in the world of CAT, there are no surprises. If we do the CAT right, everything should be in the blue bands. This puts all the load for model control on the item clusters. In our example, we have behaved as though clusters were based on content, which is appropriate for reporting. For control, we would do more by forming clusters based on difficulty, sequence, item type, item format, and item exposure or age are natural choices but as we become more creative in developing items for computer administration, there could be others.

Interpretation of the measure means explaining what the measurement tells us about the status and progress of the examinee. Establishing ‘performance levels,’ say, Master, Skilled, or Competent dog walker, is a significant step from measurement to meaning, or from quantification to qualification. Announcing that the candidate is above, say, the ‘Competent’ performance level is a start. Diagnosis with the model would then talk about what items candidates at this level have mastery of, what items they have no clue about, and what items are at the candidate’s level. This is reporting what the candidate can do, what the candidate can’t do, and what the candidate should find challenging but possible. That suggests three obvious comments that any computer could readily generate as personalized feedback assuming a well-behaved response pattern and diagnosis with the model.

Personalizing a report takes more than generating text that restates the obvious and uses the candidate’s name in a complete sentence, like “Ron, your Dog Walking GRit is 509 ± 82, which means we think you are Competent to walk dogs”. When we have a computer generating the feedback, we should use any intelligence, artificial or otherwise, that is available. It is generally ok to start with the generic, “Ron, you total score is …” and “You did well on item clusters D and A, but were weak on cluster C,” and move on to things that are less obvious. I prefer to open with a positive, encouraging statement (clusters D and A), then mention problem areas (cluster C), and close with things to work on immediately (topics that haven’t been mastered but are close). Ideally, we would discuss the specifics of the surprising responses. This includes difficult items that were passed and easy items that were missed. This is moving into diagnosis from the model.

The more specifics, the better, even item statistics and foil analysis if anyone asks. But it would be much more valuable and much more work for both the item developers and systems analysts to provide a discussion of the type of misunderstandings or errors implied by any incorrect responses. It is work for the item developers because they would need to understand and explain why every distractor is there and what selecting it means. It is work for the system analysts because they need to keep track of and manage everything.

In today’s world, there is little reason to limit reporting to what can be squeezed onto an 8.5×11 sheet of paper or by concerns about the cost of color printing[6]. Paper copies are static, often cramped and overwhelming. Ideally, an electronic report, like an electronic test, will be interactive, dynamic, and engaging, with effective, targeted scaffolding. It should begin with the general overview and then allow the user to explore or be led through the interesting, important, and useful aspects of the responses, showing more and more detail as needed. Performance Level Descriptors and item clusters could be defined and displayed on request; item details could pop up when the plotting symbol is clicked.

This is not free; there will be resistance to giving items away because they are expensive and the item bank is sacred. Hopefully, we are moving away from once-a-year, high-stakes tests toward testing when it is helpful for the student, drawing from computer-generated and crowd-sourced item banks. And more importantly, toward immediate and informative feedback that might actually have some educational value.


[1] No, you didn’t pass, but if you test again with a parallel form, there is almost a 50% chance that you will.

[2] The little gray lines mark the performance levels (Competent, Skilled, and Master from left to right).

[3] Because GRits use three, rather than e, as their base, a difference of 100 GRits means 3 to 1 odds. Our hypothetical person has 3 to 1 odds of answering an item about leash tension but less than 1 to 3 odds for an item about tugging. More generally, a difference of 100k GRits means odds of 3k to 1. That’s friendlier than base e and odds like 2.71828…k to one.

[4] The report is (almost) number-free in the sense that the numbers are not needed to understand and use the results. It is not number-free in another sense because they are essential to provide the framework to create and arrange the display.

[5] This demonstration does not include tic marks for zero and perfect; in real life, you would probably have to account for them somehow. They tend to radically distort the scale without adding much, if any, information. They would extend the scale over 100 GRits in both directions and have probably ranges more than four times that in width.

[6] Someone will undoubtedly want a printer-friendly version to stick in the file cabinet because they have the file cabinet.

Ixb. R-code to make a simple model less simple and more useful

My life as a psychometrician, the ability algorithm, and some R procs to do the work

The number one job of the psychometrician, in the world of large-scale, state-wide assessments, is to produce the appropriate raw-to-scale tables on the day promised. When they are wrong or late, lawsuits ensue. When they are right and on time, everyone is happy. If we did nothing else, most wouldn’t notice; few would complain.

Once the setup is done, computers can produce the tables in a blink of an eye. It is so easy it is often better, especially in the universe beyond fixed forms, to provide the algorithm to produce scale scores on demand and not bother with lookup tables at all. Give it the item difficulties, feed in the raw score, and the scale score pops out. Novices must take care that management never finds out how easy this step really is.

With modern technology, the ability algorithm can be written in almost any computer language (there is probably an app for your phone) but some are easier than others. My native language is Fortran, so I am most at home with C, C++, R, or related dialects. I am currently using R most of the time. For me with dichotomous items, this does it:

Ability (d)          # where d is the vector of logit difficulties.

But first, I need to copy a few other things into the R window, like a procedure named Ability.

(A simple cut and paste from this post into R probably won’t work but the code did work when I copied it from the Editor. The website seems to use a font for which not all math symbols are recognized by R. In particular, slash (/), minus (-), single and double quotes (‘ “), and ellipses (…) needed to fixed. I’ve done it with a “replace all” in a text editor before moving it into R by copying the offending symbol from the text into the “replace” box of the “replace all” command and typing the same symbol into the “with” box.  Or leave a comment and I’ll email you a simple text version)

Ability <- function (d, M=rep(1,length(d)), first=1, last=(length(d)-1), A = 500, B = 91, …) { b <- NULL; s <- NULL
    b[first] <- first / (length(d) – first)
   D <- exp(d)
    for (r in first:last) {
      b[r] <- Ablest(r, D, M,  b[r], …)
      s[r] <- SEM (exp(b[r]), D, M)
      b[r+1] <- exp(b[r] + s[r]^2)
return (data.frame(raw=(first:last), logit=b[first:last], sem.logit=s[first:last],
          GRit=round((A+B*b[first:last]),0),  sem.GRit=round(B*s[first:last],1)))
} ##############################################################

This procedure is just a front for functions named Ablest and SEM that actually do the work so you will need to copy them as well:

Ablest <- function (r, D, M=rep(1,length(D)), B=(r / (length (D)-r)), stop=0.01) {
# r is raw score; D is vector of exponential difficulties; M is vector of m[i]; stop is the convergence
      repeat {
         adjust <- (r – SumP(B,D,M)) / SumPQ (B,D,M)
         B <- exp(log(B) + adjust)
      if (abs(adjust) < stop) return (log(B))
} ##########################################################ok
SEM <- function (b, d, m=(rep(1,length(d))))  return (1 / sqrt(SumPQ(b,d,m)))

And Ablest needs some even more basic utilities copied into the window:

SumP <- function (b, d, m=NULL, P=function (B,D) (B / (B+D))) {
   if (is.null(m)) return (sum (P(b,d))) # dichotomous case; sum() is a built-in function
   k <- 1
   Sp <- 0
   for (i in 1:length(m)) {
       Sp <- Sp + EV (b, d[k:(k+m[i]-1)])
       k <- k + m[k]
return (Sp)
} ##################################################################ok
EV <- function (b, d) { #  %*% is the inner product, produces a scalar
   return (seq(1:length(d)) %*% P.Rasch(b, d, m=length(d)))
} ##################################################################ok
SumPQ <- function (B, D, m=NULL, P=function (B,D) {B/(B+D)}, PQ=function (p) {p-p^2}) {
   if (is.null(m)) return (sum(PQ(P(B,D))))  # dichotomous case;
   k <- 1
   Spq <- 0
   for (i in 1:length(m)) {
       Spq = Spq + VAR (B,D[k:(k+m[i]-1)])
       k <- k + m[k]
return (Spq)
} ##################################################################ok
VAR <- function (b,d) {  # this is just the polytomous version of (p – p^2)
   return (P.Rasch(b, d, m=length(d)) %*% ((1:length(d))^2) – EV(b,d)^2)
} ##################################################################ok
P.Rasch <- function (b, d, m=NULL, P=function (B,D) (B / (B+D)) ) {
   if (is.null(m)) return (P(b,d)) # simple logistic
   return (P.poly (P(b,d),m))     # polytomous
} ##################################################################ok
P.poly <- function (p, m) { # p is a simple vector of category probabilities
   k <- 1
   for (i in 1:length(m)) {
      p[k:(k+m[i]-1)] = P.star (p[k:(k+m[i]-1)], m[i])
      k <- k + m[i]
return (p)
} ##################################################################ok
 P.star <- function (pstar, m=length(pstar)) {
#       Converts p* to p; assumes a vector of probabilities
#       computed naively as B/(B+D).  This routine takes account
#       of the Guttmann response patterns allowed with PRM.
    q <- 1-pstar  # all wrong, 000…
    p <- prod(q)
    for (j in 1:m) {
        q[j] <- pstar[j] # one more right, eg., 100…, or 110…, …
        p[j+1] <- prod(q)
    return (p[-1]/sum(p)) # Don’t return p for category 0
} ##################################################################ok
summary.ability <- function (score, dec=5) {
   print (round(score,dec))
   plot(score[,4],score[,1],xlab=”GRit”,ylab=”Raw Score”,type=’l’,col=’red’)
} ##################################################################

This is very similar to the earlier version of Ablest but has been generalized to handle polytomous items, which is where the vector M of maximum scores or number of thresholds comes in.

To use more bells and whistles, the call statement can be things like:

Ability (d, M, first, last, A, B, stop)         # All the parameters it has
Ability (d, M)                                         # first, last, A, B, & stop have defaults
Ability (d,,,,,, 0.0001)                             # stop is positional so the commas are needed
Ability (d, ,10, 20,,, 0.001)                       # default for M assumes dichotomous items
Ability (d, M,,, 100, 10)                          # defaults for A & B are 500 and 91

To really try it out, we can define a vector of item difficulties with, say, 25 uniformly spaced dichotomous items and two polytomous items, one with three thresholds and one with five. The vector m defines the matching vector of maximum scores.

dd=c(seq(-3,3,.25), c(-1,0,1), c(-2,-1,0,1,2))
m = c(rep(1,25),3,5)
score = Ability (d=dd, M=m)
summary.ability (score, 4)

Or give it your own vectors of logit difficulties and maximum scores.

For those who speak R, the code is fairly intuitive, perhaps not optimal, and could be translated almost line by line into Fortran, although some lines would become several. Most of the routines can be called directly if you’re so inclined and get the arguments right. Most importantly, Ability expects logit difficulties and returns logit abilities. Almost everything expects and uses exponentials. Almost all error messages are unintelligible and either because d and m don’t match or something is an exponential when it should be a logit or vice versa.

I haven’t mentioned what to do about zero and perfect scores today because, first, I’m annoyed that they are still necessary, second,  these routines don’t do them, and, third, I talked about the problem a few posts ago. But, if you must, you could use b[0] = b[1] – SEM[1]^2 and b[M] = b[M-1] + SEM[M-1]^2, where M is the maximum possible score, not necessarily the number of items. Or you could appear even more scientific and use something like b[0] = Ablest(0.3, D, m) and b[M] = Ablest(M-0.3, D, m). Here D is the vector of difficulties in the exponential form and m is the vector of maximum scores for the items (and M is the sum of the m‘s.) The length of D is the total number of thresholds (aka, M) and the length of m is the number of items (sometimes called L.) Ablest doesn’t care that the score isn’t an integer but Ability would care. The value 0.3 was a somewhat arbitrary choice; you may prefer 0.25 or 0.33 instead.

To call this the “setup” is a little misleading; we normally aren’t allowed to just make up the item difficulties this way. There are a few other preliminaries that the psychometrician might weigh in on or at least show up at meetings; for example, test design, item writing, field testing, field test analysis, item reviews, item calibration, linking, equating, standards setting, form development, item validation, and key verification. There is also the small matter of presenting the items to the students. Once those are out-of-the-way, the psychometrician’s job of producing the raw score to scale score lookup table is simple.

Once I deal with a few more preliminaries , I’ll go ahead and go back to the good stuff like diagnosing item and person anomalies.

Ix. Doing the Arithmetic Redux with Guttman Patterns

For almost the same thing as a PDF with better formatting: Doing the Arithmetic Redux

Many posts ago, I asserted that doing the arithmetic to get estimates of item difficulties for dichotomous items is almost trivial. You don’t need to know anything about second derivatives, Newton’s method iterations, or convergence criterion. You do need to:

  1. Create an L x L matrix N = [nij], where L is the number of items.
  2. For each person, add a 1 to nij if item j is correct and i is incorrect; zero otherwise.
  3. Create an L x L matrix R = [rij] of log odds; i.e., rij = log(nij / nji)
  4. Calculate the row averages; di = ∑ rij / L.

Done; the row average for row i is the logit difficulty of item i.

That’s the idea but it’s a little too simplistic. Mathematically, step three won’t work if either nij or nji is zero; in one case, you can’t do the division and in the other, you can’t take the log. In the real world, this means everyone has to take the same set of items and every item has to be a winner and a loser in every pair. For reasonably large fixed form assessments, neither of these is an issue.

Expressing step 4 in matrix speak, Ad = S, where A is an LxL diagonal matrix with L on the diagonal, d is the Lx1 vector of logit difficulties that we are after, and S is the Lx1 vector of row sums. Or d = A-1S, which is nothing more than the d are the row averages.

R-code that probably works, assuming L, x, and data have been properly defined, and almost line for line what we just said:

Block 1: Estimating Difficulties from a Complete Matrix of Counts R

N = matrix (0, L, L)                                 # Define and zero an LxL matrix

for ( x in data)                                          # Loop through people

N = N + ((1x) %o% x)                   # Outer product of vectors creates square

R = log (t(N) / N)                                      # Log Odds (ji) over (ij)

d = rowMeans(R)                                     # Find the row averages

This probably requires some explanation. The object data contains the scored data with one row for each person. The vector x contains the zero-one scored response string for a person. The outer product, %o%, of x with its complement creates a square matrix with a rij = 1 when both xj and (1 xi) are one; zero otherwise. The log odds line we used here to define R will always generate some errors as written because the diagonal of N will always be zero. It should have an error trap in it like: R = ifelse ((t(N)*N), log (N / t(N) ), 0).

But if the N and R aren’t full, we will need the coefficient matrix A. We could start with a diagonal matrix with L on the diagonal. Wherever we find a zero off-diagonal entry in Y, subtract one from the diagonal and add one to the same off-diagonal entry of A. Block 2 accomplishes the same thing with slightly different logic because of what I know how to do in R; here we start with a matrix of all zeros except ones where the log odds are missing and then figure out what the diagonal should be.

Block 2: Taking Care of Cells Missing from the Matrix of Log Odds R
Build_A <- function (L, R) {
   A = ifelse (R,0,1)                                              # Mark missing cells (includes diagonal)
   diag(A) = L – (rowSums(A) – 1)                       # Fix the diagonal (now every row sums to L)
return (A)

We can tweak the first block of code a little to take care of empty cells. This is pretty much the heart of the pair-wise method for estimating logit difficulties. With this and an R-interpreter, you could do it. However any functional, self-respecting, self-contained package would surround this core with several hundred lines of code to take care of the housekeeping to find and interpret the data and to communicate with you.

Block 3: More General Code Taking Care of Missing Cells

N = matrix (0, L, L)                                   # Define and zero an LxL matrix

for (x in data)                                            # Loop through people

{N = N + ((1x) %o% x)}                   # Outer product of vectors creates square

R = ifelse ((t(N)*N), log (N / t(N) ), 0)         # Log Odds (ji) over (ij)

A = Build_A (L, R)                             # Create coefficient matrix with empty cells

d = solve (A, rowSums(R))                       # Solve equations simultaneously

There is one gaping hole hidden in the rather innocuous expression, for (x in data), which will probably keeping you from actually using this code. The vector x is the scored, zero-one item responses for one person. The object data presumably holds all the response vectors for everyone in the sample. The idea is to retrieve one response vector at a time, add it into the counts matrix N in the appropriate manner, until we’ve worked our way through everyone. I’m not going to tackle how to construct data today. What I will do is skip ahead to the fourth line and show you some actual data.

Table 1: Table of Count Matrix N for Five Multiple Choice Items


MC.1 MC.2 MC.3 MC.4 MC.5


0 35 58 45 33


280 0 240 196 170
MC.3 112 49 0 83


MC.4 171 77 155 0


MC.5 253 145 224 193


Table 1 is the actual counts for part of a real assessment. The entries in the table are the number of times the row item was missed and the column item was passed. The table is complete (i.e., all non-zeros except for the diagonal). Table 2 is the log odds computed from Table 1; e.g., log (280 / 35) = 2.079 indicating item 2 is about two logits harder than item 1. Because the table is complete, we don’t really need the A-matrix of coefficients to get difficulty estimates; just add across each row and divide by five.

Table 2: Table of Log Odds R for Five Multiple Choice Items

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 Logit


0 -2.079 -0.658 -1.335 -2.037 -1.222


2.079 0 1.589 0.934 0.159 0.952
MC.3 0.658 -1.589 0 -0.625 -1.351


MC.4 1.335 -0.934 0.625 0 -0.668


MC.5 2.037 -0.159 1.351 0.668 0


This brings me to the true elegance of the algorithm in Block 3. When we build the response vector x correctly (a rather significant qualification,) we can use exactly the same algorithm that we have been using for dichotomous items to handle polytomous items as well. So far, with zero-one items, the response vector was a string of zeros and ones and the vector’s length was the maximum possible score, which is also the number of items. We can coerce constructed responses into the same format.

If, for example, we have a constructed response item with four categories, there are three thresholds and the maximum possible score is three. With four categories, we can parse the person’s response into three non-independent items. There are four allowable response patterns, which not coincidentally, happen to be the four Guttmann patterns: (000), (100), (110), and (111), which correspond to the four observable scores: 0, 1, 2, and 3. All we need to do to make our algorithm work is replace the observed zero-to-three polytomous score with the corresponding zero-one Guttmann pattern.


CR.1-2 CR.1-2 CR.1-3


0 0 0
1 1 0



1 1


3 1 1


If for example, the person’s response vector for the five MC and one CR was (101102), the new vector will be (10110110). The person’s total score of five hasn’t changed but we know have a response vector of all ones and zeros of length equal to the maximum possible score, which is the number of thresholds, which is greater than the number of items. With all dichotomous items, the length was also the maximum possible score and the number of thresholds but that was also the number of items. With the reconstructed response vectors, we can now naively apply the same algorithm and receive in return the logit difficulty for each threshold.

Here are some more numbers to make it a little less obscure.

Table 3: Table of Counts for Five Multiple Choice Items and One Constructed Response


MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3
MC.1 0 35 58 45 33 36 70



280 0 240 196 170 91 234 21
MC.3 112 49 0 83 58 52 98



171 77 155 0 99 59 162 12
MC.5 253 145 224 193 0 74 225



14 5 14 11 8 0 0 0
CR.1-2 101 46 85 78 63 137 0


CR.1-3 432 268 404 340 277 639 502


The upper left corner is the same as we had earlier but I have now added one three-threshold item. Because we are restricted to the Guttman patterns, part of the lower right is missing: e.g., you cannot pass item CR.1-2 without passing CR.1-1, or put another way, we cannot observe non-Guttman response patterns like (0, 1, 0).

Table 4: Table of Log Odds R for Five Multiple Choice Items and One Constructed Response

Log Odds

MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Mean
MC.1 0 -2.079 -0.658 -1.335 -2.037 0.944 -0.367 -4.682 -10.214



2.079 0 1.589 0.934 0.159 2.901 1.627 -2.546 6.743 0.843
MC.3 0.658 -1.589 0 -0.625 -1.351 1.312 0.142 -3.362 -4.814



1.335 -0.934 0.625 0 -0.668 1.680 0.731 -3.344 -0.576 -0.072
MC.5 2.037 -0.159 1.351 0.668 0 2.225 1.273 -2.405 4.989



-0.944 -2.901 -1.312 -1.680 -2.225 0 0 0 -9.062 -1.133


0.367 -1.627 -0.142 -0.731 -1.273 0 0 0 -3.406


CR.1-3 4.682 2.546 3.362 3.344 2.405 0 0 0 16.340


Moving to the matrix of log odds, we have even more holes. The table includes the row sums, which we will need, and the row means, which are almost meaningless. The empty section of the logs odds does make it obvious that the constructed response thresholds are estimated from their relationship to the multiple choice items, not from anything internal to the constructed response itself.

The A-matrix of coefficients (Table 5) is now useful. The rows define the simultaneous equations to be solved. For the multiple choice, we can still just use the row means because those rows are complete. The logit difficulties in the final column are slightly different than the row means we got when working just with the five multiple choice for two reasons: the logits are now centered on the eight thresholds rather than the five difficulties, and we have added in some more data from the constructed response.

Table 5: Coefficient Matrix A for Five Multiple Choice Items and One Constructed Response


MC.1 MC.2 MC.3 MC.4 MC.5 CR.1-1 CR.1-2 CR.1-3 Sum Logit


8 0 0 0 0 0 0 0 -10.214 -1.277


0 8 0 0 0 0 0 0 6.743 0.843


0 0 8 0 0 0 0 0 -4.814 -0.602


0 0 0 8 0 0 0 0 -0.576


MC.5 0 0 0 0 8 0 0 0



CR.1-1 0 0 0 0 0 6 1 1 -9.062


CR.1-2 0 0 0 0 0 1 6 1 -3.406


CR.1-3 0 0 0 0 0 1 1 6 16.340


This is not intended to be an R primer so much as an alternative way to show some algebra and do some arithmetic. I have found the R language to be a convenient tool for doing matrix operations, the R packages to be powerful tools for many perhaps most complex analyses, and the R documentation to be almost impenetrable. The language was clearly designed by and most packages written by very clever people; the examples in the documentation seemed intended to impress the other very clever people with how very clever the author is rather than illustrate something I might actually want to do.

My examples probably aren’t any better.

Viiif: Apple Pie and Disordered Thresholds Redux

A second try at disordered thresholds

It has been suggested, with some justification, that I may be a little chauvinistic depending so heavily on a baseball analogy when pondering disordered thresholds. So for my friends in Australia, Cyprus, and the Czech Republic, I’ll try one based on apple pie.

Certified pie judges for the Minnesota State Fair are trained to evaluate each entry on the criteria in Table 1 and the results for pies, at least the ones entered into competitions, are unimodal, somewhat skewed to the left.

Table 1: Minnesota State Fair Pie Judging Rubric









Internal appearance








We might suggest some tweaks to this process, but right now our assignment is to determine preferences of potential customers for our pie shop. All our pies would be 100s on the State Fair rubric so it won’t help. We could collect preference data from potential customers by giving away small taste samples at the fair and asking each taster to respond to a short five-category rating scale with categories suggested by our psychometric consultant.

My feeling about this pie is:


1 2 3 4
I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten

I could eat this right after a major feast!

The situation is hypothetical; the data are simulated from unimodal distributions with roughly equal means. On day one, thresholds 3 and 4 were reversed; on day two, thresholds 2 and 3 for some tasters were also reversed. None of that will stop me from interpreting the results. It is not shown in this summary of the data shown below, but the answer to our marketing question is pies made with apples were the clear winners. (To appropriate a comment that Rasch made about correlation coefficients, this result is population-dependent and therefore scientifically rather uninteresting.) Any problems that the data might have with the thresholds did not prevent us from reaching this conclusion rather comfortably. The most preferred pies received the highest scores in spite of our problematic category labels. Or at least that’s the story I will include with my invoice.

The numbers we observed for the categories are shown in Table 2. Right now we are only concerned with the categories, so this table is summed over the pies and the tasters.

Table 2: Results of Pie Preference Survey for Categories


I’d rather have boiled liver Can I have cake instead? Almost as good as my mother’s Among the best I’ve ever eaten I could eat this right after a major feast!


10 250 785 83


Two 120 751 95 22


In this scenario, we have created at least two problems; first, the wording of the category descriptions may be causing some confusion. I hope those distinctions survive the cultural and language differences between the US and the UK. Second, the day two group is making an even cruder distinction among the pies; almost I like it or I don’t like it.

The category 4 was intended to capture the idea that this pie is so good that I will eat it even if I have already eaten myself to the point of pain. For some people that may not be different than this pie is among the best I’ve ever eaten, which is why relatively few chose category 3. Anything involving mothers is always problematic on a rating scale. Depending on your mother, “Almost as good as my mother’s” may be the highest possible rating; for others, it may be slightly above boiled liver. That suggests there may be a problem with the category descriptors that our psychometrician gave us, but the fit statistics would not object. And it doesn’t explain the difference between days one and two.

Day Two happened to be the day that apples were being judged in a separate arena, completely independently of the pie judging. Consequently every serious apple grower in Minnesota was at the fair. Rather than spreading across the five categories, more or less, this group tended to see pies as a dichotomy: those that were made with apples and those that weren’t. While the general population spread out reasonably well across the continuum, the apple growers were definitely bimodal in their preferences.

The day two anomaly is in the data, not the model or thresholds. The disordered thresholds that exposed the anomaly by imposing a strong model, but not reflected in the standard fit statistics, are an indication that we should think a little more about what we are doing. Almost certainly, we could improve on the wording of the category descriptions. But we might also want to separate apple orchard owners from other respondents to our survey. The same might also be true for banana growers but they don’t figure heavily in Minnesota horticulture. Once again, Rasch has shown us what is population-independent, i.e., the thresholds (and therefore scientifically interesting) and what is population-dependent, i.e., frequencies and preferences, (and therefore only interesting to marketers.)

These insights don’t tell us much about marketing pies better but I wouldn’t try to sell banana cream to apple growers and I would want to know how much of my potential market are apple growers. I am still at a loss to explain why anyone, even beef growers, would pick liver over anything involving sugar and butter.

Viib. Using R to do a little work

Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty d and the person’s ability b, the probability of person succeeding on the item is p = B / (B + D), where B = exp(b) and D = exp(d). If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the b that makes the observed score equal the expected score, i.e., solves the equation r = ∑p, where r is the person’s number correct score and p was just defined.

If you are more concrete than that, here is a little R-code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure r is in the range 1 to L-1 (where L = length of d,) the ds are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.

#Block 1: Routine to compute logit ability for number correct r given d
Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties
   b <- log (r / (length (d)-r))    # Initialize
   repeat {
         adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))
         b <- b + adjust
         if (abs(adjust) < stop) return (b)
}      }
P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability
PQ <- function (p) (p-p^2)                     # p(1-p) aka inverse of the 2nd derivative

If you would like to try it, copy the text between the lines above into an R-window and then define the ds somehow and type in, say, Able(r=1, d=ds) or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command Able(r,d) if r and d have been defined. If you don’t have R installed on your computer, following the link to LLTM in the menu on the right will take you to an R site that has a “Get R” option.

In the world of R, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and R had it first.

#Block 2: Test ability routines
Test.Able <- function (low, high, inc) {
#Create a vector of logit difficulties to play with,
d = seq(low, high, inc)

# The ability for a raw score of 1,
# overriding default the convergence criterion of 0.01 with 0.0001
print (“Ability r=1:”)
    print (Able(r=1, d=d, stop=0.0001))
#To get all the abilities from 1 to L-1
# first create a spot to receive results
b = NA
#Then compute the abilities; default convergence = 0.01
for (r in 1:(length(d)-1) )
     b[r] = Able (r, d)
#Show what we got
print (“Ability r=1 to L-1:”)
Test.Able (-2,2,0.25)

I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (sem); we have everything we need for them. For a quick average, of sorts, sem, useful for planning and test design, we have the Wright-Douglas approximation: sem = 2.5/√L, where L is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on height, width, and length, where height is the percent correct, width is the  range of difficulties, and length is the number of items. Or to extricate the sem for almost any score from the logit ability table, semr = √[(br+1 – br-1)/2]. Or if you want to do it right, semr =1 / √[∑pr(1-pr)].

Of course, I have some R-code. Let me know if it doesn’t work.

#Block 3: Standard Errors and a few shortcuts
# Wright-Douglas ‘typical’ sem
wd.sem <- function (k) (2.5/sqrt(k))
# Wright-Stone from Mead-Ryan
SEMbyHWL <- function (H=0.5,W=4,L=1) {
     C2 <- NA
     W <- ifelse(W>0,W,.001)
     for (k in 1:length(H))
            C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))
return (sqrt( C2 / L))
# SEM from logit ability table
bToSem <- function (r1, r2, b) {
     s  <- NA
     for (r in r1:r2)
           s[r] <- (sqrt((b[r+1]-b[r-1])/2))
return (s)
# Full blown SEM
sem <- function (b, d) {
     s <-  NA
    for (r in 1:length(b))
          s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))
 return (s)

To get the SEM’s from all four approaches, all you really need are the four lines below after “Now we’re ready” below. The rest is start up and reporting.


#Block 4: Try out Standard Error procedures
Test.SEM <- function (d) {
# First, a little setup (assuming Able is still loaded.)
L = length (d)
        W = max(d) – min(d)
        H = seq(L-1)/L
# Then compute the abilities; default convergence = 0.01
      b = NA
      for (r in 1:(L-1))
            b[r] = Able (r, d)
# Now we’re ready
       s.wd = wd.sem (length(d))
       s.HWL = SEMbyHWL (H,W,L)
       s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment
       s = sem(b,d)
# Show what we got
     print (“Height”)
     print (“Width”)
     print (“Length”)
    print (“Wright-Douglas typical SEM:”)
        print (round(s.wd,2))
    print (“HWL SEM r=1 to L-1:”)
        print (round(s.HWL,3))
    print (“SEM r=2 to L-2 from Ability table:”)
       print (round(c(s.from.b,NA),3))
    print (“MLE SEM r=1 to L-1:”)
      print (round(s,3))
   plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)
Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with d=seq(-3,3,0.25) are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the best test design is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form SEM =  C / √L, where C is a rather messy function of H and W. The real innovation in HWL was the recognition that test length L could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either L = ∑p or 0 = ∑p true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the sem for 1 and L-1 are typically a little greater than one, in logits, we could adjust the ability estimates for 1 and L-1 by 1.2 or so; the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to 0 and L, say, 0.25 and L-0.25. Or you can get slightly less extreme values using 0.33 or 0.5, or more extreme using 0.1.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. Close is defined as 0.33 or 0.25 or 0.10 logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

Method Shift Zero Perfect
Constant 1.20 -5.58 5.58
SE shift One -5.51 5.51
Shift 0.33 -5.57 5.57
Shift 0.25 -5.86 5.86
Shift 0.10 -6.80 6.80

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.
Test.0100 <- function (shift) {
      d = seq(-3,3,0.25)
      b = NA
      for (r in 1:(length(d)-1) ) b[r] = Able (r, d)
# Adjust by something a little greater than one squared
b0 = b[1]-shift[1]
      bL = b[length(d)-1]+shift[1] 
      print(c(“Constant shift”,shift[1],round(b0, 2),round(bL, 2)))
      plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)
# Adjust by one standard error squared
s = sem(b,d)
      b0 = b[1]-s[1]^2
      bL = b[length(d)-1]+s[1]^2
      print(c(“SE shift”,round(b0, 2),round(bL, 2)))
      points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)
#Estimate ability for something “close” to zero;
for (x in shift[-1]) {
           b0 = Able(x,d)                         # if you try Able(0,d) you will get an inscrutable error.
           bL = Able(length(d)-x,d)
           print( c(“Shift”,x,round(b0, 2),round(bL, 2)))
           points (c(b0,b,bL),c(0:length(d)+1),type=”b”)
}    }

Test.0100 (c(1.2,.33,.25,.1))

The basic issue is not statistics; it’s policy for how much the powers that be want to punish or reward zero or perfect. But, if you really want to do the right thing, don’t give tests so far off target.

Viiid: Measuring Bowmanship

Archery as an example of decomposing item difficulty and validating the construct

The practical definition of the aspect is the tasks we use to provoke the person into providing evidence. Items that are hard to get right, tasks that are difficult to perform, statements that are distasteful, targets that are hard to hit will define the high end of the scale; easy items, simple tasks, or popular statements will define the low end. The order must be consistent with what would be expected from the theory that guided the design of the instrument in the first place. Topaz is always harder than quartz regardless of how either is measured. If not, the items may be inappropriate or the theory wrong[1]. The structure that the model provides should guide the content experts through the analysis, with a little help from their friends.

Table 5 shows the results of a hypothetical archery competition. The eight targets are described in the center panel. It is convenient to set the difficulty of the base target (i.e., largest bull’s-eye, shortest distance and level range) to zero. The scale is a completely arbitrary choice; we could multiply by 9/5 and add 32, if that seemed more convenient or marketable. The most difficult target was the smallest bull’s-eye, longest distance, and swinging. Any other outcome would have raised serious questions about the validity of the competition or the data.

Table 5 Definition of Bowmanship

The relative difficulties of the basic components of target difficulty are just to the right of the numeric logit scale: a moving target added 0.5 logits to the base difficulty; moving the target from 30 m. to 90 m. added 1.0 logits; and reducing the diameter of the bull’s-eye from 122 cm to 60 cm added 2.0 logits.

The role of specific objectivity in this discussion is subtle but crucial. We have arranged the targets according to our estimated scale locations and are now debating among ourselves if the scale locations are consistent with what we believe we know about bowmanship. We are talking about the scale locations of the targets, period, not about the scale locations of the targets for knights or pages, for long bows or crossbows, for William Tell or Robin Hood. And we now know that William Tell is about quarter logit better than Robin Hood, but maybe we should take the difference between a long bow and a crossbow into consideration.

While it may be interesting to measure and compare the bowmanship of any and all of these variations and we may use different selections of targets for each, those potential applications do not change the manner in which we define bowmanship. The knights and the pages may differ dramatically in their ability to hit targets and in the probabilities that they hit any given target, but the targets must maintain the same relationships, within statistical limits, or we do not know as much about bowmanship as we thought.

The symmetry of the model allows us to express the measures of the archers in the same metric as the targets. Thus, after a competition that might have used different targets for different archers, we would still know who won, we would know how much better Robin Hood is than the Sheriff, and we would know what each is expected to do and not do. We could place both on the bowmanship continuum and make defendable statements about what kinds of targets they could or could not hit.

[1] A startling new discovery, like quartz scratching topaz, usually means that the data are miscoded.

PDF version: Measuring Bowmanship

Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models

Viiib. Linear Logistic Test Model and the Poisson Model


Rather than treating each balloon as a unique “fixed effect” and estimating a difficulty specific to it, there may be other types of effects for which it is more effective and certainly more parsimonious to represent the difficulty as a composite, i.e., linear combination of more basic factors like size, distance, drafts. With estimates of the relevant effects in hand, we would have a good sense of the difficulty of any target we might face in the future. This is the idea behind Fischer’s (1973) Linear Logistic Test Model (LLTM), which dominates the Viennese school and has been almost totally absent in Chicago.


Rasch (1960) started with the Poisson model circa 1950 with his original problem in reading remediation, for seconds needed to read a passage or for errors made in the process. Andrich (1973) used it for errors in written essays. It could also be appropriate for points scored in almost any game. The Poisson can be viewed as a limiting case of the binomial (see Wright, 2003 and Andrich, 1988) where the probability of any particular error becomes small (i.e., bn-di large positively) enough that the di and the probabilities are all essentially equal.

Read more . . . More Models Details


All models are wrong. Some are useful. G.E.P.Box

Models must be used but must never be believed. Martin Bradbury Wilk

The Basic Ideas and polytomous items

We have thus far occupied ourselves entirely with the basic, familiar form of the Rasch model. I justify this fixation in two ways. First, it is the simplest and the form that is most used and second, it contains the kernel (bn – di) for pretty much everything else. It is the mathematical equivalent of a person throwing a dart at a balloon. Scoring is very simple; either you hit it or you don’t and they know if you did or not. The likelihood of the person hitting the target depends only on the skill of the person and the “elusiveness” of the target. If there is one The Rasch Model, this is it.

Continue reading . . . More Models

Vii: Significant Relationships in the Life of a Psychometrician

Rules of Thumb, Shortcuts, Loose Ends, and Other Off-Topic Topics:

Unless you can prove your approximation is as good as my exact solution, I am not interested in your approximation. R. Daryl Bock[1]

Unless you can show me your exact solution is better than my approximation, I am not interested in your exact solution. Benjamin D. Wright[2]

Rule of Thumb Estimates for Rasch Standard Errors

The asymptotic standard error for Marginal Maximum Likelihood estimates of the Rasch difficulty d or ability b parameters is:

Continue: Rules of Thumb, Short Cuts, Loose Ends


[1] I first applied to the University of Chicago because Prof. Bock was there.

[2] There was a reason I ended up working with Prof. Wright.

VIc. Measuring and Monitoring Growth

The things taught in schools and colleges are not an education but the means of an education. Ralph Waldo Emerson.

There is no such thing as measurement absolute; there is only measurement relative. Jeanette Winterson.

#GrowthModels and longitudinal scales

We dream about measuring cognitive status so effectively that we can monitor progress over the student’s career as confidently as we monitor changes in height, weight, and time for the 100-meters. We’re not there yet but we aren’t where we were. Partly because of Rasch. Celsius and Fahrenheit probably did not decide in their youth that their mission in life was to build thermometers; when they wanted to understand something about heat, they needed measures to do that. Educators don’t do assessment because they want to build tests; they build tests because they need measures to do assessment.

Historically, we have tried to build longitudinal scales by linking together a series of grade-level tests. I’ve tried to do it myself; sometimes I claimed to have succeeded. The big publishers often go us one better by building “bridge” forms that cover three or four grades in one fell swoop. The process requires finding, e.g., third grade items that can be given effectively to fourth graders and fourth grade items that can be given to third graders, and onward and upward. We immediately run into problems with opportunity to learn for topics that haven’t been presented and opportunity to forget with topics that haven’t been re-enforced. We often aren’t sure if we are even measuring the same aspect in adjacent grades.

Given the challenges of building longitudinal scales, perhaps we should ponder our original motivation for them. For purposes of this treatise, the following assertions will be taken as axiomatic.

  1. Educational growth implies additional capability to do increasingly complex tasks.
  2. Content standards that are tightly bound to grade-level instruction can be important building blocks and diagnostically useful, but they are not the goal of education.
  3. Any agency will put resources into areas where it is accountable and every agency should be accountable for areas it can effect.
  4. Status Model questions that Standards-based assessment was conceived to answer are about school accountability and better lesson plans, e.g., Did the students finishing third grade have what they need to succeed in fourth grade; if not, what tools were they lacking?
  5. Improvement Model questions were added as annual grade-level data began to pile up in the superintendent’s office and are asking about the system’s improvement, e.g., Are the third graders this year better equipped than the third graders last year?
  6. Growth Model questions are personal, Is this individual (enough) better at solving complex tasks now than last year, or last month, or last week?

Continue . . . Longitudinal Scales

Previous                                                                      Shortcuts

VIb: Equating Multiple Links

As always, objectivity is specific to the threats eliminated.

Linking, Equating, and Bank Building

If we can equate two forms, we can equate multiple forms with multiple interconnections. We can use the same form-to-form analysis to proceed one link at a time until eventually the entire network is equated. Any redundancies can be used to monitor and control the process. For example, linking form A to form C should give the same result as linking form A to form B to form C. Or, alternatively, linking A to B to C to A should bring us back to where we started and result in a zero shift, within statistical limits. What goes up, must come down, or conversely. Perhaps inconveniently, perhaps usefully, multiple links will be inconsistent. This is either a problem for recognizing truth or an opportunity to gain understanding.

There is a straightforward least squares path to resolving any inconsistencies due to random noise.

Continue . . . Multiple Link Forms

Previous: Linking & Equating                                                                  Home

VI. Linking and Equating: Getting from A to B

To link, then to equate

It’s not possible to equate if you didn’t bother to link. In my language, to link means to physically connect; to equate means to do the arithmetic ensuring interchangeable logit scores. Equating is just another version of controlling the model so we could use everything we have just learned about controlling ourselves. But since we are looking for a different kind of answer, it is worth treating as its own topic.

Unleashing the full power of Rasch measurement means identifying, perhaps conceiving an important aspect, defining a useful construct, and calibrating a pool of relevant items that measure it over a meaningful range. So far we have concerned ourselves with processing isolated bunches of items. In this world, linking and equating item sets has been treated as a distinct and unique phase in the process from conception to measurement. With the technology available, this has typically been the most convenient and efficient approach, and may continue to be so.

In the new world, post fixed-form, paper-based instruments, which are more and more passé, building a calibrated pool can be an inherent and natural part of the process and not a separate step. Calibration procedures allow us to combine individual level test data across administrations, perhaps years apart, to check if specific objectivity holds across time or distance. This is just another between-groups comparison, which gives more opportunities for control and investigation of the process.

Continue . . . Linking and Equating

Previous: Measuring & Diagnosing                                          Home: Rasch’s Theory of Relativity

Vd. Measuring, Diagnosing, and Perhaps Understanding Objects

Measurement when the data fit; diagnosis when it doesn’t

Our purpose when undertaking this venture was not to explain data or even to build better instruments. It may not seem like it based on the discussion so far but our objective is say something useful about the objects. Although the person and item parameters have equal status in our symmetrical model, they aren’t equally important in our minds. The items are the agents of measurement; the person is the object.

Continue . . . Vc. Diagnosis with and from the Model


Previous: Beyond Outfit                   Next: Getting from A to B

Vb. Mean Squares; Outfit and Infit

A question can only be valid if the students’ minds are doing the things we want them to show us they can do. Alastair Pollitt

Able people should pass easy items; unable people should fail difficult ones. Everything else is up for grabs.

One can liken progress along a latent trait to navigating a river; we can treat it as a straight line but the pilot had best remember sandbars and meanders.

More about what could go wrong and how to find it

However one validates the items, with a plethora of sliced and diced matrices, between group analyses based on gender, ethnicity, ses, age, instruction, etc., followed by enough editing, tweaking, revising, and discarding to ensure a perfectly functioning item bank and to placate any Technical Advisory Committee, there is no guarantee that the next kid to sit down in front of the computer won’t bring something completely unanticipated to the process. After the items have all been “validated,” we still must validate the measure for every new examinee.

The residual analysis that we are working our way toward is a natural approach to validating any item and any person. But we should know what we are looking for before we get lost in the swamps of arithmetic. First, we need to make sure that we haven’t done something stupid, like score the responses against the wrong key or post the results to the wrong record.

Checking the scoring for an examinee is no different than checking for miskeyed items but with less data; either would have both surprising misses and surprising passes in the response string. Having gotten past that mine field, we can then check for differences by item type, content, sequence to just note the easy ones. Then depending on what we discover, we proceed with doing the science either with the results of the measurement process or with the anomalies from the measurement process.

Continue . . .Model Control ala Panchapekesan


Previous: Model Control ala Choppin                        Next: Beyond Outfit and Infit

V. Control of Rasch’s Models: Beyond Sufficient Statistics

 No single fit statistic is either necessary or sufficient.  David Andrich

You won’t get famous by inventing the perfect fit statistic. Benjamin Wright[1]

That’s funny or when the model reveals something we didn’t know

You say goodness of fit; Rasch said control. The important distinction in the words is that, for the measure, once you have extracted, through the sufficient statistics, all the information in the data relevant to measuring the aspect you are after, you shouldn’t care what or how much gets left in the trash. Whatever it is, it doesn’t contribute to the measurement … directly. It’s of no more than passing interest to us how well the estimated parameters reproduce the observed data, but very much our concern that we have all the relevant information and nothing but the relevant information for our task. Control, not goodness of fit, is the emphasis.

Rasch, very emphatically, did not mean that you run your data through some fashionable software package to calculate its estimates of parameters for a one-item-parameter IRT model and call it Rasch. Going beyond the sufficient statistics and parameter estimates to validate the model’s requirements is where the control is; that’s how one establishes Specific Objectivity. If it holds, then we have a pretty good idea what the residuals will look like. They are governed by the binomial variance pvi(1-pvi) and they should be just noise, with no patterns related to person ability or item difficulty, nor to gender, format, culture, type, sequence, or any of the other factors we keep harping on (but not restricted to the ones that have occurred to me this morning) as potential threats. If the residuals do look like pvi(1-pvi), then we are on reasonably solid ground for believing Specific Objectivity does obtain but even that’s not good enough.

It does not matter if there are other models out there that can “explain” a particular data set “better”, in the rather barren statistical sense of explain meaning they have smaller mean residual deviates. Rasch recognized that models can exist on three planes in increasing order of usefulness[2]:

  1. Models that explain the data,
  2. Models that predict the future, and
  3. Models that reveal something we didn’t know about the world.

Models that only try to maximize goodness of fit are stuck at the first level and are perfectly happy fitting something other than the aspect you want. This mind-set is better suited to trying to explain the stock market, weather, or Oscar winners and to generate statements like “The stock market goes up when hemlines go up.” Past performance does not ensure future performance. They try to go beyond the information in the sufficient statistics, using anything in the data that might have been correlated and, to appropriate a comment by Rasch , correlation coefficients are population dependent and therefore scientifically rather uninteresting.

Models that satisfy Rasch’s principle of Specific Objectivity have reached the second level and we can begin real science, possibly at the third level. Control of the models often points directly toward the third level, when the agents or objects didn’t interact the way we intended or anticipated[3]. “The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny.’” (Isaac Asimov.)

Continue reading . . . Model Control ala Choppin

[1] I chose to believe Ben’s comment reflected his attitude toward hypothesis testing, not his assessment of my prospects, although in that sense, it was prophetic.

[2] Paraphrasing E. D. Ford.

[3] “In the best designed experiments, the rats will do as they damn well please.” (Murphy’s Law of Experimental Psychology.)

Previous: Doing the Math                                Next: Model Control ala Panchakesan

IVb. The Point is Measurement

To measure the person, not the test

In spite of most of what has been said up to this point, we did not undertake this project with the hope of building better thermometers. The point is to measure the person. Because of the complete symmetry of the model, everything we have done for items, we can do again for people just by reversing the subscripts. For any two people who took some of the same items, count the number N12 that person 2 answered correctly and person 1 missed; also the number N21 that person 1 passed and person 2 missed. The relative abilities of the people will parallel expressions 23 and 25:

Continue reading . . .The Point is Measurement


Previous: Doing the Math                                               Next: Controlling the Model