Viib. Using R to do a little work

Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty d and the person’s ability b, the probability of person succeeding on the item is p = B / (B + D), where B = exp(b) and D = exp(d). If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the b that makes the observed score equal the expected score, i.e., solves the equation r = ∑p, where r is the person’s number correct score and p was just defined.

If you are more concrete than that, here is a little R-code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure r is in the range 1 to L-1 (where L = length of d,) the ds are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.

#Block 1: Routine to compute logit ability for number correct r given d
Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties
   b <- log (r / (length (d)-r))    # Initialize
   repeat {
         adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))
         b <- b + adjust
         if (abs(adjust) < stop) return (b)
}      }
P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability
PQ <- function (p) (p-p^2)                     # p(1-p) aka inverse of the 2nd derivative

If you would like to try it, copy the text between the lines above into an R-window and then define the ds somehow and type in, say, Able(r=1, d=ds) or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command Able(r,d) if r and d have been defined. If you don’t have R installed on your computer, following the link to LLTM in the menu on the right will take you to an R site that has a “Get R” option.

In the world of R, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and R had it first.

#Block 2: Test ability routines
Test.Able <- function (low, high, inc) {
#Create a vector of logit difficulties to play with,
d = seq(low, high, inc)

# The ability for a raw score of 1,
# overriding default the convergence criterion of 0.01 with 0.0001
print (“Ability r=1:”)
    print (Able(r=1, d=d, stop=0.0001))
#To get all the abilities from 1 to L-1
# first create a spot to receive results
b = NA
#Then compute the abilities; default convergence = 0.01
for (r in 1:(length(d)-1) )
     b[r] = Able (r, d)
#Show what we got
print (“Ability r=1 to L-1:”)
Test.Able (-2,2,0.25)

I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (sem); we have everything we need for them. For a quick average, of sorts, sem, useful for planning and test design, we have the Wright-Douglas approximation: sem = 2.5/√L, where L is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on height, width, and length, where height is the percent correct, width is the  range of difficulties, and length is the number of items. Or to extricate the sem for almost any score from the logit ability table, semr = √[(br+1 – br-1)/2]. Or if you want to do it right, semr =1 / √[∑pr(1-pr)].

Of course, I have some R-code. Let me know if it doesn’t work.

#Block 3: Standard Errors and a few shortcuts
# Wright-Douglas ‘typical’ sem
wd.sem <- function (k) (2.5/sqrt(k))
# Wright-Stone from Mead-Ryan
SEMbyHWL <- function (H=0.5,W=4,L=1) {
     C2 <- NA
     W <- ifelse(W>0,W,.001)
     for (k in 1:length(H))
            C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))
return (sqrt( C2 / L))
# SEM from logit ability table
bToSem <- function (r1, r2, b) {
     s  <- NA
     for (r in r1:r2)
           s[r] <- (sqrt((b[r+1]-b[r-1])/2))
return (s)
# Full blown SEM
sem <- function (b, d) {
     s <-  NA
    for (r in 1:length(b))
          s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))
 return (s)

To get the SEM’s from all four approaches, all you really need are the four lines below after “Now we’re ready” below. The rest is start up and reporting.


#Block 4: Try out Standard Error procedures
Test.SEM <- function (d) {
# First, a little setup (assuming Able is still loaded.)
L = length (d)
        W = max(d) – min(d)
        H = seq(L-1)/L
# Then compute the abilities; default convergence = 0.01
      b = NA
      for (r in 1:(L-1))
            b[r] = Able (r, d)
# Now we’re ready
       s.wd = wd.sem (length(d))
       s.HWL = SEMbyHWL (H,W,L)
       s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment
       s = sem(b,d)
# Show what we got
     print (“Height”)
     print (“Width”)
     print (“Length”)
    print (“Wright-Douglas typical SEM:”)
        print (round(s.wd,2))
    print (“HWL SEM r=1 to L-1:”)
        print (round(s.HWL,3))
    print (“SEM r=2 to L-2 from Ability table:”)
       print (round(c(s.from.b,NA),3))
    print (“MLE SEM r=1 to L-1:”)
      print (round(s,3))
   plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)
Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with d=seq(-3,3,0.25) are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the best test design is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form SEM =  C / √L, where C is a rather messy function of H and W. The real innovation in HWL was the recognition that test length L could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either L = ∑p or 0 = ∑p true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the sem for 1 and L-1 are typically a little greater than one, in logits, we could adjust the ability estimates for 1 and L-1 by 1.2 or so; the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to 0 and L, say, 0.25 and L-0.25. Or you can get slightly less extreme values using 0.33 or 0.5, or more extreme using 0.1.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. Close is defined as 0.33 or 0.25 or 0.10 logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

Method Shift Zero Perfect
Constant 1.20 -5.58 5.58
SE shift One -5.51 5.51
Shift 0.33 -5.57 5.57
Shift 0.25 -5.86 5.86
Shift 0.10 -6.80 6.80

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.
Test.0100 <- function (shift) {
      d = seq(-3,3,0.25)
      b = NA
      for (r in 1:(length(d)-1) ) b[r] = Able (r, d)
# Adjust by something a little greater than one squared
b0 = b[1]-shift[1]
      bL = b[length(d)-1]+shift[1] 
      print(c(“Constant shift”,shift[1],round(b0, 2),round(bL, 2)))
      plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)
# Adjust by one standard error squared
s = sem(b,d)
      b0 = b[1]-s[1]^2
      bL = b[length(d)-1]+s[1]^2
      print(c(“SE shift”,round(b0, 2),round(bL, 2)))
      points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)
#Estimate ability for something “close” to zero;
for (x in shift[-1]) {
           b0 = Able(x,d)                         # if you try Able(0,d) you will get an inscrutable error.
           bL = Able(length(d)-x,d)
           print( c(“Shift”,x,round(b0, 2),round(bL, 2)))
           points (c(b0,b,bL),c(0:length(d)+1),type=”b”)
}    }

Test.0100 (c(1.2,.33,.25,.1))

The basic issue is not statistics; it’s policy for how much the powers that be want to punish or reward zero or perfect. But, if you really want to do the right thing, don’t give tests so far off target.


Viiid: Measuring Bowmanship

Archery as an example of decomposing item difficulty and validating the construct

The practical definition of the aspect is the tasks we use to provoke the person into providing evidence. Items that are hard to get right, tasks that are difficult to perform, statements that are distasteful, targets that are hard to hit will define the high end of the scale; easy items, simple tasks, or popular statements will define the low end. The order must be consistent with what would be expected from the theory that guided the design of the instrument in the first place. Topaz is always harder than quartz regardless of how either is measured. If not, the items may be inappropriate or the theory wrong[1]. The structure that the model provides should guide the content experts through the analysis, with a little help from their friends.

Table 5 shows the results of a hypothetical archery competition. The eight targets are described in the center panel. It is convenient to set the difficulty of the base target (i.e., largest bull’s-eye, shortest distance and level range) to zero. The scale is a completely arbitrary choice; we could multiply by 9/5 and add 32, if that seemed more convenient or marketable. The most difficult target was the smallest bull’s-eye, longest distance, and swinging. Any other outcome would have raised serious questions about the validity of the competition or the data.

Table 5 Definition of Bowmanship

The relative difficulties of the basic components of target difficulty are just to the right of the numeric logit scale: a moving target added 0.5 logits to the base difficulty; moving the target from 30 m. to 90 m. added 1.0 logits; and reducing the diameter of the bull’s-eye from 122 cm to 60 cm added 2.0 logits.

The role of specific objectivity in this discussion is subtle but crucial. We have arranged the targets according to our estimated scale locations and are now debating among ourselves if the scale locations are consistent with what we believe we know about bowmanship. We are talking about the scale locations of the targets, period, not about the scale locations of the targets for knights or pages, for long bows or crossbows, for William Tell or Robin Hood. And we now know that William Tell is about quarter logit better than Robin Hood, but maybe we should take the difference between a long bow and a crossbow into consideration.

While it may be interesting to measure and compare the bowmanship of any and all of these variations and we may use different selections of targets for each, those potential applications do not change the manner in which we define bowmanship. The knights and the pages may differ dramatically in their ability to hit targets and in the probabilities that they hit any given target, but the targets must maintain the same relationships, within statistical limits, or we do not know as much about bowmanship as we thought.

The symmetry of the model allows us to express the measures of the archers in the same metric as the targets. Thus, after a competition that might have used different targets for different archers, we would still know who won, we would know how much better Robin Hood is than the Sheriff, and we would know what each is expected to do and not do. We could place both on the bowmanship continuum and make defendable statements about what kinds of targets they could or could not hit.

[1] A startling new discovery, like quartz scratching topaz, usually means that the data are miscoded.

PDF version: Measuring Bowmanship

Viiic: More than One; Less than Infinity

Rating Scale and Partial Credit models and the twain shall meet

For many testing situations, simple zero-one scoring is not enough and Poisson-type counts are too much. Polytomous Rasch models (PRM) cover the middle ground between one and infinity and allow scored responses from zero to a maximum of some small integer m. The integer scores must be ordered in the obvious way so that responding in category k implies more of the trait than responding in category k-1. While the scores must be consecutive integers, there is no requirement that the categories be equally spaced; that is something we can estimate just like ordinary item difficulties.

Once we admit the possibility of unequal spacing of categories, we almost immediately run into the issue, Can the thresholds (i.e., boundaries between categories) be disordered? To harken back to the baseball discussion, a four-base hit counts for more than a three-base hit, but four-bases are three or four times more frequent than three-bases. This begs an important question about whether we are observing the same aspect with three- and four-base hits, or with underused categories in general; we’ll come back to it.

To continue the archery metaphor, we now have a number, call it m, of concentric circles rather than just a single bull’s-eye with more points given for hitting within smaller circles. The case of m=1 is the dichotomous model and m→infinity is the Poisson, both of which can be derived as limiting cases of almost any of the models that follow. The Poisson might apply in archery if scoring were based on the distance from the center rather than which one of a few circles was hit; distance from the center (in, say, millimeters) is the same as an infinite number of rings, if you can read your ruler that precisely.

Read on . . .Polytomous Rasch Models

Viiib. Linear Logistic Test Model and the Poisson Model


Rather than treating each balloon as a unique “fixed effect” and estimating a difficulty specific to it, there may be other types of effects for which it is more effective and certainly more parsimonious to represent the difficulty as a composite, i.e., linear combination of more basic factors like size, distance, drafts. With estimates of the relevant effects in hand, we would have a good sense of the difficulty of any target we might face in the future. This is the idea behind Fischer’s (1973) Linear Logistic Test Model (LLTM), which dominates the Viennese school and has been almost totally absent in Chicago.


Rasch (1960) started with the Poisson model circa 1950 with his original problem in reading remediation, for seconds needed to read a passage or for errors made in the process. Andrich (1973) used it for errors in written essays. It could also be appropriate for points scored in almost any game. The Poisson can be viewed as a limiting case of the binomial (see Wright, 2003 and Andrich, 1988) where the probability of any particular error becomes small (i.e., bn-di large positively) enough that the di and the probabilities are all essentially equal.

Read more . . . More Models Details


All models are wrong. Some are useful. G.E.P.Box

Models must be used but must never be believed. Martin Bradbury Wilk

The Basic Ideas and polytomous items

We have thus far occupied ourselves entirely with the basic, familiar form of the Rasch model. I justify this fixation in two ways. First, it is the simplest and the form that is most used and second, it contains the kernel (bn – di) for pretty much everything else. It is the mathematical equivalent of a person throwing a dart at a balloon. Scoring is very simple; either you hit it or you don’t and they know if you did or not. The likelihood of the person hitting the target depends only on the skill of the person and the “elusiveness” of the target. If there is one The Rasch Model, this is it.

Continue reading . . . More Models

Vii: Significant Relationships in the Life of a Psychometrician

Rules of Thumb, Shortcuts, Loose Ends, and Other Off-Topic Topics:

Unless you can prove your approximation is as good as my exact solution, I am not interested in your approximation. R. Daryl Bock[1]

Unless you can show me your exact solution is better than my approximation, I am not interested in your exact solution. Benjamin D. Wright[2]

Rule of Thumb Estimates for Rasch Standard Errors

The asymptotic standard error for Marginal Maximum Likelihood estimates of the Rasch difficulty d or ability b parameters is:

Continue: Rules of Thumb, Short Cuts, Loose Ends


[1] I first applied to the University of Chicago because Prof. Bock was there.

[2] There was a reason I ended up working with Prof. Wright.

VIc. Measuring and Monitoring Growth

The things taught in schools and colleges are not an education but the means of an education. Ralph Waldo Emerson.

There is no such thing as measurement absolute; there is only measurement relative. Jeanette Winterson.

#GrowthModels and longitudinal scales

We dream about measuring cognitive status so effectively that we can monitor progress over the student’s career as confidently as we monitor changes in height, weight, and time for the 100-meters. We’re not there yet but we aren’t where we were. Partly because of Rasch. Celsius and Fahrenheit probably did not decide in their youth that their mission in life was to build thermometers; when they wanted to understand something about heat, they needed measures to do that. Educators don’t do assessment because they want to build tests; they build tests because they need measures to do assessment.

Historically, we have tried to build longitudinal scales by linking together a series of grade-level tests. I’ve tried to do it myself; sometimes I claimed to have succeeded. The big publishers often go us one better by building “bridge” forms that cover three or four grades in one fell swoop. The process requires finding, e.g., third grade items that can be given effectively to fourth graders and fourth grade items that can be given to third graders, and onward and upward. We immediately run into problems with opportunity to learn for topics that haven’t been presented and opportunity to forget with topics that haven’t been re-enforced. We often aren’t sure if we are even measuring the same aspect in adjacent grades.

Given the challenges of building longitudinal scales, perhaps we should ponder our original motivation for them. For purposes of this treatise, the following assertions will be taken as axiomatic.

  1. Educational growth implies additional capability to do increasingly complex tasks.
  2. Content standards that are tightly bound to grade-level instruction can be important building blocks and diagnostically useful, but they are not the goal of education.
  3. Any agency will put resources into areas where it is accountable and every agency should be accountable for areas it can effect.
  4. Status Model questions that Standards-based assessment was conceived to answer are about school accountability and better lesson plans, e.g., Did the students finishing third grade have what they need to succeed in fourth grade; if not, what tools were they lacking?
  5. Improvement Model questions were added as annual grade-level data began to pile up in the superintendent’s office and are asking about the system’s improvement, e.g., Are the third graders this year better equipped than the third graders last year?
  6. Growth Model questions are personal, Is this individual (enough) better at solving complex tasks now than last year, or last month, or last week?

Continue . . . Longitudinal Scales

Previous                                                                      Shortcuts