Ability estimates, perfect scores, and standard errors

The philosophical musing of most of my postings has kept me entertained, but eventually we need to connect models to data if they are going to be of any use at all. There are plenty of software packages out there that will do a lot of arithmetic for you but it is never clear exactly what someone else’s black box is actually doing. This is sort of a DIY black box.

The dichotomous case is almost trivial. Once we have estimates of the item’s difficulty ** d** and the person’s ability

**, the probability of person succeeding on the item is**

*b*

*p =**where*

**B / (B + D)**,*and*

**B = exp(b)***If you have a calibrated item bank (i.e., a bunch of items with estimated difficulties neatly filed in a cloud, flash drive, LAN, or box of index cards), you can estimate the ability of any person tested from the Bank by finding the value of the*

**D = exp(d)**.**that makes the observed score equal the expected score, i.e., solves the equation**

*b***, where**

*r = ∑**p***is the person’s number correct score and**

*r***was just defined.**

*p*If you are more concrete than that, here is a little *R-*code that will do the arithmetic, although it’s not particularly efficient nor totally safe. A responsible coder would do some error trapping to ensure *r* is in the range *1 *to* L-1* (where *L* = length of * d,*) the

**s are in logits and centered at zero. Rasch estimation and the R interpreter are robust enough that you and your computer will probably survive those transgressions.**

*d*#Block 1: Routine to compute logit ability for number correct r given d

Able <- function (r, d, stop=0.01) { # r is raw score; d is vector of logit difficulties

b <- log (r / (length (d)-r)) # Initialize

repeat {

adjust <- (r – sum(P(b,d))) / sum(PQ (P(b,d)))

b <- b + adjust

if (abs(adjust) < stop) return (b)

} }

P <- function (b, d) (1 / (1+exp (d-b))) # computationally convenient form for probability

PQ <- function (p) (p-p^2) # p(1-p) aka inverse of the 2nd derivative

If you would like to try it, copy the text between the lines above into an *R*-window and then define the ** ds **somehow and type in, say,

**or else copy the commands between the lines below to make it do something. Most of the following is just housekeeping; all you really need is the command**

*Able(r=1, d=ds)**if*

**Able(r,d)***and*

**r***have been defined. If you don’t have*

**d***R*installed on your computer, following the link to

**in the menu on the right will take you to an**

*LLTM**R*site that has a “

*Get R*” option.

In the world of *R*, the hash tag marks a comment so anything that follows is ignored. This is roughly equivalent to other uses of hash tags and *R* had it first.

#Block 2: Test ability routines

Test.Able <- function (low, high, inc) {

#Create a vector of logit difficulties to play with,

d = seq(low, high, inc)

# overriding default the convergence criterion of 0.01 with 0.0001

print (“Ability r=1:”)

print (Able(r=1, d=d, stop=0.0001)) #To get all the abilities from 1 to L-1

# first create a spot to receive results

b = NA #Then compute the abilities; default convergence = 0.01

for (r in 1:(length(d)-1) )

b[r] = Able (r, d) #Show what we got

print (“Ability r=1 to L-1:”)

print(round(b,3))

} Test.Able (-2,2,0.25)

I would be violating some sort of sacred oath if I were to leave this topic without the standard errors of measurement (*sem*); we have everything we need for them. For a quick average, of sorts, *sem*, useful for planning and test design, we have the Wright-Douglas approximation: *sem = 2.5/√L*, where* L* is the number of items on the test. Wright & Stone (1979, p 135) provide another semi-shortcut based on *height, width, *and* length*, where height is the percent correct, width is the range of difficulties, and length is the number of items. Or to extricate the *sem* for almost any score from the logit ability table, *sem _{r} = √[(b_{r+1 }– b_{r-1})/2]*. Or if you want to do it right,

*sem*

_{r}=1 / √[∑*p*

_{r}(1-p_{r})].Of course, I have some *R*-code. Let me know if it doesn’t work.

#Block 3: Standard Errors and a few shortcuts

# Wright-Douglas ‘typical’ sem

wd.sem <- function (k) (2.5/sqrt(k))

#

# Wright-Stone from Mead-Ryan

SEMbyHWL <- function (H=0.5,W=4,L=1) {

C2 <- NA

W <- ifelse(W>0,W,.001)

for (k in 1:length(H)) C2[k] <-W*(1-exp(-W))/((1-exp(-H[k]*W))*(1-exp(-(1-H[k])*W)))

return (sqrt( C2 / L))

}

# SEM from logit ability table

bToSem <- function (r1, r2, b) { s <- NA

for (r in r1:r2) s[r] <- (sqrt((b[r+1]-b[r-1])/2))

return (s)

}

# Full blown SEM

sem <- function (b, d) { s <- NA

for (r in 1:length(b)) s[r] <- 1 / sqrt(sum(PQ(P(b[r],d))))

return (s)

}

To get the SEM’s from all four approaches, all you really need are the four lines below after “*Now we’re ready*” below. The rest is start up and reporting.

#Block 4: Try out Standard Error procedures

Test.SEM <- function (d) {

# First, a little setup (assuming Able is still loaded.)

L = length (d)

W = max(d) – min(d)

H = seq(L-1)/L

# Then compute the abilities; default convergence = 0.01

b = NA

for (r in 1:(L-1))

b[r] = Able (r, d)

# Now we’re ready

s.wd = wd.sem (length(d))

s.HWL = SEMbyHWL (H,W,L)

s.from.b = bToSem (2,L-2,b) # ignore raw score 1 and L-1 for the moment

s = sem(b,d)

# Show what we got

print (“Height”)

print(H)

print (“Width”)

print(W)

print (“Length”)

print(L)

print (“Wright-Douglas typical SEM:”)

print (round(s.wd,2))

print (“HWL SEM r=1 to L-1:”)

print (round(s.HWL,3))

print (“SEM r=2 to L-2 from Ability table:”)

print (round(c(s.from.b,NA),3))

print (“MLE SEM r=1 to L-1:”)

print (round(s,3))

plot(b,s,xlim=c(-4,4),ylim=c(0.0,1),col=”red”,type=”l”,xlab=”Logit Ability”,ylab=”Standard Error”)

points(b,s.HWL,col=”green”,type=”l”)

points(b[-(L-1)],s.from.b,col=”blue”,type=”l”)

abline(h=s.wd,lty=3)

} Test.SEM (seq(-3,3,0.25))

Among other sweeping assumptions, the Wright-Douglas approximation for the standard error assumes a “typical” test with items piled up near the center. What we have been generating with *d=seq(-3,3,0.25)* are items uniformly distributed over the interval. While this is effective for fixed-form group-testing situations, it is not a good design for measuring any individual. The wider the interval, the more off-target the test will be. The point of bringing this up at this point is that Wright & Douglas will underestimate the typical standard error for a wide, uniform test. Playing with the Test.SEM command will make this painfully clear.

The Wright-Stone HWL approach, which proceeded Wright-Douglas, is also intended for test design, determining how many items were needed and how they should be distributed. This suggested the *best test design* is a uniform distribution of item difficulties, which may have been true in 1979 when there were no practicable alternatives to paper-based tests. The approach boils down to an expression of the form *SEM =* *C / √L*, where *C* is a rather messy function of *H *and* W*. The real innovation in HWL was the recognition that test length *L* could be separated from the other parameters. In hindsight, realizing that the standard error of measurement has the square root of test length in the denominator doesn’t seem that insightful.

We also need to do something intelligent or at least defensible about the zero and perfect scores. We can’t really estimate them because there are no abilities high enough for a perfect number correct or low enough for zero to make either *L = ∑**p* or *0 = ∑**p* true. This reflects the true state of affairs; we don’t know how high or how low perfect and zero performances really are but sometimes we need to manufacture something to report.

Because the *sem* for *1 *and* L-1 *are typically a little greater than one, in logits, we could adjust the ability estimates for* 1 *and* L-1 *by* 1.2 *or* so; *the appropriate value gets smaller as the test gets longer. Or we could estimate the abilities for something close to *0 *and* L, *say,* 0.25 *and* L-0.25*. Or you can get slightly less extreme values using *0.33 *or* 0.5, *or more extreme using* 0.1*.

For the example we have been playing with, here’s how much difference it does or doesn’t make. The first entry in the table below abandons the pseudo-rational arguments and says the square of something a little greater than one is 1.2 and that works about as well as anything else. This simplicity has never been popular with technical advisors or consultants. The second line moves out one standard error squared from the abilities for zero and one less than perfect. The last three lines estimate the ability for something “close” to zero or perfect. *Close* is defined as 0*.33 *or* 0.25 *or* 0.10* logits. Once the blanks for zero and perfect are filled in, we can proceed with computing a standard error for them using the standard routines and then reporting measures as though we had complete confidence.

Method | Shift | Zero | Perfect |

Constant | 1.20 | -5.58 | 5.58 |

SE shift | One | -5.51 | 5.51 |

Shift | 0.33 | -5.57 | 5.57 |

Shift | 0.25 | -5.86 | 5.86 |

Shift | 0.10 | -6.80 | 6.80 |

#Block 5: Abilities for zero and perfect: A last bit of code to play with the extreme scores and what to do about it.

Test.0100 <- function (shift) {

d = seq(-3,3,0.25)

b = NA

for (r in 1:(length(d)-1) ) b[r] = Able (r, d)

# Adjust by something a little greater than one squared

b0 = b[1]-shift[1]

bL = b[length(d)-1]+shift[1]

print(c(“Constant shift”,shift[1],round(b0, 2),round(bL, 2)))

plot(c(b0,b,bL),c(0:length(d)+1),xlim=c(-6.5,6.5),type=”b”,xlab=”Logit Ability”,ylab=”Number Correct”,col=”blue”)

# Adjust by one standard error squared

s = sem(b,d)

b0 = b[1]-s[1]^2

bL = b[length(d)-1]+s[1]^2

print(c(“SE shift”,round(b0, 2),round(bL, 2)))

points (c(b0,b,bL),c(0:length(d)+1),col=”red”,type=”b”)

#Estimate ability for something “close” to zero;

for (x in shift[-1]) {

b0 = Able(x,d) # if you try Able(0,d) you will get an inscrutable error.

bL = Able(length(d)-x,d)

print( c(“Shift”,x,round(b0, 2),round(bL, 2)))

points (c(b0,b,bL),c(0:length(d)+1),type=”b”)

} }