Comparative Judgement (CJ) is slowly gaining momentum in the context of educational assessment. Data coming from CJ is typically analysed with the BTL-model, making use of a frequentist approach. Increasing computational power of computers has made Bayesian inferences more accessible, resulting in the rise of Bayesian model estimation. In this post we will demonstrate how the package `pcFactorStan`

can be applied to analyse data from CJ within a Bayesian framework, making use of data from CJ judgements on argumentative writing coming from the D-PAC project.

*Comparative Judgement* (CJ) gets more and more used within the context of educational assessment. CJ is seen as a reliable and valid alternative to judge student work. In contrast to the use of rubrics or criteria lists it is not an absolute but relative way of assessing the quality of student work. CJ builds on the idea that people are more reliable in comparing things than scoring/rating things (a thought introduced in psychometrics by Thurstone 1927).

In CJ judges are presented with two pieces of student work and asked which of both pieces of work is best in the light of the skill that is measured. Multiple judges are sent multiple comparisons resulting in data that can be used to estimate the latent worth of student work. Different scholarly articles are dedicated to the merits of CJ for educational application (e.g. Lesterhuis et al. 2018; Van Daal et al. 2019, 2017; Verhavert et al. 2019). To run a CJ in an educational setting a dedicated tool like Comproved (Comproved, n.d.) can be used.

The data resulting from a CJ is analysed making use of the Bradley-Terry-Luce (BTL) model which can be formally written as

\[ p(x_{ij}=1|v_i,v_j)= \frac{e^{(v_j-v_i)}}{1+e^{(v_j-v_i)}} \] with \(x_{ij}=1\) if piece \(j\) wins the comparison from piece \(i\). The parameters \(v_i\) and \(v_j\) are the latent worth of both pieces of student work. A higher latent worth refers to the fact that this specific piece of student work is better in the light of the skill that is assessed.

To estimate the BTL model, dedicated packages are available for R users that facilitate a frequentist estimation of the model like the package `BradleyTerry2`

(see Turner and Firth 2012). In a frequentist analysis we get a point-estimate summarizing the posterior probability distribution and we make inferences based on ‘hypothetical data’ (see Lambert 2018). For all the students pieces of work in the assessment we get a maximum likelihood (ML) parameter estimate expressing it’s latent worth and we can get an accompanying confidence interval around that estimate. These confidence intervals express how our parameter estimates are assumed to vary from sample to sample of similar data that we draw from the population. So, imagine we get a ML estimate for a certain piece of work with the value of 2 with a standard error of .5 then we can infer that in 95% of the hypothetical samples of data that we draw, we would get a parameter estimate somewhere between roughly 1 (\(\approx 2-1.96*.5\)) and 3 (\(\approx 2+1.96*.5\)).

Bayesian inferences, on the other hand, are based on a posterior probability density distribution (further referred to as *the posterior*) regarding the parameters, given the data collected. The latter allows for a more intuitive interpretation of parameters uncertainty. Within the framework of CJ a Bayesian analysis results in a posterior for each student’s piece of work’s worth. Consequently we can derive probabilities for each possible parameter value. So, for a virtual piece of work we can derive the probability that it’s latent worth is within a certain range, given the data that we used to estimate our model. Bayesian analysts have different ways of presenting the posterior to make inferences. Two popular ways of presenting the posterior probability are a *Credible Interval* (CI) and a *Highest Density Interval* (HDI) (Lambert 2018, @Kruschke_2018). The former expresses an interval within which a parameter estimate falls with a given probability. So, if we get a 90% credible interval going from 1.2 to 2.8 this means that we are 90% sure the estimate is somewhere in this interval. This way of interpreting parameter uncertainty is actually the similar to the way that many researchers mistakenly interpret confidence intervals in a frequentist analysis. Some authors call this CI also an ETI (from *Equal Tailed Interval*), referring to the fact that both values that define the range are for instance the 5th and the 95th percentile values if we make a 90% CI. A HDI is an alternative way of expressing the uncertainty we learn from the posterior. It is an interval that defines the range with the x% most probable estimates. For instance, if we get a 90% HDI for a student’s piece of work’s worth ranging from 1.4 to 2.9, then we can interpret it as the 90% most probable estimates are within this range. If the posterior is somewhat symmetric then a both a CI and an HDI will be somewhat similar. But on the contrary, when the posterior is not symmetric both intervals can be different, each casting a different light on the estimates. A quick example to illustrate. Imagine the following skew posterior probability density function:

```
# Generate a gamma distribution (that is skew)
posterior <- distribution_gamma(1000, 1.5, 2)
# Compute HDI and Quantile CI
ci_hdi <- ci(posterior, method = "HDI")
ci_eti <- ci(posterior, method = "ETI")
# Plot the distribution and add the limits of the two CIs
posterior %>%
estimate_density(extend = TRUE) %>%
ggplot(aes(x = x, y = y)) +
geom_area(fill = "#E6AB02") +
# HDI in blue
geom_vline(xintercept = ci_hdi$CI_low, color = "#66A61E", size = 2) +
geom_vline(xintercept = ci_hdi$CI_high, color = "#66A61E", size = 2) +
# ETI in red
geom_vline(xintercept = ci_eti$CI_low, color = "#7570B3", size = 2) +
geom_vline(xintercept = ci_eti$CI_high, color = "#7570B3", size = 2) +
scale_y_continuous("posterior probability density") +
scale_x_continuous("possible parameter values") +
ggtitle("A skew posterior with a 89% HDI (green lines) \n and a 89% CI (aka ETI) (purple lines)")
```

We have drawn 2 sets of lines: green and purple. The green lines are in fact the range of a HDI. Looking at the tails outside our HDI we see that the proportion at the right side is larger than on the left side. The central idea of a HDI is that it puts the limits so that none of the values outside the range can have a similar or higher probability than the values within the range. The purple lines are the limits of a CI with equal tails. Now we see that certain values outside the range at the left side actually have a higher probability than the values inside the range (near the upper limit).

Back to our CJ data and the use of Bayesian estimation within that context. One of the crucial factors that explains the growth in usage of Bayesian analyses is the exponential increase in computational power of computers. This computational power is necessary to apply a crucial ‘toolkit’ in Bayesian analyses: Markov-Chain-Monte-Carlo (MCMC) sampling methods to estimate the posterior (Bürkner 2020). The powerful open source software Stan (Stan Development Team 2020b) makes it possible to apply *Hamiltonian Monte Carlo* sampling in an efficient way (see Lambert 2018 for an intuitive introduction to different MCMC sampling algorithms). To interface with `Stan`

we can make use of different statistical programming languages. In this article we will interface with R (R Core Team 2020) making use of the package `rstan`

(Stan Development Team 2020a). More specifically we will make use of the package `pcFactorStan`

(Pritikin 2020) that is recently released to estimate a more general *factor analysis* model of multidimensional pairwise comparison data. CJ data can be seen as a restricted case of the more general model presented by Pritikin (2020). The more general model presented by Pritikin (2020) extends the model with two features: integrate comparisons on multiple dimensions resulting on multiple latent worth parameters of student’s pieces of work; and allowing a more nuanced judgement with multiple ordinal categories. In the discussion of this post we will shortly discuss the opportunities this more general model creates for CJ in the educational assessment context. In this article we will restrict us at demonstrating how this package can be used to apply MCMC sampling for the analysis of CJ data.

To demonstrate how we can apply `pcFactorStan`

to model CJ data we will rely on a dataset coming from the D-PAC project (Comproved, n.d.) on judgements of *argumentative writing texts* of 5th graders in Flemish secondary education. More information on the ins and outs of this dataset can be found in (Lesterhuis et al. 2018). More specifically we will make use of 2 subsets of the data based on the topics of the texts judged: ‘getting children’ (56 judges, 135 texts, 1229 judgements) and ‘organ donation’ (52 judges, 136 texts, 918 judgements).

In this section we will demonstrate how to analyse the example dataset in `pcFactorStan`

. Different steps are documented: data preparation, model estimation, evaluating model convergence, and inferences on the posterior.

As an input of the `pcFactorStan`

’s central function we need to supply data in a specific format. First of all, we have to organise our data at the comparison level. So each row contains the result of a comparison. Three columns are needed: `pa1`

identifier of the first piece of student work, `pa2`

identifier of the other piece of student work, and a column with the result labelled `Quality`

in the example. The resulting dataset should look like this for the `Children`

dataset:

```
Children <- DV1 %>%
filter(assessment == "Schrijfopdracht 1: Kinderen") %>%
mutate(winner = (selected.position == "left")*1,
pa1 = left.representation, pa2 = right.representation) %>%
mutate(Quality = recode(winner, `0`=-1, `1`=1)) %>%
select(pa1, pa2, Quality)
Children %>%
datatable(extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv'),
lengthMenu = list(c(5,25,50,-1),
c(5,25,50,"All"))))
```

and this for the `Organs`

dataset:

```
Organs <- DV1 %>%
filter(assessment == "Schrijfopdracht 2: Orgaan") %>%
mutate(winner = (selected.position == "left")*1,
pa1 = left.representation, pa2 = right.representation) %>%
mutate(Quality = recode(winner, `0`=-1, `1`=1)) %>%
select(pa1, pa2, Quality)
Organs %>%
datatable(extensions = 'Buttons',
options = list(dom = 'Blfrtip',
buttons = c('copy', 'csv'),
lengthMenu = list(c(5,25,50,-1),
c(5,25,50,"All"))))
```

Once datasets are created containing these three variables we can use the function `prepData()`

from the package `pcFactorStan`

to transform the dataset to a list with the necessary data and metadata to pass on to Stan for the estimation. The list will look like this:

```
#> List of 18
#> $ nameInfo :List of 2
#> ..$ pa : chr [1:135] "0404_Kinderen2.pdf" "0215EMT_kinderen_3.pdf" "0911EMT_kinderen_1.pdf" "0816_Kinderen_1.pdf" ...
#> ..$ item: chr "Quality"
#> $ NPA : int 135
#> $ NCMP : int 1229
#> $ N : int 1229
#> $ NITEMS : int 1
#> $ NTHRESH : num 1
#> $ TOFFSET : num 1
#> $ pa1 : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#> $ pa2 : int [1:1229] 2 21 27 46 68 69 75 79 83 84 ...
#> $ weight : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#> $ pick : int [1:1229] 1 1 1 -1 1 1 -1 -1 1 -1 ...
#> $ numRefresh : int 1229
#> $ refresh : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#> $ numOutcome : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#> $ item : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#> $ alphaScalePrior: num 0.2
#> $ corLKJPrior : num 2.5
#> $ propShape : num 3
```

Once you created the list your ready for the estimation.

For the purpose of estimating the student’s pieces of work worth parameters we can make use of the function `pcStan()`

from the `pcFactorStan`

package to estimate the `unidim_adapt`

model (more technical information in Pritikin 2020). Before we can do this we have to add more information to the list with the data that identifies a constant needed for a non-centered parametrization (see Pritikin 2020). Following the results of the simulation study presented in Pritikin (2020) we set this constant at the value of \(5.0\) and store it in an object named `varCorrection`

in the list we created earlier on.

```
ChildrenL$varCorrection <- 5.0
OrgansL$varCorrection <- 5.0
```

Now we can make use of the very convenient function `pcStan()`

to estimate the models for both datasets. By default this function calls for 4 chains each running 2000 iterations with 1000 iterations used as warmup.

```
options(mc.cores=parallel::detectCores())
M_Children <- pcStan(model = 'unidim_adapt', data = ChildrenL)
M_Organs <- pcStan(model = 'unidim_adapt', data = OrgansL)
```

Once the estimation is finished we are ready to check whether the MCMC algorithm converged properly.

The package `rstan`

contains different functions to check the convergence of the MCMC sampling. One of the convenient functions is the `check_hmc_diagnostics()`

function that prints convergence diagnostics for the MCMC estimation. But, we can also individually call the different ourselves and put them in a table.

```
n_divergent_Children <- sum(get_divergent_iterations(M_Children)*1)
n_divergent_Organs <- sum(get_divergent_iterations(M_Organs)*1)
n_max_treedepth_Children <- sum(get_max_treedepth_iterations(M_Children)*1)
n_max_treedepth_Organs <- sum(get_max_treedepth_iterations(M_Organs)*1)
n_low_energy_Children <- sum(get_low_bfmi_chains(M_Children))
n_low_energy_Organs <- sum(get_low_bfmi_chains(M_Organs))
convergence_table <- data.frame(Labels = c("Children", "Organs"),
n_divergent_iterations = c(n_divergent_Children, n_divergent_Organs),
n_max_tree_depth = c(n_max_treedepth_Children, n_max_treedepth_Organs),
n_low_energy = c(n_low_energy_Children, n_low_energy_Organs))
knitr::kable(
convergence_table,
col.names = c('Assessment', 'n divergent iterations', 'n max tree depth', 'n low energy'),
align = "lccc",
caption = "HMC convergence information"
)
```

Assessment | n divergent iterations | n max tree depth | n low energy |
---|---|---|---|

Children | 0 | 0 | 0 |

Organs | 0 | 0 | 0 |

Also we can check whether \(\widehat R\) < 1.015 and effective sample size greater than 100 times the number of chains for each parameter (Vehtari et al. 2021).

```
allParsChildren <- summary(M_Children, probs=c())$summary
n_eff_Children <- min(allParsChildren[,'n_eff']) # lowest n_eff
rhat_Children <- max(allParsChildren[,'Rhat']) # highest Rhat
allParsOrgans <- summary(M_Organs, probs=c())$summary
n_eff_Organs <- min(allParsOrgans[,'n_eff'])
rhat_Organs <- max(allParsOrgans[,'Rhat'])
N_eff = c(n_eff_Children , n_eff_Organs)
R_hat = c(rhat_Children , rhat_Organs)
Fit_table <- data.frame(Labels = c("Children", "Organs"),
round(N_eff,3),
round(R_hat,3)
)
knitr::kable(
Fit_table,
col.names = c('Assessment', 'min. number ESS', 'max. $\\widehat R$'),
align = "lcc",
caption = "Convergence information for both estimated scales"
)
```

Assessment | min. number ESS | max. \(\widehat R\) |
---|---|---|

Children | 809.546 | 1.004 |

Organs | 956.053 | 1.001 |

This all looks fine!

Finally we can check the *Scale Separation Reliability* (SSR), a measure used by CJ researchers to express the reliability of the scale of estimated worths (see Bramley 2015; Verhavert et al. 2018).

```
theta_children <- summary(M_Children, pars=c("theta"), probs=c())$summary
mse_children <- mean(theta_children[,'sd']^2)
observedSD_children <- sd(theta_children[,'mean'])
trueSD_children <- sqrt(observedSD_children^2 - mse_children)
G_children <- trueSD_children / sqrt(mse_children)
SSR_Children <- G_children^2 / (1+G_children^2)
theta_Organs <- summary(M_Organs, pars=c("theta"), probs=c())$summary
mse_Organs <- mean(theta_Organs[,'sd']^2)
observedSD_Organs <- sd(theta_Organs[,'mean'])
trueSD_Organs <- sqrt(observedSD_Organs^2 - mse_Organs)
G_Organs <- trueSD_Organs / sqrt(mse_Organs)
SSR_Organs <- G_Organs^2 / (1+G_Organs^2)
SSR_table <- data.frame(Labels = c("Children", "Organs"),
SSR = c(round(SSR_Children,3), round(SSR_Organs,3))
)
knitr::kable(
SSR_table,
col.names = c('Assessment', 'SSR'),
align = "lc",
caption = "SSR for both estimated scales"
)
```

Assessment | SSR |
---|---|

Children | 0.801 |

Organs | 0.688 |

These SSR’s are in line with the SSR’s reported by Verhavert et al. (2018) on the same assessments (see Appendix B of Verhavert et al. 2018).

Once we’re confident of the estimation of the posterior we can start making inferences based on the posterior. This phase of the analysis is full of different options. A simple, but not so comprehensible method is making use of the `print()`

function. In the following we print some information on the posterior for a random selection of 20 texts from the ‘Children’ task. First we create an object named `pick_20`

that contains 20 random labels of the form `theta[...]`

where `...`

refers to a number. We do this to for two reasons: it makes the output more comprehensible as compared to printing 135 lines of output and it mimics a hypothetical situation where you want to make inferences for texts written by 20 students that belong to the same class at school. After the creation of an identifier of 20 sampled texts we ask to print the results for the accompanying parameters. With the `c(.2,.5,.8)`

we state that we want a 90% confidence interval.

```
#> Inference for Stan model: unidim_adapt.
#> 4 chains, each with iter=2000; warmup=1000; thin=1;
#> post-warmup draws per chain=1000, total post-warmup draws=4000.
#>
#> mean se_mean sd 10% 50% 90% n_eff Rhat
#> theta[68] -0.23 0.01 0.38 -0.72 -0.23 0.25 4282 1
#> theta[129] -0.61 0.01 0.36 -1.08 -0.62 -0.17 4957 1
#> theta[43] 0.00 0.01 0.35 -0.45 0.01 0.45 4890 1
#> theta[14] -0.33 0.01 0.37 -0.80 -0.34 0.14 4256 1
#> theta[51] -0.17 0.01 0.35 -0.61 -0.16 0.27 4492 1
#> theta[85] -1.04 0.01 0.40 -1.56 -1.04 -0.53 5437 1
#> theta[21] 1.33 0.01 0.46 0.75 1.32 1.94 4994 1
#> theta[106] 0.92 0.01 0.42 0.39 0.91 1.47 4786 1
#> theta[54] 1.16 0.01 0.45 0.61 1.14 1.76 5528 1
#> theta[110] 0.62 0.01 0.38 0.14 0.62 1.12 5068 1
#> theta[74] 1.33 0.01 0.43 0.78 1.32 1.89 4645 1
#> theta[7] -1.73 0.01 0.46 -2.33 -1.71 -1.15 5200 1
#> theta[73] 1.62 0.01 0.47 1.03 1.60 2.24 5621 1
#> theta[79] 0.23 0.01 0.35 -0.22 0.22 0.68 4784 1
#> theta[130] -0.23 0.00 0.36 -0.69 -0.22 0.22 5159 1
#> theta[37] 0.28 0.01 0.36 -0.18 0.28 0.74 4498 1
#> theta[105] 1.73 0.01 0.47 1.15 1.72 2.34 4417 1
#> theta[89] -0.82 0.01 0.40 -1.33 -0.82 -0.33 5275 1
#> theta[126] -0.98 0.01 0.40 -1.49 -0.98 -0.47 5577 1
#> theta[101] 0.13 0.01 0.37 -0.34 0.13 0.62 4471 1
#>
#> Samples were drawn using NUTS(diag_e) at Fri Jan 22 18:14:43 2021.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at
#> convergence, Rhat=1).
```

An alternative way to make inferences on a parameter based on the posterior is extracting the *Highest Density Interval* (HDI). To realise this we make use of the package `bayestestR`

that contains the function `hdi()`

. Before calling this function we do some preparations to get only the worth parameters (theta’s). With the function `extract()`

from the `rstan`

package we extract the draws (i.e. the sampling results). Next, we transform the resulting list to a data frame with only the theta parameters. Before we print the result we want to change the column names of the data frame because by default they are automatically named `X1`

\(\rightarrow\) `X135`

. In order to change these column names we first create a vector named `pick`

that contains names like `theta[1]`

and use `pick`

as input for the column names of our data frame. Now we are ready to call the `hdi()`

function.

```
#> # Highest Density Interval
#>
#> Parameter | 89% HDI
#> ---------------------------
#> theta[68] | [-0.82, 0.38]
#> theta[129] | [-1.22, -0.07]
#> theta[43] | [-0.55, 0.57]
#> theta[14] | [-0.90, 0.27]
#> theta[51] | [-0.72, 0.40]
#> theta[85] | [-1.69, -0.42]
#> theta[21] | [ 0.64, 2.11]
#> theta[106] | [ 0.29, 1.61]
#> theta[54] | [ 0.47, 1.91]
#> theta[110] | [ 0.01, 1.23]
#> theta[74] | [ 0.59, 1.99]
#> theta[7] | [-2.39, -0.92]
#> theta[73] | [ 0.85, 2.34]
#> theta[79] | [-0.34, 0.77]
#> theta[130] | [-0.79, 0.33]
#> theta[37] | [-0.34, 0.81]
#> theta[105] | [ 0.99, 2.48]
#> theta[89] | [-1.42, -0.17]
#> theta[126] | [-1.59, -0.33]
#> theta[101] | [-0.44, 0.74]
```

From the above table we learn for instance that the worth parameter for text number 68 is most probably situated somewhere between -.81 and .41. For text number 129 the most probable worth is situated in the range -1.20 and -0.50. So for text 68 this range incorporates the value 0 (being the mean of our worth scale) and for text 129 this is not the case.

Notice that with comparing this range with the point value 0.0 we do something that resembles what we do in frequentist analyses with *Null Hypothesis Testing*. This is somewhat counterintuitive in a Bayesian framework. Therefore one of the key scholars in Bayesian analysis J. Kruschke introduced the concept of a *Region Of Practical Equivalence* (ROPE) and a accompanying decision rule concerning the overlap of the HDI with the ROPE (Kruschke 2018). For inferences we can calculate the percentage of parameter estimates for a parameter based on the posterior that falls within the ROPE.

Let’s apply this concept to our results on the task ‘Children’ by making use of the `model_parameters()`

function from the `parameters`

package. We also demonstrate how the `export_table()`

function from the `insight`

package can be called to create a formatted table. Before we do this, we have to decide on the range of the ROPE. One way to define a ROPE could be a consulting of experts and let them place two benchmarks on the scale: a benchmark to distinguish between ‘poor’ and ‘mediocre’ texts at the one hand and a benchmark to distinguish between ‘mediocre’ and ‘strong’ texts. Imagine that this consulting exercise results in respectively -.75 and 1.05 (note that this should not be a symmetric interval around the value zero as is the case here) then we can use that as an input for our call to the `model_parameters()`

function.

```
model_parameters(Children_theta_parameters[,pick_20],
rope_range = c(-.75,1.05)) %>%
export_table(format = "markdown",
caption = "Theta estimates 20 texts (ROPE limits -.75 & 1.05)")
```

Parameter | Median | CI_low | CI_high | pd | ROPE_Percentage |
---|---|---|---|---|---|

theta[68] | -0.23 | -0.82 | 0.38 | 0.72 | 0.91 |

theta[129] | -0.62 | -1.22 | -0.07 | 0.95 | 0.65 |

theta[43] | 9.32e-03 | -0.55 | 0.57 | 0.51 | 0.98 |

theta[14] | -0.34 | -0.90 | 0.27 | 0.81 | 0.88 |

theta[51] | -0.16 | -0.72 | 0.40 | 0.68 | 0.95 |

theta[85] | -1.04 | -1.69 | -0.42 | 1.00 | 0.23 |

theta[21] | 1.32 | 0.64 | 2.11 | 1.00 | 0.28 |

theta[106] | 0.91 | 0.29 | 1.61 | 0.99 | 0.63 |

theta[54] | 1.14 | 0.47 | 1.91 | 1.00 | 0.41 |

theta[110] | 0.62 | 0.01 | 1.23 | 0.95 | 0.86 |

theta[74] | 1.32 | 0.59 | 1.99 | 1.00 | 0.26 |

theta[7] | -1.71 | -2.39 | -0.92 | 1.00 | 0.01 |

theta[73] | 1.60 | 0.85 | 2.34 | 1.00 | 0.11 |

theta[79] | 0.22 | -0.34 | 0.77 | 0.74 | 0.99 |

theta[130] | -0.22 | -0.79 | 0.33 | 0.74 | 0.93 |

theta[37] | 0.28 | -0.34 | 0.81 | 0.78 | 0.98 |

theta[105] | 1.72 | 0.99 | 2.48 | 1.00 | 0.07 |

theta[89] | -0.82 | -1.42 | -0.17 | 0.98 | 0.43 |

theta[126] | - |