Bayesian analysis of comparative judgement data

Bayesian inference Comparative Judgement Comproved Stan R

Comparative Judgement (CJ) is slowly gaining momentum in the context of educational assessment. Data coming from CJ is typically analysed with the BTL-model, making use of a frequentist approach. Increasing computational power of computers has made Bayesian inferences more accessible, resulting in the rise of Bayesian model estimation. In this post we will demonstrate how the package pcFactorStan can be applied to analyse data from CJ within a Bayesian framework, making use of data from CJ judgements on argumentative writing coming from the D-PAC project.

Sven De Maeyer https://svendemaeyer.netlify.app/ (University of Antwerp)https://www.uantwerpen.be
01-26-2021

Introduction

Comparative Judgement (CJ) gets more and more used within the context of educational assessment. CJ is seen as a reliable and valid alternative to judge student work. In contrast to the use of rubrics or criteria lists it is not an absolute but relative way of assessing the quality of student work. CJ builds on the idea that people are more reliable in comparing things than scoring/rating things (a thought introduced in psychometrics by Thurstone 1927).

In CJ judges are presented with two pieces of student work and asked which of both pieces of work is best in the light of the skill that is measured. Multiple judges are sent multiple comparisons resulting in data that can be used to estimate the latent worth of student work. Different scholarly articles are dedicated to the merits of CJ for educational application (e.g. Lesterhuis et al. 2018; Van Daal et al. 2019, 2017; Verhavert et al. 2019). To run a CJ in an educational setting a dedicated tool like Comproved (Comproved, n.d.) can be used.

The data resulting from a CJ is analysed making use of the Bradley-Terry-Luce (BTL) model which can be formally written as

\[ p(x_{ij}=1|v_i,v_j)= \frac{e^{(v_j-v_i)}}{1+e^{(v_j-v_i)}} \] with \(x_{ij}=1\) if piece \(j\) wins the comparison from piece \(i\). The parameters \(v_i\) and \(v_j\) are the latent worth of both pieces of student work. A higher latent worth refers to the fact that this specific piece of student work is better in the light of the skill that is assessed.

Why a Bayesian approach can be prefered?

To estimate the BTL model, dedicated packages are available for R users that facilitate a frequentist estimation of the model like the package BradleyTerry2 (see Turner and Firth 2012). In a frequentist analysis we get a point-estimate summarizing the posterior probability distribution and we make inferences based on ‘hypothetical data’ (see Lambert 2018). For all the students pieces of work in the assessment we get a maximum likelihood (ML) parameter estimate expressing it’s latent worth and we can get an accompanying confidence interval around that estimate. These confidence intervals express how our parameter estimates are assumed to vary from sample to sample of similar data that we draw from the population. So, imagine we get a ML estimate for a certain piece of work with the value of 2 with a standard error of .5 then we can infer that in 95% of the hypothetical samples of data that we draw, we would get a parameter estimate somewhere between roughly 1 (\(\approx 2-1.96*.5\)) and 3 (\(\approx 2+1.96*.5\)).

Bayesian inferences, on the other hand, are based on a posterior probability density distribution (further referred to as the posterior) regarding the parameters, given the data collected. The latter allows for a more intuitive interpretation of parameters uncertainty. Within the framework of CJ a Bayesian analysis results in a posterior for each student’s piece of work’s worth. Consequently we can derive probabilities for each possible parameter value. So, for a virtual piece of work we can derive the probability that it’s latent worth is within a certain range, given the data that we used to estimate our model. Bayesian analysts have different ways of presenting the posterior to make inferences. Two popular ways of presenting the posterior probability are a Credible Interval (CI) and a Highest Density Interval (HDI) (Lambert 2018, @Kruschke_2018). The former expresses an interval within which a parameter estimate falls with a given probability. So, if we get a 90% credible interval going from 1.2 to 2.8 this means that we are 90% sure the estimate is somewhere in this interval. This way of interpreting parameter uncertainty is actually the similar to the way that many researchers mistakenly interpret confidence intervals in a frequentist analysis. Some authors call this CI also an ETI (from Equal Tailed Interval), referring to the fact that both values that define the range are for instance the 5th and the 95th percentile values if we make a 90% CI. A HDI is an alternative way of expressing the uncertainty we learn from the posterior. It is an interval that defines the range with the x% most probable estimates. For instance, if we get a 90% HDI for a student’s piece of work’s worth ranging from 1.4 to 2.9, then we can interpret it as the 90% most probable estimates are within this range. If the posterior is somewhat symmetric then a both a CI and an HDI will be somewhat similar. But on the contrary, when the posterior is not symmetric both intervals can be different, each casting a different light on the estimates. A quick example to illustrate. Imagine the following skew posterior probability density function:

Show code
# Generate a gamma distribution (that is skew)
posterior <- distribution_gamma(1000, 1.5, 2)

# Compute HDI and Quantile CI
ci_hdi <- ci(posterior, method = "HDI")
ci_eti <- ci(posterior, method = "ETI")

# Plot the distribution and add the limits of the two CIs
posterior %>% 
  estimate_density(extend = TRUE) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_area(fill = "#E6AB02") +

  # HDI in blue
  geom_vline(xintercept = ci_hdi$CI_low, color = "#66A61E", size = 2) +
  geom_vline(xintercept = ci_hdi$CI_high, color = "#66A61E", size = 2) +
  # ETI in red
  geom_vline(xintercept = ci_eti$CI_low, color = "#7570B3", size = 2) +
  geom_vline(xintercept = ci_eti$CI_high, color = "#7570B3", size = 2) +
  scale_y_continuous("posterior probability density") +
  scale_x_continuous("possible parameter values") +
  ggtitle("A skew posterior with a 89% HDI (green lines) \n and a 89% CI (aka ETI) (purple lines)")

We have drawn 2 sets of lines: green and purple. The green lines are in fact the range of a HDI. Looking at the tails outside our HDI we see that the proportion at the right side is larger than on the left side. The central idea of a HDI is that it puts the limits so that none of the values outside the range can have a similar or higher probability than the values within the range. The purple lines are the limits of a CI with equal tails. Now we see that certain values outside the range at the left side actually have a higher probability than the values inside the range (near the upper limit).

The pcFactorStan package

Back to our CJ data and the use of Bayesian estimation within that context. One of the crucial factors that explains the growth in usage of Bayesian analyses is the exponential increase in computational power of computers. This computational power is necessary to apply a crucial ‘toolkit’ in Bayesian analyses: Markov-Chain-Monte-Carlo (MCMC) sampling methods to estimate the posterior (Bürkner 2020). The powerful open source software Stan (Stan Development Team 2020b) makes it possible to apply Hamiltonian Monte Carlo sampling in an efficient way (see Lambert 2018 for an intuitive introduction to different MCMC sampling algorithms). To interface with Stan we can make use of different statistical programming languages. In this article we will interface with R (R Core Team 2020) making use of the package rstan (Stan Development Team 2020a). More specifically we will make use of the package pcFactorStan (Pritikin 2020) that is recently released to estimate a more general factor analysis model of multidimensional pairwise comparison data. CJ data can be seen as a restricted case of the more general model presented by Pritikin (2020). The more general model presented by Pritikin (2020) extends the model with two features: integrate comparisons on multiple dimensions resulting on multiple latent worth parameters of student’s pieces of work; and allowing a more nuanced judgement with multiple ordinal categories. In the discussion of this post we will shortly discuss the opportunities this more general model creates for CJ in the educational assessment context. In this article we will restrict us at demonstrating how this package can be used to apply MCMC sampling for the analysis of CJ data.

The example dataset

To demonstrate how we can apply pcFactorStan to model CJ data we will rely on a dataset coming from the D-PAC project (Comproved, n.d.) on judgements of argumentative writing texts of 5th graders in Flemish secondary education. More information on the ins and outs of this dataset can be found in (Lesterhuis et al. 2018). More specifically we will make use of 2 subsets of the data based on the topics of the texts judged: ‘getting children’ (56 judges, 135 texts, 1229 judgements) and ‘organ donation’ (52 judges, 136 texts, 918 judgements).

Analyses & Results

In this section we will demonstrate how to analyse the example dataset in pcFactorStan. Different steps are documented: data preparation, model estimation, evaluating model convergence, and inferences on the posterior.

Data preparation

As an input of the pcFactorStan’s central function we need to supply data in a specific format. First of all, we have to organise our data at the comparison level. So each row contains the result of a comparison. Three columns are needed: pa1 identifier of the first piece of student work, pa2 identifier of the other piece of student work, and a column with the result labelled Quality in the example. The resulting dataset should look like this for the Children dataset:

Show code
Children <- DV1 %>%
  filter(assessment == "Schrijfopdracht 1: Kinderen") %>%
  mutate(winner = (selected.position == "left")*1, 
         pa1 = left.representation, pa2 = right.representation) %>%
  mutate(Quality = recode(winner, `0`=-1, `1`=1)) %>%
  select(pa1, pa2, Quality)

Children %>%
  datatable(extensions = 'Buttons',
            options = list(dom = 'Blfrtip',
                           buttons = c('copy', 'csv'),
                           lengthMenu = list(c(5,25,50,-1),
                                             c(5,25,50,"All"))))

and this for the Organs dataset:

Show code
Organs <- DV1 %>%
  filter(assessment == "Schrijfopdracht 2: Orgaan") %>%
  mutate(winner = (selected.position == "left")*1, 
         pa1 = left.representation, pa2 = right.representation) %>%
  mutate(Quality = recode(winner, `0`=-1, `1`=1)) %>%
  select(pa1, pa2, Quality)

Organs %>%
  datatable(extensions = 'Buttons',
            options = list(dom = 'Blfrtip',
                           buttons = c('copy', 'csv'),
                           lengthMenu = list(c(5,25,50,-1),
                                             c(5,25,50,"All"))))

Once datasets are created containing these three variables we can use the function prepData() from the package pcFactorStan to transform the dataset to a list with the necessary data and metadata to pass on to Stan for the estimation. The list will look like this:

Show code
ChildrenL <- prepData(Children)
OrgansL <- prepData(Organs)

str(ChildrenL)
#> List of 18
#>  $ nameInfo       :List of 2
#>   ..$ pa  : chr [1:135] "0404_Kinderen2.pdf" "0215EMT_kinderen_3.pdf" "0911EMT_kinderen_1.pdf" "0816_Kinderen_1.pdf" ...
#>   ..$ item: chr "Quality"
#>  $ NPA            : int 135
#>  $ NCMP           : int 1229
#>  $ N              : int 1229
#>  $ NITEMS         : int 1
#>  $ NTHRESH        : num 1
#>  $ TOFFSET        : num 1
#>  $ pa1            : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ pa2            : int [1:1229] 2 21 27 46 68 69 75 79 83 84 ...
#>  $ weight         : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ pick           : int [1:1229] 1 1 1 -1 1 1 -1 -1 1 -1 ...
#>  $ numRefresh     : int 1229
#>  $ refresh        : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ numOutcome     : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ item           : int [1:1229] 1 1 1 1 1 1 1 1 1 1 ...
#>  $ alphaScalePrior: num 0.2
#>  $ corLKJPrior    : num 2.5
#>  $ propShape      : num 3

Once you created the list your ready for the estimation.

Model estimation

For the purpose of estimating the student’s pieces of work worth parameters we can make use of the function pcStan() from the pcFactorStan package to estimate the unidim_adapt model (more technical information in Pritikin 2020). Before we can do this we have to add more information to the list with the data that identifies a constant needed for a non-centered parametrization (see Pritikin 2020). Following the results of the simulation study presented in Pritikin (2020) we set this constant at the value of \(5.0\) and store it in an object named varCorrection in the list we created earlier on.

Show code
ChildrenL$varCorrection <- 5.0
OrgansL$varCorrection <- 5.0

Now we can make use of the very convenient function pcStan() to estimate the models for both datasets. By default this function calls for 4 chains each running 2000 iterations with 1000 iterations used as warmup.

Show code
options(mc.cores=parallel::detectCores())
M_Children <- pcStan(model = 'unidim_adapt', data = ChildrenL)
M_Organs   <- pcStan(model = 'unidim_adapt', data = OrgansL)

Once the estimation is finished we are ready to check whether the MCMC algorithm converged properly.

Evaluating model convergence

The package rstan contains different functions to check the convergence of the MCMC sampling. One of the convenient functions is the check_hmc_diagnostics() function that prints convergence diagnostics for the MCMC estimation. But, we can also individually call the different ourselves and put them in a table.

Show code
n_divergent_Children     <- sum(get_divergent_iterations(M_Children)*1)
n_divergent_Organs       <- sum(get_divergent_iterations(M_Organs)*1)
n_max_treedepth_Children <- sum(get_max_treedepth_iterations(M_Children)*1)
n_max_treedepth_Organs   <- sum(get_max_treedepth_iterations(M_Organs)*1)
n_low_energy_Children    <- sum(get_low_bfmi_chains(M_Children))
n_low_energy_Organs      <- sum(get_low_bfmi_chains(M_Organs))

convergence_table <- data.frame(Labels = c("Children", "Organs"), 
                                n_divergent_iterations = c(n_divergent_Children, n_divergent_Organs),
                                n_max_tree_depth = c(n_max_treedepth_Children, n_max_treedepth_Organs),
                                n_low_energy = c(n_low_energy_Children, n_low_energy_Organs))

knitr::kable(
  convergence_table,
  col.names = c('Assessment', 'n divergent iterations', 'n max tree depth', 'n low energy'),
  align = "lccc",
  caption = "HMC convergence information"
)
Table 1: HMC convergence information
Assessment n divergent iterations n max tree depth n low energy
Children 0 0 0
Organs 0 0 0

Also we can check whether \(\widehat R\) < 1.015 and effective sample size greater than 100 times the number of chains for each parameter (Vehtari et al. 2021).

Show code
allParsChildren <- summary(M_Children, probs=c())$summary
n_eff_Children  <- min(allParsChildren[,'n_eff']) # lowest n_eff
rhat_Children   <- max(allParsChildren[,'Rhat'])  # highest Rhat

allParsOrgans   <- summary(M_Organs, probs=c())$summary
n_eff_Organs    <- min(allParsOrgans[,'n_eff'])
rhat_Organs     <- max(allParsOrgans[,'Rhat'])

N_eff = c(n_eff_Children , n_eff_Organs)
R_hat = c(rhat_Children , rhat_Organs)

Fit_table <- data.frame(Labels = c("Children", "Organs"),
                        round(N_eff,3),
                        round(R_hat,3)
                        )

knitr::kable(
  Fit_table,
  col.names = c('Assessment', 'min. number ESS', 'max. $\\widehat R$'),
  align = "lcc",
  caption = "Convergence information for both estimated scales"
)
Table 2: Convergence information for both estimated scales
Assessment min. number ESS max. \(\widehat R\)
Children 809.546 1.004
Organs 956.053 1.001

This all looks fine!

Evaluating Scale Separation Reliability

Finally we can check the Scale Separation Reliability (SSR), a measure used by CJ researchers to express the reliability of the scale of estimated worths (see Bramley 2015; Verhavert et al. 2018).

Show code
theta_children <- summary(M_Children, pars=c("theta"), probs=c())$summary
mse_children <- mean(theta_children[,'sd']^2)
observedSD_children <- sd(theta_children[,'mean'])
trueSD_children <- sqrt(observedSD_children^2 - mse_children)
G_children <- trueSD_children / sqrt(mse_children)
SSR_Children <- G_children^2 / (1+G_children^2)

theta_Organs <- summary(M_Organs, pars=c("theta"), probs=c())$summary
mse_Organs <- mean(theta_Organs[,'sd']^2)
observedSD_Organs <- sd(theta_Organs[,'mean'])
trueSD_Organs <- sqrt(observedSD_Organs^2 - mse_Organs)
G_Organs <- trueSD_Organs / sqrt(mse_Organs)
SSR_Organs <- G_Organs^2 / (1+G_Organs^2)

SSR_table <- data.frame(Labels = c("Children", "Organs"), 
                                SSR = c(round(SSR_Children,3), round(SSR_Organs,3))
                        )

knitr::kable(
  SSR_table,
  col.names = c('Assessment', 'SSR'),
  align = "lc",
  caption = "SSR for both estimated scales"
)
Table 3: SSR for both estimated scales
Assessment SSR
Children 0.801
Organs 0.688

These SSR’s are in line with the SSR’s reported by Verhavert et al. (2018) on the same assessments (see Appendix B of Verhavert et al. 2018).

Inferences on the posterior

Once we’re confident of the estimation of the posterior we can start making inferences based on the posterior. This phase of the analysis is full of different options. A simple, but not so comprehensible method is making use of the print() function. In the following we print some information on the posterior for a random selection of 20 texts from the ‘Children’ task. First we create an object named pick_20 that contains 20 random labels of the form theta[...] where ... refers to a number. We do this to for two reasons: it makes the output more comprehensible as compared to printing 135 lines of output and it mimics a hypothetical situation where you want to make inferences for texts written by 20 students that belong to the same class at school. After the creation of an identifier of 20 sampled texts we ask to print the results for the accompanying parameters. With the c(.2,.5,.8) we state that we want a 90% confidence interval.

Show code
set.seed(1)
pick_20 <- paste0('theta[',sample(1:135,20),']')
print(M_Children, pars = pick_20, c(.1,.5,.9))
#> Inference for Stan model: unidim_adapt.
#> 4 chains, each with iter=2000; warmup=1000; thin=1; 
#> post-warmup draws per chain=1000, total post-warmup draws=4000.
#> 
#>             mean se_mean   sd   10%   50%   90% n_eff Rhat
#> theta[68]  -0.23    0.01 0.38 -0.72 -0.23  0.25  4282    1
#> theta[129] -0.61    0.01 0.36 -1.08 -0.62 -0.17  4957    1
#> theta[43]   0.00    0.01 0.35 -0.45  0.01  0.45  4890    1
#> theta[14]  -0.33    0.01 0.37 -0.80 -0.34  0.14  4256    1
#> theta[51]  -0.17    0.01 0.35 -0.61 -0.16  0.27  4492    1
#> theta[85]  -1.04    0.01 0.40 -1.56 -1.04 -0.53  5437    1
#> theta[21]   1.33    0.01 0.46  0.75  1.32  1.94  4994    1
#> theta[106]  0.92    0.01 0.42  0.39  0.91  1.47  4786    1
#> theta[54]   1.16    0.01 0.45  0.61  1.14  1.76  5528    1
#> theta[110]  0.62    0.01 0.38  0.14  0.62  1.12  5068    1
#> theta[74]   1.33    0.01 0.43  0.78  1.32  1.89  4645    1
#> theta[7]   -1.73    0.01 0.46 -2.33 -1.71 -1.15  5200    1
#> theta[73]   1.62    0.01 0.47  1.03  1.60  2.24  5621    1
#> theta[79]   0.23    0.01 0.35 -0.22  0.22  0.68  4784    1
#> theta[130] -0.23    0.00 0.36 -0.69 -0.22  0.22  5159    1
#> theta[37]   0.28    0.01 0.36 -0.18  0.28  0.74  4498    1
#> theta[105]  1.73    0.01 0.47  1.15  1.72  2.34  4417    1
#> theta[89]  -0.82    0.01 0.40 -1.33 -0.82 -0.33  5275    1
#> theta[126] -0.98    0.01 0.40 -1.49 -0.98 -0.47  5577    1
#> theta[101]  0.13    0.01 0.37 -0.34  0.13  0.62  4471    1
#> 
#> Samples were drawn using NUTS(diag_e) at Fri Jan 22 18:14:43 2021.
#> For each parameter, n_eff is a crude measure of effective sample size,
#> and Rhat is the potential scale reduction factor on split chains (at 
#> convergence, Rhat=1).

An alternative way to make inferences on a parameter based on the posterior is extracting the Highest Density Interval (HDI). To realise this we make use of the package bayestestR that contains the function hdi(). Before calling this function we do some preparations to get only the worth parameters (theta’s). With the function extract() from the rstan package we extract the draws (i.e. the sampling results). Next, we transform the resulting list to a data frame with only the theta parameters. Before we print the result we want to change the column names of the data frame because by default they are automatically named X1 \(\rightarrow\) X135. In order to change these column names we first create a vector named pick that contains names like theta[1] and use pick as input for the column names of our data frame. Now we are ready to call the hdi() function.

Show code
Children_parameters <- extract(M_Children)
Children_theta_parameters <- data.frame(Children_parameters[["theta"]])
pick <- paste0('theta[',c(1:135),']')
colnames(Children_theta_parameters) <- pick
hdi(Children_theta_parameters[,pick_20])
#> # Highest Density Interval
#> 
#> Parameter  |        89% HDI
#> ---------------------------
#> theta[68]  | [-0.82,  0.38]
#> theta[129] | [-1.22, -0.07]
#> theta[43]  | [-0.55,  0.57]
#> theta[14]  | [-0.90,  0.27]
#> theta[51]  | [-0.72,  0.40]
#> theta[85]  | [-1.69, -0.42]
#> theta[21]  | [ 0.64,  2.11]
#> theta[106] | [ 0.29,  1.61]
#> theta[54]  | [ 0.47,  1.91]
#> theta[110] | [ 0.01,  1.23]
#> theta[74]  | [ 0.59,  1.99]
#> theta[7]   | [-2.39, -0.92]
#> theta[73]  | [ 0.85,  2.34]
#> theta[79]  | [-0.34,  0.77]
#> theta[130] | [-0.79,  0.33]
#> theta[37]  | [-0.34,  0.81]
#> theta[105] | [ 0.99,  2.48]
#> theta[89]  | [-1.42, -0.17]
#> theta[126] | [-1.59, -0.33]
#> theta[101] | [-0.44,  0.74]

From the above table we learn for instance that the worth parameter for text number 68 is most probably situated somewhere between -.81 and .41. For text number 129 the most probable worth is situated in the range -1.20 and -0.50. So for text 68 this range incorporates the value 0 (being the mean of our worth scale) and for text 129 this is not the case.

Notice that with comparing this range with the point value 0.0 we do something that resembles what we do in frequentist analyses with Null Hypothesis Testing. This is somewhat counterintuitive in a Bayesian framework. Therefore one of the key scholars in Bayesian analysis J. Kruschke introduced the concept of a Region Of Practical Equivalence (ROPE) and a accompanying decision rule concerning the overlap of the HDI with the ROPE (Kruschke 2018). For inferences we can calculate the percentage of parameter estimates for a parameter based on the posterior that falls within the ROPE.

Let’s apply this concept to our results on the task ‘Children’ by making use of the model_parameters() function from the parameters package. We also demonstrate how the export_table() function from the insight package can be called to create a formatted table. Before we do this, we have to decide on the range of the ROPE. One way to define a ROPE could be a consulting of experts and let them place two benchmarks on the scale: a benchmark to distinguish between ‘poor’ and ‘mediocre’ texts at the one hand and a benchmark to distinguish between ‘mediocre’ and ‘strong’ texts. Imagine that this consulting exercise results in respectively -.75 and 1.05 (note that this should not be a symmetric interval around the value zero as is the case here) then we can use that as an input for our call to the model_parameters() function.

Show code
model_parameters(Children_theta_parameters[,pick_20], 
                 rope_range = c(-.75,1.05)) %>% 
    export_table(format = "markdown", 
                 caption = "Theta estimates 20 texts (ROPE limits -.75 & 1.05)")
Theta estimates 20 texts (ROPE limits -.75 & 1.05)
Parameter Median CI_low CI_high pd ROPE_Percentage
theta[68] -0.23 -0.82 0.38 0.72 0.91
theta[129] -0.62 -1.22 -0.07 0.95 0.65
theta[43] 9.32e-03 -0.55 0.57 0.51 0.98
theta[14] -0.34 -0.90 0.27 0.81 0.88
theta[51] -0.16 -0.72 0.40 0.68 0.95
theta[85] -1.04 -1.69 -0.42 1.00 0.23
theta[21] 1.32 0.64 2.11 1.00 0.28
theta[106] 0.91 0.29 1.61 0.99 0.63
theta[54] 1.14 0.47 1.91 1.00 0.41
theta[110] 0.62 0.01 1.23 0.95 0.86
theta[74] 1.32 0.59 1.99 1.00 0.26
theta[7] -1.71 -2.39 -0.92 1.00 0.01
theta[73] 1.60 0.85 2.34 1.00 0.11
theta[79] 0.22 -0.34 0.77 0.74 0.99
theta[130] -0.22 -0.79 0.33 0.74 0.93
theta[37] 0.28 -0.34 0.81 0.78 0.98
theta[105] 1.72 0.99 2.48 1.00 0.07
theta[89] -0.82 -1.42 -0.17 0.98 0.43
theta[126] -0.98 -1.59 -0.33 0.99 0.29
theta[101] 0.13 -0.44 0.74 0.64 0.99

From this output we learn that the worth of text 68 most probably is situated in the region that we defined as ‘mediocre’ given that we have 91% confidence that this parameter is within our defined ROPE. For text 129 we are less confident that it meets the benchmark of ‘mediocre’ given the negative median of the posterior and 65% of posterior being situated in the ROPE. Another example text number 105. For this text we can be quiet confident that it reaches the benchmark of a strong argumentative text (the median of the posterior is 1.72 and only 7% of the posterior is in the ROPE).

We showed how to create tables that contain information to support making inferences. But most of the time a visualisation tells a lot more. In what follows we present some visualisations made with the package see. We continue showing the results of our hypothetical class of 20 student’s texts for the task ‘Children’.

We start by plotting the HDI. To do this we need to make an object that contains the results of the hdi() function of the package bayestestR and then use this object to call the plot() function which will plot our results making use of the see package.

Show code
HDI_results <- hdi(Children_theta_parameters[,pick_20],
                     ci = c(.5,.75,.89))


plot(data=Children_theta_parameters[,pick_20],
     HDI_results) +
  scale_fill_metro()

But we can also plot information on the ROPE to aid interpretation. In the following graph we can quickly identify for instance text 105 for which the posterior nearly overlaps with the ROPE we defined.

Show code
ROPE_results <- rope(Children_theta_parameters[,pick_20],
                     ci = c(.75,.89,1),
                     range = c(-.65,1.05))


plot(data=Children_theta_parameters[,pick_20],
     ROPE_results, rope_color = "red") +
 scale_fill_brewer(palette = "Greens", direction = -1)

Of course many other visualisations can be further applied, making use of the growing number of packages that are dedicated at generating tables and graphs for bayesian models.

To conclude

With this article we hope we have demonstrated how to make use of MCMC sampling for Bayesian inference in the context of Comparative Judgement in educational assessment. The recently published package pcFactorStan (Pritikin 2020) is a great resource to enable us to make use of the Stan software for the estimation.

What we didn’t discuss in this article is how the package pcFactorStan also opens up the possibilities of implementing different CJ scenario’s in educational assessment. The package is developed to enable following features: analysing multidimensional ordinal pairwise comparison data. Let’s translate this to the possibilities this implies for the evaluation of texts as an example.

First of all we can now ask judges for a more nuanced judgement. For instance we can ask judges to compare both texts and give a judgement on the following scale: “text A is a lot better than text B”, “text A is somewhat better than text B”, “both texts are of equal quality”, “text B is somewhat better than text A” and “text B is a lot better than text A”.

A second feature that we can introduce is the possibility to apply this kind of comparison on multiple aspects of text quality. For instance we can ask judges to compare both texts on the following dimensions: ‘argumentation’, ‘coherence’, ‘organisation’ and ‘mechanics’. With the pcFactorStan package we can estimate individual parameters for each dimension that reflect the qualities of texts for each of these dimensions, but we can also estimate whether there is an overall latent ‘text quality’ scale (hence the Factor part in the name of the package) and how well each of the texts scores on this overall scale.

Finally we want to encourage the reader to use the two datasets that accompany this article to further explore the possibilities of Bayesian analyses of CJ data.

Acknowledgments

A big thank you for the whole D-PAC team! Without your dedicated and hard work resulting in the dataset we used here.

Without the package pcFactorStan I wouldn’t have written this article anyhow. Thank you Joshua Pritikin for putting all your effort in the development of this package and sharing your knowledge and insights!

Also I wish to acknowledge the University of Antwerp for awarding my Sabbatical Leave. Without the sabbatical leave I wouldn’t have found the time to start my learning path on Bayesian analysis and Stan coding…

For the analyses presented analyses and the writing of this post, I used the programming language R (R Core Team 2020) through the interface RStudio (RStudio Team 2020). The following R packages were crucial (in alphabetical order): bayesplot(Gabry and Mahr 2020), bayestestR (Makowski, Ben-Shachar, and Lüdecke 2019), dplyr (Wickham et al. 2020), DT (Xie, Cheng, and Tan 2020), ggplot2 (Wickham 2016), insight[], knitr (Xie 2015), loo (Vehtari, Gelman, and Gabry 2017), parameters (Lüdecke, Ben-Shachar, Patil, et al. 2020), pcFactorStan (Pritikin 2020), rmarkdown (Allaire et al. 2021; Xie, Dervieux, and Riederer 2020), rstan (Stan Development Team 2020a), see(Lüdecke, Ben-Shachar, Waggoner, et al. 2020) and tidyr (Wickham 2020).

Allaire, JJ, Yihui Xie, Jonathan McPherson, Javier Luraschi, Kevin Ushey, Aron Atkins, Hadley Wickham, Joe Cheng, Winston Chang, and Richard Iannone. 2021. Rmarkdown: Dynamic Documents for R. https://github.com/rstudio/rmarkdown.

Bramley, Tom. 2015. “Investigating the Reliability of Adaptive Comparative Judgment.” UK, Cambridge: Cambridge Assessment; Cambridge Assessment. https://www.cambridgeassessment.org.uk/Images/232694-investigating-the-reliability-of-adaptive-comparative-judgment.pdf.

Bürkner, Paul-Christian. 2020. “Analysing Standard Progressive Matrices (Spm-Ls) with Bayesian Item Response Models.” Journal of Intelligence 8 (1): 5. https://doi.org/10.3390/jintelligence8010005.

Comproved. n.d. “Welcome to Comproved - Assess Better. Learn More.” Available at https://comproved.com/en/ (2021/01/18).

Gabry, Jonah, and Tristan Mahr. 2020. “Bayesplot: Plotting for Bayesian Models.” https://mc-stan.org/bayesplot.

Kruschke, John K. 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Advances in Methods and Practices in Psychological Science 1 (2): 270–80. https://doi.org/10.1177/2515245918771304.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. Sage, Thousand Oaks.

Lesterhuis, Marije, Tine van Daal, Roos Van Gasse, Liesje Coertjens, Vincent Donche, and Sven De Maeyer. 2018. “When Teachers Compare Argumentative Texts: Decisions Informed by Multiple Complex Aspects of Text Quality.” L1-Educational Studis in Language and Literature 18: 1–22. https://doi.org/10.17239/L1ESLL-2018.18.01.02.

Lüdecke, Daniel, Mattan S. Ben-Shachar, Indrajeet Patil, and Dominique Makowski. 2020. “Parameters: Extracting, Computing and Exploring the Parameters of Statistical Models Using R.” Journal of Open Source Software 5 (53): 2445. https://doi.org/10.21105/joss.02445.

Lüdecke, Daniel, Mattan S. Ben-Shachar, Philip Waggoner, and Dominique Makowski. 2020. “See: Visualisation Toolbox for ’Easystats’ and Extra Geoms, Themes and Color Palettes for ’Ggplot2’.” CRAN. https://doi.org/10.5281/zenodo.3952153.

Makowski, Dominique, Mattan S. Ben-Shachar, and Daniel Lüdecke. 2019. “BayestestR: Describing Effects and Their Uncertainty, Existence and Significance Within the Bayesian Framework.” Journal of Open Source Software 4 (40): 1541. https://doi.org/10.21105/joss.01541.

Pritikin, Joshua N. 2020. “An Exploratory Factor Model for Ordinal Paired Comparison Indicators.” Heliyon 6 (9): e04821. https://doi.org/10.1016/j.heliyon.2020.e04821.

R Core Team. 2020. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

RStudio Team. 2020. RStudio: Integrated Development Environment for R. Boston, MA: RStudio, PBC. http://www.rstudio.com/.

Stan Development Team. 2020a. “RStan: The R Interface to Stan.” http://mc-stan.org/.

———. 2020b. Stan Modeling Language Users Guide and Reference Manual. https://mc-stan.org.

Thurstone, L. L. 1927. “A Law of Comparative Judgment.” Psychological Review 34 (4): 273–86.

Turner, Heather, and David Firth. 2012. “Bradley-Terry Models in R: The BradleyTerry2 Package.” Journal of Statistical Software 48 (9): 1–21. https://www.jstatsoft.org/v48/i09/.

Van Daal, Tine, Marije Lesterhuis, Liesje Coertjens, Vincent Donche, and Sven De Maeyer. 2019. “Validity of Comparative Judgement to Assess Academic Writing : Examining Implications of Its Holistic Character and Building on a Shared Consensus.” Assessment in Education : Principles, Policy & Practice 26 (1): 59–74. https://doi.org/10.1080/0969594X.2016.1253542.

Van Daal, Tine, Marije Lesterhuis, Liesje Coertjens, Marie-Therese van de Kamp, Vincent Donche, and Sven De Maeyer. 2017. “The Complexity of Assessing Student Work Using Comparative Judgment : The Moderating Role of Decision Accuracy.” Frontiers in Education 2: 1–13. https://doi.org/10.3389/FEDUC.2017.00044.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and Waic.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Vehtari, Aki, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. 2021. “Rank-Normalization, Folding, and Localization: An Improved \(\widehat{R}\) for Assessing Convergence of Mcmc.” Bayesian Analysis. https://doi.org/10.1214/20-BA1221.

Verhavert, San, Renske Bouwer, Vincent Donche, and Sven De Maeyer. 2019. “A Meta-Analysis on the Reliability of Comparative Judgement.” Assessment in Education : Principles, Policy & Practice, 1–22. https://doi.org/10.1080/0969594X.2019.1602027.

Verhavert, San, Sven De Maeyer, Vincent Donche, and Liesje Coertjens. 2018. “Scale Separation Reliability: What Does It Mean in the Context of Comparative Judgment?” Applied Psychological Measurement 42: 428–45. https://doi.org/10.1177/0146621617748321.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

———. 2020. Tidyr: Tidy Messy Data. https://CRAN.R-project.org/package=tidyr.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2020. Dplyr: A Grammar of Data Manipulation. https://CRAN.R-project.org/package=dplyr.

Xie, Yihui. 2015. Dynamic Documents with R and Knitr. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC. https://yihui.org/knitr/.

Xie, Yihui, Joe Cheng, and Xianying Tan. 2020. DT: A Wrapper of the Javascript Library ’Datatables’. https://CRAN.R-project.org/package=DT.

Xie, Yihui, Christophe Dervieux, and Emily Riederer. 2020. R Markdown Cookbook. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown-cookbook.

References

Citation

For attribution, please cite this work as

Maeyer (2021, Jan. 26). Reproducible Stats in Education Sciences: Bayesian analysis of comparative judgement data. Retrieved from https://svendemaeyer.netlify.app/posts/2021-01-18-bayesian-analysis-of-comparative-judgement-data/

BibTeX citation

@misc{maeyer2021bayesian,
  author = {Maeyer, Sven De},
  title = {Reproducible Stats in Education Sciences: Bayesian analysis of comparative judgement data},
  url = {https://svendemaeyer.netlify.app/posts/2021-01-18-bayesian-analysis-of-comparative-judgement-data/},
  year = {2021}
}