Abstract screening

Preparation - Combining lists and removing duplicates

Before the initial screening, the search results from multiple databases need to be combined. There is high possibility that the combined list to consist of duplicates. These duplicates need to be eliminated by keeping only one record of each article. Systematic literature review tools such as Cofidence and Rayyan can automatically detect duplicates using AI, but they are not free. We can also detect duplicated documents using Microsoft Excel, utilizing sorting and conditional formatting features.

Screening practice

When the list is ready, team members need to sit in a meeting to agree on how to screen the abstract and do some screening practices by screening several papers together to have a shared understanding of this process. The team leader needs to facilitate discussion to achieve this common understanding. I prefer online meeting through video conference application so that the meeting can be conducted in relax atmosphere outside working hours. Also, this type of meeting is best when team members are from different institutions. As a suggestion, inclusion and exclusion criteria should always be made handy to help with the abstract screening. While article titles and keywords might help in detecting relevancy of articles, abstract must always be prioritized.

Cycles of screening to achieve reliability

In this process, it is important to objectively prove that all team members have the same understanding of the inclusion and exclusion criteria. To determine whether the teams members have adequate agrement, each member needs to screen at least 30 same articles independently. After completing this first cycle of the screening, the screening results of each member can be compared using Cohen’s kappa (for two raters) or Fleiss’ kappa (for more than two raters). See Cole (2024) for further reading about reliability in qualitative research. I this book, I will demonstrate how both kappas are calculated manually and in R, the statistical software that I am familiar with. You can explore how it is done in the application of your choice if you prefer. We will use this data for the calculation.

Cohen’s kappa - manual calculation

For this calculation, we will use only Rater_1 and Rater_2 because Cohen’s kappa is used to calculate inter-rater reliability when only two raters are involved. To calculate Cohen’s kappa, the data needs to be prepared as in the following table.

	Included (Rater_2)	Excluded (Rater_2)	Total
Included (Rater_1)	9	3	12
Excluded (Rater_1)	11	7	18
Total	20	10	30

The formula for kappa is: \[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

where \(p_o\) refers to percentage of agreement and \(p_e\) represents the expected percentage of agreement.

\(p_o\) is calculated by adding the number of articles that both raters rate as included (9) and those they both rate as excluded (7) and divided it with the total number of articles.

\[ p_o = \frac{(9 + 7)}{30} = \frac{16}{30}=0.5333 \]

\(p_e\) is calculated by combining the probability that both raters would randomly arrive at the rating “included” and the probablility that both raters would randomly arrive at the rating “excluded”.

The probability that both raters would randomly arrive at the rating “included” is calculated by multiplying percentage of first row total and that of first column total (look at the numbers in orange ).

\[ \text{Expected to include} = \frac{12}{30} . \frac{20}{30} = 0.2667 \]

The probability that both raters would randomly arrive at the rating “excluded” is calculated by multiplying percentage of second row total and that of second column total (look at the numbers in purple ).

\[ \text{Expected to exclude} = \frac{18}{30} . \frac{10}{30} = 0.2 \]

Therefore \(p_e\) can be calculated as follows:

\[p_e = 0.2667 + 0.2 = 0.467\]

Now we are ready to calculate the Cohen kappa:

\[ \kappa = \frac{p_o - p_e}{1 - p_e} = \frac{0.5333 - 0.4667}{1 - 0.4667} = 0.124\]

Fleiss’ kappa - manual calculation

To calculate Fleiss’s kappa manually, we need to add two columns to record the total number of rater who rate each paper as “included” and “excluded” and other two columns for the square of the previous two columns.

The kappa formula is the same but \(p_o\) and \(p_e\) are calculated differently.

\[ p_o = \frac{1}{N . n . (n - 1)} \left(\sum_{i = 1}^{N}\sum_{j = 1}^{k}n^2_{ij} - N . n\right) \text{and } p_e = \sum{}{}p^2_j \]

where N is number of article, n is number of rater, i represents articles and j represents raters.

Now let me walk you through on using \(p_o\) formula.

\[ p_o = \frac{1}{30 . 4 . (4 - 1)} \left( \text{sum of Included}^2 \text{+ sum of Excluded}^2 - 30 . 4 \right) \] \[ p_o = \frac{1}{120 . (3)} \left( 151 + 127 - 120 \right) = 0.438 \]

Now we can calculate \(p_e\). First allow me to translate the formula into a more intuitive version.

\[ p_e =\left( \frac{\text{sum of included}}{N . n} \right)^2 . \left( \frac{\text{sum of excluded}}{N . n}\right)^2 \] \[ p_e = \left( \frac{63}{30 . 4} \right)^2 . \left( \frac{57}{30 . 4}\right)^2 = 0.525^2 + 0.475^2 = 0.5012 \]

Now that we have obtained the values for \(p_o\) and \(p_e\), we can fit them into the Fleiss’ kappa formula.

\[ \kappa = \frac{p_o - p_e}{1 - p_e} = \frac{0.4389 - 0.5012}{1 - 0.5012} = -0.125\]

Cohen’s kappa and Fleiss’ kappa in R

Calculating kappa is straightforward using R. We will use irr package to perform the calculation. First, let’s reproduce the data I used in this material.

set.seed(22) #this is to ensure that the data generated is the same as mine
kappa_data <- data.frame(Article_ID = paste0("Article_", 1:30),
                         Rater_1 = sample(c("Included", "Excluded"), 30, replace = T),
                         Rater_2 = sample(c("Included", "Excluded"), 30, replace = T),
                         Rater_3 = sample(c("Included", "Excluded"), 30, replace = T),
                         Rater_4 = sample(c("Included", "Excluded"), 30, replace = T))

Furthermore, install irr package if you do not have it yet and run the calculation. We unly use the data in column 2 and 3, referring to Rater_1 and Rater_2 for this calculation because Cohen’s kappa is used when only two raters are incolved.

#install.packages("irr")
irr::kappa2(kappa_data[,2:3], weight = "unweighted")

 Cohen's Kappa for 2 Raters (Weights: unweighted)

 Subjects = 30 
   Raters = 2 
    Kappa = 0.125 

        z = 0.791 
  p-value = 0.429

To calculate Fleiss’ kappa, we will use the same package, but the columns 2-5 are targetted because we include the four raters.

irr::kappam.fleiss(kappa_data[,2:5])

 Fleiss' Kappa for m Raters

 Subjects = 30 
   Raters = 4 
    Kappa = -0.125 

        z = -1.68 
  p-value = 0.0934

Interpreting kappa and moving forward

Many researchers refer to Landis & Koch (1977) to interpret kappa coefficient. The Landis & Koch (1977)’s table requires kappa of at least 0.6 (substantial) in order to move forward to screen the rest of the abstract independently. Otherwise, the the team members need to have a meeting to discuss their disagreement to reach consensus and screen other 30 same articles, and the process continues untill adequate kappa coefficient is obtained.

Kappa	Level of agreement
>0.8	Almost perfect
>0.6	Substantial
>0.4	Moderate
>0.2	Fair
>0	Slight
<0	No agreement

Abstract screening

After adequate kappa coefficient is met, the rest of the papers are divided equally among the team members to be coded individually. At this stage, it is advisable to have three rating categories, i.e., Included, Excluded, and Not sure. The articles rated as Not sure can be discussed in weekly project meeting to make the decision together. This adds to the rebustness of the process. Memember to always refer to the inclusion and exclusion criteria when making the decision during the independent screening or team meeting screening.

References

Cole, R. (2024). Inter-rater reliability methods in qualitative case study research. Sociological Methods & Research, 53(4), 1944–1975. https://doi.org/10.1177/00491241231156971

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310