What is test-retest reliability and why is it important?

Posted on 15 September 2016 in Research

Operational Scientist, Matthew Hobbs explores what test re-test reliability is, how you would measure it and why it is important when choosing cognitive tests.


What is test re-test reliability?

When you come to choose the measurement tools for your experiment, it is important to check that they are valid (i.e. appropriately measure the construct or domain in question), and that they could also reliably replicate the result more than once in the same situation and population.

In an experiment with multiple time points, you would hope that the measurement tool chosen could consistently reproduce the same result over all the visits providing all other variables remain the same. Tools which do provide such consistency are regarded as having high test re-test reliability, and therefore appropriate for use in longitudinal research.


Why is it important to choose measures with good reliability?

Having good test re-test reliability signifies the internal validity of a test and ensures that the measurements obtained in one sitting are both representative and stable over time. Often, test re-test reliability analyses are conducted over two time-points (T1, T2) over a relatively short period of time, to mitigate against conclusions being due to age-related changes in performance, as opposed to poor test stability.

Without good reliability, it is difficult for you to trust that the data provided by the measure is an accurate representation of the participant’s performance rather than due to irrelevant artefacts in the testing session such as environmental, psychological or methodological processes.

Often your aim in research will be to evaluate the impact of an intervention on an individual’s performance. Without the confidence that the measure you’ve chosen is reliable, it is difficult to ascertain whether differences in performance pre and post-intervention are genuinely due to the intervention provided and not an artefact of the tool.

A tool with low reliability can therefore mask the true effects of an intervention, which could have serious ramifications on the conclusions drawn, and therefore the future progression of that intervention.


How is test re-test reliability calculated?

Traditionally, the approach to assessing the reliability of scores has been to ascertain the magnitude of relationship between the test statistics. Thus, if a measurement tool consistently produces the same result, the relationship between those data points would be high.

To answer the question of relationship, researchers have often turned to calculating the correlation coefficient (r) which measures the strength of relationship. A measurement tool providing the same data output at every time point would therefore produce a perfect linear correlation of r = 1.

However, whilst it is useful to know the degree of relationship between the data points, the true question we are aiming to ascertain with test re-test reliability is the magnitude of agreement between the time points rather than the relationship.

When we use the same measure in the same population over T1 and T2, it is very possible to obtain a high degree of relationship as measured through the correlation coefficient, yet show a poor level of agreement (Bland & Altman, 1986).

The question of ascertaining agreement between data points rather than the relationship can be answered through Bland and Altman’s (1986) statistical procedure which can summarise the lack of agreement through calculating the bias.

Through plotting the data points and calculating the difference between each data point and the mean (mean difference) alongside the standard deviation, we can assess how agreeable the measures are. We would expect 95% of differences to be less than two standard deviations away from the mean, allowing us to determine how agreeable the measures are based on how close the data points deviate from the line of equality.


CANTAB test re-rest reliability:

Many papers exist in the literature calculating the test-retest reliability of our CANTAB tests, with the overall conclusion demonstrating relatively good reliability (Lowe & Rabbitt, 1998).

However, the conclusions drawn from literature based analyses rely heavily on the outcome measures chosen by the researchers to investigate and are often those related to the research question of those individuals.

It is therefore not appropriate to summarise a test re-test reliability conclusion for an overall CANTAB task based on the analysis of one outcome measure, but instead to assess multiple outcome measures, especially those applicable to the majority of research projects.

We are currently underway in conducting an updated analysis of the test re-test reliability credentials of our tasks using these more appropriate statistical measures, across outcome measures more commonly recommended for experiments.

Some of our preliminary analyses are trickling in, and we are excited to share these results in the near future.

As a teaser, have a look at our latest analysis of the Paired Associates Learning task (PAL). The data below is generated using the PAL Total Errors Adjusted (PALTEA) outcome measure comparing two separate visits by the same participants (N = 45).

The Bland-Altman plot above, is a special variation of a scatter plot. The x-axis represents the baseline value, which is equal to the mean of T1 and T2 scores per each participant; the y-axis displays the difference between the two scores. The solid horizontal line represents the overall Mean­­, while the dashed line stands for the “zero difference”: in the purely ideal case of perfect agreement between two methods, or, as in our case, two identical T1 and T2 scores, all the points would lie on the dashed line. The upper dashed line indicates 2 standard deviations from the mean, while the lower dashed line represents -2 standard deviations. Based on clinical, and experimental considerations, and goals, the scientist(s) have to define a priori acceptable limits for the Bland-Altman plot. Finally, the bar plots on the graph axes show us the score frequency distributions: e.g., the higher the bar, the greater is the number of points with a given value in our dataset. 

The data above is a brief snapshot of the analysis currently underway to re-confirm the test-retest reliability of our CANTAB tasks on our newest platform CANTAB Connect. We are building upon the published literature which already shows good test-retest reliability for CANTAB and provide a more comprehensive body of work for all of our tasks across a wider variety of outcome measures. We will keep you updated with the progress of this project, and look forward to presenting more reliability data when we have finished crunching the numbers soon. References:

Bland, M, J., & Altman, D. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307–310. doi:10.1016/s0140-6736(86)90837-8

Giavarina, D. (2015). Understanding Bland Altman analysis. Biochemia Medica, 25(2), 141–151.

Lowe, C., & Rabbitt, P. (1998). Test\re-test reliability of the CANTAB and ISPOCD neuropsychological batteries: Theoretical and practical issues. Neuropsychologia, 36(9), 915–923. doi:10.1016/s0028-3932(98)00036-0