A think out loud post - Showing causality - Randomized control trials, Difference in difference, Instrumental variables & Mendelian Randomization
Warning : This is a thinking out-loud post, where I try to elucidate my understanding of statistics and ways of showing evidence for causation. However I am not a statistician and have taken minimal research method courses year ago, and given how such matters are complicated and subtle , so do take any of my musings here with a larger pinch of salt than usual.
Like most librarians I am familiar with the saying "correlation is not causation" but what does it actually take to actually show evidence of causation rather than just correlation?
We see so many library impact studies that calculate correlations between library use, typically electronic resource usage and "student success" (often defined as GPA grades) and often when talking about such studies we will gamely intone the mantra "correlation is not just causation" but then often act like it means something but not go further.
So what does it take to show causation? Even if we don't do such "learning analytics" studies because of concerns of the lack of statistical power or on moral grounds and we are often asked to make decisions on whether to implement a certain tool or policy and this often hinges on knowing what the root of the issue is.
Conversely when we implement a policy change and we observe a hopefully desired change , how do we know if something we introduced really caused the change?
Sad to say while I have some basic knowledge of statistics, showing causation is an area I was not trained in, so after some study this is my attempt to explain why causation isn't easy to show and talk about 3 methods that begin to show how this might be done.
As always, my understanding is not 100% correct. You have been warned!
What is required to show causation?
It is suggested that causation exists when
(1) there is association(correlation) between independent and dependent variables
(2) the variation in the independent variable happens before the change in the dependent variable
(3) nonspuriousness of the association - essentially the relationship is not driven by a third omitted/extraneous variable - this is the main reason why we say "correlation does not imply causation" (see below).
Once you have shown these three factors exist, if you can identify a mechanism for which the causation can happen, you further strength the evidence for causation.
The gold standard - Randomized Controlled Trials (RCT)
There are many reasons why correlation does not imply causation but one of the problems commonly seen with correlation in the library context is we often run into spurious relationships where there are confounding variables.
For example, most studies easily find a moderate correlation (at least those that are published!) between electronic resource use and higher GPA. Assuming such studies are conducted properly with sufficient statistical power and ignoring problems like the filedrawer problem, does that mean electronic resources usage causes higher GPA ?
Unfortunately this in my opinion is unlikely.
Instead probably what is happening (at the very least) is we are not accounting for a hidden/omitted variable called "motivation".
From my personal observation motivated/capable students tend to take an interest in trying out electronic resources (I see this with students requesting restricted databases just to try it out for example) and their motivation also causes them to work hard and do well in GPA.

As such the sample 2 factor correlation we see between electronic resource usage and higher GPA could actually be more driven by the motivation of the student, then simple use of electronic resource.
You could say simple, why don't we construct a model that takes into account their motivation/ability with some suitable proxy and control for that. Besides the difficulty of measuring that, even if you somehow did that, can you be sure your model accounts for every other possible factor out there that might interact and confound the relationship we are trying to tease out? After all the casual model I propose above is just one of many possibilities.
So it seems causation is really hard to show but this is where the concept of control groups and Controlled random trials (RCT) come in.
This idea is explained typically in the context of a medical trial of a new drug or treatment.

The way to do RCT is to get a sample of patents, then do a random assignment (say by throwing a dice or flipping a coin) of them into a control group and a treatment group. We then leave the control group alone and apply a treatment on the other group and then measure any differences between the group.
Ideally these patents should be also representative of the population at large for the results to be generalization but this isn't strictly necessary to show causation for the sample.
Why is random assignment important? Firstly, if people were allowed to decide which group they would be in that would bias the result. For example, say you wanted to see the impact of librarian workshops on GPAs, and you allowed students to self-select whether to attend, obviously this would cause problems if motivation to attend workshops was correlated with motivation to work hard for GPAs.
So clearly, you as the experimenter need to assign your subjects to the two groups.
But using what criteria? If you use any systematic non-random criteria such as age, gender, or even by alphabetical order of name for assignment to groups who is to say those criteria won't have any correlation with other variables leading to similar problems.
You can now see why assignment is as random as possible , say by flipping a coin, to ensure (hopefully) that membership of the group isn't linked to any systematic factor other than random chance.
If you have done this correctly, in a large enough sample, the two groups, control and treatment should in aggregate be pretty similar, so the main difference is the treatment.
Normally you would also need to do descriptive statistics after the assignment to ensure those two groups are more or less similar in various areas e.g. roughly same age distributions, gender
For a real Life library related example - The Citation Advantage of Promoted Articles in a Cross‐Publisher Distribution Platform: A 12‐Month Randomized Controlled Trial, see also evaluation of research here
Difficulty of RCT
The reality of the world is RCT are often difficult to use because in the real world, we usually don't have the luxury of doing experiments. Even if we could do it, it would often not be ethical.
For example, say you wanted to study the effects of use of electronic resources on GPA, could you really run a RCT for use of electronic resources by assigning students to a treatment group by randomly blocking students from accessing electronic resources (treatment) for a semester, then measure the average GPA of treatment group vs control group?
While this would give pretty strong evidence of causation if an effect was found, it's obviously not doable in practice. Similarly, you can't test the impact of library workshops by arbitrarily turning away students on a coin toss
The only way a RCT could work in our library context is if you exploit some event or happening that already naturally results in such a segregation, this is also often called a "natural experiment".
Here's an example. In my University, 80% of freshman (excluding the law school and some of the best students) have to attend a compulsory writing course which has a heavy library component.
Due to the size of the cohort, half of the freshman (around a thousand students) are assigned by the University to take this course in semester 1 and half in semester 2.
Assuming that this assignment is random (or at least randomized enough), one could divide the students into a control group (those who did the course in semester 2) and subject group (those who did the course in semester 1) and see if the writing course had an impact on overall academic performance in semester 1 ,perhaps proxied by overall GPA (excluding writing course GPA).
If any effect was found, this does provide quite strong evidence on the impact of library workshops and if one took the whole population of freshman rather than a sample, it would be generalizable to the whole University (at least of the 80% who had to take the writing classes)..
Quasi-experiment design
So what can we do if we can't do random assignment for experiments or can't find natural experiments that fit those criteria?
Then we are in the realm of quasi-experiment designs where the "treatment" isn't randomly assigned.
In such situations you can still show some evidence of causality but it's in a weaker form.
There are multiple methods such as propensity score matching (matching of members of treatment group to appropriate control group members based on logistic regression to predict group membership), regression discontinuity design(studying effects just above and below a certain threshold) but in this blog post I will talk about only Difference in Difference (DID) and instrumental variables (IV)
Difference in Difference (DID)
I recently was asked a question. Did adding ebook versions of hard copy textbooks in course reserves this semester (semester 2) cause print copies usage to decline (from semester 1)?
One could easily of course, show that the print books (with ebook versions) usage declined this semester compared to the last.
But we also know that print circulation - even of textbooks is declining over the years, so how can we be sure this decline wouldn't have occurred anyway? Alternatively speaking how much of the decline is caused just by introducing e-copies?
Another problem is when we measure across periods, other changes occur that will affect usage, how do we account for these other effects?
So we need a way to capture the trend that would have happened even without the change.
The method we can use is DID (Difference in Difference).
The idea is you measure another control group and assume the trend in the change for the control group approximates for the trend in the change of the experimental group, which allows you to simulate the change that would have happened even if you did not apply the treatment.

In my example, one could define a control group , say all textbooks in course reserves that didn't have a ecopy as a control. (Note unlike in the above graph, T2 is a decline in usage compared to T1)
One could then calculate the mean usage for both groups for both periods.

So in my example above the treatment group (text books that had a ecopy) - mean print usage dropped by 25/35 = 71.4%.
On the other hand, for the control group (text books that had no ecopy)- mean print usage dropped by 10/30 = 33.3%
Intuitively, we can now account for the impact of the natural decline in the print usage from the control this way and the Difference in difference (DID) effect of adding ebook versions to existing print copies would be 71.4%-33.3% = decline of 38%.
Now that we have a intuitive sense of how DID works let's look at the proper way of doing it with a regression.
Essentially you construct a worksheet of data that looks like this.

You need
Column for print usage = Y
Column for whether it is control or treatment = S , S=1 if treatment, S=0 if control
Column for whether it is from semester 1 or 2 = T , T=0, if is semester 1, T=1 , if is semester 2
You then construct a multiple linear regression where

S and T are defined as above (dummy variables). The key is you want to look at the coefficient for S*T , which is the interaction term for S and T aka the effect of both being in treatment group and difference between the two periods.
You can use whatever your favorite tool is for the linear regression, below I show the use of JASP, a new powerful open source statistical tool.

You can see T*S the interaction term added in the model terms.
How to add any desired interaction effect to your analysis in JASP #stats #interactioneffect pic.twitter.com/DHeFfDM2me
— JASP Statistics (@JASPStats) June 17, 2018
Here are the results from JASP using sample data

So you can see from the table above, even though the standardised term is -0.118 (in the right direction) the p value is not below 0.05 , so the results are not quite statistically significant.
One thing to note is even if the results were statistically significant, the evidence for causation using DID method is weaker than in randomised control trials.
As noted in wikipedia "Although it is intended to mitigate the effects of extraneous factors and selection bias, depending on how the treatment group is chosen, this method may still be subject to certain biases (e.g., mean regression, reverse causality and omitted variable bias)."
In our example imagine the problem if the physical textbooks we got ecopies for were chosen because they were most popular text books and for the control you selected all other text books.
Using Instrumental variables(IV)
A much more advanced and difficult to understand concept for showing evidence of causation is by using Instrumental variables (IV) when RCTs are not possible.
This is a method used a lot in Economics and Epidemiology as such some of the terminology might be a bit technical.
Imagine a dependent variable X and a independent variable Y.
You are looking for a instrumental variable Z that
(1) is strongly correlated with X
(2) affects Y only through the effect on X and not otherwise
(3) not affected by other variables in the system including Y - aka Z is a exogenous factor.
If this seems confusing don't panic.
But the idea is simply find a variable Z (the instrumental variable), that affects X , which in turn affects Y.
Importantly Z is unaffected by anything else and only affects Y through X. So clearly by manipulating Z and looking at what happens to X and ultimately Y, you can see if X causes Y.

Try this example
Imagine X = Depression and Y = Smoking.
Does depression (X) cause smoking (Y)? It's tempting to say yes by just looking at only X and Y, but perhaps smoking(Y) causes depression(X) as well? Or another possibility being stressed (omitted variable) causes both depression (X) and smoking (Y).
So how do we entangle this?
The trick here might be to consider the Instrumental variables- IV - lack of job opportunities (Z).
It is reasonable to believe, lack of job opportunities (Z) , could lead to depression (X) - (1).
It is also reasonable to believe, lack of job opportunities affects smoking (Y) only via depression (X) - (2)
Most importantly, it is reasonable to believe lack of job opportunities (Z) is not affected by smoking(Y), or any other relevant casual factors.
Side note : "It might be reasonable" is code for there is a lot of literature studying the correlation between these variables
So by looking at differences in lack of job opportunities (Z), we can see if X actually causes Y.
Library Scenario - Instrumental variables
Let's apply this to the earlier library scenario. How do we show usage of library databases causes higher GPA?
Let's say we are looking at only specific databases that can be used only in the library say bloomberg terminals.
Again if we try just doing correlation of use of bloomberg databases on GPAm we will likely run into various confounding factors such as motivation/ability.
The trick here is to study an instrumental variable - "Dorm proximity to the library".
Assume students are randomly assigned dorm rooms with differing proximity library. Given proximity to library is correlated to use in library databases such as bloomberg databases, "Dorm proximity to library" looks like a potential IV.
More importantly this IV is not expected to affect GPA except via the intermediate use of databases (in library only) and it itself is not affected by any other factor.

"Dorm proximity to library" might be a good IV if the above holds
The tricky part of this technique is there isn't always a good IV. In the above example, we assumed proximity to library did not affect GPA except via use of bloomberg databases.
One could easily argue this isn't true, since proximity to library could also lead to increases to GPA via increased consults with librarians, professors etc (omitted variables), so even if you see an effect, it might due to be other factors.

"Dorm proximity to library" is not a good IV if the above holds
Mendelian Randomization
The main problem with the IV technique is that it is very hard to think of a good IV that affects the dependent variable only through the variable of interest and is itself not related to other variables.
But in the last ten years, in the field of epidemiology they have hit on a clever idea called Mendelian randomization which is basically the IV technique, except they use genetic variants as IV!
The thinking is as follows, what genes we inherit can be seen as a purely random process that isn't affected by any decision/choices we make in life , which avoids confounding factors which of course helps a lot as a potential IV.
If we can identify suitable genetic traits to study, we can study different groups of people based on their genetic traits. One way is to see this has a natural randomization trial, but a better way of seeing it is to say the gene variants are a IV.
Here's a real example of how it is used.
You might have read articles saying showing drinking moderate amounts of red wine , reduced heart diseases. But the problem is the evidence of this usually comes from cohort studies where large numbers of the population was studied and looking at the difference between drinking habits and not proper randomised control trials, where people were assigned into the groups that drank moderately and those who didn't.
As such even though researcher found a correlation between drinking wine and living longer, there might be confounding factors. For example, some suspected people who were ill would tend to cut down drinking, which explained why it was found those drinking more (and generally more healthy) group lived longer!
But how do we see if this theory is true, since we cant usually force people to drink or not drink wine?
The trick is they identified genes which reduced tolerance to wine, so most people with that gene did not drink much.
Assuming the gene affects only health via drinking habits, what genes people had is a good IV.
How does this apply to our context, is there likely to be a gene associated/correlated with using library electronic resources?
Does schooling cause myopia? Genes for educational attainment?
This seems absurd on the surface but a study using Mendelian randomization entitled - Education and myopia: assessing the direction of causality by mendelian randomisation caught my eye.
It has been known for years that there is a correlation between levels of education attainment/years of schooling and Myopia, but of course we can't be sure if there is causation due to may possible confounding factors.
As the article states
"It is not known with any certainty whether more years in education causes myopia, children with myopia spend more time on near work leading to better educational outcomes, children with myopia are more intelligent, or, indeed, an association with another confounding factor, such as socioeconomic position, leads to more years in education and myopia"
The study then uses gene variants/alleles associated with myopia and more amazingly "educational attainment" (measured in years of schooling) to do a Bidirectional Mendelian randomization, where they study the possibility of educational attainment causes myopia and for good measure the reverse whether myopia causes educational attainment using Mendelian randomization.
First they studied the impact of the first IV (alleles that affect myopia - they used the top 50 variants found in this study, and calculate a single weighted "allele score") to assess the impact of myopia on education.
This is to see if perhaps people who tend to have myopia causes them to get higher education attainment. e.g. If you are more nearsighted, that might make you more inclined to stay in school more. They found this was not true.

Mendelian randomization showed increase myopia does not cause one to have more years of schooling
Secondly they studied the impact of the second IV (alleles that affect educational attainment/years of schooling - they used the 74 identified in this study, and calculate a single weighted "allele score") on myopia.
And indeed they found that when measuring the impact of educational attainment via alleles correlated with education (using single weighted allele score), people tend to have worse myopia.

Mendelian randomization showed increase educational attainment does cause Myopia.
So what's the catch here?
We earlier discussed the caveats of using IV and this applies to Mendelian Randomization.
Researchers think that what genes we have may not be totally randomized and worse yet genes are complicated things we barely understand, it may be that genes affect a multitude of behavior and effects, also known as "horizontal pleiotropy" and this might of course cause some confounding effect.
Notice this is different from "vertical pleiotropy" which is okay for a IV, where a gene will have a indirect effect. For example, a gene might cause you to be associated with high cholesterol, which then increases your chance of a stroke. In this case, the gene might be suitable as a IV.

Vertical pleiotropy - gene variant is appropriate as a IV
On the other hand if the same gene directly causes higher stroke chances as well as cholesterol independent of it's effect on cholesterol, you should see why it can't be used as a IV if you followed the earlier discussion.

Horizontal pleiotropy - gene variant is not appropriate as a IV
The part that blew my mind about it, was that they were able to identify gene variants that were associated with educational attainment (years of schooling). I suppose this means in theory if we studied a large enough number of people , we could find gene variants associated with library related activities like visits to library, use of databases etc?
Still mulling over this.
Conclusion
It's my first 2020 blog post, so this is my attempt to stretch and learn something new by explaining what I read t to myself. No doubt this post is riddled with errors, feel free to comment....

