4 interesting things about bibliometrics that I recently learnt
I've always had an interest in bibliometrics, and while I have served in teams supporting bibliometrics in the past (2010-2015) and enjoy reading and skimming bibliometrics papers, I have never been officially given the title of a librarian supporting bibliometrics.
Still based on the 2021 Competency Model for Bibliometric Work by the LIS-BIBLIOMETRICS people in the UK, I self-assess my own knowledge and skill-level as roughly "Advanced" in both "Knowledge in the field" and "Technical skills" and even at the same level for "responsibilities and tasks"

This is to say, while I think I am fairly decent for an academic librarian in understanding of this area but of course I don't quite measure up to the top librarian practitioners in this area who are absolutely stars in this area (e.g. Elizabeth Gadd, Bianca Kramer etc), nor am I by any means a bibliometrician.
Yet. the more I study about bibliometrics, the more I realize it is a very deep topic, and it's easy to fool yourself into thinking because you read and can mostly explain how common bibliometric indicators are used and/or are familiar with common bibliometric tools like Web of Science/incites, Scopus/Scival, VOSviewer, you are an expert.
Below are 4 different "epiphanies" I have had this year alone, that made me realize really understanding bibliometrics is not that simple....
1. Having a size independent metric like mean normalized citation score (MNCS), % of highly cited articles etc, may not fully account for size/resource differences.
As I am from a smaller institution, I am acutely aware of how comparing institutions of different sizes with indicators can be tricky and I am careful to look at size normalized, or size independent metrics when doing comparisons.
Taking a simple example, the indicator total number of citations to output from an institution is clearly a size dependent metric. The bigger you are, the more likely you will have a higher total citation count. In contrast, measures like average citation count per output or % of output in top 95% percentile by citations is size independent. When you compare results from a bigger institution or small institution, there is no advantage to being bigger.
The same thing occurs for metrics at the journal level. The Eigenfactor is an overall measure of the importance/prestige of a journal as a whole. This is a size dependent metric, and the more articles published by the journal, the higher the eigenfactor score.
A very similar type of metric to Eigenfactor is SJR. Both are network based metrics where citations are weighted by their importance unlike standard Journal Impact Factor where one citation is "worth" the same no matter where it comes from. See a comparison of the differences between SJR and Eigenfactor)
You would have to convert it to a "article influence" score which measures the average influence per article to get a size independent metric.
That is why Impact factor which is a size independent metric (there is no advantage for being particular big in terms of publishing more articles) is more properly compared with article influence scores than eigenfactor.
So far so obvious.
Does Size independent metrics account for everything?
The key point here is to convert the measure to a size indepedent metric you need to normalize it by some number which is used as a proxy for size and most of the time you just divide by the output which typically is just the number of publications. This solves the issue of size. But does it?
In a paper, A farewell to the MNCS (Mean normalized citation scores) and like size-independent indicators, the authors point out the following issue.
Imagine the following scenario, where Institution A and B are being compared.
Assuming I told you Institution A and B had the exact same "resources" including funding and number of researchers and after a year,
Institution A has produced 100 articles each with 10 citations,
while
Institution B has produced 100 articles each with 10 citations and an additional 100 articles with 5 citations.
To keep things simple, I'm going to assume all the citations are normalized by field and or any other relevant factor such as type of article already.
If you calculate say the average citation per output which is a size independent metric you get
Institution A = 10*100/100 = 10
Institution B = (10*100) + (5*100) / 200 = 7.5
So based on average citation per output Institution A does better.
But wait a moment, take it from the common sense/economics viewpoint. Institution B in fact produced the exact same output as A (100 papers with 10 citations each) and even produced additional papers which, while not as highly cited, did have some impact!
So how can you say Institution B did worse?
Here's a quote from the paper
Basic Economic reasoning confirms that the better performer under parity of resources is the actor who produces more; or under parity of output, the better is the one who uses fewer resources. Indeed the MNCS, the proportion of HCAs (Highly cited articles), and all other size-independent indicators based on the ratio to publications are invalid indicators of performance, because they violate an axiom of production theory: as output increases under equal inputs, performance cannot be considered to diminish. Indeed an organization (or individual) will find itself in the paradoxical situation of worsened MNCS ranking if it produces an additional article, whose normalized impact is even slightly below the previous MNCS value. (bold emphasis mine)
Notice the problem here isn't because we used average citation per output, if you switched to another typical size independent measure like % of highly cited articles (aka top 95% percentile), FWCI etc, you will get the same problem, because at the end of it, you are essentially just normalizing by publication output and not actual resources used and because here you are measuring productivity, you will need inputs as the denominator.
When I first read this, I was still feeling the itch of confusion. I suspect my problem is when we say "Size", we may mean different things. Certainly "size" of an institution can be seen as the output in terms of number of papers produced. I was also carrying an unexamined assumption that the number of publications produced by an institution would indeed be proportional to the resources poured in. These two factors combined means I expected such size independent metrics to measure productivity.
How does the authors suggest to solve this issue? They recommend measures like "Fractional Scientific Strength (FSS)" which is a complicated measure that considers productivity of institutions by including input factors like number of research staff (and even wage level).
This led to quite a few bibliometric researchers responding to this proposal, but I think the response here is a good one.
First they summarize and acknowledge the issue
The key element in the criticism of AA is that commonly used scientometric indicators of scientific performance do not take into account the productivity of research unit. Indicators such as the MNCS are obtained by calculating the total field normalized number of citations of the publications of a research unit and by dividing this number by the number of publications of the research unit. These indicators provide a proxy of the average scientific impact of the publications of a research unit, but they do not take into consideration the productivity of the research unit. An indicator of productivity can be obtained by dividing the number of publications of a research unit by the number of researchers, or alternatively, by the amount of money spent on research.
They then go on to explain why this isn't usually done.
The obvious reason is that to do this you need input data (notice while number of publications is an output figure you almost always have by definition) and it is extremely hard when comparing institutions to get all this information. To do this at the international level, you would need not just the number of researchers but also in detail by field, how much of their jobs are on research vs teaching etc etc.
Beyond this reason, there's an even bigger lesson. They remind us that - Scientific performance should be seen as a multidimensional concept and one should not expect any single metric even normalized to capture it.
There are times where you might be interested in a size dependent measure such as a researcher trying to figure out which Institution in Singapore made the most research impact to a field regardless of size of the institution.
On the other hand, if you are a policy maker, say a government looking into efficiency and effectiveness, you probably want to look into productivity measures which require input data (e.g. number of researchers in each field). The tricky bit here is while you can easily get size independent measures like those that normalize by publication output, this doesn't actually measure productivity and you should be clear what you are doing when you use such measures that normalize by output and not just assume size independent measures are all the same.
2. Elsevier's CiteScore formula has changed substantially in 2020.
Elsevier launched Citescore in 2016 which was widely seen as their version of Clarivate's Journal Impact Factor.
There were some differences of course such as
a 3 year citation window (the traditional JIF is 2 years and there is also a 5-year JIF version)
differences in what counts in the numerator and denominator (Citescore in 2016, included all types while JIF allows different types for the numerator and denominator)
and obviously the database/citation index used was different
But basically the core idea is the same as JIF, you take citations from ONE year and look at the papers it cites within the citiaton window (3 in the case of Citescore in 2016 and 2 in the case of JIF).

For the JIF of the same year, you would have a similar formula, except it would be citations in 2015 to papers published in 2013 and 2014.
But it seems in 2020, Elsevier totally revised the Citescore formula. One relatively minor change they did was to tighten up what types of documents could be used in the numerator and denominator, it used to be any type but now it is restricted to only the following publication types: articles, reviews, conference papers, data papers and book chapters.
But the biggest change was this. Not only has the citation window opened to 4 (instead of 3 years), but they also count citations from 4 years (not 1) now.

In other words, to get the citescore for 2021, you count all the cites from papers published in 2018-2021, to all the documents published in the same years!
I don't have sufficient expertise in bibliometrics to say for sure this is better than just using one year (current) worth of citations but this is annoying to me because I think this is going to add to the confusion of how JIF works.
As a side note Journal Impact Factor must be one of the most confusing marketed concepts in bibliometrics ever created. Everybody seems to have heard of JIF but are confused on a multitude of things, I have seen plenty of people confused on
where to get it -Journal Citation Reports isn't a household name compared to JIF/Web of Science
the difference/relationship between Web of Science Core collection (SCIE, SSCI etc.) and Web of Science (don't get me started on people still calling it Web of Knowledge)
that the JCR 2021 report gets you JIF for 2020 and the JCR for 2022 report gets you JIF for 2021
Confused that if you looked up JIF in early June 2022, JIF is still "2 years behind" and the latest you can get is JIF 2020 (Due to combination of confusion #3 and they will never imagine it takes 6+ months to get the results of the last year out!)
Moreover Clarivate recently in 2021 added to the confusion by releasing another Journal metric - Journal Citation Indicator (JCI), which looks like a normalized JIF using CNCI to me. (Somewhat similar to SNIP - Source Normalized Impact per paper?)
Side note : given how JCI works, it also is effectively also using 4 years of citations rather than 1.
But here's one more possible confusion, I have found quite a lot of people are quite confused on how the Journal Impact factor works. Some even librarians may just having a vague idea of JIF as an "average of a sort" and some might even think it works the way CiteScore 2020 works (multiple citing years to multiple years of publications).
3. Normalization isn't a magic bullet
As a practioner, we often face faculty who bemoan how unfair it is that their disciplines have lower citation rates or we are asked by University administrators to find some magic metric to allow comparisons across all disciplines.
Hence the rise of "normalized" metrics at every level of aggregation. Everything from FWCI, CNCI, RCR, ways to normalize H-index to things like SNIP, SJR, eigenfactor/article influence, and recently 2021's Journal Citation Indicator seem to give us what we want (at least at journal level).
But do they really work? In the first section above, we encounter a common case where a measure that seems to normalize by size may not do actually what they want - i.e. measure productivity.
I can imagine a unsuspecting administrator not thinking too hard and using measures that are "size independent" based on dividing by output and thinking it makes everything comparable.
But surely things like FWCI that normalize by similar publications, which are determined by publication year, publication type, and subject area do exactly what we want?
Leaving aside the serious flaw that FWCI should not be used for small sample sizes, I suspect how well normalization works for handling field differences is far from a settled matter (there are so many ways to do this with many tricky technical details like how to delineate fields) and we should be careful of overselling metrics even if they claim to have normalization built in.
I'm not saying to not use these normalized citation indicators, just to be a bit unsure and more skeptical about presenting them as the holy grail of allowing you to compare across all relevant factors.
4. Why showing causation and in particular gender bias using bibliometrics isn't a straightforward matter
Warning: This section is based on my limited understanding of the complicated issues regarding causal inference.
I wrote in December last year that one of the things that I was learning was in the area of casual inference. Specifically after watching a few Youtube videos I discovered the book "The Book of Why: The New Science of Cause and Effect " by Judea Pearl and Dana Mackenzie.
While it was written as a "popular science" book, I was sucked in by the historical anecdotes and real examples, where you find an understanding of causal inference, particularly colliders and dangers of controlling variables without thinking would have clarified things faster.
I'm not going to try to explain concepts like a collider, confounder, mediator here, but essentially without a casual model, you might end of controlling for a collider which will cause a collider bias misleading you into thinking an association or even casual relationship exists when it doesn't.
At the time I was fascinated by this because I was interested in how you show causality beyond Randomized control trials. But I didn't think of how this could apply for bibliometrics areas as well.
For example, the study of gender differences in academia using bibliometrics has become (more) trendy. This is obviously a politically sensitive question with some results leading to a lot of scrutiny, leading to controversy (see two examples here and here)
There has also been a lot of news on this lately such as Nature's - The rise of citational justice: how scholars are making references fairer with the idea of getting researchers to be aware of a gender bias in citations.
One of my favorite bibliometricians - Ludo Waltman suggests things are not so simple, In a blog post - The causal intricacies of studying gender bias in science, Vincent Traag and Ludo Waltman distinguish between
“gender bias”, which we propose to define as any difference between people with a different gender that is directly causally affected by their gender.
and "gender disparity"
to refer to any difference between people with a different gender that is causally affected by their gender.
Both define the terms in terms of casual effects, the difference lies in the word "directly casually affected" which applies only for gender bias.
At the risk of misinterpreting their reasoning....

In the example above (assuming this is the casual model we believe is at work), there is a gender bias, where gender directly affects whether a manuscript is accepted by peer review. Also this casual effect is not mediated by another variable.

In the second example above, Gender doesn't directly affect whether a manuscript is accepted but rather Gender affects the type of research question that is studied which indirectly affects whether it is accepted by Peer review. In other words, the effect of Gender on acceptance is mediated by the Research question.
Under their definition this is a gender disparity and not a gender bias on acceptance by peer review, because gender does not directly affect acceptance.
Note : This doesn't mean this isn't a casual effect, it's still casual in nature just that it isn't direct and this is still important to understand.
There might also be a bias in terms of research question affecting acceptance.

Another possible casual model is as above. Here Gender again doesn't directly affect acceptance. Rather than Gender affects quality of research which affects acceptance.
Again, this is a gender disparity on acceptance, and not bias as the effect of gender on acceptance is indirect, moderated through the "quality of research" variable.
Incidentally, you most likely won't say there is a bias of quality of research on acceptance, since bias is usually considered an "unfair" casual effect and if quality of research is the only thing accepting acceptance, that's how it should work and not a "bias"!
Why is this distinction important? Firstly, most studies do not attempt to distinguish between these scenarios and simply look at gender differences in acceptance or in terms of citations. But why do we care if gender is causing a "bias" or "Disparity" if it's casual in nature?
The reason is depending on the actual cause the solutions might be different. E.g. Which of the casual models about are shown to be true, would heavily affect what we want to do about the outcome.
For example, if for some reason we have evidence that manuscripts from females are not being accepted because the quality is objectively weaker (for whatever reason), should we just insist on quotas on acceptance by gender without finding out why such manuscripts tend to be weaker? This would just result in far weaker quality papers getting accepted. Conversely if the real reason females tend to be accepted less is because they like to choose topics that are less popular (but equally good quality work is produced), the solution is yet again different.
I unfortunately missed the talk given by Ludo Waltman entitled "Are we all biased? The complexity of the diversity puzzle" but what I saw in his tweets intrigued me greatly.
A clear definition of bias is essential, which can be obtained by distinguishing between direct causal effects (biases) and indirect causal effects (disparities), in line with a suggestion made by @vtraag and me in a @LeidenMadtrics blog post https://t.co/xD4zrB7AQD @yudapearl pic.twitter.com/STtk0JTldH
— Ludo Waltman (@LudoWaltman) April 20, 2022
This led me to watch a talk by his collaborator Vincent Traag entitled "Causal intricacies of bias in the research system" which went more into the details.
In his talk, he refers to the following casual model.

The tricky bit here is if this casual model is true, it is going to be almost impossible to measure the direct casual effect of Gender on citations.
In the model above, we want to control the variable Z, Journal, because we want to measure the direct effect of Gender X on Citation Y without considering Z.
However, Z is also a collider between Gender X and Quality Q, so controlling on Z leads to a collider bias leading to a X becoming correlated with Q (which isn't true overall and only appears when we condition/control on the collider) which will directly affect citations.
In other words, we are in trouble either way. The crux of the problem is Q, quality is unobservable, so we can't solve the issue by controlling for it.
Thinking about casual models can also possibly help explain counterintuitive findings.
Earlier I linked to a controversial paper where the authors reported a unpopular finding, that
increasing the proportion of female mentors is associated not only with a reduction in post-mentorship impact of female protégés, but also a reduction in the gain of female mentors. While current diversity policies encourage same-gender mentorships to retain women in academia, our findings raise the possibility that opposite-gender mentorship may actually increase the impact of women who pursue a scientific career.
In other words, females who are mentored by other females do worse in terms of research impact (citations) then those who are mentored by man!
This paper was eventually retracted because there were many criticisms of the methods but I think the major reason given in the retraction note was the problem of defining mentorship by coauthorship.
But as explained in the same talk depending on your casual model, other explanations could exist for why this finding was seen.

Assume a casual structure as above.
The main issue with the above retracted study is that it studies people who stay in academia (published X papers), so clearly it is conditioning on variable A.
But you can see from above that A is a collider. Controlling this, results in a "illusionary" relationship (that shouldn't appear if you didn't condition) to appear between Quality and Gender Mentorship, and of course, higher quality leads to higher citations.
Intuitively, the idea here is, female mentors tend to be more supportive than male ones and as such their mentees tend to stay on compared to those mentored by males. Clearly this means if you look only at mentees that stay on, the ones mentored by females will tend to be less talented/lower quality than those by males mentors. Naturally, those mentees having lower quality will have lower citations.
As they discuss in their blog post, if this is indeed a true casual model, trying to reduce mentorship of woman by woman might make things even worse! Particularly if the aim is to encourage more woman to stay in academia.
Conclusion
The proper use of bibliometrics is not as easy as it seems at first glance and there are numerous traps ready to ensnare the unwary and the over-confident.
Simply memorizing formulas for indicators and/or learning how to use commercial interfaces like Scopus, Web of Science, Scival, Incites is just the first step....

