Estimating the amount of papers published in potentially predatory journals by your institution, country - an attempt
With the rise of open access journals and in particularly the business model involving APCs (Article Processing Charges) where authors pay a fee upfront to make their papersopen access, we are now understandably worried about authors publishing in "potentially predatory journals", where publishers have the incentive to charge a fee and not do proper vetting before publishing the paper.
Definitions and terminologies of predatory publishing vary but this recent definition published on Nature that was thrashed out by "leading scholars and publishers from ten countries" defines it as such
“Predatory journals and publishers are entities that prioritize self-interest at the expense of scholarship and are characterized by false or misleading information, deviation from best editorial and publication practices, a lack of transparency, and/or the use of aggressive and indiscriminate solicitation practices.”
As noted this definition is not fully agreed upon, but the way I understand it, these are journals where generally publishers accept a fee but do not provide proper vetting via peer review and other quality controls.
This is an area where there has been a ton of debate. For example, what journals deserve to be considered "predatory"? What is the size of the issue (which itself is affected by what counts as predatory)? Are researchers who publish in such journals fooled into doing so, or are they doing this with their eyes wide open (i.e. They do so to pad their CV)?
While no one doubts such journals do exist, how big is the problem?
Do they really damage the integrity of our scholarly communication system where researchers cite such dubious research? This recent study suggests predatory journals are if not ignored generally cited a lot less (e.g. "60% of the articles in predatory journals attracted no citations, compared with just 9% of those in the peer-reviewed journals.")
While this issue is supposedly more serious in Asia and developing countries, a recent study has shown that 5% of 46,0000 Italian researchers have apparently published in such journals (I would add this study uses the Beall's list as a standard of "predatory" which might be overstating the problem though to be fair the article goes way beyond this to survey authors which led to interesting comments as well as try to quantify the impact on performance review of publishing in such journals)...
As a librarian who is called upon to be knowledgeable about Scholarly Communication issues, I was curious , how big is this issue among authors in my institution and country (Singapore)? How do I find out?
Depending on the answer this might lead to different courses of actions such as doubling down on raising awareness of the issue in workshops by librarians.
A plan for finding out
It seems to me it is a two step process.
#1. Pulling out all papers you are interested in
#2. Checking if they have published in potentially predatory journals
For #1, one way would be to pull all papers from my institution's CRIS/RIM (Current research information system or Research Information System) for study.
While this should be done, I have some doubts on how complete the data is for us and instead I decided to see if I could use a search index and filter down to papers that belonged to us. The idea was to do a quick check, rather than go for absolute accuracy.
Looking for a suitable index
Clearly, for our purposes we should be looking at the broadest index available rather than a selective one such as Web of Science or Scopus (though there are claims even those selective indexes have a few potentially predatory journals, also see Are search results in library discovery really more trust-worthy? Of Predatory journals and Authority) .
To be inclusive we would be talking about indexes in the neighborhood of 50+ million journal articles.
We will also need a way to filter by affiliation or country.
While Dimensions is a broad inclusive index that allows you to export in bulk, the free version of it did not allow filtering by affiliations and as I did not have access to the premium version - I ruled it out. Similarly the large Web Scale Discovery indexes from Proquest, Summon and Worldcat discovery were ruled out because they did not have the ability to filter by affiliation.
Today we have more choices though and among the other major indexes that have emerged, many of them draw heavily from two main sources - Microsoft Academic Graph (MAG) and Crossref.
Both result from two fundamentally different processes. In the case of Crossref, metadata on articles are human generated and deposited into Crossref when registering a Crossref DOI. In the case of Microsoft Academic, the index is generated by crawlers harvesting data from the web, which will allow it to pick up articles with no dois.
How inclusive are these indexes?
It is important to note that just because a journal article has a DOI, it does not mean the article is legitimate, since all an organization needs to do to get a DOI is to register with Crossref or any other DOI Registration Agencies (e.g. Datacite).
Since most journals whether legitimate or not do have DOIs, using Crossref Data or indexes that are based off them is good.
But what about journals without DOIs?
Most indexes such as Web of Science, Scopus - out there are generated from obtaining cleaned publisher feeds (most of which are articles that have dois).
On the other hand, Google Scholar and Microsoft Academic Graph generate an index by using bots that crawl the web looking for academic content and the metadata results to harvest data.
This allows it to pick up academic papers that have no dois and/or are not in Crossref.
Since Google Scholar does not harvest institutional affiliation data , we are down to using Microsoft Academic Graph data which does have institutions affiliation data.
The main question is though, does Microsoft academic graph already filter predatory journals? On the face of it , with the number of records it returns it does not seem to be selective like Scopus or Web of Science.
But do they really not filter predatory or suspect journals?
As I write this, there are two articles written by the Microsoft Research team that explain the nuts & bolts of Microsoft Academic
A Review of Microsoft Academic Services for Science of Science Studies (a technical paper focused on the machine learning/AI aspects)
Microsoft Academic Graph: When experts are not enough (published in QSS, targetted to the bibliometrics community, highly recommend to read this)
First off we are told that Microsoft Academic has an "article centric" as opposed to "venue centric" approach to indexing publications which is in the spirit of San Francisco Declaration on Research Assessment (DORA) "and includes all articles from the web that are deemed as scholarly by a machine-learning-based classifier." such "that articles published in obscure venues will be found in MAG, including journals that some considered “predatory” in nature."
So this looks promising.
But we are told they use a "multipronged approach" to handle predatory journals. My limited understanding from reading "A Review of Microsoft Academic Services for Science of Science Studies" is that Microsoft calculates "Salience" (think of it as an importance) of various entities in the research graph (not just for articles but also authors, institutions, conference venues, journals etc) and this can be used to handle the problem.

Microsoft Academic ranking of journals by Salience
The obvious thing to do of course is to drop off journals with the lowest salience. Indeed we are told
"MAG does not immediately report a newly discovered publication venue as an entity as soon as it is recognized by the automatic algorithm, until the saliencies of its publications have jointly exceeded a manually chosen threshold." - In fact, they mention that this is one of the few areas they "exercise human intervention", a rare departure from their philisophy of relying on automated processes.
In the latest article, they reveal that because calculating salience is computationally expensive, they even do a first cut processing by doing "a principal component analysis (PCA) on the citation graph is taken and only the nodes corresponding to the largest component are selected for MAG"
The idea it seems is that the predatory articles are rarely cited, so one can exclude them mostly by just doing PCA on the largest cluster. They do point out this has a drawback in excluding non-english material because while such material cite english languge papers the reverse is rarely true.
So the upshot if MAG does indeed try to algorithmically exclude dubious journals/venues.
Going for Lens.org
In any case after looking at the various indexes, Lens.org quickly began the obvious choice as an index.
Why? It has among other data Crossref, Microsoft Academic and Pubmed data, which is probably as inclusive coverage as you can get by including both articles with dois (Crossref) and picking up the remaining via harvesting (MAG). Coupled with unsurpassed filters (including affiliation and OA status data) and powerful bulk export and visualization features, it is a clear choice to start.
What counts as potentially predatory
A lot of ink has been spilled over the issue of what counts as predatory, from using the infamous and controversial Beall's list (that some researchers and librarians still swear by even though it is now defunct though there are some archived versions), to the current commercial list called Cabell's.
While Beall's uses a blacklist approach, it is well known you can use a whitelist by using DOAJ (Directory of Open Access Journal). They are one of the most reputable names out there and after the cleanup a few years back, I estimate the chance of a predatory journal being listed as pretty low.

https://blog.doaj.org/tag/doaj-criteria/
While Beall's list is a blacklist and DOAJ is a white list. Cabell's offers both a blacklist and a whitelist.
It is interesting to note, that while there are 10,000+ titles in Cabell's Whitelist that are not in DOAJ and 11,000+ in DOAJ not in Cabell's white list which is understandable as the two have difference coverage ,there are actually 37 titles in Cabell's Blacklist that are in DOAJ, showing there is some small disagreement between the two.

Blacklists and Whitelists To Tackle Predatory Publishing: a Cross-Sectional Comparison and Thematic Analysis
In any case, 37 titles is a small amount of disagreement percentage wise, so you can get away by using the free DOAJ as a whitelist.
Given that my institution and Singapore is relatively small, I expect that the number of journals that do not pass the whitelist are going to be relatively small, so I can eyeball them. Bigger countries or institutions might want to use the Cabell's whitelist & blacklist to further increase coverage.
An alternative approach would be to study the publishers....
Using Unpaywall OA status
In Lens.org there is a Open Access Colour filter derived from Unpaywall. Clearly you would export the articles tagged as hybrid and Golds journals to inspect, since these might be APC based and possibly predatory.
With regards to hybrids, there is no white list in DOAJ here, so you would have to inspect the articles there yourself, but in theory hybrids would be "safe"?

The Gold journals (here defined as fully OA journals) are a bit tricky though.
My initial thought was that you would export all the articles tagged by Unpaywall as Gold and then match against DOAJ API.
But does Unpaywall already label journals Gold only if there are in DOAJ? Mousing over the help in Lens suggested so. If so and we trust DOAJ, we need only look at the "unknowns".

Looking at the Unpaywall document - How do we decide if a given journal is fully OA? suggests that even if a Journal is not in DOAJ, they can still recognise is fully OA/Gold.
Personally to play it safe, I would just export the dois from Lens and then run the Unpaywall API directly to look for titles that are tagged journal_is_oa=true and journal_is_in_doaj=false, then eyeball those.
Affiliation accuracy
One can easily filter to Lens.org by institution affiliation, but there are many questions on the accuracy of affiliation assignment.

In a recent paper - Comparing institutional-level bibliometric research performance indicator values based on different affiliation disambiguation systems that compared performance using Scopus Author identifiers and Web of Science organization enhanced identifers with a Gold standard of papers in German institutions found that assuming the institution existed as an identifer, there was a average precision of 0.95 and average recall of 0.93 for WOS and average precision of 0.96 and average recall of 0.86 for Scopus.
It's unfortunate the same study didnt focus on MAG affiliations (Crossref data has minimal affiliation data) but we can guess that being a new source that is mostly machine learning tuned, MAG affiliations are likely to have a lower recall. As such one could try to increase recall by doing additional searches.
Conclusion
Qualifying the problem of the predatory journal problem in a country or a institution, isn't as straight forward as I thought due to a host of issues with accuracy. Among them we find
Uncertainity of the recall/precision of institution filiter
Uncertainity of categorization of "predatory"
Uncertainity of categorization of OA journal
The method above also only has a chance of detecting published articles in dubious sources, it might not work for authors participating or even organizing at predatory or at least dubious conferences.
Unfortunately there does not seem to be a list of predatory conferences for all subjects. The closest we have is by using MAG's saliencies for conferences, taking the top 100 say as a whitelist.

Post edit note : In case you are wondering about the results for my institution, it seems we have practically no journal article published in a title that is obviously problematic. In a sense this isn't surprising because (1) Singapore has never had a tradition of funding APCs for publishing (2) While we are not China levels of chasing prestige by publishing only in SCI (Science citation index) , there is enough of a pressure to ensure publishing in a less than established journal is not common.

