OpenRefine, & reconciliation services - learning the ropes
I've come to believe that learning how to search for and manipulate data will become the next "must-have" skill sets for academic librarians. As such, I've been trying to learn myself.
One of the obvious tools to gain competency in is OpenRefine (formerly Google Refine) for cleaning of data. I've been playing with this open-refine tool on and off for the last few years but generally found the clustering functions most impressive and almost impossible to duplicate in Excel.
Recently, I finally buckled down to formally learn the basics because I decided to do a lunch-time workshop introducing the tool.
I'm still far from an expert but I have the basics down already. After the class, I decided to go through the book below to see what gaps remained.

Why Openrefine?
Have a messy data set (e.g from surveys or scraped data) and you want to standardise the values in a column? OpenRefine's variety of clustering functions is pretty much unmatched in matching similar strings for you to group together.

Clustering function in Open Refine
This coupled with the ability to quickly get a sense of the data by slicing and dicing by facets (e.g. facet to show only values that are blank, errors, by character length and more etc) and a bunch of common transforms (which you can supplement by using the open refine expression language) and you have the makings of a very useful data cleaning tool.
This hasn't been missed by many and it is now a standard data cleaning tool, used by Historians, Archivsts, Journalists and yes librarians (particularly metadata and electronic resource librarians).

Common transforms built into Open Refine
While the cluttering in OpenRefine impressed me immediately and the ability to carry out common transformations needed to clean data is extremely useful, what eventually dawned on me is the ability to enrich existing data columns with other data that is the killer feature.
The stereotypical example of this is when you have longitude and latitude values in your columns and you connect to some other external system to convert it into a geographical location. In other words, geocode your data using a service like Google map.
But you can go beyond that using what Openrefine calls reconciliation services. For example, say you have a bunch of author supplied keywords that authors add to your institutional repository deposits, but you want to map them to say a controlled vocabulary like LCSH or MESH. How do you do it in a semi-automatic way?
Using Openrefine reconciliation services - you can try to match your column data called say "Subject terms" versus a external source of data like LCSH, Dbpedia (linked data extracted from wikipedia), Wikidata (roughly linked data from Wikipedia infoboxes) etc and Openrefine will try to do some fuzzy matching and offer you alternatives for matching (by default exact matches will be auto selected).
Here's the official explanation.
"A reconciliation service is a web service that, given some text which is a name or label for something, and optionally some additional details, returns a ranked list of potential entities matching the criteria. The candidate text does not have to match each entity's official name perfectly, and that's the whole point of reconciliation--to get from ambiguous text name to precisely identified entities. For instance, given the text "apple", a reconciliation service probably should return the Apple Inc. company, the apple fruit, and New York city (also known as the Big Apple)."
Say you have the file below and you want to match the data column "dataTXT" against LCSH matches.

Original file
So you run a LC Reconciliation service on the data and this is how it looks like once the reconciliation is done.

For the first row, your reconciliation service automatically recognises the right LCSH is Japan. For the next row, the original was "stock market" and the reconciliation service suggests a few close matches such as Stock Market (Firm), Stock Market Crash, 1929 etc.
If you select the double check next to say Stock Market (Firm), all the data in the column with "stock market" will be auto matched with Stock Market (Firm), while checking the single check will only match it for that cell.
You can also reject all these choices and choose your own.
Interested to know more? There are basically 4 ways to enrich data using openrefine.
1) Use add columns by fetching URLS function
2) Use a standard reconciliation service
3) Use any RDF store (via SPARQL endpoint or local RDF) as a reconciliation service via RDF extension
4) Using name entity extraction via NER extension
1. Use Add columns by fetching URLS

This method basically tells OpenRefine to construct a URL based on values in a column, run it's through a RESTFUL API to extract some data (typically in JSON/XML) and then use OpenRefine GREL to parse the output.
As mentioned before geocode your data using a service like Google map is a typical example, but a more librarianish example would be this tutorial that uses the Sherpa-Romeo API to pull in information on what version of a paper could be archived for each journal title.
You can find the instructions here, so I will briefly explain the idea.
1. First get data with a list of ISSNs in one column.
2. Sherpa-Romeo as you probably know allows you to enter an ISSN and it will then tell you what is permissible to be archived by the publisher. You can do it manually by entering the ISSN manually or better yet you could exploit the Sherpa Romeo API to automatically do it for a bunch of ISSNs at the same time.

Example of searching by ISSN in Sherpa Romeo
3. To call the API, you need a URL like this
"http://www.sherpa.ac.uk/romeo/api29.php?issn="+value+"&ak=YOURAPIKEY"
4. The part in RED will be replaced by values in your column with ISSNs, in addition you need to register for a free API Key (free up to 500 calls a day) and replace the part in blue.
5. Once you have that string ready, go to the column with ISSNs, select edit column->Add column by fetching URLs and enter the string above (remember to use the APIKEY you registered and replace the part in blue with that)

6. If you did it correctly, you will get a mess of XML appearing. Then use openrefine's function to parse the results (instructions here).
Though this function is a very powerful way to work with APIs, this function isn't quite unique to Openrefine. You can do similar with Google sheets or Excel (particularly with tools like Powerquery).
2. Use a standard reconciliation service.
Somewhat more impressive to me is using Openrefine's reconciliation services. These are APIs that others have used to create openrefine standard reconciliation services you can run openrefine data against to "Reconcile".

Running a reconciliation service
Most standard reconciliation services can run both remotely (you point to a URL) or you can set them up locally on your machine (and point to localhost).
Some examples
a. Wikidata Reconciliation for OpenRefine(en) - This is a default service in my version of OpenRefine 2.7-rc2.
If you don't have it you can add the service by clicking, Reconciliation->Start Reconciliation->"Add standard service"->"Add Reconciliation service". Then enter https://tools.wmflabs.org/openrefine-wikidata/en/api

If done correctly, you will see a new standard reconciliation services named "Wikidata Reconciliation for Openrefine (en).

Wikidata together with Dbpedia are two of the most useful reconciliation services because they are cross-domain and if you are not sure of any specialised services to reconcile your data against they are good bets.
Once you have the reconciliation service working it's not too difficult to use,

If the service (dbpedia above) recognises the type of content in the column you are reconciling it will display the type of entities you can limit the match to. In the above example the service recognises the column includes place based data.
You can also choose "Reconcile against no particular type", though that might slow things down.
See also this for more details on how to use the advanced functions when reconciling.
If you are a librarian, having a way to semi-automatically reconcile data against the library of congress subject headings or Library of Congress Name Authority File would be very useful.
Later, you will see another way to do it, but there is a standard reconciliation service at https://lc-reconcile.herokuapp.com/ that you can point to (same procedure as for the wikidata one), or you can run it locally.
A somewhat similar reconciliation service is one that works with OCLC's FAST.
MeSH (Medical Subject heading) seems available via SPARQL endpoint (see later)
c. VIAF and ORCID Reconciliation service
VIAF and ORCID are both people identifiers so you can see how this can be useful. e.g. Get a list of your faculty run a reconciliation service on them against VIAF or ORCID.
You can find the instructions here.
d. FundRef Reconciliation service
e. JournalTocs reconciliation service
I haven't tried this yet but Journaltocs has some of the most interesting data so it's not surprising someone did a openrefine standard reconciliation service.
Want to do fuzzy matching of data in openrefine versus your own csv file, rather then existing services? Now you can! You can use this match say 2 files of data that are almost but not quite similar.
(Note, there's also a fuzzy matching plugin for Microsoft excel that works well for matching of names and addresses)
3. Use a Reconciliation service via RDF extension
What if the service you want to use does not have a openrefine standard reconciliation service? One way around the problem is to install the RDF extension (also see instructions here). Once you have this a wealth of possibilities open as you can now reconcile against linked data available on the net.
I won't go into linked data/RDF because it's a area I only recently started to dimly grasp, but basically linked data is typically express use RDF and you can query the RDF/Linked data graph using SPARQL.
With the RDF extension installed, you can download the whole RDF file (if available and not too large) to match against as a reconciliation service,
Also the RDF extension can "hook" up with the SPARQL endpoints (basically URLs that accept SPARQL queries with responses) made available by most Linked data to act as a reconciliation service.

For example you can follow the instructions here to setup a reconciliation service using the Library of Congress SPARQL endpoint.
Other examples you could reconcile against using SPARQL endpoint include DBpedia (linked data obtained by extracting Wikipedia data), Europeana data, MeSH (Medical subject headings), JISC Open Bibliography British National Bibliography dataset etc.
Want more ? Looks up datahub, Mannheim Linked Data Catalog or here for other linked data that have SPARQL endpoints or RDF to download.

Getting reconciliation services to work using SPARQL endpoints can be a little hit or miss, for various reasons detailed by the documentation but may still be worth it. E.g. My connection to the DBpedia SPARQL endpoint at http://dbpedia.org/sparql is a bit flaky. There are many options when setting it up.
I'm still learning and would appreciate tips by more experienced users.
4. Using name entity extraction via NER extension
The method above for reconciliation work well generally if you have single string/phase of text. But what happens if you have a chunk of text, say text from an abstract? Say you have a abstract like this
"Wikipedia is fast becoming a key information source for many despite criticism that it is unreliable and inaccurate. A number of recommendations have been made to sort the chaff from the wheat in Wikipedia, among which is the idea of color-coding article segment edits according to age (Cross, 2006). Using data collected as part of a wider study published in Nature, this article examines the distribution of errors throughout the life of a select group of Wikipedia articles. The survival time of each “error edit” in terms of the edit counts and days was calculated and the hypothesis that surviving material added by older edits is more trustworthy was tested. Surprisingly, we find that roughly 20% of errors can be attributed to surviving text added by the first edit, which confirmed the existence of a “first-mover” effect (Viegas, Wattenberg, & Kushal, 2004) whereby material added by early edits are less likely to be removed. "
Which of these text are important? Do any of this text refer to "entities" known by other datasets?
The process of going through a chunk of text and extracting "known entities" is known as Named entity extraction (NER).
To do this in openrefine, you will need to download and install this open extension. By default it works with dbpedia spotlight but you can register and add other services that do Named entity extraction.

Here's a concrete but play example on how to use Named entity extraction, firstly I run the plugin on my column with articles titles. As my plugin works with 2 services (Dbpedia spotlight & datatext), after the extraction, openrefine lists two columns of extracted entities.

In this case, datatext seems a bit better than the free dbpedia spotlight. It generally seems to extract more named entities for each title than DBpedia.
What I've extracted is just the entities from these two services, but if I want them to be in LCSH, just run the output against the LCSH reconciliation service (see above).
Conclusion
Openrefine is a really flexible tool and I'm still learning the rope of Openrefine. If any of you have any tips for better use of openrefine please let me know. If you use Openrefine, let me know how you use it and how useful it has been in your day to day work.

