GND reconciliation for OpenRefine

27 Aug 2018 – 19 Aug 2019, Fabian Steeg, Adrian Pohl | 🏷 lobid-gnd 

Our lobid-gnd service provides access to the Integrated Authority File GND. The service contains integration into OpenRefine, a powerful tool for working with messy data. This tutorial provides an overview of GND reconciliation for OpenRefine. The features used here require OpenRefine 2.8 or later.

Reconciliation is the process of matching name strings to identifiers of entities in a database like an authority file, Wikidata etc. This is useful whenever you want to merge differing name strings for the same person in your data or when you want to fetch additional data from the target database you are reconciling against.

The first step in the reconciliation process is to create a project. OpenRefine can import data from various sources. For this tutorial, we’ll simply import data from the clipboard:

1

Copy these lines and paste them in OpenRefine:

name;beruf;ort
J. Weizenbaum;Informatiker;Berlin
Twain, Mark;Schriftsteller;
Kumar, Lalit;;
Jemand;;

2

In the following preview screen you can take over the settings which were automatically detected and create the project:

3

We now want to reconcile the text strings in the name column with GND entries:

4

We’ll have to add the GND reconciliation service:

5

Paste https://lobid.org/gnd/reconcile as the service URL:

6

Collapse the drawer on the left hand side by clicking the newly added service. As our list for reconciliation consists solely of personal names, we now select DifferentiatedPerson to reconcile only against GND entries of that type:

7-1

Optionally, we could reconcile against a non-default type by typing into the “Reconcile against type” field and selecting one of the suggested types, e.g. Person:

7-2

It can make sense to pass additional data from other columns to improve the reconciliation results. Type in the text fields for each column, and select one of the suggested properties. E.g. use the data from the beruf column to search in the professionOrOccupationAsLiteral field in the GND:

8

After reconciliation, we can inspect candidates that have not been automatically matched by clicking or hovering over (depending on your OpenRefine version) their name:

9

This brings up a preview, with the option to match them:

10

Alternatively, we can search for a match by clicking “Search for match”. This brings up a dialog with a text field prefilled with the cell value. Select one of the suggestions to match the cell:

10

After matching, we can enrich our data with the reconciled data. We want to add columns based on the reconciled values:

11

We can now select the properties we want to add (using the search field and picking one of the suggestions for what we typed, or from the the prefilled list below the search field) and preview them. Here, we choose Beruf oder Beschäftigung, Geburtsort, Sterbeort, and Ländercode:

12

The first three properties are GND entries themselves, so they are recognized as reconciled items (they are links in the preview).

For non-reconciled items that have a label and an ID in lobid-gnd (such as Ländercode), we can configure the content we want (label or ID) using the configure link for that property:

13

Note also the limit setting, which works for all properties and limits the number of values added for each entry (0 is the default, meaning no limit).

After confirming the preview (removing the old columns beruf and ort, cutting off the non-reconciled item using the facet on the left hand side), we have the enriched table with new data:

14

We can now use the new reconciled items (like Berlin in the Sterbeort column here) to add more columns based on their properties (i.e. properties of Berlin, not Weizenbaum, Joseph):

15

As an example, we add a link to a depiction of the Sterbeort:

16

Finally, we can export our data in various supported formats:

17

This concludes our overview of GND reconciliation in OpenRefine. For further information check out the OpenRefine general documentation and the reconciliation wiki page.

Comments? Feedback? Just add an annotation with hypothes.is.