Tuesday, July 20, 2010

Building a Linked Data based index of library institutions

As a side-effect of our efforts to convert bibliographic data to Linked Data, we realized that library institutions are not identified yet by Linked-Data-URIs (i.e. HTTP-URIs) [1]. Since these institutions play a significant role in linked bibliographic data, we set out to provide such URIs as the first segment of lobid.org. This would not only enable us to reference to institutions by URIs, but the URIs could be used by other institutions in other contexts as well.

The URIs are based on the existing and well established International Standard Identifier for Libraries and Related Organizations (ISIL) which can also act as MARC Organization Codes. They are assigned to library institutions by national or institutional agencies. But these HTTP-URIs are just a small first step. The Linked Data Design Issues advise that "[w]hen someone looks up a URI, provide useful information, using the standards", so we went out looking for some useful information we could serve. It is currently aggregated from two sources: the address database for german libraries and the MARC Organization Codes Database. The data we found in those sources differ in detail: the MARC Organization Codes Database provides the name and address of an institution in varying levels of detail, the database for german libraries additionally contains contact information and opening hours (which are unfortunately hard to parse since they are provided as an arbitrary literal descriptions). Please note that we have not processed all gathered data yet (e.g. we have just began to add US-libraries).

After finishing a first quick-and-dirty prototype, we began to think about how to enhance the quality of the data, since it is often sparse and/or out of date (esp. the opening hours). One solution would be to provide editing capabilities on the central website directly. But most library institutions already have websites, and that is where the information in question is created and updated in the first and often also only place. Additionally administering the same information in a central database is repetitive and would probably be ceased at some point. One might argue that the information held in the central database could somehow be included on the institutions website by AJAX or whatnot, but this would imply being dependent on the centralized service.

This is why we decided to try to adapt a solution for a similar problem that Mark Birbeck describes here. It is based on the idea of aggregating data provided by the individual institutions directly. The suggestion is to use an interface that you already have: your institutional website. Libraries probably provide all the information for their human visitors there already. With just a litte enhancement, by adding a couple of additional attributes, that information can be made available to machines, too. By providing these additional attributes, you effectively provide an RDF description embedded in the HTML you serve anyway. This technique is called RDFa. Search engines are beginning to support RDFa, so enriching a web-site is useful not only for us to aggregate the data but for just about every other machine agent on the web.

An RDFa description containing address and opening hours



A challenge with this approach is that libraries need to be made aware of the necessity to encode their data in this form (after all this is the reason why we started out by converting existing databases). Most probably the CMS used for library websites need to be adapted to include this xml-snippet automatically when a person who does not know any markup enters the relevant data in some database through a web form. Should this mechanism be adopted by libraries, a method for harvesting the library data and integrating it with the lobid.org list must be implemented. Websites that provide this data need to be identified and visited regularly or on the fly by a web spider in order to update the information on the lobid.org database and serve it to users of this service.

Disclaimer

We are currently using data provided by Google’s Geocoding API which may not be distributed. As soon as institutions publish their data under an open licence, we can also provide the aggregate as Open Data.

[1] According to the German Wikipedia, the info:isil namespace was requested to be used as non-HTTP-URIs based on ISILs. That namespace does not seem to have been registered though, and as of May 22 2010 the registration of "info" namespaces has been closed. If it turns out that info:isil-URIs have been used, we will owl:sameas them to the HTTP-URIs.

No comments:

Post a Comment

Powered by Blogger.