=======
OAI-PMH
=======

Introduction
------------

The Open Archives Initiative Protocol for Metadata Harvesting
(OAI-PMH) is a protocol that allows servers to publish metadata to the
web. The information can then be harvested by OAI-PMH clients. The
word *harvest* is important; OAI-PMH does not define real-time search
but a way for sites to share large quantities of metadata with other
sites. The sites that harvest the data can then offer their own
services for this information.

The OAI-PMH protocol is XML-based and is accessible over HTTP. Clients
express harvesting requests as simple HTTP requests, and XML data is
returned by the OAI-PMH server. This makes it easy to implement
OAI-PMH compliant harvesters in many programming languages: you just
need a library that can open URLs, and some XML processing library.

The use of OAI-PMH makes the Document Library a true *open data*
application, meaning that using the Document Library does not lock an
organization into one single monolithic application but makes it one
component in a collection of collaborating applications.

What is OAI-PMH used for?
-------------------------

OAI-PMH is used to open up the metadata information in online archives
for use by other applications and organizations. Large organizations
often deal with large amounts of information resources such as
documents, images or papers. Systems to manage this information are in
place, but often added services need to be provided, and it doesn't
always make sense to add these services within the original
application itself when it can be done more effectively in another
application.

An organization could be managing all its documents in a system like
the Document Library. While such a system is good for managing this
information, it does not allow this information to be integrated
within larger websites, and does not offer much in the way of search
interfaces. Luckily the Document Library exposes its information to be
harvested using OAI-PMH, so other software can provide advanced
browsing and searching features. Infrae in fact provides such features
on top of Zope (``OAICore``) and the Silva CMS (``SilvaOAI``) as open
source software.

Another reason to expose information through OAI-PMH can be
visibility. Universities for instance are very motivated to have their
researchers publish the results of their research far and wide. By
exposing metadata about academic papers through OAI-PMH, OAI-PMH-based
aggregation services can put the papers in their index, and thus
increase the visibility of these papers. A number of these service
providers, often aggregators, are listed on the `OAI-PMH service
providers listing`_.

.. _`OAI-PMH service providers listing`: http://www.openarchives.org/service/listproviders.html

In the Netherlands, Dutch universities participate in the Digital
Academic Repositories (DARE) initiative, to make all their research
results (typically in the form of scientific papers) digitally
accessible. The protocol chosen for exposing this information is
OAI-PMH. Some services made possible because of this standardization
on OAI-PMH between universities in the Netherlands is the `Cream of
Science`_ showcase of Dutch prominent research, and NARCIS_, a gateway
to Dutch scientific information.

.. _`Cream of Science`: http://www.creamofscience.org/en/page/language.view/keur.page

.. _NARCIS: http://www.narcis.info/narcis/

OAI-PMH as a web service
------------------------

Offering OAI-PMH access to metadata managed by a site could be seen as
offering a *web service*.

What is a web service? It's an interface that's accessible over the
WWW that is normally used by applications, and is not meant for direct
consumption by humans. Applications can use web services to access
external information or to gain new capabilities. Web services are
thus used for application to application communication, as opposed to
a web *site*, which is an interface accessible by humans, albeit
intermediated by a web browser.

OAI-PMH is a web service in this sense: it is definitely not meant for
direct human consumption - pointing your web browser to an OAI-PMH
service will result in a lot of hard-to-read XML that describes
metadata.

Web services are an attractive way to enable application to
application communication because they reuse the infrastructure
already available for the web: the HTTP protocol, web servers,
proxies, and so on.

There are multiple competing visions about how web services should
work. A well-known vision is based around the SOAP_ protocol and the
WS series of documents (WS-Addressing_, WS-Transfer_, WS-Eventing_,
WS-Enumeration_, WS-Security_, and so on). The "WS" vision is heavy on
specifications, provides thick layers over HTTP and is generally seen
as a tool-based approach towards web services: software providers such
as Microsoft and IBM will provide the toolkits and IDE environments to
work with these web services so it's not necessary to understand the
details of these documents. This backing by large industry players
considered to be an advantage of this approach.

.. _SOAP: http://www.w3.org/TR/soap/
.. _WS-Addressing: http://www.w3.org/2002/ws/addr/
.. _WS-Transfer: http://www.w3.org/Submission/WS-Transfer/
.. _WS-Eventing: http://www.w3.org/Submission/WS-Eventing/
.. _WS-Enumeration: http://www.w3.org/Submission/WS-Enumeration/
.. _WS-Security: http://xml.coverpages.org/ws-security.html

Another, competing vision is REST_ (Representational State
Transfer). REST advocates criticize the WS vision as overly
complicated, and even as a way for large software vendors to be able
to sell more complicated tools. REST is not a standard or a protocol,
but a so-called "architectural style" that can be used to inform the
design of web services. With a REST-style web service, the web service
is accessed directly as URLs by the programmer using a HTTP client
library, using the basic HTTP protocol. Resources are accessed over
HTTP GET, created and altered using HTTP POST and PUT, and deleted
using HTTP DELETE. Resources typically represent themselves as XML
data or some other simple textual markup. Care is taken that the
"stateless" nature of the web is preserved: the server does not need
to retain state between multiple web requests.

.. _REST: http://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm

REST has advantages over WS because of simplicity and
introspectability: a human developer with a web browser can access a
REST-based service and get an idea of what's going on. In addition,
since REST is so much like the web architecture today, the same
knowledge and toolkits used to build large and scalable web
applications can be used to build large and scalable REST
applications.

There is a third approach to web services that nobody is advocating
very much but lots of people are actually using: XML over HTTP. This
is what is often declared REST by REST advocates, but then as often is
declared "impure" (or "low") REST as the full REST style is not used
(often only HTTP GET).

As open source developers using the Python programming language,
Infrae is naturally somewhat biased in this debate; we favor REST
(pure or not), as XML over HTTP is something we can easily work
with. We would prefer not to have to worry too much about the
seemingly endless series of WS specifications. Luckily OAI-PMH is such
a protocol.

OAI-PMH is a protocol that follows an XML over HTTP approach. It isn't
pure REST; the protocol only provides for access to data over HTTP GET
and does not allow for clients to alter or add metadata. For OAI-PMH,
HTTP POST is defined as doing the same as HTTP GET, just with a
different encoding of parameters. The way different OAI-PMH "verbs"
are all tunneled as request parameters over the same URL is also not
very "RESTful".

OAI-PMH examples 
----------------

High time for some examples. OAI-PMH only exposes a single URL as its
entry point. We'll use the following URL as an example::

  http://ep.eur.nl/oai/request

This is the OAI-PMH service of the Erasmus University Library in
Rotterdam, the Netherlands. Accessing that URL by itself will result
in an error, as to make a real OAI-PMH request we need to add HTTP
parameters that specify what bit of information we are requesting.

This is a valid OAI-PMH request::

  http://ep.eur.nl/oai/request?verb=Identify

This asks the service to give some information about itself. We receive
the following information in return::

  <?xml version="1.0" encoding="UTF-8"?>
  <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2006-04-10T15:01:02Z</responseDate>
    <request verb="Identify">http://ep.eur.nl/oai/request</request>
    <Identify>
      <repositoryName>DSpace at Erasmus</repositoryName>
      <baseURL>http://ep.eur.nl/oai/request</baseURL>
      <protocolVersion>2.0</protocolVersion>
      <adminEmail>eepi@ubib.eur.nl</adminEmail>
      <earliestDatestamp>2001-01-01T00:00:00Z</earliestDatestamp>
      <deletedRecord>persistent</deletedRecord>
      <granularity>YYYY-MM-DDThh:mm:ssZ</granularity>
      <compression>gzip</compression>
      <compression>deflate</compression>
      <description>
        <toolkit xmlns="http://oai.dlib.vt.edu/OAI/metadata/toolkit" xsi:schemaLocation="http://oai.dlib.vt.edu/OAI/metadata/toolkit http://oai.dlib.vt.edu/OAI/metadata/toolkit.xsd">
          <title>OCLC's OAICat Repository Framework</title>
          <author>
            <name>Jeffrey A. Young</name>
            <email>jyoung@oclc.org</email>
            <institution>OCLC</institution>
          </author>
          <version>1.5.26</version>
          <toolkitIcon>http://alcme.oclc.org/oaicat/oaicat_icon.gif</toolkitIcon>
          <URL>http://www.oclc.org/research/software/oai/cat.shtm</URL>
        </toolkit>
      </description>
    </Identify>
  </OAI-PMH>

As you can see, the OAI-PMH service returns an XML body. Any OAI-PMH
response is wrapped in a *OAI-PMH* tag. All OAI-PMH specific XML
elements are in the "http://www.openarchives.org/OAI/2.0/"
namespace. In the XML defined for the "Identify" verb, we see things
like the name of the repository and the email address of the
administrator.

OAI-PMH defines a number of verbs that can be used to access
information. Above we saw the "Identify" verb, but now let's try a
verb that gives us actual access to metadata::

  http://ep.eur.nl/oai/request?verb=GetRecord&metadataPrefix=eur_qdc&identifier=oai:ep.eur.nl:1765/9

We have to supply the following parameters; the verb used
(``GetRecord``), the metadata set we're interested in
(``metadataPrefix``) and the identifier of the actual record we're
interested in (``identifier``). Below is the result (with some
information elided)::

  <?xml version="1.0" encoding="UTF-8"?>
  <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
    <responseDate>2006-04-11T10:31:10Z</responseDate>
    <request 
      identifier="oai:ep.eur.nl:1765/9" 
      metadataPrefix="eur_qdc" verb="GetRecord">http://ep.eur.nl/oai/request</request>
    <GetRecord>
      <record>
        <header>
          <identifier>oai:ep.eur.nl:1765/9</identifier>
          <datestamp>2006-02-03T14:03:59Z</datestamp>
          <setSpec>hdl_1765_2</setSpec>
        </header>
        <metadata>
          <eur_qdc:dc 
            xmlns:eur_qdc="http://ubib.eur.nl/eur_qdc/1.0" 
            xmlns:dc="http://ubib.eur.nl/dc/">
            <dc:contributor value="author">Jong, G. de</dc:contributor>
            <dc:contributor value="author">Nooteboom, B.</dc:contributor>
            <dc:date value="null">2001-01-04</dc:date>
            <dc:date value="accessioned">2003-03-11T14:00:50Z</dc:date>
            <dc:date value="available">2003-03-11T14:00:50Z</dc:date>
            <dc:date value="created">2001-01-04</dc:date>
            <dc:date value="issued">2001-01-04</dc:date>
            <dc:identifier value="uri">http://hdl.handle.net/1765/9</dc:identifier>
            <dc:description value="null">This study examines...</dc:description>
            <dc:description value="abstract">This study examines...</dc:description>
            <dc:format value="extent">38</dc:format>
            <dc:format value="extent">681026</dc:format>
            <dc:format value="mimetype">application/pdf</dc:format>
            <dc:language value="null">en</dc:language>
            <dc:language value="iso">en_US</dc:language>
            <dc:publisher value="null">Erasmus Research Institute of Management (ERIM), Erasmus University Rotterdam</dc:publisher>
            <dc:relation value="ispartofseries">ERS; ERS-2001-73-ORG</dc:relation>
            <dc:rights value="null">Copyright 2001, G. de  Jong, B. Nooteboom,...</dc:rights>
            <dc:subject value="null">Automobile industries</dc:subject>
            <dc:subject value="null">Learning theory</dc:subject>
            <dc:subject value="null">Social exchange theory</dc:subject>
            <dc:subject value="null">commitment</dc:subject>
            <dc:subject value="null">Supply relationships</dc:subject>
            <dc:subject value="lcc">5001-6182</dc:subject>
            <dc:subject value="lcc">5546.5548.6</dc:subject>
            <dc:subject value="lcc">5548.7-548.85</dc:subject>
            <dc:subject value="lcc">HD-41</dc:subject>
            <dc:title value="null">The Causality of Supply Relationships</dc:title>
            <dc:type value="null">Research paper</dc:type>
            <dc:subject value="jel">M</dc:subject>
            <dc:subject value="jel">M10</dc:subject>
            <dc:subject value="jel">L12</dc:subject>
            <dc:subject value="jel">L14</dc:subject>
            <dc:subject value="ebslg">85A</dc:subject>
            <dc:subject value="ebslg">100B</dc:subject>
            <dc:subject value="ebslg">240B</dc:subject>
            <dc:subject value="ebslg">260N</dc:subject>
            <dc:subject value="ebslg">270K</dc:subject>
            <dc:identifier value="repec">RePEc:dgr:eureri:2001134</dc:identifier>
            <dc:identifier value="webdoc">erimrs20020104123434</dc:identifier>
            <dc:date value="modified">2005-10-27</dc:date>
          </eur_qdc:dc>
        </metadata>
      </record>
    </GetRecord>
  </OAI-PMH>

The metadata payload is in a different namespace than the OAI-PMH one,
namely `Dublin Core`_ (dc) and eur_qdc, a qualified version of Dublin
Core specific to the Erasmus University. OAI-PMH does not specify the
format of the metadata provided, except that for each resource at
least a generic Dublin Core (metadata prefix of ``dc``) representation
must be provided. This allows OAI-PMH to be used with many different
metadata standards, though varieties of Dublin Core are the most
common.

.. _`Dublin Core`: http://www.dublincore.org

We've just seen how to look at a single record. OAI-PMH is a
harvesting protocol, meaning it is possible to get a potentially very
long list of records out from the system. This functionality is
provided for with the verbs ``ListIdentifiers`` and
``ListRecords``. ``ListIdentifiers`` is used to harvest a list of
record identifiers available in the repository, harvesting showing the
metadata itself. ``ListRecords`` is used to harvest metadata records
like the one we saw in the ``GetRecord`` examples.

Here is a simple example of ``ListIdentifiers``::

  http://ep.eur.nl/oai/request?verb=ListIdentifiers&metadataPrefix=eur_qdc

And this is an example of ``ListRecords``, only retrieving those
records that were added to the repository or updated since april 1,
2006::

  http://ep.eur.nl/oai/request?verb=ListRecords&metadataPrefix=eur_qdc&from=2006-04-01
   
Both these verbs can be qualified to restrict harvesting to a
particular range in time in which the records were added or updated to
the system using ``from`` and ``until``. If the sets feature is
supported by the repository it is also possible restrict the
harvesting to particular sets (collections). There is a special
*resumption token* facility so that the server does not have to return
enormous quantities of XML for thousands of records in a single HTTP
response, but instead can batch the results in smaller amounts. A
harvester can in this way retrieve records batch by batch in multiple
smaller requests, and thus avoid stressing the server too heavily.

Infrae OAI Pack: ``pyoai``, ``OAICore`` and ``SilvaOAI``
--------------------------------------------------------

Infrae has written a number of OAI-PMH related libraries. Together we
call these libraries the `OAI Pack`_. The components in the OAI Pack
are entirely open source and available under the BSD license.

.. _`OAI Pack`: http://www.infrae.com/products/oaipack

----------------
pyoai for Python
----------------

pyoai is a Python library that implements a Python-friendly OAI
harvester (``oaipmh.client``) as well as an easy way to create OAI
data providers (``oaipmh.server``).

The client library maps the OAI-PMH verbs to Python methods, so that
the programmer can deal with a Python API that returns Python objects,
instead of HTTP and XML. Batching is transparently handled - asking
for a result set of thousands may result in multiple HTTP requests to
the server, but will appear as a single list in Python to the
programmer. The programmer can create support for different metadata
sets by specifying XPath expressions that extract the right
information from the XML that describes the metadata.

The server library allows the programmer to *implement* the OAI-PMH
verbs as Python methods, exactly they seem to look from the
perspective of the client library. The programmer also needs to supply
the system with a metadata to XML serializer, so that the server
library can also support arbitrary metadata sets. The server library
then takes care of the bits that generate the OAI-PMH container XML,
handle HTTP request parameters and batching.

Overall the client and server libraries make it easy for a Python
programmer to create both OAI-PMH data providers (servers) as well as
OAI-PMH service providers using the harvesting component.

Since pyoai is a Python library it can be integrated with any piece of
Python software, such as Zope 3 as well as Zope 2, which are
internally very different. The pyoai server component is used in the
Document Library to make it expose itself as an OAI-PMH
repository. The pyoai client component is a building block used to
construct the ``OAICore`` Zope 2 component.

--------------------
``OAICore`` for Zope
--------------------

``OAICore`` is a Zope 2 extension built on top of pyoai. It can be
used to build Zope 2 applications that make use of OAI-PMH harvested
data, i.e. to construct OAI-PMH service providers. ``OAICore``
harvests data from an OAI-PMH data provider and then stores this data
in Zope's object database (ZODB). It also indexes the metadata using
Zope's catalog functionality so that fast queries over metadata can be
performed, something the OAI-PMH protocol by itself does not
offer. Programmers have programmatic control by which they can define
which metadata fields get indexed.

``OAICore`` provides a foundation layer on top of which further
Zope-based applications can be built. One of these applications is
``SilvaOAI``.

----------------------
``SilvaOAI`` for Silva
----------------------

Silva_ is an open source CMS based on Zope 2 that is developed by
Infrae. Authors in Silva can add new web pages by adding content
items. Silva can be extended in multiple ways: by adding new content
items, and by adding new external sources. The external sources
feature is used to include arbitrary generated data in other contents,
such as documents.

.. _Silva: http://www.infrae.com/products/silva

The ``SilvaOAI`` is a Silva extension built on ``OAICore`` that
provides a new content item, the OAI Query. OAI Query items can be
added by a Silva author to a website in order to list metadata
harvested using ``OAICore`` on a web page (as a HTML table). The Silva
author can select the criteria that determine the content of the
listing, such as for instance only the harvested resources that have a
certain author, or only those harvested resources in a particular set.

An OAI Query can be configured programmatically with a schema. Part of
the schema facility is handled by ``OAICore``, the rest by
``SilvaOAI``. A programmer can write new schemas in which a number of
things are configured:

* How to extract metadata fields from the XML, using an XPath
  expression.

* How to index the metadata field (full text, keyword, field-based, no
  index).

* When displaying a detail view on a record, how to present the data.

* When displaying the data in tabular form, which columns to present.

* Which columns can be sorted by the end user.

* Which search criteria can be configured by the Silva author.

* Whether any search UI is exposed to the end user of the web page, so
  that they can create further selections of the metadata.

The ``SilvaOAI`` extension also provides two external sources:

* The OAI source, which allows listings much like those generated by
  OAI Query to be included as external sources in other Silva content,
  such as documents.

* The "cherry picking" source, which allows references to a single
  record to be included as external sources in other Silva content.

Conclusion
----------

OAI-PMH is an interesting protocol that is pragmatic enough to make it
easy to to integrate with many programming languages and platforms. It
is little known but its adoption is in fact widespread in the academic
community. Those who wish to work with OAI-PMH and are working with
the Python language or the Zope platform should consider the
components in the Infrae OAI Pack.
