Harvester Documentation
Overview
The jOAI harvester is used to retrieve metadata records from remote OAI data providers and save them to the local file system, one record per file. In addition, records that have been harvested may be packaged into zip archives that can be downloaded and opened through the harvester's web-based interface.The harvester can be configured to harvest automatically at regular intervals and effectively maintain a mirror of the remote repository on the local file system.
The jOAI harvester supports OAI protocol versions 1.1 and 2.0, supports data
providers that use resumption tokens for flow
control, selective harvesting by date
or set,
gzip response
compression and other protocol features.
See the Harvester FAQ
for additional information.
Harvester setup
1. Install the jOAI software on a system in a servlet container such as Apache Tomcat.
See INSTALL.md for installation instructions. If reading this page, most likely this step has been completed.
2. Complete Harvester Setup. Add a new harvest and complete:
- Enter a repository name (required)
- Provide a repository base URL that starts with http:// (required)
- Include a setSpec (optional)
- Provide the metadata format being harvested (required)
- Indicate if the harvest should occur at regular intervals (optional)
- Indicate where metadata files should be saved (required)
- Indicate how metadata files are saved (by set or not)
The repository name is a name to describe the data provider being harvested. The harvester status table is organized as an alphabetical listing of repository names.
The base URL is the access point of a data provider. It’s a web address that starts with http://
The harvested metadata format can be any metadata format as long as it matches a metadata format used by the provider being harvested. Use the OAI ListMetadataFormats request to find available metadata formats at the provider. The ListMetadataFormats requests look like:
http://some.provider.org/base/url?verb=ListMetadataFormats
that is, concatenate together the [base URL] + [?verb=ListMetadataFormats]
The OAI ListMetadataFormats request returns an XML document and the XML element, metadataPrefix, provides the metadata formats available.
Harvest automatically at regular intervals means a time interval (days/hours/minutes/seconds) can be specified that tells the jOAI harvester when and how often to perform an automatic harvest that checks for and updates new records.
Saving files at the default harvest location means metadata files are saved to the context (directory) within the OAI application generally of the form "~oai/WEB-INF/harvested_records/". To view the default directory path of this location, click on the save files help button (the question mark).
Saving files to a non-default harvest location means metadata files are saved to a user-specified location in which the full directory path is provided or files are saved to a recently used location.
If a SetSpec is specified, metadata files are saved as a group. If a SetSpec is not specified, metadata files can be saved into one big group (the do not split by set option) or saved in many groups (split by set option) depending on how the provider being harvested is organized. The default save option is do not split by set.
Harvest test files
Conduct a test harvest by completing the harvester setup section above using one of the repositories
registered data providers at openarchives.org
- Repository name: name of repository
- Repository base URL: BASE_URL
- Metadata format: oai_dc
Leave all other fields blank and save the entry.
On the Harvester Setup and Status click 'View harvest history' page to see the harvest being performed. Click 'Refresh page' to see the number of metadata files increase. The entire harvest may take several minutes to complete.
The test harvest is successful if the metadata files can be viewed by one of these methods. On the Harvester Setup and Status page,
- Locate and go to the 'Harvested to' directory on the server and view the files.
- If zipping of files was enebled, under 'Download zipped harvest', click on 'Most recent'. Save the zip file to your Desktop, unzip it, and view the harvested records.
Registered data providers
The Open Archives Initiative maintains a list of registered data providers that can be harvested.
The Java Harvester API
The jOAI code base includes a Harvester API that may be used in Java programs to harvest from OAI data providers. The API is part of the DLESETools.jar Java library, found in the $tomcat/webapps/oai/WEB-INF/lib/ directory of the jOAI installation. See the Harvester Javadoc for details. Use of the API assumes familiarity with the Java programming language.
Harvest, validate and transform from the command line
Linux shell scripts are included in the jOAI distribution that allow you to perform OAI harvests, XML validation, and XML transformations from the command line.
- To install: See the instructions provided in the README and script files located in the jOAI installation at
$tomcat/webapps/oai/WEB-INF/bin/ . Once installed, the scripts do not require the jOAI Web application in order to be used.
harvest - This script performs harvests from OAI data providers and saves the harvested records as individual files on disk. It accepts options to harvest by date range, set, and variations on how the metadata is written to files. It is simply a wrapper to the Java Harvester API mentioned above.
validate - This script performs schema validation on a single XML file or batch validation on a directory of files, outputting a summary report of the results.
transform - This script performs an XSL transformation on a single XML file or batch transformations on a directory of files, outputting the transformed XML files to a directory.
|