Search Criteria (+ )

Indexing

Overview

Harvesting is the process of retrieving data records from one or more data providers and storing them in a collection. This includes retrieving new records, updating modified records and deleting records that are no longer available from a resource. The harvesting process is only to manage the items in a collection. It does not involve the indexing process, though indexing should be updated in conjunction with the harvest (basically retrieve items that are modified after the last index operation).

-- NOTE: Need to tag last index operation, i.e. have metadata about the last
time an index was updated.

Indexing the Collection

Indexing is process of creating an index to enable generation of subsets of items that appear in a collection. The indexing process basically iterates through all recently modified items, generates an index entry, and modifies the index.

The process for indexing a collection is:

items = Item.objects.filter(modified < index.modified)
for item in items:
  indexrepr = index.getItemRepresentation(item)
  index.add(indexrepr)

The collection of DataResource instances defines the entire set of targets that are harvested.

An instance of DataResource describes a target to be harvest.

The harvest of a DataResource involves generation of a set of DataHarvestTask instances, each DataHarvestTask retrieves a chunk of data from the DataResource and is an atomic operation that either succeeds or fails.

DataHarvestTask instances are generated by the DataResource instance through a call to loadHarvestTasks().

The state of a harvesting operation can be determined by examining the list of DataHarvestTask instances.

loadharvestTasks() can be called by an authorized user through the web interface or through the system management tasks.

Processing of DataHarvestTask instances is performed by a process separate to the web service, though is controlled by the web interface (administrator only).