Search Criteria (+ )

Harvesting and Indexing

Overview

Harvesting is the process of retrieving data records from one or more data providers and storing them in a collection. This includes retrieving new records, updating modified records and deleting records that are no longer available from a resource. The harvesting process is only to manage the items in a collection. It does not involve the indexing process, though indexing should be updated in conjunction with the harvest (basically retrieve items that are modified after the last index operation).

-- NOTE: Need to tag last index operation, i.e. have metadata about the last
time an index was updated.

Harvest Operation

DiGIR Providers

The process of harvesting involves generation of a set of harvest tasks. For a DiGIR resource, the tasks are a set of queries that when executed, will retrieve all records from the resource.

The basic process for handling tasks which applies to all items that are not under manual control:

//Harvest overview for DiGIR style data providers
for task in tasks:
  #get the records from a target.  records is a list of XML documents
  records = task.retrieve()
  for record in records:
    #create, update, touch items
    try:
      existing = Item.objects.get(record.id)
      if existing.hash != CalculateRecordHash(record):
        existing.update(record)
      else:
        existing.lastHarvest = now()
    except:
      #doesn't exist, add new record
      item = CreateItemFromRecord(record)
      item.save()

//delete items older than XXX and not manually controlled
untouched = Item.objects.filter(dateharvested > max_untouched_age)
untouched = untouched.filter(manualControl=False)
untouched.delete()

TAPIR Providers

TBD.

Pseudo-Static Sources

These data sources provide a URL that points to a single document (perhaps zipped) that contains a copy of all the Darwin Core records. The entire set can be retrieved and processed with minimal interaction from the server.

TBD.

Push

The rcache implementation supports a REST style interface that supports CRUD+L (Create, Read, Update, Delete and List) operations that enable an authenticated remote process to manage records within the collection. This method enables data holders to enable access to their content without operating a web server.

TBD.

Indexing the Collection

Indexing is process of creating an index to enable generation of subsets of items that appear in a collection. The indexing process basically iterates through all recently modified items, generates an index entry, and modifies the index.

The process for indexing a collection is:

items = Item.objects.filter(modified < index.modified)
for item in items:
  indexrepr = index.getItemRepresentation(item)
  index.add(indexrepr)

The collection of DataResource instances defines the entire set of targets that are harvested.

An instance of DataResource describes a target to be harvest.

The harvest of a DataResource involves generation of a set of DataHarvestTask instances, each DataHarvestTask retrieves a chunk of data from the DataResource and is an atomic operation that either succeeds or fails.

DataHarvestTask instances are generated by the DataResource instance through a call to loadHarvestTasks().

The state of a harvesting operation can be determined by examining the list of DataHarvestTask instances.

loadharvestTasks() can be called by an authorized user through the web interface or through the system management tasks.

Processing of DataHarvestTask instances is performed by a process separate to the web service, though is controlled by the web interface (administrator only).