Skip to content

Harvester

Running the harvester

  • The harvester us run using the forever wrapper, so it is restarted automatically if it dies.
  • It checks repeatedly if it is signed out to handle restarts of entrystore and automatical sign-outs after the two week period.
  • It listens for signals from the outside to shutdown gracefully.
  • All output is logged to special.

Harvester process

  1. It checks for all pending or ongoing pipeline results (harvester jobs).
  2. It runs one job at a time.
  3. Each pipeline result is an instantiation of a pipeline
  4. Each pipeline may contain one or several transforms
  5. Only pipelines starting with an empty transform is considered on the client side as the other are handled by entrystore.
  6. Transforms are run and the pipeline result is updated to reflect the new status

Pipeline result instatiation

A cron job is running nightly to generate new pipeline results from the pipelines in the entrystore.

Executing a pipeline

A pipeline contains a set of transforms that are executed in order. Transform will send along its output to the following transform for further processing.

In some situations it is better to pass a along a series of values. This is useful when fetching large amounts of data, this is achieve with streamable transforms. A transform is streamable if it has the value "start", "intermediate" or "stop" in the "stream" attribute. A stream must be formed by a start stream followed by 0 or more intermediate transforms and ended with a single stop transform.

All streamable transforms supports a single "init" call, multiple "process" calls and finally a single "finish" call.

Results of executing pipelines

The status of executing a pipeline can be reported per transform. However in most situations the status of the standalone transforms and the final transforms in a stream is most important. Hence, we refer to standalone and the last transform in a stream to be the primary transforms and the other transforms are supportive transforms.

Now, the result of running a pipeline with transform is an array of integers taking one of the following five values:

Value Status Type Explanation
1 Succeeded Primary
-1 Failed Primary
0 Succeeded Supportive
-2 Failed Supportive Will hinder any following supportive transforms from being run
-3 Not run Supportive A supportive transform in this stream has failed before this transform was reached

A pipeline result is considered to be successful if at least one of the main transforms succeeded. If successful the following triple will be set in the entryinformation:

pipelineResultEntryURI
    store:status  http://entrystore.org/terms/Success .

Where store is expanded to: http://entrystore.org/terms/

The pipeline result will have the following triples:

<pipResURI>
    storepr:successCount "nr"^^"xsd:integer",
    storepr:allSucceeded "true"^^"xsd:boolean",
    storepr:oneSucceeded "true"^^"xsd:boolean".

Where storepr is expanded to: http://entrystore.org/terms/pipelineresult#

Passed on objects

Most transforms accepts an object with the following signature:

{
    mainURI: "http://example.com",
    mainGraph: _instance_of_Graph,
    extraGraphs: {
        extraURI_11: _instance_of_Graph,
        extraURI_12: _instance_of_Graph
    },
    extraIndex: {
        extraURI_1: true,
        extraURI_2: true,
        ...
        extraURI_12: true
    }
}

The mainURI and the extraURIs are assumed to correspond to entries in a context with the given resourceURIs.

If the same main resource depends on a supporting resource, e.g. a publisher and it has already been imported by a previous passed on object it will not be passed on as an extraGraph as it is listed in the extraIndex. If the validation transform (or some other transform) needs to filter out / block a resource from being imported, make sure to remove the extraURIs from the extraIndex to not avoid broken links in later imports.

ValidationTransform

The validation transform uses RDForms templates to validate resources. To know which template to use the validation transform requires a map between rdf:type and which RDForm template to use, provide this in the config like this:

type2template: {
    'dcat:Catalog': 'dcat:OnlyCatalog',
    'dcat:Dataset': 'dcat:OnlyDataset',
    'dcat:Distribution': 'dcat:OnlyDistribution',
    'vcard:Kind': 'dcat:contactPoint',
    'vcard:Individual': 'dcat:contactPoint',
    'vcard:Organization': 'dcat:contactPoint',
    'foaf:Agent': 'dcat:foaf:Agent',
}

It is also possible to foce checking that there are at least one value of a set of specified mandatory types by providing this config:

mandatoryTypes: ['dcat:Catalog'];

The validation transform can be part of a stream as an intermediate, but it can also be standalone.

In addition, the config has the following options:

{
    validate: {
        saveReport: true,
        mandatoryFieldsRequired: true,
        to: "matthias@metasolutions.se",
        mailReportOnMissingMandatoryFields: true,
        mailReportOnMissingRecommendedFields: true,
    }
}
option Expanation
saveReport a text version of the validation report is saved as an entry and pointed to via prov:generated.
mandatoryFieldsRequired resources not fullfilling the fields marked as mandatory in the matching RDForms template will not be imported (removed so they will not be handled by the merge transform)
to email address(es) to send to, can be array. (Only sends mail if the global mail:true flag.)
mailReportOnMissingMandatoryFields send mail if mandatory fields are missing
mailReportOnMissingRecommendedFields send mail even if only recommended fields are missing

The validation transform will add the following triples to the pipeline result:

pipelineResultResourceURI
    storepr:validateMandatoryMissing   "1"^^"xsd:integer" ,
    storepr:validateErrors            "23"^^"xsd:integer" ,
    storepr:validateWarnings          "17"^^"xsd:integer" ,
    storepr:validateDeprecated         "2"^^"xsd:integer" ,
    storepr:validateMainResourceCount "33"^^"xsd:integer" .
Property Explanation
validateMandatoryMissing Number of mandatory types no resources was matched for.
validateErrors Number of missing fields marked as mandatory in all resources
validateWarnings Number of missing fields marked as recommended in all resources
validateDeprecated Number of fields present marked as deprecated in all resources
validateMainResourceCount Number of main resources, e.g. datasets not including distributions, contactpoints and publishers.