Skip to content

Deduplication

dedup is used to make sure that data items are not duplicated among catalogs. By default, data items in a DCAT catalog (data sets, data services & publishers) will be included in a deduplication when harvested for the first time. If an identifier of a data item already exits in the dedup index however, the data item will not be re-harvested from this new catalog.

However, not always should a previously harvested catalog have precedence over new catalogs. Hence, it can be stated in the configuration if a catalog should be considered to be a primary catalog. A primary catalog will then, at harvest time, have precedence over other catalogs which includes the same data item (based on an id).

Configuring this deduplication mechanism is done by assigning one catalog as the primary catalog, in the example below it would be nr 504. Furthermore, a key needs to be provided, in the example 'dcterms:identifier', which functions as the key by which we recognize if two data items are the same. Optionally, it is possible to disable the deduplication by setting the value to true, in which case the mechanism will not be utilized.

module.exports = {
    dedup: {
    primary: '504',
    key: 'dcterms:identifier',
    disabled: false,
  },
};