Skip to content

Link Checker

Introduction

Overview

Link checking right now consists of the following parts:

  • Link extractor - pipeline script that extracts relevant links into a backlog per catalog
  • Link checker - pipeline script that checks the extracted links and generates a report entry for each catalog. Also exports an api links json file.
  • Link check report UI - EntryScape UI presenting the report entry data
  • API link checker - pipeline script that checks extracted links for a Swagger or OpenAPI json file.

Scheduled for implementation:

  • Link check notifications - pipeline script that sends notification emails
  • Link check notifications UI - registry UI to configure notifications

Link checker overview

Above: Link checker overview

Inquiries

A link check inquiry consists of the following information:

Name Value Example
uri The link to check https://www.migrationsverket.se/oppna-data/uppehallstillstand-till-adoptivbarn-1984-2014
property The property pointing to the URI dcat:accessURL
entryType The class on which the property is expressed http://www.w3.org/ns/dcat#Distribution
entryURI The URI for the instance of the class where the link failed https://www.migrationsverket.se/datasets/dcat#distribution15
entryLabel A label for the main URI Excel
createdAt Date and time when link was added to the queue 2020-09-30T12:11:20.621Z
context 308
status Succes, broken or excluded broken
statusCode HTTP response code 404
statusMessage Message corresponding to the response code Not Found (404)
checkedAt Date and time when the link was last checked 2020-10-14T06:09:48.000Z
attempts Number of checking attempts, max 3 1

On a high level, link extractor does the following:

The link extractor is not smart. It just adds all links it finds that fit the provided configuration, not caring whether some links were checked yesterday etc.

Configuration

The links that will be detected are based on a configuration that basically lists a set of properties grouped by classes. A single property may be checked on multiple classes, e.g. dcterms:conformsTo.

Sample configuration

extractLinksOnHarvest: false,
extractLinksProps: {
  'dcat:Distribution': [
    'dcterms:conformsTo',
    'dcat:accessURL',
    'dcat:downloadURL',
  ],
  'dcat:Dataset': [
    'dcat:landingPage',
    'dcterms:conformsTo',
    'foaf:page',
    'owl:versionInfo',
  ],
},
  • extractLinksOnHarvest: Expects a boolean value that denotes whether the link exporter should run after a harvesting job.

  • extractLinksProps: Expects an object with EntryType properties and array values, containing the properties we want to check for links.

Export

The link extractor produces a json file with the relevant link check inquiries.

Filename format

link-inquiries-`context`-2020-09-30T12:11:20.169Z.json

Sample export

[
  {
    "entryType": "http://www.w3.org/ns/dcat#Distribution",
    "entryURI": "https://www.migrationsverket.se/datasets/dcat#distribution15",
    "entryLabel": "Excel",
    "context": "308",
    "uri": "https://www.migrationsverket.se/oppna-data/uppehallstillstand-till-adoptivbarn-1984-2014",
    "property": "dcat:accessURL",
    "createdAt": "2020-09-30T12:11:20.621Z"
  }
]

Running the extractor

./pl export links `contextID`

This might seem confusing since we want to run the extractor, but the extraction is part of the exporting process.

On success, a json file containing the link check inquiries should be in entryscape-pipelines/output/links and a message like the following should show up in the terminal:

Link extractor - Exported 32 link inquiries for context `contextID` on file link-inquiries-`contextID`-2020-09-30T12:11:20.169Z.

A linkcheck report is done per catalog and is represented as an entry stored in the context of the catalog. As always an entry can consist of two parts, the resource and the metadata.

  • Resource - A detailed report in JSON containing the checked inquiry objects from the link checker.
  • Metadata - Containing an overview of the resource including the amounts of links checked, how many failed or where excluded and which ones did so.

Resource

The json file should be very simple with an array of checked inquiry objects e.g.:

[
  {
    "entryType": "http://www.w3.org/ns/dcat#Distribution",
    "entryURI": "https://ckan-storsthlm.dataplatform.se/dataset/f6333fc4-e07f-40d3-9405-a41875b840a1/resource/835017be-3ee2-459c-9fdd-bcbc5039790e",
    "entryLabel": "Årlig energislutanvändning Sundbyberg",
    "context": "427",
    "uri": "http://www.statistikdatabasen.scb.se/sq/86494",
    "property": "dcat:accessURL",
    "createdAt": "2020-10-12T09:31:20.348Z",
    "statusCode": 200,
    "checkedAt": "2020-10-12T09:39:25.000Z",
    "status": "success",
    "statusMessage": "OK"
  },
  ...
]

Note that if multiple inquiry objects refer to the same URI, there might be some duplication in the information (since the same check info applies), this is to increase readability.

Metadata

@prefix esterms: <http://entryscape.com/terms/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

    <https://registrera.oppnadata.se/store/34/resource/2>
      a esterms:LinkCheckReport ;
      dcterms:created "2020-01-21T14:30:01" ;
      esterms:linkChecks "3"^^xsd:integer ;
      esterms:excludedLinkChecks "0"^^xsd:integer ;
      esterms:failedLinkChecks "2"^^xsd:integer ;
      esterms:failedLinkCheck <http://example.com/one> .
      esterms:failedLinkCheck <http://example.com/two> .

UI

Provides access to the list of broken links. Available on the organization harvesting status dialog, under the link check report tab. Broken links can be seen through a list in the UI and a link to the raw json data.

The UI should contain an overview, showing how many links were checked and how many failed. A download report button should also be exposed. Either in an expandable section or in another tab should the entire report be visible where a list of all links that failed are shown first and then all that succeeded. Per link it should be clear from which main resource (dataset, distribution etc.) using the main_label as well as which property the link was indicated by.

Configuration

registry: {
  includeLinkCheckReport: true,
},

** NOT IMPLEMENTED YET **

UI

Should be able to configure notification preferences, see for example harvesting notifications configuration.

Script

Pipelines job. Based on a template a mail should be sent (if so configured in the catalog pipeline) containing the information in the metadata and provide a link to the detailed JSON file.

The link checker checks the available links, generating a report entry for each catalog and exports potential api links to an api-links.json file. It mostly uses the broken link checker package's URLChecker together with some additions to fit our use case.

High level functionality overview:

  • Reads the following files and populates an inbox inquiry list
  • link check file, contains extracted inquiries
  • pending file, contains inquiries with a previously failed check
  • recent file, contains checks done recently. Recency can be set using the linkCheckSkipRetryHours configuration variable
  • Iterates the inbox. If an inquiry has been checked recently (a check exists on the recent file), it assigns the previous response. Otherwise it adds the inquiry to the broken link checker queue.
  • The broken link checker iterates over it's queue and makes a HEAD request to each queued inquiry URI. Depending on the response, it emits one of the error, queue, link or end events. When the end of the queue has been reached, an end event is emitted.
  • The link checker takes over again and
  • creates an api-links.json file containing checks that correspond to potential API links. These links are identified through the extractLinksApiProps configuration parameter
  • generates a report entry for each context
  • cleans up the old link check files
  • updates the pending and recent files

Configuration

  • linkCheckSkipRetryHours: Expects a number corresponding to hours. Used to filter which checks will be read from the recent file on link checker initialization and which ones will end up on the updated file on end.
  • extractLinksApiProps: An object containing an array of properties per entry type. A single property may be checked on multiple types, e.g. dcterms:conformsTo. This is necessary for the API link checker to work properly
  • linkCheckExclude: A regular expression for excluding every url that matches it. Default: None [regexp]
  • linkCheck: An object containing values for:
    • maxSockets: How many sockets to use. Default: 10 [int]
    • hostDelay: Delay time (milliseconds) between link checks. Default: 500 [int]
    • attempts: How many request attempts per link. Default: 3 [int]
    • dryRun: If a test run without any commits. Default: False [bool]
    • verboseOutput: A more detailed log. Default: False [bool]

Sample configuration

extractLinksApiProps: {
  'dcat:Distribution': ['dcterms:conformsTo'],
  'dcat:DataService': ['dcat:endpointDescription'],
},
linkCheck: {
  maxSockets: 20,
  hostDelay: 200,
  attempts: 3,
  dryRun: false,
  verboseOutput: false
},
./pl linkcheck

The API link checker checks a bunch of links trying to identify whether they correspond to a Swagger or OpenAPI json file. The links to check come from the api-links.json file that's exported from the link checker if the extractLinksApiProps configuration parameter has been set.

Output

The API link checker's output is an api-link-check-results.json file with the following format:

{
    "data": [
        {
            "contextId" : "43",
            "entryId" : "43369", // Dataset
            "detections" : [
                {
                    "entryId" : "43323",   // Distribution
                    "apiDefinition": "http://urltillswagger",
                    "type": "swagger",
                    "version": "2.0"
                },
                {
                    "entryId" : "43324", // Distribution
                    "apiDefinition": "http://urltillswagger",
                    "type": "swagger",
                    "version": "3.0"
                }
            ]
        },
        {
            "contextId" : "43",
            "entryId" : "43370", // Data service
            "detections" : [
                {
                    "entryId" : "43370",   // Same as above
                    "apiDefinition": "http://urltillswagger",
                    "type": "swagger",
                    "version": "2.0"
                }
            ]
        }
    ]
}
./pl apilinkcheck