Link Checker¶
Introduction¶
Overview¶
Link checking right now consists of the following parts:
- Link extractor - pipeline script that extracts relevant links into a backlog per catalog
- Link checker - pipeline script that checks the extracted links and generates a report entry for each catalog. Also exports an api links json file.
- Link check report UI - EntryScape UI presenting the report entry data
- API link checker - pipeline script that checks extracted links for a Swagger or OpenAPI json file.
Scheduled for implementation:
- Link check notifications - pipeline script that sends notification emails
- Link check notifications UI - registry UI to configure notifications
Above: Link checker overview
Inquiries¶
A link check inquiry consists of the following information:
Name | Value | Example |
---|---|---|
uri | The link to check | https://www.migrationsverket.se/oppna-data/uppehallstillstand-till-adoptivbarn-1984-2014 |
property | The property pointing to the URI | dcat:accessURL |
entryType | The class on which the property is expressed | http://www.w3.org/ns/dcat#Distribution |
entryURI | The URI for the instance of the class where the link failed | https://www.migrationsverket.se/datasets/dcat#distribution15 |
entryLabel | A label for the main URI | Excel |
createdAt | Date and time when link was added to the queue | 2020-09-30T12:11:20.621Z |
context | 308 | |
status | Succes, broken or excluded | broken |
statusCode | HTTP response code | 404 |
statusMessage | Message corresponding to the response code | Not Found (404) |
checkedAt | Date and time when the link was last checked | 2020-10-14T06:09:48.000Z |
attempts | Number of checking attempts, max 3 | 1 |
Link Extractor¶
On a high level, link extractor does the following:
- Extracts relevant links from a catalog based on the available configuration
- Creates a link check inquiry for each relevant link
- Exports the inquiries in json files
The link extractor is not smart. It just adds all links it finds that fit the provided configuration, not caring whether some links were checked yesterday etc.
Configuration¶
The links that will be detected are based on a configuration that basically lists a set of properties grouped by classes. A single property may be checked on multiple classes, e.g. dcterms:conformsTo
.
Sample configuration
extractLinksOnHarvest: false,
extractLinksProps: {
'dcat:Distribution': [
'dcterms:conformsTo',
'dcat:accessURL',
'dcat:downloadURL',
],
'dcat:Dataset': [
'dcat:landingPage',
'dcterms:conformsTo',
'foaf:page',
'owl:versionInfo',
],
},
-
extractLinksOnHarvest: Expects a boolean value that denotes whether the link exporter should run after a harvesting job.
-
extractLinksProps: Expects an object with EntryType properties and array values, containing the properties we want to check for links.
Export¶
The link extractor produces a json file with the relevant link check inquiries.
Filename format
link-inquiries-`context`-2020-09-30T12:11:20.169Z.json
Sample export
[
{
"entryType": "http://www.w3.org/ns/dcat#Distribution",
"entryURI": "https://www.migrationsverket.se/datasets/dcat#distribution15",
"entryLabel": "Excel",
"context": "308",
"uri": "https://www.migrationsverket.se/oppna-data/uppehallstillstand-till-adoptivbarn-1984-2014",
"property": "dcat:accessURL",
"createdAt": "2020-09-30T12:11:20.621Z"
}
]
Running the extractor¶
./pl export links `contextID`
This might seem confusing since we want to run the extractor, but the extraction is part of the exporting process.
On success, a json file containing the link check inquiries should be in entryscape-pipelines/output/links
and a message like the following should show up in the terminal:
Link extractor - Exported 32 link inquiries for context `contextID` on file link-inquiries-`contextID`-2020-09-30T12:11:20.169Z.
Link Check Report¶
A linkcheck report is done per catalog and is represented as an entry stored in the context of the catalog. As always an entry can consist of two parts, the resource and the metadata.
- Resource - A detailed report in JSON containing the checked inquiry objects from the link checker.
- Metadata - Containing an overview of the resource including the amounts of links checked, how many failed or where excluded and which ones did so.
Resource¶
The json file should be very simple with an array of checked inquiry objects e.g.:
[
{
"entryType": "http://www.w3.org/ns/dcat#Distribution",
"entryURI": "https://ckan-storsthlm.dataplatform.se/dataset/f6333fc4-e07f-40d3-9405-a41875b840a1/resource/835017be-3ee2-459c-9fdd-bcbc5039790e",
"entryLabel": "Årlig energislutanvändning Sundbyberg",
"context": "427",
"uri": "http://www.statistikdatabasen.scb.se/sq/86494",
"property": "dcat:accessURL",
"createdAt": "2020-10-12T09:31:20.348Z",
"statusCode": 200,
"checkedAt": "2020-10-12T09:39:25.000Z",
"status": "success",
"statusMessage": "OK"
},
...
]
Note that if multiple inquiry objects refer to the same URI, there might be some duplication in the information (since the same check info applies), this is to increase readability.
Metadata¶
@prefix esterms: <http://entryscape.com/terms/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
<https://registrera.oppnadata.se/store/34/resource/2>
a esterms:LinkCheckReport ;
dcterms:created "2020-01-21T14:30:01" ;
esterms:linkChecks "3"^^xsd:integer ;
esterms:excludedLinkChecks "0"^^xsd:integer ;
esterms:failedLinkChecks "2"^^xsd:integer ;
esterms:failedLinkCheck <http://example.com/one> .
esterms:failedLinkCheck <http://example.com/two> .
UI¶
Provides access to the list of broken links. Available on the organization harvesting status dialog, under the link check report tab. Broken links can be seen through a list in the UI and a link to the raw json data.
The UI should contain an overview, showing how many links were checked and how many failed. A download report button should also be exposed. Either in an expandable section or in another tab should the entire report be visible where a list of all links that failed are shown first and then all that succeeded. Per link it should be clear from which main resource (dataset, distribution etc.) using the main_label as well as which property the link was indicated by.
Configuration¶
registry: {
includeLinkCheckReport: true,
},
Link Check Notifications¶
** NOT IMPLEMENTED YET **
UI¶
Should be able to configure notification preferences, see for example harvesting notifications configuration.
Script¶
Pipelines job. Based on a template a mail should be sent (if so configured in the catalog pipeline) containing the information in the metadata and provide a link to the detailed JSON file.
Link Checker¶
The link checker checks the available links, generating a report entry for each catalog and exports potential api links to an api-links.json
file. It mostly uses the broken link checker package's URLChecker
together with some additions to fit our use case.
High level functionality overview:
- Reads the following files and populates an
inbox
inquiry list link check
file, contains extracted inquiriespending
file, contains inquiries with a previously failed checkrecent
file, contains checks done recently. Recency can be set using thelinkCheckSkipRetryHours
configuration variable- Iterates the
inbox
. If an inquiry has been checked recently (a check exists on therecent
file), it assigns the previous response. Otherwise it adds the inquiry to the broken link checkerqueue
. - The broken link checker iterates over it's
queue
and makes a HEAD request to each queued inquiry URI. Depending on the response, it emits one of theerror
,queue
,link
orend
events. When the end of thequeue
has been reached, anend
event is emitted. - The link checker takes over again and
- creates an
api-links.json
file containing checks that correspond to potential API links. These links are identified through theextractLinksApiProps
configuration parameter - generates a
report
entry for each context - cleans up the old
link check
files - updates the
pending
andrecent
files
Configuration¶
- linkCheckSkipRetryHours: Expects a number corresponding to hours. Used to filter which checks will be read from the
recent
file on link checker initialization and which ones will end up on the updated file on end. - extractLinksApiProps: An object containing an array of properties per entry type. A single property may be checked on multiple types, e.g. dcterms:conformsTo. This is necessary for the API link checker to work properly
- linkCheckExclude: A regular expression for excluding every url that matches it. Default: None [regexp]
- linkCheck:
An object containing values for:
- maxSockets: How many sockets to use. Default: 10 [int]
- hostDelay: Delay time (milliseconds) between link checks. Default: 500 [int]
- attempts: How many request attempts per link. Default: 3 [int]
- dryRun: If a test run without any commits. Default: False [bool]
- verboseOutput: A more detailed log. Default: False [bool]
Sample configuration
extractLinksApiProps: {
'dcat:Distribution': ['dcterms:conformsTo'],
'dcat:DataService': ['dcat:endpointDescription'],
},
linkCheck: {
maxSockets: 20,
hostDelay: 200,
attempts: 3,
dryRun: false,
verboseOutput: false
},
Running the link checker¶
./pl linkcheck
API Link Checker¶
The API link checker checks a bunch of links trying to identify whether they correspond to a Swagger or OpenAPI json file. The links to check come from the api-links.json
file that's exported from the link checker if the extractLinksApiProps
configuration parameter has been set.
Output¶
The API link checker's output is an api-link-check-results.json
file with the following format:
{
"data": [
{
"contextId" : "43",
"entryId" : "43369", // Dataset
"detections" : [
{
"entryId" : "43323", // Distribution
"apiDefinition": "http://urltillswagger",
"type": "swagger",
"version": "2.0"
},
{
"entryId" : "43324", // Distribution
"apiDefinition": "http://urltillswagger",
"type": "swagger",
"version": "3.0"
}
]
},
{
"contextId" : "43",
"entryId" : "43370", // Data service
"detections" : [
{
"entryId" : "43370", // Same as above
"apiDefinition": "http://urltillswagger",
"type": "swagger",
"version": "2.0"
}
]
}
]
}
Running the API link checker¶
./pl apilinkcheck