Transforms¶
A pipeline is a recipe on how to perform a data processing job. This recipe consists of transforms, where each transform process the data and pass it on to the subsequent pipe. One important aspect of the flow of data between transforms is that in some situations it is better to pass a along a series of values instead of all values at once. This is useful when fetching large amounts of data, and is achieved via utilizing streamable transforms. A streamable transform has three modes: "start", "intermediate" and "stop", and depending on the mode, the data value that is handled by the transform is processed differently. Since data value are passed on between transforms, it can be important to note that streamable transforms run concurrently. This can be contrasted to non-streamable transforms, which process the all values at once and then pass them all on to the next transform.
Depending on the data source and its intended use-case, the transforms can be combined in various ways in order to fulfill the task at hand. For some use-case, the data processing job at hand is can be done by using a pre-defined pipeline. This sort of pipeline already includes the transforms needed to perform the job. One example of such a re-usable pipeline is for handling RDF data sources that follows the DCAT application profile:
[
{"type": "empty"},
{"type": "fetch",
"args": {"source": 'https://oppnadata.regionblekinge.se/store/2/metadata/2?recursive=dcat'}
},
{"type": "validate",
"args": {"profile": 'dcat_ap_se' }
},
{"type": "merge",
"args": { }
}
]
This pipeline consists of five transforms. The first transform is an empty one, which instructs that this is a client-side pipeline, and is not to be processed directly by EntryStore. The second transform is checks the availability of the data source. The third transform is a fetch type, which harvest the metadata from the source. The fourth transform validates that the harvested data corresponds to the specification, in this case the DCAT-AP-SE profile. Lastly, the fifth transform merges the newly harvested data with data previously harvested using this pipeline.
Note: in this case, it is known that the data is source is RDF and hence, there is no need to include a transform which converts the data into RDF.
Transform types¶
Transforms can be chosen and put together to form a custom pipeline. As already noted, there also exist pre-defined pipelines, such as the DCAT pipeline described above. Regardless of if the pipeline is custom or pre-defined, the included transforms always consist of different types. Four common types of transforms are fetch, convert, validate and merge. Each of these transform types in turn come in various sub-types, which depends on the use case. Moreover, there is a need moving forward to further enhance how each transform is handled, and this could be done by providing a profile via the configuration, which could further refine the data process job as a whole.