Skip to content
Snippets Groups Projects

DCAT Catalog Check

pipeline status Coverage

This project is a Python script designed to monitor and validate links in a DCAT catalog.

The script is particularly useful for maintaining the integrity of distributions by ensuring that links are active and files are correctly formatted, thus helping to avoid issues related to broken links and invalid file types.

Table of Contents

Features

  • Retrieves the DCAT catalog.
  • Checks if the URLs associated with the resources are alive or dead.
  • If the file has been successfully downloaded, it is checked using the format specified in the metadata.
  • Validates the MIME type of the distributions if no specialized check is available.
  • Logs the results.

The following format checks are currently being carried out:

Format Check
ATOM Validates whether the file content is a valid ATOM feed by confirming the root element is <feed> in the Atom XML namespace.
DOCX Verifies that the file is a valid DOCX by ensuring the ZIP archive contains the necessary XML files (document.xml and styles.xml).
GEOJSON Loads and validates the file using GeoPandas.
GEOTIFF Verifies the file is a valid GeoTIFF by checking its GeoTransform information and supports both standalone and ZIP-compressed GeoTIFF formats.
GML Loads and validates the file using GeoPandas.
JPEG Loads and validates the image file.
JSON Verifies that the file is syntactically correct JSON and, if it is a Frictionless Data Resource, checks it using Frictionless Tools.
ODS Validates that the file is a valid ODS (OpenDocument Spreadsheet) by checking the ZIP structure, required files, and correct MIME type.
ODT Validates that the file is a valid ODT (OpenDocument Text) by confirming the ZIP structure, required files, and correct MIME type.
PARQUET Verifies that the file is a readable Apache Parquet file by loading it using pandas.
PDF Loads and validates the PDF document using pypdf.
PNG Loads and validates the image file.
RDF Verifies the file is a valid RDF (Resource Description Framework) document and contains at least two statements.
SHP Loads and validates the file using GeoPandas.
WFS Validates if the file is a well-formed WMS_Capabilities XML document. If not, a GetCapabilities request is made and validated.
WMS Validates if the file is a well-formed WFS_Capabilities XML document. If not, a GetCapabilities request is made and validated.
WMTS Validates if the file contains a valid WMTS (Web Map Tile Service) capabilities XML response, either directly or by performing a GetCapabilities request.
XLSX Verifies that the file is a ZIP archive and contains the required files (xl/workbook.xml and xl/styles.xml) typical of a valid XLSX file.
XML Verifies if the file is well-formed XML.
ZIP Verifies if the file is a valid ZIP archive using Python's zipfile.is_zipfile() method.

Installation

Follow the steps below to set up the DCAT Catalog Check on your local machine.

Installation with Poetry

Using Poetry is recommended for dependency management and virtual environment handling.

  1. Install Dependencies

    Navigate to the project directory and install the project’s dependencies (including development dependencies) using Poetry:

    poetry install

    This command will create a virtual environment and install all necessary packages as specified in the pyproject.toml file.

  2. Activating the Virtual Environment

    Poetry automatically manages virtual environments. You can activate the virtual environment with:

    poetry shell

    To exit the virtual environment, simply run:

    exit

Usage

Parameters

The DCAT Catalog Check script accepts several command-line arguments to customize its behavior. Below is a detailed explanation of each parameter:

Parameter Description Type Default
--url The URL of the DCAT catalog to check. String Required
--log_file Path to the log file for storing detailed output. String None
--results File path to load results from previous runs. String None
--verbose Enable verbose logging for more detailed output. Flag Off
--debug Enable debug logging for troubleshooting purposes. Flag Off
--recheck Use the previous results (specified by --results) as input for rechecking only. Flag Off
--no-recheck Only check new entries from the catalog without rechecking existing results. Flag Off
--check-format Specify a single format to check (e.g., JSON, JPEG). String None
--force-check-format Force checking distributions with the specified format, regardless of previous results. String None
--check-http-5xx Recheck entries that encountered HTTP 5xx errors in previous runs. Flag Off

Example Usage

Basic Run:

To check a DCAT catalog and save the results:

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml > results.jsonl

The catalog (including possible subsequent pages) is completely downloaded and checked. The result is written to the file results.jsonl in JSON Lines text file format.

Recheck Previous Results:

To recheck only existing results from a previous run:

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --recheck

Check New Entries Only:

To check only new entries without rechecking the existing ones:

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --no-recheck > new.jsonl
mv new.json results.jsonl

The results from a previous run from the file result.jsonl are used. The catalog is processed completely. Only new data records are checked. All results (new ones as well as the old ones that have not been checked again) are output to the file new.jsonl. Once the check is complete, the old results file is overwritten with the new one.

Debugging and Verbose Output:

To enable verbose and debug logging:

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --verbose --debug

Format-Specific Checks:

To check only a specific format (e.g., JSON):

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --check-format JSON

Force Format Check:

To force-check a specific format regardless of previous results:

poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --force-check-format JSON

Configuration

File Formats

The script reads the allowed file formats from resources/file_types.json file. This file defines the MIME types that are considered valid for each format and should be placed in the same directory as the script.

Example file_types.json

{
  "HTML": [
    "text/html"
  ],
  "JPEG": [
    "image/jpeg"
  ],
  "JSON": [
    "application/json", "text/plain"
  ]
}

URI Replacements (optional)

The uri_replacements.json file is an optional configuration file that provides a way to preprocess and modify URLs before they are checked by the script. This can be useful for standardizing, correcting, or transforming URLs to match specific patterns or to comply with expected formats.

Example uri_replacements.json

The file is a JSON array, where each element is an object containing two keys:

  • regex: A regular expression (in Python regex syntax) that matches parts of the URL that need to be replaced.
  • replaced_by: A string specifying the replacement value for the matched parts of the URL.

Example:

[
  {
    "regex": "http://example.com/old-path",
    "replaced_by": "http://example.com/new-path"
  },
  {
    "regex": "https://(.*)/deprecated",
    "replaced_by": "https://\\1/updated"
  }
]

In this example:

  • URLs starting with http://example.com/old-path will be replaced with http://example.com/new-path.
  • Any URL containing /deprecated after the domain will have /deprecated replaced with /updated.

How to Use

  1. Create a file named uri_replacements.json in the script's directory.
  2. Define the desired replacements in the JSON array format described above.
  3. Run the script as usual. If the file exists, replacements will be applied automatically.

Docker

You can run the script in a Docker container. See the Dockerfile for more information.

Build and Run

  1. Build the Docker image:

    docker build -t dcat-catalog-check .
  2. Run the Docker container:

    docker run --rm dcat-catalog-check --url https://example.com

Tests

To ensure the quality of the code, we utilize unittest for testing and coverage to measure code coverage. Follow the instructions below to run the tests and generate coverage reports.

Running Tests

To run the tests with coverage, you can use either of the following commands:

# Using Python directly
python3 -m coverage run -m unittest

or

# Using Poetry
poetry run coverage run -m unittest

Generating a Coverage Report

After running the tests, you can generate a coverage report to see which parts of your code were exercised during testing:

# Using Python directly
python3 -m coverage report

or

# Using Poetry
poetry run coverage report

Code Linting

For code linting, we use ruff to enforce style and catch potential issues. Run the following command to lint your code:

# Using Python directly
python3 -m ruff check .

or

# Using Poetry
poetry run ruff check .

Contributing

Contributions are welcome! Please open an issue or submit a pull request with your changes.

License

This project is licensed under the European Union Public License 1.2. See the LICENSE file for details.