DCAT Catalog Check
This project is a Python script designed to monitor and validate links in a DCAT catalog.
The script is particularly useful for maintaining the integrity of distributions by ensuring that links are active and files are correctly formatted, thus helping to avoid issues related to broken links and invalid file types.
Table of Contents
Features
- Retrieves the DCAT catalog.
- Checks if the URLs associated with the resources are alive or dead.
- If the file has been successfully downloaded, it is checked using the format specified in the metadata.
- Validates the MIME type of the distributions if no specialized check is available.
- Logs the results.
The following format checks are currently being carried out:
Format | Check |
---|---|
ATOM |
Validates whether the file content is a valid ATOM feed by confirming the root element is <feed> in the Atom XML namespace. |
DOCX |
Verifies that the file is a valid DOCX by ensuring the ZIP archive contains the necessary XML files (document.xml and styles.xml ). |
GEOJSON |
Loads and validates the file using GeoPandas . |
GEOTIFF |
Verifies the file is a valid GeoTIFF by checking its GeoTransform information and supports both standalone and ZIP-compressed GeoTIFF formats. |
GML |
Loads and validates the file using GeoPandas . |
JPEG |
Loads and validates the image file. |
JSON |
Verifies that the file is syntactically correct JSON and, if it is a Frictionless Data Resource, checks it using Frictionless Tools. |
ODS |
Validates that the file is a valid ODS (OpenDocument Spreadsheet) by checking the ZIP structure, required files, and correct MIME type. |
ODT |
Validates that the file is a valid ODT (OpenDocument Text) by confirming the ZIP structure, required files, and correct MIME type. |
PARQUET |
Verifies that the file is a readable Apache Parquet file by loading it using pandas . |
PDF |
Loads and validates the PDF document using pypdf . |
PNG |
Loads and validates the image file. |
RDF |
Verifies the file is a valid RDF (Resource Description Framework) document and contains at least two statements. |
SHP |
Loads and validates the file using GeoPandas . |
WFS |
Validates if the file is a well-formed WMS_Capabilities XML document. If not, a GetCapabilities request is made and validated. |
WMS |
Validates if the file is a well-formed WFS_Capabilities XML document. If not, a GetCapabilities request is made and validated. |
WMTS |
Validates if the file contains a valid WMTS (Web Map Tile Service) capabilities XML response, either directly or by performing a GetCapabilities request. |
XLSX |
Verifies that the file is a ZIP archive and contains the required files (xl/workbook.xml and xl/styles.xml ) typical of a valid XLSX file. |
XML |
Verifies if the file is well-formed XML. |
ZIP |
Verifies if the file is a valid ZIP archive using Python's zipfile.is_zipfile() method. |
Installation
Follow the steps below to set up the DCAT Catalog Check on your local machine.
Installation with Poetry
Using Poetry is recommended for dependency management and virtual environment handling.
-
Install Dependencies
Navigate to the project directory and install the project’s dependencies (including development dependencies) using Poetry:
poetry install
This command will create a virtual environment and install all necessary packages as specified in the
pyproject.toml
file. -
Activating the Virtual Environment
Poetry automatically manages virtual environments. You can activate the virtual environment with:
poetry shell
To exit the virtual environment, simply run:
exit
Usage
Parameters
The DCAT Catalog Check script accepts several command-line arguments to customize its behavior. Below is a detailed explanation of each parameter:
Parameter | Description | Type | Default |
---|---|---|---|
--url |
The URL of the DCAT catalog to check. | String | Required |
--log_file |
Path to the log file for storing detailed output. | String | None |
--results |
File path to load results from previous runs. | String | None |
--verbose |
Enable verbose logging for more detailed output. | Flag | Off |
--debug |
Enable debug logging for troubleshooting purposes. | Flag | Off |
--recheck |
Use the previous results (specified by --results ) as input for rechecking only. |
Flag | Off |
--no-recheck |
Only check new entries from the catalog without rechecking existing results. | Flag | Off |
--check-format |
Specify a single format to check (e.g., JSON , JPEG ). |
String | None |
--force-check-format |
Force checking distributions with the specified format, regardless of previous results. | String | None |
--check-http-5xx |
Recheck entries that encountered HTTP 5xx errors in previous runs. | Flag | Off |
Example Usage
Basic Run:
To check a DCAT catalog and save the results:
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml > results.jsonl
The catalog (including possible subsequent pages) is completely downloaded and checked. The result is written to the file results.jsonl
in JSON Lines text file format.
Recheck Previous Results:
To recheck only existing results from a previous run:
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --recheck
Check New Entries Only:
To check only new entries without rechecking the existing ones:
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --no-recheck > new.jsonl
mv new.json results.jsonl
The results from a previous run from the file result.jsonl
are used. The catalog is processed completely. Only new data records are checked. All results (new ones as well as the old ones that have not been checked again) are output to the file new.jsonl
. Once the check is complete, the old results file is overwritten with the new one.
Debugging and Verbose Output:
To enable verbose and debug logging:
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --verbose --debug
Format-Specific Checks:
To check only a specific format (e.g., JSON
):
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --check-format JSON
Force Format Check:
To force-check a specific format regardless of previous results:
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --force-check-format JSON
Configuration
File Formats
The script reads the allowed file formats from resources/file_types.json
file. This file defines the MIME types that are considered valid for each
format and should be placed in the same directory as the script.
file_types.json
Example {
"HTML": [
"text/html"
],
"JPEG": [
"image/jpeg"
],
"JSON": [
"application/json", "text/plain"
]
}
URI Replacements (optional)
The uri_replacements.json
file is an optional configuration file that provides a way to preprocess and modify URLs before they are checked by the script. This can be useful for standardizing, correcting, or transforming URLs to match specific patterns or to comply with expected formats.
uri_replacements.json
Example The file is a JSON array, where each element is an object containing two keys:
-
regex
: A regular expression (in Python regex syntax) that matches parts of the URL that need to be replaced. -
replaced_by
: A string specifying the replacement value for the matched parts of the URL.
Example:
[
{
"regex": "http://example.com/old-path",
"replaced_by": "http://example.com/new-path"
},
{
"regex": "https://(.*)/deprecated",
"replaced_by": "https://\\1/updated"
}
]
In this example:
- URLs starting with
http://example.com/old-path
will be replaced withhttp://example.com/new-path
. - Any URL containing
/deprecated
after the domain will have/deprecated
replaced with/updated
.
How to Use
- Create a file named
uri_replacements.json
in the script's directory. - Define the desired replacements in the JSON array format described above.
- Run the script as usual. If the file exists, replacements will be applied automatically.
Docker
You can run the script in a Docker container. See the Dockerfile for more information.
Build and Run
-
Build the Docker image:
docker build -t dcat-catalog-check .
-
Run the Docker container:
docker run --rm dcat-catalog-check --url https://example.com
Tests
To ensure the quality of the code, we utilize unittest for testing and coverage to measure code coverage. Follow the instructions below to run the tests and generate coverage reports.
Running Tests
To run the tests with coverage, you can use either of the following commands:
# Using Python directly
python3 -m coverage run -m unittest
or
# Using Poetry
poetry run coverage run -m unittest
Generating a Coverage Report
After running the tests, you can generate a coverage report to see which parts of your code were exercised during testing:
# Using Python directly
python3 -m coverage report
or
# Using Poetry
poetry run coverage report
Code Linting
For code linting, we use ruff to enforce style and catch potential issues. Run the following command to lint your code:
# Using Python directly
python3 -m ruff check .
or
# Using Poetry
poetry run ruff check .
Contributing
Contributions are welcome! Please open an issue or submit a pull request with your changes.
License
This project is licensed under the European Union Public License 1.2.
See the LICENSE
file for details.