README.md

# DCAT Catalog Check

![pipeline status](https://code.schleswig-holstein.de/opendata/dcat-catalog-check/badges/main/pipeline.svg)
![Coverage](https://code.schleswig-holstein.de/opendata/dcat-catalog-check/badges/main/coverage.svg?job=test)

This project is a Python script designed to monitor and validate links in a DCAT catalog.

The script is particularly useful for maintaining the integrity of distributions by ensuring that links are active and files are correctly formatted, thus helping to avoid issues related to broken links and invalid file types.

## Table of Contents

- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Docker](#docker)
- [Tests](#tests)
- [Contributing](#contributing)
- [License](#license)

## Features

- Retrieves the DCAT catalog.
- Checks if the URLs associated with the resources are alive or dead.
- If the file has been successfully downloaded, it is checked using the **format specified in the metadata**.
- Validates the MIME type of the distributions if no specialized check is available.
- Logs the results.

The following format checks are currently being carried out:

| Format    | Check |
| --------- | ------- |
| `GEOJSON` | Load the file using [`GeoPandas`](https://geopandas.org). |
| `GML`     | Load the file using [`GeoPandas`](https://geopandas.org). |
| `JPEG`    | Load the image. |
| `JSON`    | Is it syntactically correct JSON? If it is a *Frictionless Data Resource*, it is checked with the Frictionless Tools. |
| `PNG`     | Load the image. |
| `PDF`     | Load the document using [`pypdf`](https://pypi.org/project/pypdf/). |
| `SHP`     | Load the file using [`GeoPandas`](https://geopandas.org). |
| `WFS`     | Is it a valid well-formed `WMS_Capabilities` XML document? If the address does not contain the `request=GetCapabilities` parameter, a `GetCapabilities` request is performed. This response is then checked. |
| `WMS`     | Is it a valid well-formed `WFS_Capabilities` XML document? If the address does not contain the `request=GetCapabilities` parameter, a `GetCapabilities` request is performed. This response is then checked.  |
| `XML`     | Is it well-formed XML? |

## Installation

Follow the steps below to set up the **DCAT Catalog Check** on your local machine.

### Installation with Poetry

Using **Poetry** is recommended for dependency management and virtual environment handling.

1. **Install Dependencies**

   Navigate to the project directory and install the project’s dependencies (including development dependencies) using Poetry:

   ```sh
   poetry install
   ```

   This command will create a virtual environment and install all necessary packages as specified in the [`pyproject.toml`](./pyproject.toml) file.

2. **Activating the Virtual Environment**

   Poetry automatically manages virtual environments. You can activate the virtual environment with:

   ```sh
   poetry shell
   ```

   To exit the virtual environment, simply run:

   ```sh
   exit
   ```

## Usage

### Parameters

The **DCAT Catalog Check** script accepts several command-line arguments to customize its behavior. Below is a detailed explanation of each parameter:

| Parameter | Description | Type | Default |
| --------- | ----------- | ---- | ------- |
| `--url` | The URL of the DCAT catalog to check. | String | Required |
| `--log_file` | Path to the log file for storing detailed output. | String | None |
| `--results` | File path to load results from previous runs. | String | None |
| `--verbose` | Enable verbose logging for more detailed output. | Flag | Off |
| `--debug` | Enable debug logging for troubleshooting purposes. | Flag | Off |
| `--recheck` | Use the previous results (specified by `--results`) as input for rechecking only. | Flag | Off |
| `--no-recheck` | Only check new entries from the catalog without rechecking existing results. | Flag | Off |
| `--check-format` | Specify a single format to check (e.g., `JSON`, `JPEG`). | String | None |
| `--force-check-format`| Force checking distributions with the specified format, regardless of previous results. | String | None |
| `--check-http-5xx` | Recheck entries that encountered HTTP 5xx errors in previous runs. | Flag | Off |

### Example Usage

**Basic Run:**

To check a DCAT catalog and save the results:

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml > results.jsonl
```

The catalog (including possible subsequent pages) is completely downloaded and checked. The result is written to the file `results.jsonl` in *JSON Lines text file format*.

**Recheck Previous Results:**

To recheck only existing results from a previous run:

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --recheck
```

**Check New Entries Only:**

To check only new entries without rechecking the existing ones:

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --no-recheck > new.jsonl
mv new.json results.jsonl
```

The results from a previous run from the file `result.jsonl` are used. The catalog is processed completely. Only new data records are checked. All results (new ones as well as the old ones that have not been checked again) are output to the file `new.jsonl`. Once the check is complete, the old results file is overwritten with the new one.

**Debugging and Verbose Output:**

To enable verbose and debug logging:

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --verbose --debug
```

**Format-Specific Checks:**

To check only a specific format (e.g., `JSON`):

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --check-format JSON
```

**Force Format Check:**

To force-check a specific format regardless of previous results:

```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --force-check-format JSON
```

## Configuration

### File Formats

The script reads the allowed file formats from [`resources/file_types.json`](./resources/file_types.json)
file. This file defines the MIME types that are considered valid for each
format and should be placed in the same directory as the script.

#### Example `file_types.json`

```json
{
  "HTML": [
    "text/html"
  ],
  "JPEG": [
    "image/jpeg"
  ],
  "JSON": [
    "application/json", "text/plain"
  ]
}
```

### URI Replacements (optional)

The `uri_replacements.json` file is an optional configuration file that provides a way to preprocess and modify URLs before they are checked by the script. This can be useful for standardizing, correcting, or transforming URLs to match specific patterns or to comply with expected formats.

#### Example `uri_replacements.json`

The file is a JSON array, where each element is an object containing two keys:

- `regex`: A regular expression (in Python regex syntax) that matches parts of the URL that need to be replaced.
- `replaced_by`: A string specifying the replacement value for the matched parts of the URL.

Example:

```json
[
  {
    "regex": "http://example.com/old-path",
    "replaced_by": "http://example.com/new-path"
  },
  {
    "regex": "https://(.*)/deprecated",
    "replaced_by": "https://\\1/updated"
  }
]
```

In this example:

- URLs starting with `http://example.com/old-path` will be replaced with `http://example.com/new-path`.
- Any URL containing `/deprecated` after the domain will have `/deprecated` replaced with `/updated`.

#### How to Use

1. Create a file named `uri_replacements.json` in the script's directory.
2. Define the desired replacements in the JSON array format described above.
3. Run the script as usual. If the file exists, replacements will be applied automatically.

By using `uri_replacements.json`, you can streamline URL handling and ensure consistent preprocessing for your link-checking tasks.

## Docker

You can run the script in a Docker container. See the [Dockerfile](./Dockerfile) for more information.

### Build and Run

1. Build the Docker image:

    ```sh
    docker build -t dcat-catalog-check .
    ```

2. Run the Docker container:

    ```sh
    docker run --rm dcat-catalog-check --url https://example.com 
    ```

## Tests

To ensure the quality of the code, we utilize **unittest** for testing and **coverage** to measure code coverage. Follow the instructions below to run the tests and generate coverage reports.

### Running Tests

To run the tests with coverage, you can use either of the following commands:

```sh
# Using Python directly
python3 -m coverage run -m unittest
```

or

```sh
# Using Poetry
poetry run coverage run -m unittest
```

### Generating a Coverage Report

After running the tests, you can generate a coverage report to see which parts of your code were exercised during testing:

```sh
# Using Python directly
python3 -m coverage report
```

or

```sh
# Using Poetry
poetry run coverage report
```

### Code Linting

For code linting, we use **ruff** to enforce style and catch potential issues. Run the following command to lint your code:

```sh
# Using Python directly
python3 -m ruff check .
```

or

```sh
# Using Poetry
poetry run ruff check .
```

## Contributing

Contributions are welcome! Please open an issue or submit a pull request
with your changes.

## License

This project is licensed under the European Union Public License 1.2.
See the [`LICENSE`](./LICENSE) file for details.