Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
# DCAT Catalog Check


This project is a Python script designed to monitor and validate links in a DCAT catalog.
The script is particularly useful for maintaining the integrity of distributions by ensuring that links are active and files are correctly formatted, thus helping to avoid issues related to broken links and invalid file types.
## Table of Contents
- [Features](#features)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Docker](#docker)
- [Tests](#tests)
- [Contributing](#contributing)
- [License](#license)
## Features
- Retrieves the DCAT catalog.
- Checks if the URLs associated with the resources are alive or dead.
- If the file has been successfully downloaded, it is checked using the **format specified in the metadata**.
- Validates the MIME type of the distributions if no specialized check is available.
- Logs the results.
The following format checks are currently being carried out:
| Format | Check |
| --------- | ------- |
| `GEOJSON` | Load the file using [`GeoPandas`](https://geopandas.org). |
| `GML` | Load the file using [`GeoPandas`](https://geopandas.org). |
| `JPEG` | Load the image. |
| `JSON` | Is it syntactically correct JSON? If it is a *Frictionless Data Resource*, it is checked with the Frictionless Tools. |
| `PNG` | Load the image. |
| `PDF` | Load the document using [`pypdf`](https://pypi.org/project/pypdf/). |
| `SHP` | Load the file using [`GeoPandas`](https://geopandas.org). |
| `WFS` | Is it a valid well-formed `WMS_Capabilities` XML document? If the address does not contain the `request=GetCapabilities` parameter, a `GetCapabilities` request is performed. This response is then checked. |
| `WMS` | Is it a valid well-formed `WFS_Capabilities` XML document? If the address does not contain the `request=GetCapabilities` parameter, a `GetCapabilities` request is performed. This response is then checked. |
| `XML` | Is it well-formed XML? |
## Installation
Follow the steps below to set up the **DCAT Catalog Check** on your local machine.
### Installation with Poetry
Using **Poetry** is recommended for dependency management and virtual environment handling.
1. **Install Dependencies**
Navigate to the project directory and install the project’s dependencies (including development dependencies) using Poetry:
```sh
poetry install
```
This command will create a virtual environment and install all necessary packages as specified in the [`pyproject.toml`](./pyproject.toml) file.
2. **Activating the Virtual Environment**
Poetry automatically manages virtual environments. You can activate the virtual environment with:
```sh
poetry shell
```
To exit the virtual environment, simply run:
```sh
exit
```
## Usage
### Parameters
The **DCAT Catalog Check** script accepts several command-line arguments to customize its behavior. Below is a detailed explanation of each parameter:
| Parameter | Description | Type | Default |
| --------- | ----------- | ---- | ------- |
| `--url` | The URL of the DCAT catalog to check. | String | Required |
| `--log_file` | Path to the log file for storing detailed output. | String | None |
| `--results` | File path to load results from previous runs. | String | None |
| `--verbose` | Enable verbose logging for more detailed output. | Flag | Off |
| `--debug` | Enable debug logging for troubleshooting purposes. | Flag | Off |
| `--recheck` | Use the previous results (specified by `--results`) as input for rechecking only. | Flag | Off |
| `--no-recheck` | Only check new entries from the catalog without rechecking existing results. | Flag | Off |
| `--check-format` | Specify a single format to check (e.g., `JSON`, `JPEG`). | String | None |
| `--force-check-format`| Force checking distributions with the specified format, regardless of previous results. | String | None |
| `--check-http-5xx` | Recheck entries that encountered HTTP 5xx errors in previous runs. | Flag | Off |
### Example Usage
**Basic Run:**
To check a DCAT catalog and save the results:
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml > results.jsonl
```
The catalog (including possible subsequent pages) is completely downloaded and checked. The result is written to the file `results.jsonl` in *JSON Lines text file format*.
**Recheck Previous Results:**
To recheck only existing results from a previous run:
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --recheck
```
**Check New Entries Only:**
To check only new entries without rechecking the existing ones:
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --results results.jsonl --no-recheck > new.jsonl
mv new.json results.jsonl
```
The results from a previous run from the file `result.jsonl` are used. The catalog is processed completely. Only new data records are checked. All results (new ones as well as the old ones that have not been checked again) are output to the file `new.jsonl`. Once the check is complete, the old results file is overwritten with the new one.
**Debugging and Verbose Output:**
To enable verbose and debug logging:
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --verbose --debug
```
**Format-Specific Checks:**
To check only a specific format (e.g., `JSON`):
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --check-format JSON
```
**Force Format Check:**
To force-check a specific format regardless of previous results:
```sh
poetry run python dcat_catalog_check.py --url https://example.com/catalog.xml --force-check-format JSON
```
## Configuration
### File Formats
The script reads the allowed file formats from [`resources/file_types.json`](./resources/file_types.json)
file. This file defines the MIME types that are considered valid for each
format and should be placed in the same directory as the script.
#### Example `file_types.json`
```json
{
"HTML": [
"text/html"
],
"JPEG": [
"image/jpeg"
],
"JSON": [
"application/json", "text/plain"
]
}
```
### URI Replacements (optional)
The `uri_replacements.json` file is an optional configuration file that provides a way to preprocess and modify URLs before they are checked by the script. This can be useful for standardizing, correcting, or transforming URLs to match specific patterns or to comply with expected formats.
#### Example `uri_replacements.json`
The file is a JSON array, where each element is an object containing two keys:
- `regex`: A regular expression (in Python regex syntax) that matches parts of the URL that need to be replaced.
- `replaced_by`: A string specifying the replacement value for the matched parts of the URL.
Example:
```json
[
{
"regex": "http://example.com/old-path",
"replaced_by": "http://example.com/new-path"
},
{
"regex": "https://(.*)/deprecated",
"replaced_by": "https://\\1/updated"
}
]
```
In this example:
- URLs starting with `http://example.com/old-path` will be replaced with `http://example.com/new-path`.
- Any URL containing `/deprecated` after the domain will have `/deprecated` replaced with `/updated`.
#### How to Use
1. Create a file named `uri_replacements.json` in the script's directory.
2. Define the desired replacements in the JSON array format described above.
3. Run the script as usual. If the file exists, replacements will be applied automatically.
By using `uri_replacements.json`, you can streamline URL handling and ensure consistent preprocessing for your link-checking tasks.
## Docker
You can run the script in a Docker container. See the [Dockerfile](./Dockerfile) for more information.
### Build and Run
1. Build the Docker image:
```sh
docker build -t dcat-catalog-check .
```
2. Run the Docker container:
```sh
docker run --rm dcat-catalog-check --url https://example.com
```
## Tests
To ensure the quality of the code, we utilize **unittest** for testing and **coverage** to measure code coverage. Follow the instructions below to run the tests and generate coverage reports.
### Running Tests
To run the tests with coverage, you can use either of the following commands:
```sh
# Using Python directly
python3 -m coverage run -m unittest
```
or
```sh
# Using Poetry
poetry run coverage run -m unittest
```
### Generating a Coverage Report
After running the tests, you can generate a coverage report to see which parts of your code were exercised during testing:
```sh
# Using Python directly
python3 -m coverage report
```
or
```sh
# Using Poetry
poetry run coverage report
```
### Code Linting
For code linting, we use **ruff** to enforce style and catch potential issues. Run the following command to lint your code:
```sh
# Using Python directly
python3 -m ruff check .
```
or
```sh
# Using Poetry
poetry run ruff check .
```
## Contributing
Contributions are welcome! Please open an issue or submit a pull request
with your changes.
## License
This project is licensed under the European Union Public License 1.2.
See the [`LICENSE`](./LICENSE) file for details.