Published October 5, 2023
| Version v2
DSSGx 2023 - NRW Bebauungspläne (document_texts)
Description
Description
This dataset contains the result of the PDF parser done by Tika. It contains for each document, the land parcel it refers to and the content downloaded.
Data Fields
- filename: Name of the parsed pdf file.
- document_id: Unique ID of the document, it is the combination of the land parcel id_number of document from that land parcel.
- content: Extracted text content.
- land_parcel_id: Unique ID of the land parcel for the document.
- land_parcel_name: Name of the land parcel for the document.
- land_parcel_scanurl: URL for the parsed content.
Source Data
Comes from the module document_texts_creation in the code of this repository on GitHub.
Contact
-
Homepage: DSSGx Munich organization page.
Variables
| Name | Description |
|---|---|
| filename | Name of the parsed pdf file. |
| document_id | Unique ID of the document, it is the combination of the land parcel id_number of document from that land parcel. |
| content | Extracted text content. |
| land_parcel_id | Unique ID of the land parcel for the document. |
| land_parcel_name | Name of the land parcel for the document. |
| land_parcel_scanurl | URL for the parsed content. |
Details
| Resource type | Funded research project dataset |
| Title | DSSGx 2023 - NRW Bebauungspläne (document_texts) |
| Creators |
|
| Research Fields | Other |
| Size | 95.4 MB |
| Formats | Microsoft Excel (OpenXML) (.xlsx) |
| External Resource | https://huggingface.co/datasets/DSSGxMunich/document_text/blob/main/document_texts.xlsx |
| Countries | Germany |