Published October 5, 2023 | Version v2

DSSGx 2023 - NRW Bebauungspläne (document_texts)

Description

Description

This dataset contains the result of the PDF parser done by Tika. It contains for each document, the land parcel it refers to and the content downloaded.

Data Fields

  • filename: Name of the parsed pdf file.
  • document_id: Unique ID of the document, it is the combination of the land parcel id_number of document from that land parcel.
  • content: Extracted text content.
  • land_parcel_id: Unique ID of the land parcel for the document.
  • land_parcel_name: Name of the land parcel for the document.
  • land_parcel_scanurl: URL for the parsed content.

Source Data

Comes from the module document_texts_creation in the code of this repository on GitHub.

Contact

Variables

Name Description
filename Name of the parsed pdf file.
document_id Unique ID of the document, it is the combination of the land parcel id_number of document from that land parcel.
content Extracted text content.
land_parcel_id Unique ID of the land parcel for the document.
land_parcel_name Name of the land parcel for the document.
land_parcel_scanurl URL for the parsed content.

Details

Resource type Funded research project dataset
Title DSSGx 2023 - NRW Bebauungspläne (document_texts)
Creators
  • Domenech Burin, Laia
  • Klotz, Jonas
  • Ding, Franca
  • Srivastava, Sanya
  • Research Fields Other
    Size 95.4 MB
    Formats Microsoft Excel (OpenXML) (.xlsx)
    External Resource https://huggingface.co/datasets/DSSGxMunich/document_text/blob/main/document_texts.xlsx
    Countries Germany