Published October 20, 2025 | Version v1

Gold Standard and Annotation Dataset for CO2 Emissions Annotation

  • 1. ROR icon Ludwig-Maximilians-Universität München
  • 2. ROR icon Munich Center for Machine Learning
  • 3. ROR icon Deutsche Bundesbank

Description

Description

This repository contains the results of a research project which provides a benchmark dataset for extracting greenhouse gas emissions from corporate annual and sustainability reports. The paper which explains the data collection methodology and provides a detailed description of the benchmark dataset can be found in the Nature Scientific Data journal publication

The zipped datasets file contains two datasets, gold_standard and annotation_dataset(inside the outer zip file there is a password-protected zip file containing the two datasets. To unpack, use the password is provided in the outer zip file).

Data collection

  1. A Large Language Model (LLM) based pipeline was used to extract the greenhouse gas emissions from the reports (see columns prefixed with llm_ in annotation_dataset). The extracted emissions follow the categories Scope 1, 2 (market-based) and 2 (location-based) and 3, as defined in the GHGP protocol (see variables scope).
  2. Annotation of the pipeline output was done in 3 phases: first by non-experts (see columns prefixed with non_expert_ in annotation_dataset), then by expert groups (columns prefixed with exp_group_ in annotation_dataset) in case of disagreement of non-experts and finally in a discussion of all experts (columns prefixed with exp__disc in annotation_dataset) in case of disagreement between expert groups. The annotation guidelines for the non-experts and experts are also included in this repository.
  3. The annotation results from all three phases are combined to form the final benchmark dataset: gold_standard. Codebooks detailing each variable of each of the two datasets are also provided. More details about the annotation template or the data wrangling scripts can be found in the GitHub repository

Merging of datasets

Users can match the two datasets (gold_standard and annotation_dataset) using the variable combination of company_name, report_year and merge_id (index column). The merge_id already includes the company name and report year implicitly, but to avoid column duplication in the join operation, it should be included as join variables. For example this is useful when comparing LLM extractions to gold standard data.

Files

codebook_gold_standard.csv

Files (4.7 MB)

Name Size Download all
md5:af291b895c9c8c02c1f29d5a4965c0af
14.1 kB Preview Download
md5:64c001e5d854ebd3624571f10ceb130f
4.3 kB Preview Download
md5:bbaf29a2c1fee11f49a0cc77de1bbd30
213.6 kB Preview Download
md5:153dc9fc6065561db62302b079746c1d
1.3 MB Preview Download
md5:5033da6297152d20cf618b8fd9d13ec6
3.2 MB Preview Download

Details

Resource type Funded research project dataset
Title Gold Standard and Annotation Dataset for CO2 Emissions Annotation
Creators
  • Beck, Jacob1, 2 ORCID icon
  • Steinberg, Anna1, 2 ORCID icon
  • Dimmelmeier, Andreas1 ORCID icon
  • Domenech Burin, Laia1 ORCID icon
  • Kormanyos, Emily3 ORCID icon
  • Fehr, Maurice3
  • Schierholz, Malte1, 2 ORCID icon
  • Size 4.7 MB
    Dates of collection December 10, 2024

    Additional Details

    Related works

    Is cited by
    Data paper: 10.1038/s41597-025-05664-8 (DOI)
    Is part of
    Computational notebook: https://github.com/soda-lmu/gist-data-descriptor (URL)