GSAP-ERE

Otto, Wolfgang

doi:10.60914/c4c1d-s0587

Published December 1, 2025 | Version v1

Open dataset Open

GSAP-ERE

Otto, Wolfgang (Contact person)¹

1. GESIS - Leibniz-Institut für Sozialwissenschaften

Contributors

Data curators:

Data managers:

1. GESIS - Leibniz-Institut für Sozialwissenschaften

Description

GSAP-ERE Dataset

Introduction

GSAP-ERE is a dataset to train and evaluate models for Entity and Relation Extraction of machine learning related entities in scholarly publications (e.g., research papers). Find more information on the GSAP Project on data.gesis.org/gsap.

Data Citation

Please reference:

Wolfgang Otto, Lu Gan, Sharmila Upadhyaya, Saurav Karmakar, Stefan Dietze (2026) GSAP-ERE: Fine-Grained Scholarly Entity and Relation Extraction Focused on Machine Learning. AAAI2026.

Version Information

The annotation is finished on the 15th of April 2025 and can be used to reproduce the results in the connected publication Otto et al. 2026 (mentioned above).

Train/Dev/Test-Split

The dataset was partitioned into training, validation, and test sets with an 80% / 10% / 10% split, respectively, ensuring that all data points from a single publication remained within a single set to prevent data leakage.

Label Sets

Our 10 Named Entity Labels in 4 semantic grouped

Method related:
- MLModel
- MLModelGeneric
- ModelArchitecture
- Method
Data related:
- Dataset
- DatasetGeneric
- DataSource
Task related:
- Task
Referencing:
- ReferenceLink
- URL

Our 18 Relation Labels (incl. domain and range) in 7 semantic groups

Model Design:
- Method -usedFor-> Method|MLModel(Generic)
- MLModel(Generic)|Method -architecture-> ModelArchitecture
- MLModel(Generic) -isBasedOn-> MLModel(Generic)
Task Binding:
- MLModel(Generic)|Method -appliedTo-> Task
- Dataset(Generic) -benchmarkFor-> Task
Data Usage:
- MLModel(Generic)|Method -trainedOn-> Dataset(Generic)
- MLModel(Generic)|Method -evaluatedOn-> Dataset(Generic)
Data Provenance:
- Dataset(Generic) -transformedFrom-> Dataset(Generic)
- Dataset(Generic) -generatedBy-> Method
- Dataset(Generic) -sourcedFrom-> DataSource
Data Properties:
- Dataset(Generic) -size-> DatasetGeneric
- Dataset(Generic) -hasInstanceType-> DatasetGeneric
Peer Relations:
- <Any> -coreference-> <Same as Subject>
- <Any> -isPartOf-> <Same as Subject>
- <Any> -isHyponymOf-> <Same as Subject>
- <Any> -isComparedTo-> <Same as Subject>
Referencing:
- <Any> -citation-> ReferenceLink
- <Any> -url-> URL

Format

The Files are encoded in the jsonl format, where each line represents the valid json of one publication.

Data field for each document

The data format of the jsonl files is compatible with many works in the field of entity and relation extraction (e.g., HGERE).

Each line of the jsonl file represents one document containing the following fields:

sentences: A list of sentences represented by a list of tokens (`[[<sentence_1_token_1_id>, <sentence_1_token_2_id>, ...], [sentence_2_token_2id, ...], ...] (Resolve the word_ids based on the vocabulary given on our github project GSAP-ERE.)

ner: A list of named entities represented by a list of three elements: begin of entity, end of entity, label (e.g., [[<begin_idx>, <end_idx>, "MLModel"], ...] for each sentence. This includes stacked (i.e., overlapping) annotations.

relations : A list of relation for each sentence. Each relation is represented by the begin and end of subject and object and the relation label for each sentence (e.g., `[<begin_idx_subject>, <end_idx_subject>, <begin_idx_object>, <end_idx_object>, "isPartOf"] `

clusters: This field exists for compatibility reasons. In this version no reference clusters are annotated. This will be reflected in future versions of the dataset.

doc_id: a unique identifier for each document

annotator: Id representing the initial annoator of the document (0 or 1) . During the refinement process other annotators might have corrected some of the annotations.

Files

Files (17.3 MB)

Name	Size	Download all
dev.jsonl md5:b3e379d168a21ca371cfaea80b1cbede	1.9 MB	Download
test.jsonl md5:0462bc3719ec2ffbd2951fc4635a45a8	1.7 MB	Download
train.jsonl md5:8466624a14973b2d88d658068c417db3	13.7 MB	Download

Resource type	Open dataset
Title	GSAP-ERE
Alternative title	GSAP-ERE 1.0
Creators	Otto, Wolfgang¹
Contributors	Otto, Wolfgang¹ Gan, Lu¹ Upadhyaya, Sharmila¹ Kanishka, Silva¹
Research Fields	Other
Size	100 publications
Formats	jsonl
License(s)	Creative Commons Attribution Non Commercial 4.0 International
Dates of collection	April 15, 2025

	All versions	This version
Views	141	141
Downloads	95	95
Data volume	688.5 MB	688.5 MB

GSAP-ERE

Creators

Contributors

Data curators:

Data managers:

Description

Description

GSAP-ERE Dataset

Introduction

Data Citation

Version Information

Train/Dev/Test-Split

Label Sets

Our 10 Named Entity Labels in 4 semantic grouped

Our 18 Relation Labels (incl. domain and range) in 7 semantic groups

Format

Data field for each document

Files

Files (17.3 MB)

Details

Related Resources