There is a newer version of the record available.

Published August 22, 2023 | Version v3
Open dataset

MLW Zettelmaterial

  • 1. Bayerische Akademie der Wissenschaften (BAdW)
  • 2. LMU Mun
  • 3. Universität Zürich
  • 4. LMU Munich

Description

Description

General information:

The data set comprises a total of 114,653 images (18,9 GB), corresponding to 3,507 distinct lemmas.
All images are in RGB, but not uniform in size, i.e. height, and width differ from image to image. 
Additionally, the information on the corresponding lemma is available for each image in a separate json file.

Structure:

Most record cards follow the same structure being composed of three main parts. 

  • The first one (1), and the one deemed most challenging, is the lemma, which is always located in the upper left corner of the record card. 
  • The second part (2) is the index of the text where the lemma is found. 
  • The third part (3) contains a text extract in which the word (corresponding to the lemma) occurs in context.

Character inventory:

There is a total of 17 different first letters, eight of which are each upper- and lowercase, as well as one special character. 
The capitalization of a word plays a crucial role since a word's meaning changes depending on capitalization. 
Since the majority of our data stems from the S-series of the dictionary, most lemmas start with the letter "s". 
Likewise, a larger number of lemmas also starts with "m", "v", "t", "u", "l", and "n".

Occurrence frequencies:

  • A total of 2,420 lemmas (69%) were found to appear on ten record cards or less 
  • 854 lemmas (24.4%) are present on between 10 and 100 record cards
  • 233 lemmas (6.6%)can be found on more than 100 record cards
  • 1,123 lemmas (approximately 36.7%) had only one record card

Lengths:

  • Lemma lengths range from one character up to a maximum of 19 characters. 
  • The average length of the lemmas lies between five and six characters. 

Availability:

Research activity:

  • Koch, P., Nuñez, G. V., Arias, E. G., Heumann, C., Schöffel, M., Häberlin, A., & Aßenmacher, M. (2023). A tailored Handwritten-Text-Recognition System for Medieval Latin. arXiv preprint arXiv:2308.09368.

Variables

Name Description
id id of the image of the record card
lemma lemma of the record card (target variable)

Details

Resource type Open dataset
Title MLW Zettelmaterial
Translated title MLW record cards
Creators
  • Bayerische Akademie der Wissenschaften (BAdW)
  • Contributors
  • Schöffel, Matthias1
  • Garces Arias, Esteban2
  • Häberlin, Alexander3
  • Aßenmacher, Matthias4
  • Heumann, Christian4
  • Koch, Philipp4
  • Nuñez, Gilary Vera4
  • Research Fields Economic & Social History
    Size 18.9 GiB
    Formats JSON format (.json)
    License(s) Creative Commons Attribution 4.0 International
    External Resource https://huggingface.co/datasets/misoda/MLW_data
    Countries Germany