Published September 15, 2025 | Version v1
Open dataset

10k-fraud-detection

  • 1. ROR icon Ludwig-Maximilians-Universität München
  • 2. ROR icon Munich Center for Machine Learning

Description

Description

Description: We provide (to the best of our knowledge) the first publicly available data set of (specific sections of) 10-K reports alongside labels indicating fraudulent behavior. To enable other researchers to replicate or extend our work, we provide transparent descriptions of the data scraping and labeling processes, release our source code, and make the data available. In our paper "Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports", we provide baseline results for a given data split motivated by specific temporal characteristics of fraud.

 

Scientific Work: This data set accompanies the following research article:
Amin, Moustafa & Aßenmacher, Matthias (2025) Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports". In Proceedings of the 10th Workshop on Financial Technology and Natural Language Processing (FinNLP), Suzhou, China (Hybrid). Association for Computational Linguistics.

Variables

Name Description
cik Central Index Key (unique company identifier)
name Company name
city City
state State
sic Standard Industry Classification number
incorp_state State of incorporation
filing_type Filing type
fye Fiscal year end
filing_date Date the 10-K was filed
reporting_date Period the 10-K reports on
url URL to the filing
mda Management Discussion and Analysis section (text)
late_filing Indicates a 10-K405 (late filing)
transition_filing Indicates a 10-KT (transition report)
amend_filing Indicates amended 10-K (any 10-K ending in /A)
dataTime Date and time of the AAER
respondents Names of respondents in the AAER
fraud_start Beginning date of the fraudulent period (mm-yyyy)
fraud_end End date of the fraudulent period (mm-yyyy)
revoked Revokation date of Exchange Act registration (mm-yyyy)
certainty_start Binary indicator: certainty regarding fraud start date
certainty_end Binary indicator: certainty regarding fraud end date
17a 17(a) Securities Act violation
17a2 17(a)(2) Securities Act violation
17a3 17(a)(3) Securities Act violation
17b 17(b) Securities Act violation
5a 5(a) Securities Act violation
5b1 5(b)(1) Securities Act violation
5c 5(c) Securities Act violation
10b 10(b) Securities Exchange Act violation
13a 13(a) Securities Exchange Act violation
12b20 Section 12b rule 12b-20 Securities Exchange Act violation
12b25 Section 12b rule 12b-25 Securities Exchange Act violation
13a1 Section 13a rule 13a-1 Securities Exchange Act violation
13a10 Section 13a rule 13a-10 Securities Exchange Act violation
13a11 Section 13a rule 13a-11 Securities Exchange Act violation
13a13 Section 13a rule 13a-13 Securities Exchange Act violation
13a14 Section 13a rule 13a-14 Securities Exchange Act violation
13a15 Section 13a rule 13a-15 Securities Exchange Act violation
13a16 Section 13a rule 13a-16 Securities Exchange Act violation
13b2A 13(b)(2)(A) Securities Exchange Act violation
13b2B 13(b)(2)(B) Securities Exchange Act violation
13b5 13(b)(5) Securities Exchange Act violation
14a 14(a) Securities Exchange Act violation
14c 14(c) Securities Exchange Act violation
30A Foreign Corrupt Practices Act violation
100a2 100(a)(2) Regulation G of Securities Act violation
100b 100(b) Regulation G of Securities Act violation
19a 19(a) violation under Investment Company Act
105c7B 105(c)(7)(B) violation under SOX
corruption Binary indicator of corruption
amis Binary indicator of asset misappropriation
fsf Binary indicator of financial statement fraud
fraudulent Binary indicator of fraud
char_count Character count of the MDA text
word_count Word count of the MDA text
word_density Number of characters per word

Details

Resource type Open dataset
Title 10k-fraud-detection
Creators
  • Amin, Moustafa1 ORCID icon
  • Aßenmacher, Matthias1, 2 ORCID icon
  • Research Fields Economics Business Administration
    Size 5.6 GB
    Formats JSON format (.json)
    License(s) Creative Commons Attribution 4.0 International
    External Resource https://doi.org/10.5281/zenodo.17121948
    Countries United States
    Dates of collection May 21, 2025 ; May 22, 2025 ; October 10, 2024 ; December 19, 2024 ; May 22, 2025