10k-fraud-detection
Description
Description
Description: We provide (to the best of our knowledge) the first publicly available data set of (specific sections of) 10-K reports alongside labels indicating fraudulent behavior. To enable other researchers to replicate or extend our work, we provide transparent descriptions of the data scraping and labeling processes, release our source code, and make the data available. In our paper "Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports", we provide baseline results for a given data split motivated by specific temporal characteristics of fraud.
Scientific Work: This data set accompanies the following research article:
Amin, Moustafa & Aßenmacher, Matthias (2025) Do Companies Reveal Their Own Fraud? - A Novel Data Set for Fraud Detection Based on 10-K Reports". In Proceedings of the 10th Workshop on Financial Technology and Natural Language Processing (FinNLP), Suzhou, China (Hybrid). Association for Computational Linguistics.
Variables
| Name | Description |
|---|---|
| cik | Central Index Key (unique company identifier) |
| name | Company name |
| city | City |
| state | State |
| sic | Standard Industry Classification number |
| incorp_state | State of incorporation |
| filing_type | Filing type |
| fye | Fiscal year end |
| filing_date | Date the 10-K was filed |
| reporting_date | Period the 10-K reports on |
| url | URL to the filing |
| mda | Management Discussion and Analysis section (text) |
| late_filing | Indicates a 10-K405 (late filing) |
| transition_filing | Indicates a 10-KT (transition report) |
| amend_filing | Indicates amended 10-K (any 10-K ending in /A) |
| dataTime | Date and time of the AAER |
| respondents | Names of respondents in the AAER |
| fraud_start | Beginning date of the fraudulent period (mm-yyyy) |
| fraud_end | End date of the fraudulent period (mm-yyyy) |
| revoked | Revokation date of Exchange Act registration (mm-yyyy) |
| certainty_start | Binary indicator: certainty regarding fraud start date |
| certainty_end | Binary indicator: certainty regarding fraud end date |
| 17a | 17(a) Securities Act violation |
| 17a2 | 17(a)(2) Securities Act violation |
| 17a3 | 17(a)(3) Securities Act violation |
| 17b | 17(b) Securities Act violation |
| 5a | 5(a) Securities Act violation |
| 5b1 | 5(b)(1) Securities Act violation |
| 5c | 5(c) Securities Act violation |
| 10b | 10(b) Securities Exchange Act violation |
| 13a | 13(a) Securities Exchange Act violation |
| 12b20 | Section 12b rule 12b-20 Securities Exchange Act violation |
| 12b25 | Section 12b rule 12b-25 Securities Exchange Act violation |
| 13a1 | Section 13a rule 13a-1 Securities Exchange Act violation |
| 13a10 | Section 13a rule 13a-10 Securities Exchange Act violation |
| 13a11 | Section 13a rule 13a-11 Securities Exchange Act violation |
| 13a13 | Section 13a rule 13a-13 Securities Exchange Act violation |
| 13a14 | Section 13a rule 13a-14 Securities Exchange Act violation |
| 13a15 | Section 13a rule 13a-15 Securities Exchange Act violation |
| 13a16 | Section 13a rule 13a-16 Securities Exchange Act violation |
| 13b2A | 13(b)(2)(A) Securities Exchange Act violation |
| 13b2B | 13(b)(2)(B) Securities Exchange Act violation |
| 13b5 | 13(b)(5) Securities Exchange Act violation |
| 14a | 14(a) Securities Exchange Act violation |
| 14c | 14(c) Securities Exchange Act violation |
| 30A | Foreign Corrupt Practices Act violation |
| 100a2 | 100(a)(2) Regulation G of Securities Act violation |
| 100b | 100(b) Regulation G of Securities Act violation |
| 19a | 19(a) violation under Investment Company Act |
| 105c7B | 105(c)(7)(B) violation under SOX |
| corruption | Binary indicator of corruption |
| amis | Binary indicator of asset misappropriation |
| fsf | Binary indicator of financial statement fraud |
| fraudulent | Binary indicator of fraud |
| char_count | Character count of the MDA text |
| word_count | Word count of the MDA text |
| word_density | Number of characters per word |
Details
| Resource type | Open dataset |
| Title | 10k-fraud-detection |
| Creators |
|
| Research Fields | Economics Business Administration |
| Size | 5.6 GB |
| Formats | JSON format (.json) |
| License(s) | Creative Commons Attribution 4.0 International |
| External Resource | https://doi.org/10.5281/zenodo.17121948 |
| Countries | United States |
| Dates of collection | May 21, 2025 ; May 22, 2025 ; October 10, 2024 ; December 19, 2024 ; May 22, 2025 |