Published 2022 | Version v2
Open dataset

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Description

Description

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

Variables

Name Description
text the document text
created_timestamp If the original source provided a timestamp when the document was created we provide this as well. Note, these may be inaccurate. For example CourtListener case opinions provide the timestamp of when it was uploaded to CourtListener not when the opinion was published.
downloaded_timestamp When the document was scraped
url the source url

Details

Resource type Open dataset
Title Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Creators
  • Henderson, Peter
  • Krass, Mark S
  • Zheng, Lucia
  • Guha, Neel
  • Manning, Christopher D.
  • Jurafsky, Dan
  • Ho, Daniel E.
  • License(s) Creative Commons Attribution Share Alike 4.0 International
    External Resource https://huggingface.co/datasets/pile-of-law/pile-of-law