Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Henderson, Peter; Krass, Mark S; Zheng, Lucia; Guha, Neel; Manning, Christopher D.; Jurafsky, Dan; Ho, Daniel E.

Published 2022 | Version v2

Open dataset Metadata-only

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Description

We curate a large corpus of legal and administrative data. The utility of this data is twofold: (1) to aggregate legal and administrative data sources that demonstrate different norms and legal standards for data filtering; (2) to collect a dataset that can be used in the future for pretraining legal-domain language models, a key direction in access-to-justice initiatives.

Name	Description
text	the document text
created_timestamp	If the original source provided a timestamp when the document was created we provide this as well. Note, these may be inaccurate. For example CourtListener case opinions provide the timestamp of when it was uploaded to CourtListener not when the opinion was published.
downloaded_timestamp	When the document was scraped
url	the source url

Resource type	Open dataset
Title	Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset
Creators	Henderson, Peter Krass, Mark S Zheng, Lucia Guha, Neel Manning, Christopher D. Jurafsky, Dan Ho, Daniel E.
License(s)	Creative Commons Attribution Share Alike 4.0 International
External Resource	https://huggingface.co/datasets/pile-of-law/pile-of-law

	All versions	This version
Views	59	56
Downloads	0	0
Data volume	0 Bytes	0 Bytes

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Description

Variables

Details

Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset

Creators

Description

Description

Variables

Details