The Stanford Natural Language Inference (SNLI) Corpus

Bowman, Samuel R.; Angeli, Gabor; Potts, Christopher; Manning, Christopher D.

Published 2015 | Version v2

Open dataset Open

The Stanford Natural Language Inference (SNLI) Corpus

1. Stanford University

Description

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large

It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.

The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Files

README.txt

Files (411.8 MB)

Name	Size	Download all
README.txt md5:73e0df3e2383fdde5fbcef7d58d5bc5d	5.8 kB	Preview Download
snli_1.0_dev.csv md5:d5d4ec5bdddc4a8b650f681d8756da1c	7.9 MB	Preview Download
snli_1.0_test.csv md5:2507a418d103a7e37c78e34c6f6c5fa0	7.9 MB	Preview Download
snli_1.0_train.csv md5:0d5ed5038816f516a1521623c9f964f6	395.9 MB	Preview Download

Name	Description
gold_label	This is the label chosen by the majority of annotators. Where no majority exists, this is '-', and the pair should not be included when evaluating hard classification accuracy.
sentence1_binary_parse	The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence2_binary_parse	The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence1_parse	The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence2_parse	The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence1	The premise caption that was supplied to the author of the pair.
sentence2	The hypothesis caption that was written by the author of the pair.
captionID	A unique identifier for each sentence1 from the original Flickr30k example.
pairID	A unique identifier for each sentence1--sentence2 pair.
label1	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label2	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label3	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label4	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label5	These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.

Resource type	Open dataset
Title	The Stanford Natural Language Inference (SNLI) Corpus
Alternative title	SNL
Creators	Bowman, Samuel R.¹ Angeli, Gabor¹ Potts, Christopher¹ Manning, Christopher D.¹
Research Fields	Business Administration Economics Psychology Sociology Political Science Economic & Social History Communication Sciences Educational Research Other
Size	0.09 GB
Formats	JSON format (.json) Text (generally ASCII or ISO 8859-n) (.txt)
License(s)	Creative Commons Attribution Share Alike 4.0 International
External Resource	https://nlp.stanford.edu/projects/snli/

	All versions	This version
Views	157	106
Downloads	117	56
Data volume	11.2 GB	5.0 GB

The Stanford Natural Language Inference (SNLI) Corpus

Description

Files

README.txt

Files (411.8 MB)

Variables

Details

The Stanford Natural Language Inference (SNLI) Corpus

Creators

Description

Description

Files

README.txt

Files (411.8 MB)

Variables

Details

Related Resources