There is a newer version of the record available.

Published 2015 | Version v2

The Stanford Natural Language Inference (SNLI) Corpus

Description

Description

The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large 

It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.

The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.

The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).

Files

README.txt

Files (411.8 MB)

Name Size Download all
md5:73e0df3e2383fdde5fbcef7d58d5bc5d
5.8 kB Preview Download
md5:d5d4ec5bdddc4a8b650f681d8756da1c
7.9 MB Preview Download
md5:2507a418d103a7e37c78e34c6f6c5fa0
7.9 MB Preview Download
md5:0d5ed5038816f516a1521623c9f964f6
395.9 MB Preview Download

Variables

Name Description
gold_label This is the label chosen by the majority of annotators. Where no majority exists, this is '-', and the pair should not be included when evaluating hard classification accuracy.
sentence1_binary_parse The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence2_binary_parse The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels.
sentence1_parse The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence2_parse The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format.
sentence1 The premise caption that was supplied to the author of the pair.
sentence2 The hypothesis caption that was written by the author of the pair.
captionID A unique identifier for each sentence1 from the original Flickr30k example.
pairID A unique identifier for each sentence1--sentence2 pair.
label1 These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label2 These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label3 These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label4 These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.
label5 These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it.

Details

Resource type Open dataset
Title The Stanford Natural Language Inference (SNLI) Corpus
Alternative title SNL
Creators
  • Bowman, Samuel R.1
  • Angeli, Gabor1
  • Potts, Christopher1
  • Manning, Christopher D.1
  • Research Fields Business Administration Economics Psychology Sociology Political Science Economic & Social History Communication Sciences Educational Research Other
    Size 0.09 GB
    Formats JSON format (.json) Text (generally ASCII or ISO 8859-n) (.txt)
    License(s) Creative Commons Attribution Share Alike 4.0 International
    External Resource https://nlp.stanford.edu/projects/snli/