The Stanford Natural Language Inference (SNLI) Corpus
- 1. Stanford University
Description
Description
The Stanford Natural Language Inference (SNLI) corpus (version 1.0) is a collection of 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. It is only 0.09GB large
It consists of a training, validation, and test set. The variables contained in each of these sub datasets is described below.
The data providers aim for it to serve both as a benchmark for evaluating representational systems for text, especially including those induced by representation-learning methods, as well as a resource for developing NLP models of any kind.
The following paper introduces the corpus in detail. If you use the corpus in published work, please cite it:
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Files
README.txt
Variables
| Name | Description |
|---|---|
| gold_label | This is the label chosen by the majority of annotators. Where no majority exists, this is '-', and the pair should not be included when evaluating hard classification accuracy. |
| sentence1_binary_parse | The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels. |
| sentence2_binary_parse | The same parse as in sentence{1,2}_parse, but formatted for use in tree-structured neural networks with no unary nodes and no labels. |
| sentence1_parse | The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format. |
| sentence2_parse | The parse produced by the Stanford Parser (3.5.2, case insensitive PCFG, trained on the standard training set augmented with the parsed Brown Corpus) in Penn Treebank format. |
| sentence1 | The premise caption that was supplied to the author of the pair. |
| sentence2 | The hypothesis caption that was written by the author of the pair. |
| captionID | A unique identifier for each sentence1 from the original Flickr30k example. |
| pairID | A unique identifier for each sentence1--sentence2 pair. |
| label1 | These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it. |
| label2 | These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it. |
| label3 | These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it. |
| label4 | These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it. |
| label5 | These are all of the individual labels from annotators in phases 1 and 2. The first label comes from the phase 1 author, and is the only label for examples that did not undergo phase 2 annotation. In a few cases, the one of the phase 2 labels may be blank, indicating that an annotator saw the example but could not annotate it. |
Details
| Resource type | Open dataset |
| Title | The Stanford Natural Language Inference (SNLI) Corpus |
| Alternative title | SNL |
| Creators |
|
| Research Fields | Business Administration Economics Psychology Sociology Political Science Economic & Social History Communication Sciences Educational Research Other |
| Size | 0.09 GB |
| Formats | JSON format (.json) Text (generally ASCII or ISO 8859-n) (.txt) |
| License(s) | Creative Commons Attribution Share Alike 4.0 International |
| External Resource | https://nlp.stanford.edu/projects/snli/ |