YouTube-8M Dataset

Abu-El-Haija, Sami; Kothari, Nisarg; Nisarg, Lee; Natsev, Paul; Toderici, George; Varadarajan, Balakrishnan; Vijayanarasimhan, Sudheendra

Published 2019 | Version v2

Open dataset Metadata-only

YouTube-8M Dataset

1. Google Research

Description

YouTube-8M is a large-scale labeled video dataset that consists of millions of YouTube video IDs and with high-quality machine-generated & partially human-verified annotations from a diverse vocabulary of 3,800+ visual entities.

It comprises two subsets:

8M Segments Dataset: 230K human-verified segment labels, 1000 classes, 5 segments/video
8M Dataset: May 2018 version (current): 6.1M videos, 3862 classes, 3.0 labels/video, 2.6B audio-visual features

Thus, it comes with precomputed audio-visual features from billions of frames and audio segments, designed to fit on a single hard disk. This makes it possible to train a strong baseline model on this dataset in less than a day on a single GPU! At the same time, the dataset's scale and diversity can enable deep exploration of complex audio-visual models that can take weeks to train even in a distributed fashion.

YouTube offers the YouTube8M dataset for download as TensorFlow Record files on their website. Starter code for the dataset can be found on their GitHubpage.

Name	Description
id	Video id
labels	Video-level labels
segment_start_times	Starting time of a given segment
segment_end_times	End time of a given segment
segment_labels	Label of the segment
segment_scores	Segment is there (1) or not (0)
rgb	RGB values on a frame-level
audio	Audio values on a frame-level
mean_rgb	Average of all RGB features for the video
mean_audio	Average of all audio features for the video

Resource type	Open dataset
Title	YouTube-8M Dataset
Creators	Abu-El-Haija, Sami¹ Kothari, Nisarg¹ Nisarg, Lee¹ Natsev, Paul¹ Toderici, George¹ Varadarajan, Balakrishnan¹ Vijayanarasimhan, Sudheendra¹
Research Fields	Business Administration Economics Psychology Sociology Political Science Economic & Social History Communication Sciences Educational Research Other
Size	1530 GB
Formats	TensorFlow Record Files
License(s)	Creative Commons Attribution 4.0 International
External Resource	https://research.google.com/youtube8m/download.html
Companies	Google YouTube
Industries	Social Media
Dates of collection	June 1, 2019

Is cited by: Journal article: 10.48550/arXiv.1612.07828 (DOI)

	All versions	This version
Views	180	177
Downloads	0	0
Data volume	0 Bytes	0 Bytes

YouTube-8M Dataset

Description

Variables

Details

Additional Details

Related works

YouTube-8M Dataset

Creators

Description

Description

Variables

Details

Additional Details

Related works

Related Resources