Youtube Dataset Analysis
Use Cases
Possible uses for this dataset could include:
- Sentiment analysis in a variety of forms
- Categorising YouTube videos based on their comments and statistics.
- Training ML algorithms like RNNs to generate their own YouTube comments.
- Analysing what factors affect how popular a YouTube video will be.
- Statistical analysis over time.
Dataset Description
This dataset includes several months (and counting) of data on daily trending YouTube videos.
Data is included for the US, GB, DE, CA, and FR regions (USA, Great Britain, Germany, Canada, and France, respectively), with up to 200 listed trending videos per day.
EDIT: Now includes data from RU, MX, KR, JP and IN regions (Russia, Mexico, South Korea, Japan and India respectively) over the same time period.
Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.
The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.
Citation
Mitchell J. (November 30, 2018). Trending YouTube Video Statistics, Version 1. Retrieved September 6, 2025 from https://www.kaggle.com/datasets/datasnaek/youtube-new/version/1.
Data Preparation
- Merge
input/youtube-new/**videos.csvinto a single dataframe (videos_df). - Merge
input/youtube-new/**_category_id.jsoninto a single dataframe (categories_df).
Data Preprocessing & Cleaning
- Extract the
country_cdfrom thefilename. - Convert the
category_idcolumn intoint. - Convert the
trending_dateinto datetime - Drop the unnecessary columns (
filename,video_id,kind,etag,snippet_channelId) - Merge the
categories_dfinto thevideos_dfoncategory_idandcountry_cd. - Rename columns
snippet_titletocategory_title,snippet_assignabletocategory_assignable. - Write the cleaned dataframe to
working/videos_cleaned.csv