Datasets that we use for this project are about YouTube Videos. The data is collected by a webcrawler using YouTube APIs and is publically available on a web page at Simon Fraser University's website. The raw data contains the features described below.
Video ID | Uploader | Age | Category | Length | Views | Rate | Ratings | Comments | Related IDs |
bRPeEVpHiI8 | ufc | 744 | Sports | 154 | 331333 | 3.32 | 1284 | 103 | "bRPeEVpHiI8,D7-oJH4TR88,etc" |
Some of them (e.g. Video ID) will be self-explanatory, but several will not be. Age refers to how old the video (not the uploader) is. The difference between Rate and Ratings is that Ratings represents how many times the video has been rated while Rate is the actual rating value (number of stars); similarly, Comments represents how many comments have been posted rather than what people have commented.
More than 100 datasets are available on the original site. Since the duration time of the webcrawler is different, each dataset has a different size. A small one is, for example, less than 100KB, but a large one can be more than 10MB. The oldest dataset was created on February 2007, and the newest on September 2008. We are using a dataset (0.txt) created on March 2, 2007. In order to visualize it, we preprocessed the data: Views and Rate (aka Stars) changed numerical values to categorical values. Some values of Category are also changed. The details are as follows:
< 10k: All videos views of which are less than 10,000
< 50k: Less than 50,000
< 100k: Less than 100,000
≥ 100k: Equal or more than 100,000
1: Rates between 0 and 0.99..
2: between 1 and 1.99..
3: between 2 and 2.99..
4: between 3 and 3.99..
5: between 4 and 5
1. Howto & DIY
2. Pets & Animals
3. Autos & Vehicles
4. Gadgets & Games
5. Travel & Places