AI finds Golden 🌟 Kaggle Keywords
These ten keywords are the top predictors of whether a Kaggle dataset will become popular and receive a lot of upvotes. First, we need to get the Kaggle dataset that lists the 10,000 most popular datasets on Kaggle. This dataset contains valuable information such as the dataset name and the number of upvotes.
Next, we will launch Devra, go into our project, and use the dataset we have already downloaded from Kaggle. We will create a Python notebook for the analysis of the Kaggle dataset. Our goal is to identify which keywords in the dataset names have the most influence on whether a dataset will receive upvotes.
To achieve this, we will first find the median number of upvotes for all the datasets. Then, for each dataset, we will break apart the dataset name into individual keywords and standardize them by converting them to lowercase. For each keyword, we require that it is associated with at least ten different datasets. For the datasets associated with a given keyword, we will find the median number of upvotes.
Next, for each keyword, we will calculate the ratio of the median number of upvotes for that keyword divided by the median number of upvotes for all the datasets. We will rank the keywords by that ratio in descending order and display the top ten keywords.
Let's create and start the analysis to see what happens. The first step is to go through our project and identify the available projects. Then, we will look at the first few rows of our dataset. The plan looks good, and Devra has created the Python notebook. Let's launch it and see the results.
The top ten keywords are: coronavirus, star, suicide, 2016, brain, shootings, mnip, legends, expression, and women. This is a great start. Even if this isn't the perfect analysis we want, look at all the work that has already been done. We have completed our imports, loaded the dataset, calculated the median, extracted keywords, and performed keyword analysis. Just consider all the legwork in terms of data wrangling that has been accomplished.
The link to the Kaggle page for this dataset is in the description. Check it out!
View the video