For this blog post, I’m focusing on a friend’s Goodreads data. Goodreads is a social media and rating platform for books. I love fantasy novels and enjoy talking about them with other people even more. I used this user’s data because she has been using the Goodreads platform longer than me (but I’m aiming to catch up!).
The Data and Data Cleaning:
Goodreads allows users easy access to all their data. Hooray! I loaded the dataset in a Pandas dataframe for efficiency and ease. I found this dataset had 570 entries (one entry per book) and 31 attributes. The attributes included information about the book (e.g., ID, title, authors, ISBN, number of pages, year published), information provided by the user (e.g., user rating, date read, notes) and some global information from the Goodreads platform (e.g., average rating).
Initial data cleaning was kept to a minimum for this project. I knew pretty early on that I was most interested in exploring the user’s ratings. Goodreads allows users to save books users might be interested in reading later. Data cleaning for this dataset meant removing those entries that were saved but didn’t have this user’s ratings. This took the number of entries (books) from 570 to 378. The remainder of the analysis only refers to those 378 entries.
Before diving too deep into the individual attributes, I like to look at how well populated all the attributes are in the dataset. The chart gives us a good idea. In general, attributes populated by the Goodreads platform such as ID, title, authors and year published are very consistently reported. Information provided by the user such as written reviews, purchase information and date read was less well reported and in many instances never reported. That’s not surprising. Even when I use Goodreads, it’s mostly to only record a rating. This information just lets us know where to focus our attention.
Now we can take a look at some trends among the dataset’s attributes. To start, I looked at 4 different features: year published, number of pages, Goodreads global ratings and frequency of authors read.
For year published and number of pages, I tested the null hypothesis that the attributes were normally distributed. This proved to be false for both attributes. Consequently, it is more appropriate to discuss these distributions in terms of median and quartile.
Feature #1: Year Published
Here we can observe the distribution of the year published for the books recorded. The histogram is fairly straightforward. It gives a count of the number of books that fall within each 5-year date range (or bin). From this, we see that the number of books in a 5-year period doesn’t exceed 10 until after 2000, but remains consistently high thereafter. The box plot above it gives us more of what we call descriptive statistics. If you are not relatively familiar with box plots, here is a quick guide:
- The shaded box represents the interquartile (IQR). The left bound on the box is the 25th percentile (25% of the points fall below this line) and the right bound of the box is the 75th percentile (25% of the points fall above this line). The line inside the box represents the median.
- The lines protruding on either side of the box are called “whiskers.” Their representation can vary, but I like to use the “Tukey boxplot” methodology where each line extends to the point within 1.5*IQR of the lower and upper quartiles.
- Data points shown outside the whiskers are called “outliers”.
From this boxplot we notice several things. Our reader recorded books published from 1950 to 2018. The median was 2006, but the first quartile was 2002.75 and the third quartile was 2012. Roughly speaking, this means that 75% of the books read were published between 2002-2018 and 25% of the books were published between 1950-2002.
Feature #2: Number of Pages
Next, we are going to take a look at the distribution of a book’s length. Our reader rated books ranging from 30 to 1,238 pages. This distribution was more compactly distributed and skewed lower. The median book length is 272 pages. The first and third quartiles are 168 and 386 pages, meaning 50% of the books recorded contained between 168 and 386 pages. The IQR is 218 pages, while the total range was 1,298 pages. That’s a small IQR relative to overall range! That likely contributes to another interesting point: there are a high number of outliers.
Feature #3: Global Goodreads Ratings
It seemed particularly fitting to look at the Goodreads average rating (from all users) of the books. After all, I often like books that other people liked as well. The histogram shows the ratings distribution, as well as a Gaussian fitted curve. A Chi-squared test confirms that the ratings as normally distributed (woo-hoo!), with a mean of 4.10 and a standard deviation of 0.59.
So far, all the features discussed were immediately reported and thus fairly easy to analyze and present. I got to thinking about my reading patterns and how I select books. I came up with the following questions I might ask myself when selecting a book:
- Have I read this author before and did I like their novels?
- Do I like the genre? (I am a huge fantasy novel nut!)
Unfortunately, downloadable Goodreads data does not include novel genre. I hope to pull in additional data sources in the future to explore this topic. For now, let’s focus on #1 (author frequency).
Feature #4: Author Frequency
For our final feature, let’s see if our reader has author biases! This bar chart shows the number of authors binned by number books our reader read by that author. This chart is pretty interesting. For 119 authors recorded, our reader only read them once. That’s 31.5%! What I also found interesting here was the large number of times the reader read their favorites. Only 4 authors penned approximately 19% of the total books read! I found this interesting enough to pull out the author’s names: Brian K. Vaughan, Bill Willingham Haruki Murakami and Neil Gaiman. I’m not surprised Neil Gaiman made the top list. He is amazing!
Optimizing Book Selection
My ultimate goal on this project was to find a way to optimize book selection. I don’t know about you, but time is a very important resource to me! When I take the time to read a book, I want to make sure it was something I really liked! If we could build a model that accurately predicts a user’s rating, then the model could be used to make book recommendations to our reader
For this project, I chose to build a naïve Bayes binary classifier to classify books into one of the following two categories: Books that receive a rating of ‘5’ and those that receive a rating of less than ‘5’. The remainder of this blog focuses on (1) providing and intuitive understanding of the naïve Bayes classifier and (2) exploring the performance tradeoffs inherent to feature selection.
I really wanted to start with the simplest solution first. I’m a big believer in Occam’s razor. Often the simplest solution is best. Let’s start with a one-feature classifier. This will also set us up nicely for our intuitive explanation of the Bayes classifier. So, first let’s see which feature seems to have the strongest impact on the user ratings. The correlation matrix shown here gives us an idea of the various relationships between the potential features. In this heat map, the intensity of the color signifies the strength of the relationship and the color indicates the direction (i.e. blue means the variables are negatively correlated while red means they are positively correlated). The first column provides information about the relationship between our features and the user rating. The global rating is moderately correlated with the user rating and is the feature with the strongest relationship. The remaining features are weakly correlated with user rating. Given that, we will start by building our binary classifier based only on Goodreads global ratings.
Naïve Bayes Classifier Brief Intro
The naïve Bayes classifier is a gold standard in some circles. A friend of mine (who did his PhD in machine learning and AI) once told me for something naïve, a naïve Bayes classifier was pretty darn smart and his group often used it as a benchmark for their research results. So, how does a naïve Bayes classifier work? Simple, it uses the Bayes theorem. 😉 Bayes theorem can be written as follows:
where Ck is the classification label and X is the set of features. In this example, that means Ck is the user rating equal to 5 (C1) or less than 5 (C0). So far, we are only examining a one-feature system, so X is the global Goodreads average. When we expand our example later to include more features, we add an assumption core to the naïve Bayes implementation: the features are all strongly independent from each other. It’s what makes the naïve Bayes theory naïve. This assumption allows us to re-write the formula:
The histogram figures below show us the conditional probabilities, of global ratings (our feature) given the two different classification labels. Both distributions were fitted with a Gaussian curve and a Chi squared test confirms that the distributions are in fact normally distributed. The biggest takeaway from this figure is the following:
- If the user gave the book a rating of 5, the global Goodreads ratings had a mean of 4.21 and a standard deviation of 0.22. The probability of the user giving any book a 5 is 47%.
- If the user gave the book a rating less than 5, the global Goodreads ratings had a mean of 4.01 and a standard deviation of 0.25. The probability of the user giving any book less than a 5 is 53%.
But how do we make this a classifier? Well, the classifier ultimately uses the Bayes theory as follows to determine the classification of an arbitrary input, X:
In plainer English, the classifier chooses the classification label that had the higher probability based on the input features, p(Ck|x). This is simply the numerator from Bayes theorem! The p(x) term from the denominator drops out of the decision rule because it is constant. The figure below shows p(Ck|x)/p(x) and how it changes with the input feature. For this example, the probability of a user rating equal to 5 is greater than not when the global rating is greater than 4.11.
Feature Selection and Results
I trained the classifier with just the global ratings feature and measured performance of a test set. I used 10-fold cross validation to avoid overfitting and averaged the performance metrics from the separate runs. The results are given in the in the first row of the results table below.
There are three different performance metrics presented. The easiest, and most obvious, way to judge the classifier is accuracy (percentage of the time the classifier correctly classifies a book). The second metric is called precision. Simply put, precision is the percentage of the time that when the classifier predicted a 5, it was correct. The third metric is recall. Recall is the percentage of the actual 5’s identified.
And it doesn’t look bad! In comparison, if we randomly selected one of these books without knowledge of the global rating (it’s a crumby metric because the dataset would have changed substantially without this knowledge but it’s an ok comparison for now), the accuracy is 0.50, precision is 0.47, and recall is 0.5. On average the classifier’s precision is 15% better. In other words, if we picked up a book recommended by this classifier, the user would be 15% more likely to give it a ‘5’. On average, the classifier’s recall is also 18% better than random. So, the classifier is 18% more likely to not let a ‘5’ book fall through the cracks!
Finally, let’s see how to classifier reacts to additional features. Adding features can be a double-edged sword. To some extent, more features might tell us more about the reader’s preferences. On the flip side, too many features can quickly lead to overfitting. Overfitting is something to be especially cautious of with a smaller data set such as this one. The idea behind overfitting via too many features is that you can build a classifier that fits this data perfectly… but it was so perfect that it wasn’t really relying on general trends and becomes far less robust to new data points.
I started with two-feature systems. The ultimate question became how should I pick our second feature? Proper feature selection is an enormous part of data science, so I’m going to keep it brief here and save that much longer discussion for a future blog. I’m going to look at two different two-feature systems here and discuss intuitively why the added second feature might be a good choice.
The first two-feature system will feature (haha) global ratings and the author frequency variables. Just from an intuitive how-do-I-select-books standpoint, I can see these two features being relevant inputs. I like books that were well liked by others and I tend to like books by certain authors. From a more basic mathematical standpoint, the author frequency was the second highest correlated feature (note the correlation coefficient heat map) with user rating. So, when we put them together and perform the same training/testing as before, we get the second row of our results table. In summary, the addition of the author frequency feature causes very slight improvements to accuracy robustness, precision performance and a very slight and more consistent degradation to recall.
For the second two-feature system, I used global ratings averages and the publication year for features. This could make an interesting combination. The correlation coefficient between user rating and year published was very close (0.01) to that of author frequency, so we weren’t losing much there. However, the correlation coefficient between year published and global ratings (the other feature) is significantly smaller than that of author frequency, meaning the two features provide (hopefully) good information about our class label but each have something different to contribute. For example, imagine two features, global ratings and author frequency, are closely related. Maybe the user tends to read and like highly rated books all written by the same author. So, the two features together may not give us extra information. However, let’s think about two features that are less related to one another: global rating averages and year published. If our reader really likes highly rated novels, but did NOT like older novels (well-rated or not well-rated), then that additional feature becomes quite useful. There is a lot of work behind these ideas, including a paper I helped pen here.
But, let’s focus on this particular system. I perform the same training/testing as before. In comparison to the original one-feature system, the addition of the year published feature caused a very slight boost to accuracy and its consistency, precision and its consistency and a slight and more consistent degradation to recall.
Finally, let’s throw four features at the Bayes classifier and see what happens. More features is more better, right? Maybe. We still need to be wary of overfitting. I repeated the experiment with the same 10-fold training/testing as before with the following four features: global rating average, author frequency, year published and number of pages. The results are shown in the results table. In comparison to the global rating + year published system, the precision saw a slight and more consistent performance drop and recall is significantly and negatively impacted. In more general terms, the four-feature system was slightly more likely to not give you a bad recommendation (less false positives) but much more likely to miss good books (more false negatives).
So, how do you pick the best classifier? That is completely dependent on your goals. Personally, as a mother of two small children, I have little time to curl up and get into a good book. Consequently, I value precision most. Precision minimizes the number of false positives, ensuring the books I do read are ones I’m going to really enjoy. That being said, I still value recall, but as a distant second. A good recall score will help me to not miss any novels I might think are absolutely amazing. In the end, I would choose the Global Ratings Average + Year Published classifier which had great precision without the recall performance degradation seen by the four-feature system.
In this work, I presented a new dataset containing user data from the social platform Goodreads and a classifier to help guide book selection. Here is a summary and my conclusions:
- Goodreads user data has a few really great features that were explored here including user rating, Goodreads global rating, number of pages, year published and (extrapolated) author frequency. The Goodreads user data has several more features that are frequently not populated by the user (and thus not useful).
- A naïve Bayes binary classifier was built to predict whether the user would give a novel a rating of 5 or less than 5. Feature selection was discussed and tested.
- The optimal classifier selected here was the Global Ratings Average + Year Published classifier which optimized precision without sacrificing recall. The precision was 0.66 ± 0.20 and recall was 0.64 ± 0.21 with 10-fold cross validation.
- Our classifier is a significant improvement over a completely random choice classifier and was easy to implement. However, in comparison to more advanced binary classifiers (like neural networks!), it probably performs moderately well. I hypothesize that the data is small and thus the problem is more difficult. I plan to address this in future blog posts.
Happy (and Good) Reading! If you want to see the code and analysis for yourself, feel free to visit my Github page. If you want to send me your Goodreads data and we can see where it all takes us, please email me!
“A book is a dream that you hold in your hands.” -Neil Gaiman