My husband and I love craft beer. We enjoy searching for rare and yummy beers, making our own home-brews, discussing the different flavors, and most importantly, drinking craft beers! This blog post is the first of a series where I explore different aspects of that passion.

The Data:

The dataset is the records from my Colin’s Untappd account. He has graciously volunteered to let me use his data here… thank you hubby! Untappd is a craft beer social media platform. Every time we try a beer, Colin checks in to the beer on the Untappd app. During a check-in, users select their beer from the Untappd registry, make notes about their experience, rate their beer from 0-5, and provide other details such as serving style, location and flavor profile. I’m pretty sure Colin is actually using Untappd in this picture!

All in all, Colin’s check-ins span from July 2013 to September 2018. There are 2,166 unique check-ins. Each check-in has potentially 26 distinct attributes including info about the beer (name, style, type, ABV, IBU), user feedback (comments, rating) and location information for both the brewery and venue and time data.

Data Cleaning:

The first stage of the data cleaning removed duplicate check-ins. My hubby is motivated to collect as many check-ins as possible. A weird anomaly in the data is introduced by the fact that Untappd tracks both a specific beer and all its’ variants/vintages. For example, let’s just take a look at Colin’s check-ins for Sierra Nevada Brewing Company’s Narwhal Imperial Stout.

There are 9 different check-ins for this beer. They include the base version, Narwhal Imperial Stout, and vintages from 2013, 2014, 2016 and 2017. (I know you are dying to know what difference a variation/vintage of a beer makes. Remain calm! There will be a whole other blog post dedicated to this topic in the near future!) For now, I’m addressing the topic because I wanted to make sure each check-in was in fact unique. Some users believe that you should get “credit” for both the beer and it’s vintage and will check-in to both for the same experience. Case in point, here are two check-ins created within one minute of each other for the Narwhal Imperial Stout and the Narwhal Imperial Stout (2014). I consider that a duplicate check-in since they are both recording the same experience. So, I filtered out all check-ins that were created within an hour of each other where the name of one completely contained the name of the other.

Next, I had to take out data points that didn’t contain the information I need. Probably one of the most notable attributes for our purposes is the user’s rating of the check-in. So, I also filtered out all check-ins that did not have a user rating. (My husband is ridiculously diligent about rating his beer but says it’s cheating to rate your own homebrew. I politely disagree and believe they all deserve a 5! Nonetheless there are a few check-ins with no rating that needed to be removed.)

What does the data look like now? Of the original 2,166 total check-ins, we are now left with 2,087 check-ins. The user’s rating average is 4.09 with a standard deviation of 0.47. We only lost 3.6% of our data points. Not bad.

Then, I looked at the global distribution of the check-in ratings. That’s when I noticed a sort of odd bimodal peak. It turns out that in April, 2015, the Untappd rating system changed from 0.5 increments to 0.25 increments. The change lends to an inconsistency in the data. Consequently, I filtered the data to only include check-ins after April, 2015. Now we have a more normally distributed looking histogram with mean 4.18 and standard deviation 0.38.

The resulting dataset’s user ratings consist of 1490 data points, which means we retained 68.8% of our original data points. While it’s always unfortunate to lose >30% of your original data, it is all too common in the process and emphasizes the importance of having a large data set to start. Fortunately, this data set is still fairly large and we can continue to use it.

Whew! Now we have a good clean dataset. I strongly considered eliminating the data cleaning details in this blog post. It’s pretty lengthy and doesn’t contain all the fun stuff. Ultimately, it would have been disingenuous. Data cleaning is a huge and important part of the process. Now on to the fun stuff. To infinity and beyond!

Beer Style Analysis:

Untapped Beer Types

The first thing I wanted to know is does my husband have a preferred type of beer? I looked at how many different types of beer we are looking at in this Untappd data set and discovered that Colin has logged 140 different types of beer. Wait, there are 140 different types of beer?! (In all seriousness though, as an avid craft beer fan, I totally believe this.) There is actually no universally agreed upon beer style classification guide. I’m not sure how Untappd decides what beer styles to include but as of August 2018, Untappd categorized its’ beers into 189 different styles. That’s a lot! So, I looked at the distribution of check-ins for the different Untappd styles. Not surprisingly, most of the styles were sampled only a handful of times.

I made the assumption that more meaningful data could be extracted from larger populations and took a closer look at beer types with greater than 30 check-ins, leaving me with 11 beer styles. The violin plot shows the distribution of user ratings for the 11 beer styles. If you aren’t familiar with violin plots, they are pretty cool. You can think of them as sideways and mirrored histograms. The wider portions indicate more entries exist at that location. For example, in the plot below, the global distribution is widest at 4.25 indicating thats what most beers received.

There is loads of information contained in this plot but here are a couple high-levels takeaways:

  • American IPA’s have the lowest median at 4.0.
  • American Imperial/Double Stouts, Imperial Milk/Sweet Stouts, American Wild Ales, Imperial/Double Stouts and Russian Imperial Stouts tie for the highest median at 4.5. The five-way tie is easy to do if your rating system is discretized to 0.25 increments. Notice a trend yet? I’m thinking Colin really likes stouts!
  • Saison/Farmhouse Ales have the largest range. The long narrow tail suggests there was one or two negative experience outliers.

So, is there a statistically significant relationship between Untappd beer types and Colin’s ratings? At first, I wanted to conduct an analysis of variance (ANOVA) to test if the sample means differ. Before that, I tested for an assumption to use ANOVA: that all populations being compared must be normally distributed. So, I used the Shapiro-Wilk test to test the null hypothesis that the group ratings are normally distributed. Unfortunately, using a 95% confidence level, we can reject the null hypothesis that any of the groups are normally distributed. I’m not really surprised and here is why: beer is yummy and we purposely try to only drink the yummiest beer causing the ratings to be skewed high and not normally distributed. (Side note – When data is not normally distributed it is often more appropriate to describe data by the median rather than the mean. That is why you see the median discussed in this work so often.) Consequently, we will have to rely on non-parametric testing. The Kruskal-Wallis test tests the null hypothesis that the medians of all groups are equal. In this case, the null hypothesis is rejected. The medians of the categories do indeed differ from each other with statistical significance. Woot!

Aggregating Untapped Categories

How different are these styles though? It might be difficult to pull enough meaningful trends out of that many categories. I looked for an appropriate way to merge some of the styles and noticed several instances where there appeared to be a category and subcategory separated by a hyphen. For example, in the plot here, there are four different types of stouts: Stout – American Imperial/Double, Stout – Russian Imperial, Stout – Imperial Milk/Sweet, and Stout – Imperial/Double. It turns out this is a pretty common occurrence. In this dataset alone, there are 2 blonde ales, 4 bocks, 4 brown ales, 5 ciders, 14 IPAs, 10 lagers, 5 lambics, 3 pale ales, 3 pilsners, 5 porters, 4 red ales, 7 sours, 14 stouts and 3 strong ales. So, I tried to aggregate the Untappd beer styles into categories based on labels preceding the hyphen.

Before looking at the full picture, I checked to see if this method seemed to make sense. Can we assume that these categories are viewed similarly enough in Colin-land to group them together? Stouts and IPA’s are the groups with the largest number of ratings. The plots above show the distribution of ratings for these two groups. While the stout violin plot shows overwhelming agreement, the IPA plot is a little less clear. The only way to be truly certain is to check the statistical significance of the relationship. Once again, the rating distributions were not normally distributed. For stouts, the Kruskal-Wallis test could not reject the null hypothesis (Normal People Language: The median of all groups is considered equal and it is reasonable to assume the types of stouts can be grouped together.) There are only two different IPA’s large enough to reasonably test, so the Mann-Whitney U test is employed. However, the results showed that the null hypothesis was rejected, meaning that the Imperial/Double IPA ratings are stochastically greater than the American IPA ratings. So, if you declared that the user ratings had to show uniformity to consider this categorization method valid, you would likely reject the method.

However, if you were tiny bit stubborn like me (ok, very stubborn) and believed that they were aptly named based on brewing processes and continued on, you might get a new beer style category distribution like the one pictured above. Just as with the original Untappd style categories, the Kruskal-Wallis analysis shows that the medians of the categories do indeed differ from each other with statistical significance. A couple other interesting notes about the data set as grouped in this fashion:

  • IPAs and pale ales are consistently rated lower than the other frequently sampled categories with a median of 4.0.
  • Even grouping all the stouts still results in the highest median tied with American wild ales at 4.5.
  • Barleywines (which are now a large enough category to be observed) have a median similar to the other styles but have a much smaller distribution range (i.e., they are more consistently liked).

Ale vs. Lager

I wanted to take one more different look at the beer types. While there appears to be very little in the way of a standard beer categorization scheme, one overarching categorization seems to hold true: ale vs. lager. The difference between ales and lagers is where the yeast ferments during the brewing process. In ales, yeast ferments at warmer temperatures and ferments near the surface. The yeast in lagers, on the other hand, ferments at lower temperatures and tends to collect at the bottom. So, for the final categorization of beer types, I decided to look at the differences in this data set between lagers and ales.

Unfortunately, the Untappd dataset wasn’t sorted between ales vs. lagers, so I had to do it manually. I used the Brewers Association Beer Style Guidelines to categorize as many of the beer styles in the dataset as possible. Here is some fun trivia I learned along the way:

  • A shandy is a beer mixed with a clear carbonated lemon-lime drink.
  • Lichtenhainer is a type of German beer that has references dating back to at least the mid 19th century. It appears to have went extinct in 1983, but was revived again in 1997. Brewing has a long history and some craft breweries have made interesting efforts to reproduce ancient beers. For example, Dogfish Head reverse engineered a 2,700 year old beer from molecular evidence found in a Turkish tomb and another beer from an ingredient list discovered in a 9,000 year old tomb in China.
  • Bière de Champagne is a Belgian ale that undergoes a process similar to that of making champagne. In some cases it is even cave-aged in the Champagne region of France!

The results of the ales vs. lager sort are as follows: Colin drank 1,270 ales and 44 lagers. There were 165 check-ins labeled as unknown because they are either not recognized as a beer style by the Brewers Association (e.g., cider and shandy) or the Untappd style did not exclusively fall into either the ale or lager category (e.g., pumpkin/ yam beer). At any rate, that’s a lot more ales than lagers! Lagers consisted of lagers (of course), pilsners and bocks. Ales encompassed pretty much everything else. Furthermore, while ales had a mean rating of 4.22, lagers had a mean rating of 3.56. Since the total global rating mean was 4.18, it’s safe to say that lagers are considerably lower ranked here. The Mann-Whitney U test confirms that the difference is statistically significant.


I wanted to used this analysis to observe beer style trends in ratings and make recommendations accordingly. Here are the conclusions about this Untappd user from this work:

  • Beer style has a statistically significant impact on user ratings.
  • Stouts and IPAs are the most frequently consumed beer. Stouts are consistently rated higher than the global average IPAs are consistently rated lower.
  • I can tell you that my husband–statistically–does not prefer blondes! Haha.
  • Lagers were the least liked beer style with a mean rating more than 1.5 standard deviations lower than the global mean. I do not recommend this style for this user!

Feel free to take a look at the data/code for yourself on my GitHub page. Cheers!