Step-by-step walkthroughs of the Kaggle Titanic competition

RMS Titanic

One of the “Getting Started” competitions on Kaggle, the site that hosts predictive modelling and analytics competitions, is the Titanic: Machine Learning from Disaster competition.

Recently there have been 2 in-depth blog posts on how to analyse the data sets and build predictive models for this competition with R & R-Studio.

I highly recommend working through these step-by-step tutorials in R-Studio as they will help you understand each author’s analytic process and choice of R code / libraries.

The best place to start is Titanic: Getting Started With R by Trevor Stephens which is in 5 parts that works through:

  • An introduction
  • Booting up R
  • Building & submitting a gender-class model
  • Decison trees
  • Featuring engineering
  • Random forests & conditional inference trees

The other excellent tutorial is Handicapping Passengers on the Unsinkable Ship by Curt Wehrley who uses the following approaches:

  • Logistic regression
  • Random forest
  • Support Vector Machines (SVM)

My big data & analytics influencers on Twitter


Twitter_logo_blue#BigData # Analytics

Everyone with a Twitter account is now an expert & thought leader. I know I follow hundreds of big data & analytics Twitter accounts & there’s lots of repetition and links to the same articles.

I’ve started keeping a Twitter list of the small number of accounts that I think post original content and points of view.

My current list includes:

@FiveThirtyEight – FiveThirtyEight
The home of Nate Silver’s FiveThirtyEight on Twitter. Politics, Economics, Science, Life, Sports. New York, NY

@TeachTheMachine – Jason Brownlee
Making Programmers Awesome at Machine Learning. Melbourne, Australia

@ResearchMark – Research Wahlberg
Research. Stats. Sup.

@trevs – Trevor Stephens
Living, Learning & Loving Analysis: 2014 MS Analytics Graduate. San Francisco, CA

@jp_isson – JP Isson
Author,Business Analytics Executive and Keynotes speaker. Global Vice President business intelligence and predictive analytics at Monster Worldwide

@KirkDBorne – Kirk Borne
PhD DataScientist Astrophysicist, Top #BigData Influencer. Passions: #DataScience, #DataMining, Astroinformatics, #CitizenScience. George Mason University

@IAPA_org_au – IAPA
The Institute of Analytics Professionals of Australia (IAPA) is the professional organisation for the analytics industry in Australia. Sydney, AU

@AndrewYNg – Andrew Ng
Chief Scientist of Baidu; Chairman and Co-Founder of Coursera; Stanford CS faculty. #machinelearning, #deeplearning #MOOCs, #edtech. Mountain View, CA

@Doug_Laney – Doug Laney
Gartner VP Research, Analytics, Info Innovation & Big Data | Originator, discipline of Infonomics | Competitive tennis & non-competitive golf. Chicago

@hadleywickham – Hadley Wickham
R, data, visualisation. Houston, TX

@brianwilt – Brian Wilt
Making data human @Jawbone. @VisionVBC coach. @Stanford and @MIT physics. San Francisco, CA

@mjcavaretta – Michael Cavaretta
Data Science Leader – Ford Motor Co. #BigData, #DataScience, #DataViz, #IoT. Opinions are my own. Top Big Data Influencer and Speaker. Michigan

@lauramclay – Laura McLay
Professor of operations research in ISyE dept at University of Wisconsin-Madison, Punk Rock OR blogger, mother of 3, runner, bracketologist, aspiring Jedi. Madison, Wisconsin

@avinash – Avinash Kaushik
Author, Web Analytics 2.0 & Web Analytics: An Hour A Day | Digital Marketing Evangelist, Google | Co-Founder, Market Motive

@flowingdata – Nathan Yau
Statistician, background in eating and beer

@hmason – Hilary Mason
Founder at @FastForwardLabs. Data Scientist in Residence at @accel. I ♥ data and cheeseburgers. NYC

That’s the end of my small list – who are your big data & analytics influencers ?

39 Great Data Mining Tutorials from Andrew W. Moore

Photo by infocux Technologies

I was refreshing my memory on statistical decision trees when I came across the Statistical Data Mining Tutorials by Andrew W. Moore from Carnegie Mellon University.

A series of 39 PDF slide sets that go through basic concepts on topics including decision trees,neural networksBayesian networks & support vector machines. I highly recommend these slides as you’ll learn something even if you are refreshing your existing knowledge.

I haven’t been through every slide set so am interested on others thoughts.

Continue reading

Has the NFL Combine’s 40 yard dash gotten faster ?

Last week on one of my favourite podcasts, ESPN’s Football Today, Matt Williamson & Kevin Weidl discussed the standout prospects from the NFL Combine. A lot of the conversation was around how the 40 yard dash times have improved year on year due to better training technique and specific training for the combine activities.

I wanted to see for myself and found Combine results for all participants going back to 1999 at including last week’s 2013 results. This data set has key data for all 4,283 participants during this period and is a gold mine for analysis. The data needed a bit of cleaning up to get it into a data frame but if you’d like a copy then leave a comment or message via twitter (@minimalrblog) – I haven’t spent the time to work out how to use github to share datasets.

I compared the 40 yard dash times of 1999 and 2013 and initally didn’t see real improvements as the 5 best times were:

Name College Position Draft Year 40 Yard Time
Rondel Menendez Eastern Kentucky WR 1999 4.24
Marquise Goodwin Texas WR 2013 4.27
Champ Bailey Georgia CB 1999 4.28
Jay Hinton Morgan State (MD) RB 1999 4.29
Karsten Bailey Auburn WR 1999 4.33

The Combine class of 1999 had 6 of the best 10 times. However looking at the quartiles and plotting the 2 distributions showed a real improvement over the 14 years – while the fastest runners didn’t get faster, the rest of the field did benefit from improved training and technique.

Draft Year Fastest Time 1st Quartile Median 3rd Quartile Slowest Time
1999 4.24 4.61 4.78 5.09 5.84
2013 4.27 4.55 4.71 4.99 5.65


The overlapping distribution was generated using the ggplot2 library.

CombineData19992013 <- data.frame(CombineData[CombineData$Year == 1999 | CombineData$Year == 2013,])
ggplot(CombineData19992013, aes(X40Yard., fill = Year)) + geom_density(alpha = 0.2)

More visualisation of 2012 NFL Quarterback performance with R

In last week’s post I used R heatmaps to visualise the performance of NFL Quarterbacks in 2012. This was done in a 2 step process,

  1. Clustering QB performance based on the 12 performance metrics using hierarchical clustering
  2. Plotting the performance clusters using R’s pheatmap library

An output from the step 1 is the cluster dendrogram that represents the clusters and how far apart they are. Reading the dendogram from the top, it first splits the 33 QBs into 2 clusters. Moving down, it then splits into 4 clusters and so on. This is useful as you can move down the diagram and stop when you have the number of clusters you want to analyse or show and easily read off the members of each cluster.


An alternative way to visualise clusters is to use the distance matrix and transform it into a 2 dimensional representation using R’s multidimensional scaling function cmdscale().

QBdist <- as.matrix(dist(QBscaled))
QBdist.cmds <- cmdscale(QBdist,eig=TRUE, k=2) # k is the number of dimensions
x <- QBdist.cmds$points[,1]
y <- QBdist.cmds$points[,2]
plot(x, y, main="Metric MDS", type="n")
text(x, y, labels = row.names(QBscaled), cex=.7)


This works well when the clusters are well defined visually but when they’re not like in this case then it just raises questions why certain data points belong to one cluster versus another. For example, Ben Roethlisberger and Matt Ryan above. Unfortunately Mark Sanchez is still unambiguously in a special class with Brady Quinn and Matt Cassel.

Visualising 2012 NFL Quarterback performance with R heat maps

With only 24 hours remaining in the 2012 NFL season, this is a good time to review how the league's QBs performed during the regular season using performance data from KFFL and the heat mapping capabilities of R.

#scale data to mean=0, sd=1 and convert to matrix
QBscaled <- as.matrix(scale(QB2012))

#create heatmap and don't reorder columns
pheatmap(QBscaled, cluster_cols=F, legend=FALSE, fontsize_row=12, fontsize_col=12, border_color=NA)


Instead of using the R's default heatmap, I've used the pheatmap function from the pheatmap library.

The analysis includes KFFL's data on Passes per Game, Passes Completed per Game, Pass Completion Rate, Pass Yards per Attempt, Pass Touchdowns per Attempt, Pass Interceptions per Attempt, Runs per Game, Run Yards per Attempt, Run Touchdowns per Attempt, 2 Point Conversions per Game, Fumbles per Game, Sacks per Game.

#cluster rows
hc.rows <- hclust(dist(QBscaled))


This cluster dendrogram shows 4 broad performance clusters of QBs who started at least half the regular season (8 games) plus Colin Kaepernick (7 games). It's important to remember this analysis does not include any playoff games. Our assessment of playoff QBs is also easily biased by the results of these games – just because Joe Flacco makes SuperBowl XLVII does not mean he has consistently outperformed Tom Brady.

Cluster 1 – The top tier passers

#draw heatmap for first cluster
pheatmap(QBscaled[cutree(hc.rows,k=4)==1,], cluster_cols=F, legend=FALSE, fontsize_row=12, fontsize_col=12, border_color=NA)


Pass first QBs with good passing stats and who kept out of trouble (low interceptions, sacks & fumbles). Within the group – Brees, Peyton Manning, Brady and Ryan have the best results with Carson Palmer a surprise in this group.

Cluster 2 – Successful run & pass QBs

#draw heatmap for second cluster
pheatmap(QBscaled[cutree(hc.rows,k=4)==2,], cluster_cols=F, legend=FALSE, fontsize_row=12, fontsize_col=12, border_color=NA)


Strong outcomes in both the passing and running game including the 3 QBs who led in run attempts per game – Newton, RG III and Kaepernick. RG III & Kaepernick also had surprisingly few interceptions per game given their propensity to aggressively throw deep.

Cluster 3 – The Middle

#draw heatmap for third cluster
pheatmap(QBscaled[cutree(hc.rows,k=4)==3,], cluster_cols=F, legend=FALSE, fontsize_row=12, fontsize_col=12, border_color=NA)


Not great but not the worse either including Joe Flacco.

Cluster 4 – A year of fumbles, interceptions and sacks

#draw heatmap for fourth cluster
pheatmap(QBscaled[cutree(hc.rows,k=4)==4,], cluster_cols=F, legend=FALSE, fontsize_row=12, fontsize_col=12, border_color=NA)


As a NY Jets supporter this is painful.

Speed up for loops in R

Are your for loops too slow in R ? Are loops that should take seconds actually taking hours ?

As I found out recently, how you structure your code can make a huge difference in execution times. Fortunately making a few small changes to your code can speed up these loops by several orders of magnitude.

This Stack Overflow post goes through a number of ways to optimise your for loops – I only implemented the first method and my loop run time went from over an hour to less than 10 seconds !!!

The secret ? to loop over a vector rather than data frames as R is optimised for vector and matrix operations.

Heat maps using R

One of the great things about following blogs on R is seeing what others are doing & being able to replicate and try out things on my own data sets.


For example, some great links on rapidly creating heat maps using R.

The basic steps in the process are (i) to scale the numeric data using the scale function, (ii) create a Euclidean distance matrix using the dist function and then (iii) plotting the heat map with the heatmap function.


tolower() – error catching unmappable characters

The tolower() function returns an error where it can’t map to the Unicode character set of the input data – a common occurrence when analysing social media data with emoticons.

Emoticons are those symbols that are commonly used on mobile phones but aren’t always recognised on all platforms.

For example, when converting tweets to @delta (Delta Airlines), I got the following error:

Error in tolower(text) :
invalid input '@ActualALove: First time I've seen a foot-rest in first class! Oh @Delta, how I love thee \ud83d\ude0a✈\ud83d\udc78' in 'utf8towcs'

When I looked up the actual tweet, it looked liked this.


The two unicode characters that weren’t recognised were \ud83d\ude0a (SMILING FACE WITH SMILING EYES) and \ud83d\udc78 (PRINCESS).

Gaston Sanchez has posted a solution to this problem in his blog Data Analysis Visually Enforced. I’ve used the code and it works well. When I have time, I’ll extend it to replace the offending characters instead of returning NA for the entire string.

100 most read R posts in 2012

R-bloggers is the source for R news and tutorials. Posts are aggregated from 425 R bloggers with daily updates.

I use it when I'm looking for help on a particular subject and also to see what cool things people are doing in the R community.

A great way to start is to check out the recent post 100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages.