The 2015 Rugby World Cup finished last weekend but Ireland’s contribution came to an end two weeks previous at the quarter final stage with a loss to Argentina. A number of players were injured in the previous game against France and I built a plot showing the difference between the team that started against France and that team that was on the field against Argentina at the 15 minute mark after another injury.
A big loss in terms of match experience and scoring power. Commentators in Ireland began talking about the need for lighter, faster players like New-Zealand has and I wanted to check if this was indeed the case. I built a scatter plot showing each of the four semi-finalist squads plotted by height and weight and another showing player positions by height and weight.
Seeing how well split off some positions were, such as prop and lock, I wondered if it would be possible to build a model to classify and identify player positions based on the information available on the Rugby World Cup website. I created an IPython Notebook to pull down the information needed for the more prominent rugby playing nations and save them as .csv files. The notebooks and files can be found on my GitHub account. Notebook 5 deals with building the classifier, at the end of which I use the model to predict the position of two retired Irish rugby players.
With the very small amount of observations and the little data available I was happy with the accuracy score of 62% considering there are 9 classes to predict. Height and weight, as the above plot shows, are very important in predicting what position a player plays in. By dividing by the amount of games played I created Average Points per Game and Average Tries per Game variables to see if these could be used to pull out high scoring players. Considering that this was all the data that I had it was going to be hard to pull players out of the group on the bottom left of the plot. Centre is a very hard position to predict due to how spread out the points are in the above graph and are often miss predicted as Wings. Full Back is another difficult position to predict as there are so few of them.
There are a number of things I didn’t try, such as normalizing and scaling the values before creating a combined data frame. I created a combined data frame of all the players and then did normalizing and scaling. Ireland’s highest scoring player is Johnny Sexton with around 600 pts. New-Zealand’s is Dan Carter with 1,500. They both play Fly-Half but Carter and Sexton will have normalized and scaled values greatly different to each other. By doing this first for each country and then combining them each country’s highest scoring player would end up with a value near 1.0 in terms of scaling. This would allow the use of data from all of the nations who took part in the World Cup. The addition of every nation tended to bring the accuracy score down; in some countries it is still an amateur, part time sport and even the high scoring players here would tend to score significantly less then players of the bigger nations.
I created a mass variable by multiplying a player’s height in metres by their weight. In the final model I left this out in favour of using height and weight (it would have been unwise to use all three) but in future I will try a variable of KG per Metre by dividing a player’s weight by height. Mass can’t differentiate between a tall light player and a short heavy player (at certain ranges and figures) where as KGperM might. It’s worth a try!
Obviously this data can only be used after the fact. As we need Average points and tries scored we can’t use this on teenagers or players just starting out to see where they would be best placed, thought we can use height and weight as an indication, shorter, lighter players becoming a Scrum-Half or a Fly-Half, taller, heavier players becoming props etc. External data sets such as weight lifting personal bests or times for sprinting short distances could be combined with the height and weight data above to predict the potential position of a player starting out.
I’ve added a plots section and have included the above plots there as well as a box plot of the semi-finalists.