Blog

I recently give a talk at my university’s R User group on how to publish packages to CRAN (slides here). This isn’t an easy topic to distill into a 60 minute slot, and so I had to abandon my original idea of a hands on workshop with examples in favour of a condensed summary of the main challenges in the submission process. This mostly focused on the issue of Namespaces, since this is a rather complex topic to understand if you’re coming from a non-software engineering background, as it doesn’t come up in day-to-day statistical analysis.

CONTINUE READING

I’ve always been curious to know if any of the 4 major European leagues (Serie A, Bundesliga, Premiership, La Liga) are more predictable than others. La Liga certainly has a reputation as being dull and predictable, although this is due to the sheer dominance of Barcelona and Real Madrid in recent years. I’ve increased my database of football matches in order to improve my football prediction bot this summer, and so now have sufficient data to investigate.

CONTINUE READING

At ECSG (Epidemiology and Cancer Statistics Group), we primarily work with myeloid and lymphoid disease registries. Resulting from our successful collaborative research project - HMRN (Haematological Malignancy Research Network) - we have access to a large observational dataset of haematological malignancies across Yorkshire. From this we can estimate various measures of interest, such as the effect of standard demographic factors (mainly age and sex) on incidence rates, any longitudinal incidence trends, in addition to numerous statistics related to survival, for example noting any clinical or demographic factors associated with a high risk level.

CONTINUE READING

This post summarises Predictaball’s performance in the 2015-2016 season. I’ll look at overall performance, accuracy per week, how it fared in terms of making profit, and finally the annual comparison with Lawro. Compared to last year when it achieved 48% overall, Predictaball has fared less well this season with 43%. This isn’t largely surprising since this season has been full of surprises to say the least, with Leicester beating out the traditional top four for the title, and Spurs doing their best to break the monopoly (despite failing in typical Spurs fashion).

CONTINUE READING

I’ve been wanting to play with Markov Chains for a while now, and now that I’m starting to get into Bayesian analysis I’m going to need to use them more often. One fun use of them is to generate text which can (at a stretch) pass as written by a drunk person. For nice examples of them in action have a look at Garkov (generated Garfield strips) or even an entire subreddit generated with them reddit.

CONTINUE READING

While writing the previous post on the two ‘cultures’ of statistical modeling for prediction and inference, I realised that I was glossing over an extremely important area of predictive modeling, and judging by frequent StackExchange posts, one that is often misunderstood. As you will have summarised from the title, I’m talking about cross-validation. If done correctly, cross-validation (CV) will provide a thorough assessment of a predictive model providing you with: unbiased, publishable results; a means of selecting the final model instance to use for your application; and an accurate estimate of the model’s performance on future data.

CONTINUE READING

“Two Cultures” One aspect of statistical modeling which can be taken for granted by those with a bit of experience, but may not be immediately obvious to newcomers, is the difference between modeling for explanation and modeling for prediction. When you’re a newbie to modeling you may think that this only has an effect on how you interpret your results and what conclusions you’re aiming to make, but it has a far bigger impact than that, from influencing the way you form the models, to the types of learning algorithms you use, and even how you evaluate their performance.

CONTINUE READING

I’ve tinkered around with Predictaball a bit recently in an effort to increase its accuracy, with the overall goal of beating Paul Merson and Lawro so that I can claim ‘human competitiveness’. I’ve mentioned in previous posts that I envisage 2 potential ways to achieve this. Include more player data Incorporate bookies odds Adding more player data (such as a variable for each player indicating whether they are in the squad or not) would allow the model to account for situations when a player who is strongly associated with the team winning is now injured - for an example see City’s abysmal record when Kompany isn’t playing.

CONTINUE READING

It’s been a while since I’ve posted anything as I’ve spent my summer in a thesis related haze, which I’m starting to come out of now so expect more frequent updates - particularly as I work my way through the backlog of ideas I’ve been meaning to write about. I’ll start with assessing Predictaball’s performance last season. Just to summarise, this was a classification task attempting to predict the outcome (W/L/D) of every premier league match from the end of September onwards.

CONTINUE READING

One thing that struck me as odd when I was studying for my teaching award was the way in which teachers must consider the varied learning styles of their students when planning lessons. Clearly this is common sense and good teaching practice, what works for one person does not necessarily work for another. The issue I have with it however is that the students themselves, in my experience, are never taught to consider their own learning method despite it having a huge impact upon an individual’s performance.

CONTINUE READING