I’ve finally managed to combine my two main hobbies of data science and running into one project: a Shiny web-app that allows you to explore your Strava fitness data, as well as providing a local database of all your activities that you can analyse with R.
My initial motivation was that I wanted quick access to certain visualisations and metrics that either Strava doesn’t provide, or are awkward to get so I wrote a basic app for my own use.
Introduction Deep learning has attracted considerable attention for its near-human ability in a variety of complex problems such as image recognition, playing games, and recently conversational AI through large language models. Each of these applications requires unimaginable volumes of data and computational resources beyond the reach of all but the richest companies. This resource hungry nature, coupled with the huge hype that accompanies any deep-learning application, makes it challenging to gain a realistic assessment of their real-world potential for less demanding use-cases, such as scientific time-series modelling.
Just a quick update, more to test that the website infrastructure is still running than anything else, as it’s been 4 years since my last post. I ran a workshop (slides here) at the University’s Research Coding Club on speeding up data analysis in R last month that might be useful for anyone who stumbles across this page in the future.
The Research Coding Club is an informal collective of people from across the entire University who write software to aid their research.
In the previous entry of what has evidently become a series on modelling binary mixtures with Dirichlet Processes (part 1 discussed using pymc3 and part 2 detailed writing custom Gibbs samplers), I ended by stating that I’d like to look into writing a Gibbs sampler using the stick-breaking formulation of the Dirichlet Process, in contrast to the Chinese Restaurant Process (CRP) version I’d just implemented.
Actually coding this up this was rather straight forward and took less time than I expected, but I found the differences and similarities between these two same ways of expressing the same mathematical model interesting enough for a post of its own.
Back at the start of the year (which really doesn’t seem like that long a time ago) I was looking at using Dirichlet Processes to cluster binary data using PyMC3. I was unable to get the PyMC3 mixture model API working using the general purpose Gibbs Sampler, but after some tweaking of a custom likelihood function I got something reasonable-looking working using Variational Inference (VI). While this was still useful for exploratory analysis purposes, I’d prefer to use MCMC sampling so that I have more confidence in the groupings (since VI only approximates the posterior) in case I wanted to use these groups to generate further research questions.
While I’ve been quite happy with the performance of my Predictaball football rating system, one thing that that’s bothered me since its inception last summer is the reliance on hard-coded parameters.
Similar to many other football rating methods, it’s an adaptation of the Elo system that was designed for Chess matches by Arpad Elo in the 1950s. His aim was to devise an easily implementable system to rate competitors in a 2-person zero-sum game.
I came across a tweet from Piers Morgan this morning in which he suggested that the BBC is favouring women since 43 out of the 53 paper reviewers on The Andrew Marr Show in 2019 were women. Unfortunately I was a day late to this hot-take, fortunately this is because I don’t follow Piers Morgan. However, I knew that there must be more to it than a single PC-baiting statistic and knowing that I had a ~3 hour train journey coming up this evening I thought I’d look into it a bit more.
Back in March I rewrote thepredictaball.com from its original R Shiny implementation into a static website using the Vue Javascript framework. I intended to write about it at the time but I’ve been busy and hadn’t made time for it until now, which is handy given that the football season has just finished!
Excuse the clickbait title, but I genuinely couldn’t think of a better way of organising this post.
A second post in 2 days on mixture modelling? No awards for guessing what type of analysis I’ve been preoccupied with recently!
Today’s post provides an ugly hack to fix a bug in the R flexmix package for likelihood-based mixture modelling and provides a cautionary tale about environments. In short, I’ve encountered problems when trying to predict the cluster membership for out-of-sample data using this package, and judging from a couple of posts I found online, I’m not the only one.
I’ve been spending a lot of time over the last week getting Theano working on Windows playing with Dirichlet Processes for clustering binary data using PyMC3. While there is a great tutorial for mixtures of univariate distributions, there isn’t a lot out there for multivariate mixtures, and Bernoulli mixtures in particular.
This notebook shows an example of fitting such a model using PyMC3 and highlights the importance of careful parameterisation as well as demonstrating that variational inference can prove advantageous over standard sampling methods like NUTS for such problems.