It’s gradually getting closer to the three year PhD deadline in which I intend to submit, meaning I’ve got two and a half months to not only finish up my experiments but write up my entire thesis. To help motivate myself to work on this huge document (and definitely not as a form of procrastination) I’ve started recording my writing progress and am publicly displaying the data here. The idea is that I won’t want people (family, supervisors, colleagues) to notice that I’m slacking.
Today I gave a talk introducing Python to early stage researchers in my Department. It’s always hard deciding what material to include in an hour’s talk, particularly when the subject material is so vast. This wasn’t helped by the fact that in my department there is a large range of programming experience, from researchers with backgrounds in Computer Science to Electronic Engineers who are only comfortable with Matlab. I attempted to address both of these groups by introducing Python as a language in terms of its syntax, data structures and control flow, before discussing how you can emulate Matlab by using the SciPy stack.
Now that I’ve finished my teaching qualification (the York Learning and Teaching award) I’ve had some time to get back into research. I’ve been updating various bits of software that I haven’t used much over the last month or so, one of which was to update PyPy to version 2.5 from 2.3, skipping a version in the process. I expected that I may get a few speed bonuses but there wouldn’t be a significant improvement from 2.
Receiver Operating Characteristics (ROC) are becoming increasingly commonly used in machine learning as they offer a valuable insight into how your model is performing that isn’t captured with just log-loss, facilitating diagnosis of any issues. I won’t go into much detail of what ROC actually is here, as this post is more intended to help navigate people looking for a MAUC Python implementation. If however you are looking for an overview of ROC then I’d recommend Fawcett’s tutorial here.
I haven’t blogged in a while, mainly because I’ve been so busy with teaching work. It’s fantastic experience and very rewarding, but at the same time I find myself sometimes wishing I had more time to do my research, especially now that I’m in my third year. The other day I came across a very well done hierarchical Bayesian modelling approach for football games. I’ve been thinking a lot recently about what area I want to go into for my first post-doc research, and learning standard statistical techniques (including Bayesian methods) is something I’ve been considering.
I’ve been working on another paper today and decided to update my previous xtable function (as described here) to use dplyr, as I want to fully get to grips with Hadley Wickham’s wonderful ecosystem of packages including dplyr (and its predecessor plyr), ggplot2 and tidyr (and its predecessor reshape2). I mentioned this before Christmas but have only got round to it now, which included a few hours of struggling with tidyr to make it do what I want!
I’ve recently decided to start using Sweave for producing my publications since I already use R for the data analysis side and LaTeX for the markup, so it seems natural to combine them. In a nutshell, Sweave lets you embed R output directly into your documents, allowing for a more organised workflow. You mark a section as containing R code, then run your analyses with your output, be it in the form of text, a table, or a chart, formatted directly into LaTeX markup.
I’ve never fully taught myself R, just dipped in and out when necessary. I’ve primarily used it for standard data analysis and visualisation, although I have been meaning to get to grips with one of the numerous available machine learning packages. Dealing with datasets tends to involve a lot of hacky manipulation until it’s in a useful format for your analysis. Initially I was just trying to use standard library functions, although once I came upon the essential reshape2 package and the ease with which you could convert your dataframe between wide and long formats I knew I was going to have to use a different approach.
In my last blog post I touched on the fact that I’m an extrinsically motivated learner, I need to be working towards a goal rather than learning for learning’s sake. Thus, in times where I’m not working towards a specific deliverable such as a report, or a publication, I can find it challenging to maintain focus and motivation and frequently end up bored. This also happens when I’ve been working on the same thing for a while (which is what is happening at the moment!
Thanks to a year long course I’m taking in Higher Education teaching I’ve been thinking a lot recently about the education I received, not only as an undergraduate but throughout my schooling. One issue that I keep coming back to is that I believe I’ve developed disadvantageous working habits due to the use of exams as the primary form of assessment in the UK education system. I’ve always excelled in exams, whereas with coursework I lack the discipline to stay focused without the tight time constraint of an exam.