I’ve never fully taught myself R, just dipped in and out when necessary. I’ve primarily used it for standard data analysis and visualisation, although I have been meaning to get to grips with one of the numerous available machine learning packages. Dealing with datasets tends to involve a lot of hacky manipulation until it’s in a useful format for your analysis. Initially I was just trying to use standard library functions, although once I came upon the essential reshape2 package and the ease with which you could convert your dataframe between wide and long formats I knew I was going to have to use a different approach. I also spent a while trying to understand all the ways of doing split-apply-combine functions in core R, although after getting confused with aggregate and the apply family I just wished for a single function which worked well with dataframes, as that was all that I was using. I found the answer to my problems in the form of the plyr package, which contains all kind of useful functions for summarising datasets. Having seen that both of these last 2 packages were authored by the renowned Hadley Wickham, who has also produced the ggplot2 plotting library (which in my opinion looks better and is more user friendly than the standard R lattice module), I’ve decided to learn his ecosystem (the so-called Hadleyverse) rather than than using the core R library for basic data munging for now.
Hadley has released an update to plyr
only last year, which introduces not only a simplified set of functions and promises better performance, but introduces a piping operator into the mix. This allows for very similar code to using ggplot2
, and as you’d expect from sharing a common author, they interact beautifully. Have a look at this overview of the main functions from dplyr
if this interests you. On their own they aren’t much use, but when piped together and in tandem with ggplot2
they allow for very quick data analysis and visualisation with a very readable syntax.