I’ve just released a new package onto
CRAN and
while it doesn’t perform any complex calculations or fit a statistical
niche, it may be one of the most useful everyday libraries I’ll write.
In short, epitab
provides a framework for building descriptive tables
by extending contingency tables with additional functionality.
I initially developed it for my work in epidemiology, as I kept coming
across situations where I wanted to programmatically generate tables
containing various descriptive statistics to facilitate reproducible
research, but I could not find any existing software that met my
requirements. I tried Epi::stats.table
, but found it limited by not
being able to display multiple independent variables; adding a third
variable builds a 3-way table instead. It also lacks the ability to
calculate statistics that aren’t from a cross-tabulated combination of
covariate and outcome. My final requirement was that I wanted a way of
tidying the table for publication in various formats.
This post will provide a brief introduction to epitab
and detail how
to use its basic functionality. For further guidance see the
vignette,
or the reference
manual.
Installation
The current version (0.2.1 as of the time of writing) is hosted on CRAN and can easily be installed in the usual manner.
install.packages("epitab")
Development is managed on Github,
and so the latest release can be installed with devtools
.
devtools::install_github("stulacy/epitab")
Basic usage
The mtcars
data set will be used to demonstrate the types of tables
that can be built with epitab
. Note that discrete variables are
coerced into factors to simplify subsequent analysis; in epidemiology it
is even common to discretise continuous variables.
library(epitab)
library(dplyr)
library(knitr)
facs <- c('cyl', 'am', 'gear', 'carb', 'vs')
mtcars[facs] <- lapply(mtcars[facs], factor)
head(mtcars) %>%
kable()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
A standard contingency table built with epitab
is displayed below. The
covariates (displayed on the table rows) are defined in argument
independents
; outcomes
specifies the outcome variables (columns).
These are both provided in the form of named lists, with the names
giving the column/row labels. The crosstab_funcs
argument defines
statistics to calculate for each covariate/outcome combination; in this
example the freq
function (provided in epitab
) calculates the
frequency.
contingency_table(independents=list("Cylinders"="cyl", "Num gears"="gear"),
outcomes=list("Manual transmission"="am"),
data=mtcars,
crosstab_funcs = list(freq()))
## | | |Manual transmission | |
## | |All |0 |1 |
## -------------------------------------------------------------------------
## | | | | |
## |Total |32 |19 (59) |13 (41) |
## | | | | |
## Cylinders |4 |11 |3 (27) |8 (73) |
## |6 |7 |4 (57) |3 (43) |
## |8 |14 |12 (86) |2 (14) |
## | | | | |
## Num gears |3 |15 |15 (100) |0 (0) |
## |4 |12 |4 (33) |8 (67) |
## |5 |5 |0 (0) |5 (100) |
The above table is suitable for use in an interactive R console,
however, if the table is to be shared with others then a clean, easily
exportable version is required. The function neat_table
produces a
knitr::kable
object that can be exported to either HTML or LaTeX. This
allows for the generation of descriptive tables alongside the analysis
in RMarkdown, thereby facilitating reproducible research.
contingency_table(independents=list("Cylinders"="cyl", "Num gears"="gear"),
outcomes=list("Manual transmission"="am"),
data=mtcars,
crosstab_funcs = list(freq())) %>%
neat_table() %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'),
full_width=FALSE)
All | 0 | 1 | |
---|---|---|---|
Total | 32 | 19 (59) | 13 (41) |
Cylinders | |||
4 | 11 | 3 (27) | 8 (73) |
6 | 7 | 4 (57) | 3 (43) |
8 | 14 | 12 (86) | 2 (14) |
Num gears | |||
3 | 15 | 15 (100) | 0 (0) |
4 | 12 | 4 (33) | 8 (67) |
5 | 5 | 0 (0) | 5 (100) |
contingency_table
.contingency_table(independents=list("Carburetors"="carb", "Num gears"="gear"),
outcomes=list("Manual transmission"="am", "Cylinders"="cyl"),
data=mtcars,
crosstab_funcs = list(freq())) %>%
neat_table() %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'),
full_width=FALSE)
All | 0 | 1 | 4 | 6 | 8 | |
---|---|---|---|---|---|---|
Total | 32 | 19 (59) | 13 (41) | 11 (34) | 7 (22) | 14 (44) |
Carburetors | ||||||
1 | 7 | 3 (43) | 4 (57) | 5 (71) | 2 (29) | 0 (0) |
2 | 10 | 6 (60) | 4 (40) | 6 (60) | 0 (0) | 4 (40) |
3 | 3 | 3 (100) | 0 (0) | 0 (0) | 0 (0) | 3 (100) |
4 | 10 | 7 (70) | 3 (30) | 0 (0) | 4 (40) | 6 (60) |
6 | 1 | 0 (0) | 1 (100) | 0 (0) | 1 (100) | 0 (0) |
8 | 1 | 0 (0) | 1 (100) | 0 (0) | 0 (0) | 1 (100) |
Num gears | ||||||
3 | 15 | 15 (100) | 0 (0) | 1 (7) | 2 (13) | 12 (80) |
4 | 12 | 4 (33) | 8 (67) | 8 (67) | 4 (33) | 0 (0) |
5 | 5 | 0 (0) | 5 (100) | 2 (40) | 1 (20) | 2 (40) |
Other summary statistics
Alongside the cross-tabulated frequencies, summary statistics that are
dependent on only the covariates or the outcomes can be displayed.
Column-wise functions are those that act on every outcome column,
independently from the covariates. This behaviour is useful for
identifying a relationship between a continuous independent variable and
the outcome(s). Two column-wise functions are supplied with epitab
:
summary_mean
and summary_median
, which calculate the mean and median
values of a specified covariate for each outcome level. The table below
shows that in this data set, manual cars have less power but greater
fuel economy than automatics.
contingency_table(independents=list("Cylinders"="cyl", "Num gears"="gear"),
outcomes=list("Manual transmission"="am"),
data=mtcars,
crosstab_funcs = list(freq()),
col_funcs=list("Mean MPG"=summary_mean("mpg"),
"Mean horsepower"=summary_mean("hp"))) %>%
neat_table() %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'),
full_width=FALSE)
All | 0 | 1 | |
---|---|---|---|
Total | 32 | 19 (59) | 13 (41) |
Cylinders | |||
4 | 11 | 3 (27) | 8 (73) |
6 | 7 | 4 (57) | 3 (43) |
8 | 14 | 12 (86) | 2 (14) |
Num gears | |||
3 | 15 | 15 (100) | 0 (0) |
4 | 12 | 4 (33) | 8 (67) |
5 | 5 | 0 (0) | 5 (100) |
Mean MPG | 17.15 | 24.39 | |
Mean horsepower | 160.26 | 126.85 |
epitab::odds_ratio
function, which obtains the odds ratios for each
covariate level from a logistic regression on a specified outcome.
epitab
also provides functionality for displaying hazard ratios for
time-to-event outcomes. Note that in the example below the odds ratios
are obtained from univariate models of each covariate in turn, although
functionality is provided for adjusting for other factors. See the help
page for odds_ratio
for further details.contingency_table(independents=list("Cylinders"="cyl", "Engine shape"="vs"),
outcomes=list("Manual transmission"="am"),
data=mtcars,
crosstab_funcs = list(freq()),
col_funcs=list("Mean MPG"=summary_mean("mpg"),
"Mean horsepower"=summary_mean("hp")),
row_funcs=list("OR"=odds_ratio("am"))) %>%
neat_table() %>%
kableExtra::kable_styling(bootstrap_options = c('striped', 'hover'),
full_width=FALSE)
All | 0 | 1 | OR | |
---|---|---|---|---|
Total | 32 | 19 (59) | 13 (41) | |
Cylinders | ||||
4 | 11 | 3 (27) | 8 (73) | 1 |
6 | 7 | 4 (57) | 3 (43) | 0.28 (0.03 - 1.99) |
8 | 14 | 12 (86) | 2 (14) | 0.06 (0.01 - 0.39) |
Engine shape | ||||
0 | 18 | 12 (67) | 6 (33) | 1 |
1 | 14 | 7 (50) | 7 (50) | 2.00 (0.48 - 8.76) |
Mean MPG | 17.15 | 24.39 | ||
Mean horsepower | 160.26 | 126.85 |
Custom functions
This brief introduction has shown the use of inbuilt functions for
displaying the three types of summary statistics, such as freq
,
odds_ratio
, and summary_mean
. However, the main strength of the
package as I see it is that any user-defined function can be used in
each of these three roles, provided they meet the required
parameterisations. Please see the
vignette
for details.
Further work
It’s a rather simple package but one that I’ve already incorporated quite frequently into my exploratory analysis workflow. I’m always keen to improve it and would appreciate any comments and feedback (contact me here). Particular features I’d like to add in future releases include the ability to calculate statistics related to each covariate and outcome but not necessarily calculated for each level, such as displaying the output of a chi-square test between the covariate and outcome. I’d also welcome feedback on different options for formatting the table; the layout that is currently used is one that makes sense to me but others may have different preferences.