Personalized Data Summary Function Using "data.table"

One function I miss about Stata is its tabstat. By using just one line code, it can produce very useful summary statistics such as mean, and standard error by groups by conditions. R has its own built-in summary function – summary(), too, but in most cases in my research, I found the summaries produced is barely useful. Consider the following pseudo-data:

library(data.table)
set.seed(10)
N = 120
DT = data.table(x = rnorm(N,1), y = rnorm(N,2),
                category = sample(letters[1:3], N, replace = T))
DT[1:10]
##              x         y category
##  1:  1.0187462 1.5186344        c
##  2:  0.8157475 2.2028818        a
##  3: -0.3713305 1.9682603        c
##  4:  0.4008323 0.8044197        a
##  5:  1.2945451 2.6236812        c
##  6:  1.3897943 1.0851955        c
##  7: -0.2080762 2.2487580        b
##  8:  0.6363240 0.9373772        b
##  9: -0.6266727 1.6360178        c
## 10:  0.7435216 0.7930051        a

If we summarize the variable x using summary(), it gives:

summary(DT$x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1853  0.2380  0.9101  0.9235  1.7119  3.2205

In most the case, I want to have a sense of the dispersion of the mean, number of non-mising observations. More importantly, I want to have a data table from which I can generate a barchart, which is very common in analyzing experimental data. Since I almost need this types of summary function for all of my on-going project, why not make a personalized one for myself and potential other users? Here it is.

SumFunOne = function(Data, Var, Group, StatList){
  arguments <- as.list(match.call())
  x = eval(arguments$Var, Data)
  category = eval(arguments$Group, Data)
  keep = c("category",StatList)
  result = Data[,  .(Mean = mean(x, na.rm=TRUE), 
                     N = sum(!is.na(x)), 
                     SE = sd(x, na.rm=TRUE)/sqrt(sum(!is.na(x))),
                     median = median(x),
                     max = max(x),
                     min = min(x),
                     Missing = sum(is.na(x))),
                     by = .(category)][,..keep][order(category)]
  return(result)
}
(Data.Summary = SumFunOne(DT, x, category, c("Mean","N","SE")))
##    category      Mean  N        SE
## 1:        a 1.0233245 36 0.1561983
## 2:        b 0.9516208 41 0.1353133
## 3:        c 0.8131563 43 0.1560230

The function SumFunOne() has 4 inputs: Data (should be in data.table), Var – variable to be summarized, Group – the group variable I want to condition on, and StatList – the statistics I want to show. We can also subset the data using DT[y>0] in the input. Given the results, I can easily draw a barchart with standard errors and number of observations in ggplot2:

Well, what if I want to summarize multiple variables? The goal is to have a counterpart of Stata’s tabstat in R, isn’t it? It is straightforward, too. We just need to use the powerful .SD in data.table to apply the summary function to multiple variables. But we define the summary function first outside of the data.table. For simplicity, I only show three statistics: Mean, N and SE. varList is the variable list we want to summrize.

SumFunMult = function(Data, varList, Group){
  arguments <- as.list(match.call())
  category = eval(arguments$Group, Data)
  my.summary <- function(x){
    c(Mean = mean(x, na.rm=TRUE), 
      N = sum(!is.na(x)), 
      SE = sd(x, na.rm=TRUE)/sqrt(sum(!is.na(x))))
  }
  result = Data[, lapply(.SD, my.summary), by=.(category), .SDcols= varList]
  Stats = rep(c("Mean","N","SE"),length(unique(category)))
  summary = cbind(Stats,result)
  return(summary)
}

SumFunMult(DT, c("x","y"),category)
##    Stats category          x          y
## 1:  Mean        c  0.8131563  1.6547724
## 2:     N        c 43.0000000 43.0000000
## 3:    SE        c  0.1560230  0.1491735
## 4:  Mean        a  1.0233245  1.8356193
## 5:     N        a 36.0000000 36.0000000
## 6:    SE        a  0.1561983  0.1652287
## 7:  Mean        b  0.9516208  2.0720581
## 8:     N        b 41.0000000 41.0000000
## 9:    SE        b  0.1353133  0.1404608

Next task is the develop it into a package so that I can easily call the function to summarize and visualize the summary statistics…

Related