One function I miss about *Stata* is its *tabstat*. By using just one line code, it can produce very useful summary statistics such as `mean`

, and `standard error`

by groups by conditions. R has its own built-in summary function – `summary()`

, too, but in most cases in my research, I found the summaries produced is barely useful. Consider the following pseudo-data:

```
library(data.table)
set.seed(10)
N = 120
DT = data.table(x = rnorm(N,1), y = rnorm(N,2),
category = sample(letters[1:3], N, replace = T))
DT[1:10]
```

```
## x y category
## 1: 1.0187462 1.5186344 c
## 2: 0.8157475 2.2028818 a
## 3: -0.3713305 1.9682603 c
## 4: 0.4008323 0.8044197 a
## 5: 1.2945451 2.6236812 c
## 6: 1.3897943 1.0851955 c
## 7: -0.2080762 2.2487580 b
## 8: 0.6363240 0.9373772 b
## 9: -0.6266727 1.6360178 c
## 10: 0.7435216 0.7930051 a
```

If we summarize the variable `x`

using `summary()`

, it gives:

`summary(DT$x)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.1853 0.2380 0.9101 0.9235 1.7119 3.2205
```

In most the case, I want to have a sense of the dispersion of the mean, number of non-mising observations. More importantly, I want to have a data table from which I can generate a barchart, which is very common in analyzing experimental data. Since I almost need this types of summary function for all of my on-going project, why not make a personalized one for myself and potential other users? Here it is.

```
SumFunOne = function(Data, Var, Group, StatList){
arguments <- as.list(match.call())
x = eval(arguments$Var, Data)
category = eval(arguments$Group, Data)
keep = c("category",StatList)
result = Data[, .(Mean = mean(x, na.rm=TRUE),
N = sum(!is.na(x)),
SE = sd(x, na.rm=TRUE)/sqrt(sum(!is.na(x))),
median = median(x),
max = max(x),
min = min(x),
Missing = sum(is.na(x))),
by = .(category)][,..keep][order(category)]
return(result)
}
(Data.Summary = SumFunOne(DT, x, category, c("Mean","N","SE")))
```

```
## category Mean N SE
## 1: a 1.0233245 36 0.1561983
## 2: b 0.9516208 41 0.1353133
## 3: c 0.8131563 43 0.1560230
```

The function `SumFunOne()`

has 4 inputs: `Data`

(should be in `data.table`

), `Var`

– variable to be summarized, `Group`

– the group variable I want to condition on, and `StatList`

– the statistics I want to show. We can also subset the data using `DT[y>0]`

in the input. Given the results, I can easily draw a barchart with standard errors and number of observations in `ggplot2`

:

Well, what if I want to summarize multiple variables? The goal is to have a counterpart of Stata’s *tabstat* in R, isn’t it? It is straightforward, too. We just need to use the powerful `.SD`

in `data.table`

to apply the summary function to multiple variables. But we define the summary function first outside of the data.table. For simplicity, I only show three statistics: `Mean`

, `N`

and `SE`

. `varList`

is the variable list we want to summrize.

```
SumFunMult = function(Data, varList, Group){
arguments <- as.list(match.call())
category = eval(arguments$Group, Data)
my.summary <- function(x){
c(Mean = mean(x, na.rm=TRUE),
N = sum(!is.na(x)),
SE = sd(x, na.rm=TRUE)/sqrt(sum(!is.na(x))))
}
result = Data[, lapply(.SD, my.summary), by=.(category), .SDcols= varList]
Stats = rep(c("Mean","N","SE"),length(unique(category)))
summary = cbind(Stats,result)
return(summary)
}
SumFunMult(DT, c("x","y"),category)
```

```
## Stats category x y
## 1: Mean c 0.8131563 1.6547724
## 2: N c 43.0000000 43.0000000
## 3: SE c 0.1560230 0.1491735
## 4: Mean a 1.0233245 1.8356193
## 5: N a 36.0000000 36.0000000
## 6: SE a 0.1561983 0.1652287
## 7: Mean b 0.9516208 2.0720581
## 8: N b 41.0000000 41.0000000
## 9: SE b 0.1353133 0.1404608
```

Next task is the develop it into a package so that I can easily call the function to summarize and visualize the summary statistics…