One function I miss about Stata is its tabstat. By using just one line code, it can produce very useful summary statistics such as mean, and standard error by groups by conditions. R has its own built-in summary function – summary(), too, but in most cases in my research, I found the summaries produced is barely useful. Consider the following pseudo-data:
library(data.table) set.seed(10) N = 120 DT = data.table(x = rnorm(N,1), y = rnorm(N,2), category = sample(letters[1:3], N, replace = T)) DT[1:10] ## x y category ## 1: 1.
In this paper by Benjamin et al (2017) on redefining statistical significance, they proposed to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. That is the proposed p-value is one tenth of the conventional one!! Suppose the world changed to p=0.005. Do we need 10X more sample? As a researcher without sufficient funding, we care about how much additional sample we need suppose our hypothesis is true.
When requesting individual level data from others (a company or a government agency), we usually need to properly anomymize the individuals to protect their privacy. The following is an example:
(Data = data.frame(Name = c("John Smith", "Jenny Ford","Vivian Lee"), Secret = c("Hate dog","Afraid of ghost","A bathroom dancer"))) ## Name Secret ## 1 John Smith Hate dog ## 2 Jenny Ford Afraid of ghost ## 3 Vivian Lee A bathroom dancer One simple way is we can just drop the Name, and only keep the Secret since we are more interested in their secrets.