Posts

U-Shape Test using "Two Lines" - A Simple Solution for Discrete IV

Motivation Two-lines Approaches Issue with Discrete Dependent Variable Motivation In this paper by Uri Simonsohn (2017), the author proposed a noval method to test U-Shape relationship. In the literature, the popular way of testing U-shapeness relationship between x and y is to add a quadratic term in the regression $y=\beta_0+\beta_1 x + \beta_2 x^2 +\epsilon$ ($\epsilon$ is an i.i.d noise). If $\beta_1$ is statistitally significant, then the relationship betewen x and y are U-shape.

Fake News Consumption and Segregation on Twitter

To form accurate beliefs about the world (e.g., whether the earth is flat or a sphere, whether vaccination causes autism, etc), people must encounter diverse views and opinions which will sometimes contradict their pre-existing views. Many scholars concerned that the emergence of internet especially recent social media reduces the cost of acquiring information from a wide range of sources, facilitating consumers to self-segregate and limit themselves to the information sources that are likely to confirm their views.

Personalized Data Summary Function Using "data.table"

One function I miss about Stata is its tabstat. By using just one line code, it can produce very useful summary statistics such as mean, and standard error by groups by conditions. R has its own built-in summary function – summary(), too, but in most cases in my research, I found the summaries produced is barely useful. Consider the following pseudo-data: library(data.table) set.seed(10) N = 120 DT = data.table(x = rnorm(N,1), y = rnorm(N,2), category = sample(letters[1:3], N, replace = T)) DT[1:10] ## x y category ## 1: 1.

Is it the end of the world if $\alpha=0.005$ is the new norm?

In this paper by Benjamin et al (2017) on redefining statistical significance, they proposed to change the default P-value threshold for statistical significance from 0.05 to 0.005 for claims of new discoveries. That is the proposed p-value is one tenth of the conventional one!! Suppose the world changed to p=0.005. Do we need 10X more sample? As a researcher without sufficient funding, we care about how much additional sample we need suppose our hypothesis is true.

Anonymize Individuals using digest()

When requesting individual level data from others (a company or a government agency), we usually need to properly anomymize the individuals to protect their privacy. The following is an example: (Data = data.frame(Name = c("John Smith", "Jenny Ford","Vivian Lee"), Secret = c("Hate dog","Afraid of ghost","A bathroom dancer"))) ## Name Secret ## 1 John Smith Hate dog ## 2 Jenny Ford Afraid of ghost ## 3 Vivian Lee A bathroom dancer One simple way is we can just drop the Name, and only keep the Secret since we are more interested in their secrets.

Limited Attention in Response to Email Scam -- A Toy Model

About one year ago or so, I started a project on consumer’s fraud protection issue, especially on how to protect consumers from falling prey to phishing emails – the emails sent by scammers to obtain senstive information such as password, credit card number. How to model consumer’s response to the scam? Natually I would assume that conusmers have limited attention. That is, paying attention is effortful and costly. More effort the consumer exerts, the more accurate information the consumer will acquire.

How much we can learn from Google search data

I just finished the book Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz, which is a highly rated book. The author devoted a great amount of text to the Google Trends data. My fun part of reading this book is that I could dig the results from the Google Trends website myself. Here is one example: in the book the author argues that Google search reveals that contemporary American parents are far more focused on their son’s intelligence than on their daughters.