U-Shape Test using "Two Lines" - A Simple Solution for Discrete IV

Motivation

In this paper by Uri Simonsohn (2017), the author proposed a noval method to test U-Shape relationship. In the literature, the popular way of testing U-shapeness relationship between x and y is to add a quadratic term in the regression \(y=\beta_0+\beta_1 x + \beta_2 x^2 +\epsilon\) (\(\epsilon\) is an i.i.d noise). If \(\beta_1\) is statistitally significant, then the relationship betewen x and y are U-shape. There are plenty of examples in the real world that two variables have such kind of relationship (consider x = amount of sugar in the icecream, and y = the taste). Simonsohn pointed out that using the quadratic regression would lead to high false positive rate.

I am directly borrowing the example he gave in the paper. Suppose the underlying relationship is \(y=\log{x} +\epsilon\), where \(x\sim U[0,1]\). We first simulate the data and have the following plot:

set.seed(111)
obs = 10000 # Sample Size of Observations
x = runif(n = obs)^2 # Unnecessary to have the ^2 term, 
# but it gives more striking graph fitting the quadratic term
x2 = x*x
y=log(x)+ rnorm(obs, 0, 1)
plot(x, y)
curve(log(x),from=0,to=1,col='red',
      lwd='2',ylab="y",xlab="x",xaxt='n',add=TRUE)

The yellow line is the curve \(y=\log{x}\), and using naked eyes, there is no way it would be close to a U-shape. Now let’s fit a quadratic model \(y=\beta_0+\beta_1 x + \beta_2 x^2 +\epsilon\) in the regression, and we get:

summary(lm(y~x+x2))
## 
## Call:
## lm(formula = y ~ x + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4851  -0.7435   0.0933   0.9304   4.5570 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.59872    0.02702 -170.19   <2e-16 ***
## x            14.09972    0.16903   83.42   <2e-16 ***
## x2          -10.57034    0.18953  -55.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.445 on 9997 degrees of freedom
## Multiple R-squared:  0.5841, Adjusted R-squared:  0.584 
## F-statistic:  7020 on 2 and 9997 DF,  p-value: < 2.2e-16

From the the results, the coefficients on both \(x\) and \(x^2\) are highly significant. In the plot, we have:

In such case, the regression with quadratic term may not be a good choice to test the underlying relationship between x and y. How to fix it?

Two-lines Approaches

The idea behind Professor Simonsohn’s two-lines approach is rather simple and intuitive: if x and y truly have U-shape relationship (in the example, it is an inverted U-shape), we should be able to find a cutoff point where x and y should have opposite relationship before and after the point.

Consider the example above. When we fitted the quadratic curve, it implies that if \(x <\frac{14.1}{2 * 10.57}=0.67\), the axis of symmetry, \(x\) has a positive relationship with \(y\), and if \(x >0.67\), \(x\) has a negative relationship with \(y\), which we know from the data generating process is untrue. Intuitively (the intuition is going to be revisited shortly), we can simply run two linear regressions:

  • \(y=\beta_0+\beta_1^{1}\cdot x+\epsilon\), when \(x<0.67\)
  • \(y=\beta_0+\beta_1^{2}\cdot x+\epsilon\), when \(x>0.67\)

If \(\beta_1^{1}\) is positive and \(\beta_1^{2}\) is negative, and both coefficients are significant, then we can be more confident that we find an inverted U-shape relationship.

Interestingly, as the author shows in the paper, the quadratic function’s axis of symmetry is not a good option to serve as the cut-off point. He proposed a Robinhood algorithm which would automatically find the cut-off point. In a horse-race simulation test, it gives lowest type-I error rate and highest power (see an online App and corresponding R code in which other researchers can simply upload the data and conduct two-lines test by themselves.).

Issue with Discrete Dependent Variable

One problem with method (the author also pointed in supplimentary materials) is it based on the assumption that independent variable is continuous. However, in many circumstances, the independent variable is a discrete one. Will that create problem? The following example would show that Type-II error may arise.

Suppose this is the data.

set.seed(343)
N = 30
SD = 6
DT1 = data.frame(x = rep(1, N), y = rnorm(n = N, mean = 1, sd = SD))
DT2 = data.frame(x = rep(2, N), y = rnorm(n = N, mean = 2, sd = SD))
DT3 = data.frame(x = rep(3, N), y = rnorm(n = N/2, mean = 3, sd = SD))
DT4 = data.frame(x = rep(4, N), y = rnorm(n = 20*N, mean = 2, sd = SD))
DT5 = data.frame(x = rep(5, N), y = rnorm(n = N, mean = 1.8, sd = SD))
DT6 = data.frame(x = rep(6, N), y = rnorm(n = N, mean = 1.5, sd = SD))
DT7 = data.frame(x = rep(7, N), y = rnorm(n = N, mean = 1, sd = SD))
DT = rbind(DT1, DT2, DT3, DT4, DT5, DT6, DT7)

If we plot the data, we have:

Pretty good "U-shape" results, huh? But using a= twolines(y ~ x, data = DT), the twoline test yields insignificant results:

The reason is that the data at \(x=4\) were splited to the left line, which results in flatter slope and hence insignificant coefficient. Ideally, the left line should only include \(x=1,2,3\), and right line only include \(x=4,5,6,7\). In such case, perhaps the simpler way is just run two regressions:

summary(lm(y~x, data=DT[DT$x<=3,]))
## 
## Call:
## lm(formula = y ~ x, data = DT[DT$x <= 3, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9096  -3.3686  -0.1478   3.0677  13.5946 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  -1.7686     1.3892  -1.273   0.2063  
## x             1.6543     0.6431   2.573   0.0118 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.981 on 88 degrees of freedom
## Multiple R-squared:  0.06994,    Adjusted R-squared:  0.05938 
## F-statistic: 6.618 on 1 and 88 DF,  p-value: 0.01177
summary(lm(y~x, data=DT[DT$x>=4,]))
## 
## Call:
## lm(formula = y ~ x, data = DT[DT$x >= 4, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0009  -3.7641   0.0881   3.9437  16.0451 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.4296     1.3282   4.088 4.87e-05 ***
## x            -0.8765     0.3072  -2.853  0.00446 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.933 on 688 degrees of freedom
## Multiple R-squared:  0.0117, Adjusted R-squared:  0.01026 
## F-statistic: 8.142 on 1 and 688 DF,  p-value: 0.004455

This would result in significant results for both lines.

There are also other limiations of the original two-line method. For example:

  • A bunch of fixed effects I have to control. My economists friends usually have hundreds or thousands fixed effects to control (like controlling each counties’ specific effect in the US.)
  • The standard errors are serially correlated within certain segments

It is less of an issue if we use the felm from lfe package in the regression.

Related