8  Endogeneity

PDF version

8.1 The Linear Model and Exogeneity

So far we have written the conditional mean of an outcome Y_i as a linear function of observed covariates \boldsymbol X_i: \begin{align*} &Y_i = \boldsymbol X_i'\boldsymbol \beta + u_i, \\ &E[u_i\mid \boldsymbol X_i]=0 \tag{A1} \end{align*} If (A1) holds, then E[Y_i\mid \boldsymbol X_i] = \boldsymbol X_i'\boldsymbol \beta, which makes \boldsymbol X_i'\boldsymbol \beta the best predictor of Y_i given \boldsymbol X_i. Each coefficient \beta_j is a conditional marginal effect:

Interpretation: “Among individuals who share the same values of all included control variables, those whose X_{ij} is higher by one unit have, on average, a Y_i that is higher by \beta_j.”

So far the course has provided three empirical tactics to narrow the gap between correlation and causation:

  • Add observed confounders. Whenever economic theory identifies a variable that influences both X_{ij} and Y_i, we try to measure it and augment \boldsymbol X_i.

  • Exploit panel structure. With panel data data we include individual and time fixed effects to control for unobserved factors that are constant across individuals or time periods.

  • Use flexible functional forms. Polynomials, interactions, or other transformations can absorb nonlinearities that would otherwise leak into u_i.

Even after taking these steps, important issues remain. For example, there may be reverse causality, which occurs when Y_i feeds back into X_i. Additionally, there may be control variables with a dual role that act as both confounders and mediators/colliders simultaneously.

Nothing in (A1) – nor in the additional assumptions (A2)–(A4) about i.i.d. sampling, finite moments, and full rank – guarantees that \beta_j is causal. It represents only a conditional correlative relationship unless X_{ij} is uncorrelated with all unobserved determinants of Y_i.


8.2 Conditional vs Causal Effects: Price Elasticities

Economists often want causal price effects, not merely conditional associations. Consider the following structural system in a competitive market written in logs so that slopes are elasticities:

\begin{align*} \text{Demand:}&\quad \log(Q_i) = \beta_1 + \beta_2 \log(P_i) + u_i,\\ \text{Supply (pricing rule):}&\quad \log(P_i)=\gamma_1 + \gamma_2 \log(C_i) + \gamma_3 u_i + \eta_i. \end{align*} We have \beta_2<0 by theory.

  • Index i denotes a market (e.g., city or store) observed at a single point in time; the data are cross‑sectional and i.i.d.
  • Q_i is the total quantity demanded in market i.
  • P_i is price.
  • C_i is the exogenous wholesale cost of the product.
  • u_i captures consumers’ taste shocks unobserved by the econometrician (though retailers may infer them and respond when setting prices); \eta_i captures supply‑side shocks.

Because higher demand (large u_i) in a particular store leads retailers to charge higher prices (\gamma_3>0), we have Cov(\log(P_i), u_i) > 0. Hence, (A1) is violated in the demand equation.

Suppose a researcher estimates \log(Q_i) = \alpha_1 + \alpha_2 \log(P_i) + \varepsilon_i or \log(Q_i) = \theta_1 + \theta_2 \log(P_i) + \theta_3 \log(C_i) + v_i Both regressions (one simple and one with wholesale‑cost controls) deliver conditional marginal effects \alpha_2 or \theta_2. They answer

“Among markets with the same wholesale cost (and any other included controls), how does observed quantity co‑move with observed price?”

But the policy‑relevant question is different:

“By how much would quantity fall if we exogenously raised price – say, via a 1\% tax – holding everything else constant?”

That causal elasticity is \beta_2. Because P_i responds to u_i, OLS estimates suffer simultaneity bias and \alpha_2 or \theta_2 generally differ from \beta_2.

Endogeneity arises because we want the parameter to be causal, not because the regression is mechanically misspecified. Even if the conditional mean is correctly linear, interpreting \beta_2 causally implies Cov(\log(P_i),u_i)\neq 0.


8.3 Measurement Error

Another important source of endogeneity arises from measurement error. Suppose we consider the structural model:

Y_i^0 = \beta_1 + \beta_2 X_i^0 + u_i^0, \quad i = 1, \dots, n, \quad u_i^0 \sim \text{i.i.d.}(0, \sigma^2),

but we do not observe the latent variables Y_i^0 and X_i^0 directly. Instead, we observe:

Y_i = Y_i^0 + \eta_{i}, \quad X_i = X_i^0 + \zeta_{i},

where \eta_{i} \sim \text{i.i.d.} (0, \sigma_\eta^2) and \zeta_{i} \sim \text{i.i.d.} (0, \sigma_\zeta^2) denote classical measurement errors that are assumed independent of each other and of X_i^0, Y_i^0, and u_i^0.

Plugging the observed variables into the structural equation yields:

Y_i - \eta_{i} = \beta_1 + \beta_2 (X_i - \zeta_{i}) + u_i^0,

which can be rearranged as:

Y_i = \beta_1 + \beta_2 X_i + \underbrace{(u_i^0 + \eta_{i} - \beta_2 \zeta_{i})}_{\text{composite error term}}.

The composite error term is problematic:

E[u_i^0 + \eta_{i} - \beta_2 \zeta_{i} \mid X_i] \neq 0,

because X_i contains \zeta_{i}, which also appears in the error term. This violates the exogeneity condition, resulting in a biased and inconsistent OLS estimator. Specifically, the bias tends to attenuate the coefficient estimate \hat{\beta}_2 toward zero (known as attenuation bias). For positive true coefficients, this leads to underestimation; for negative coefficients, overestimation.

By contrast, if only the dependent variable Y_i is measured with error, OLS remains unbiased, although the variance of the error term increases.


8.4 Endogeneity as a Violation of (A1)

Formally, a regressor X_{ij} is endogenous if it correlates with the structural error term: Cov(X_{ij},u_i)\neq 0 \quad \Rightarrow \quad E[u_i\mid X_i]\neq0 When this happens, OLS estimates remain descriptive but lose their causal interpretation. Whether you care depends on your goal:

Purpose Is (A1) needed? Parameter meaning
Prediction / description No. Bias relative to causal truth is irrelevant if forecasting is the aim. Conditional marginal effect
Causal policy evaluation Yes! You need E[u|X]=0 in the causal sense, or an alternative identification strategy. Structural (causal) effect

8.5 Sources of Endogeneity

Besides the functional-form misspecification that we have already discussed in previous sections, there are four other common sources of endogeneity in practice:

Mechanism Typical manifestation
Omitted‑variable bias Unobserved ability affects both schooling (X) and wages (Y)
Simultaneity / reverse causality Price and quantity determined jointly in markets
Measurement error in X Measurement error inflates the variance of the regressor, so OLS slopes are biased toward zero (attenuation bias)
Dual role controls A variable (e.g., health) acts as both confounder and mediator/collider

All four cases yield E[\boldsymbol u|\boldsymbol X]\neq 0 and threaten causal inference.

We have E[\widehat{\boldsymbol \beta}|\boldsymbol X] = \boldsymbol \beta + (\boldsymbol X' \boldsymbol X)^{-1} \boldsymbol X' E[\boldsymbol u|\boldsymbol X] \neq \boldsymbol \beta.