Regression cheat sheet: general topics

Designing predictor terms

In the linear regression model $Y = β_{0} + β_{1} X_{1} + . . .$ , coefficients $β$ relate each $X_{i}$ to $Y$ .

Predictor variables $X_{i}$ can take any form:

Continuous: $β$ captures change in $Y$ per unit change in $X$
Indicator ( $X \in {0, 1}$ ): $β$ is the difference between categories
- Make dummy variables for categorical predictors
Interaction ( $X_{1} X_{2})$ : captures modification of the effect of one variable by another
Polynomial ( $X^{a}$ ): allow for nonlinear effects
- Natural cubic splines are a common flexible choice

Regression goodness of fit

Ways to test goodness of fit:

Tests of the global null hypothesis, which is that the intercept-only model fits as well as the full model
- The intercept-only model is nested within the full model, so likelihood ratio test can be used
- Wald and Score test are also available but are less reliable
Information criteria
- These balance underfitting against overfitting – additional parameters will only reduce information criteria when they offset complexity with better fit
Explained variance

Regression diagnostics

Detecting outliers using residuals:

Pearson residuals
- Sum of squared pearson residuals is the $χ^{2}$ statistic
Deviance residuals
- Sum of squared deviance residuals is the deviance of the model

Detecting influential points:

Leverage
- Distance of independent variable values from those of other observations
Dfbeta
- Identify influential observations on estimates of each parameter
Cook’s distance
- Influence of an observation overall

GEE

GEE allows unbiased estimation when there are clusters of correlated observations. It involves specifying a covariance matrix for the set of repeated outcome observations $y_{i}$ .

Normally we assume independence. This is the independence matrix with three measurements per subject/cluster:

$[\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{matrix}]$

But with GEE we specify any covariance structure for the outcomes. Using robust standard errors, the model will result in unbiased estimates even if the covariance structure is wrong. GEE is also an option to account for overdispersion in Poisson regression.

Parameter interpretation

One way to interpret parameter estimates is to check the standardized beta, which normalizes estimates by the standard error.

Profile likelihood confidence intervals can be more accurate than Wald confidence intervals with small sample sizes.

Multiple imputation

For this to work, data must either be MCAR, or it must be MAR with correct predictors available to use for imputation.

Procedure:

Fit a regression with missing values as the outcome
Impute predicted values using $x = x_{p r e d} + ϵ$ , where $ϵ$ is a random normal term with the same variance as the regression residuals
Make multiple copies of imputed datasets (>5)
Fit the model for each dataset and estimate parameters by pooling results

Note: If multiple variables have missing values, we will have a multivariate normal distribution that can be sampled using MCMC. If the pattern of missingness is monotone (as for unfinished surveys) the values can be imputed sequentially instead.

Epi Notes