GLM links and distributions

Nick Griffiths · Sep 09, 2020

Generalized Linear Models allows transformations of the predictor terms using a link function η:

η(y)=β0+β1x1+...+βnxn

When we use maximum likelihood, y=η1(β0+β1x1...) is part of what is used to calculate the likelihood: =P(yobs|y). The other piece is the assumed distribution of yobs.

The distribution

When analyzing binary data with a logit link, yobs is logically given a bernoulli distribution. We assign likelihood y to all yobs=1 and 1y to all yobs=0 (y is a probability, not odds or logit).

When using log-binomial regression, implementations aren’t as robust and the models often fail to converge. So with a log link people often use a poisson distribution. However, the variance of poisson is equal to y=η1(x1,x2,...), and it overestimates the variance of a bernoulli outcome, which is only p(1p)=y(1y). So the poisson’s CIs are too wide and the usual solution is to use robust variance estimators instead.

Offsets

When modeling count data, an offset term is used to correct for processes with different rates. For example, we can model the number of hypoglycemic events over time like so:

y=λt

log(y)=β0+β1x1+...+log(t)

Where y is the number of events and t is the time the person was followed. The outcome y is a count variable which can be modeled with a poisson distribution. With the log link, predicted counts are always positive.

Interpretating parameters

If the outcome is binary, then each yobs is 0 or 1 and E(yobs) is a probability. When interpreting the coefficients, an identity link would correspond with a linear relationship between the probability and the predictor (though it is hardly used). A log link corresponds with a log relative risk: β=log(y1)log(y2)=log(y1/y2). A logit link corresponds with a log odds ratio.

With a time offset, β0+β1x1+... equals log(y/t), which is the event rate. So the coefficients in poisson regression are log rate ratios, not risk ratios.