Logistic Regression

Aflatoxin B1 was fed to lab animals at various doses and the number responding with liver cancer recorded.

1. Build a model to predict the occurrence of liver cancer.

The data aflatoxin includes only 6 observations with 3 variables:
- dose: dose in ppb
- total: number of test animals
- tumor: number with liver cancer

data("aflatoxin")
kable(aflatoxin)

dose	total	tumor
0	18	0
1	22	2
5	22	1
15	21	4
50	25	20
100	28	28

For this data, we only could use one predictor which is dose to fit a logistic regression to predict the occurrence of liver cancer. The summary of the logistic regression is:

summary(afl_fit <- glm(cbind(tumor, total-tumor) ~ dose, 
                       data=aflatoxin, family = "binomial"))

## 
## Call:
## glm(formula = cbind(tumor, total - tumor) ~ dose, family = "binomial", 
##     data = aflatoxin)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -1.2995   0.7959  -0.4814   0.4174  -0.1629   0.3774  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.03604    0.48226  -6.295 3.07e-10 ***
## dose         0.09009    0.01456   6.189 6.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 116.524  on 5  degrees of freedom
## Residual deviance:   2.897  on 4  degrees of freedom
## AIC: 17.685
## 
## Number of Fisher Scoring iterations: 5

ggplot(aflatoxin, aes(x=dose, y=tumor/total)) + 
  geom_point() + 
  geom_smooth(method="glm",method.args = list(family = "binomial"),se=FALSE) + 
  ylim(0,1) +
  labs(x="", y="Prob of tumor")

Dose vs. Probability of getting tumor with fitted curve of logistic regression

In glm(), the response to be a two-column matrix where the first column is the number successes (tumor) and the second column is the number of failures (total - tumor). With small p-value, dose is significant to predict the occurrence of liver cancer. The fitted model is:

\[log(odds) = log{\frac{p}{1-p}} = -3.036 + 0.09 x\]

\[odds = e^{-3.036} e^{0.09 x}\]

The exponential coefficients are:

exp(coef(afl_fit))

## (Intercept)        dose 
##  0.04802486  1.09427143

A unit increase in dose decreases the log-odds of getting tumor by 0.09 or the odds of getting tumor (liver cancer) decrease by 9.4% with each additional ppb in dose level.

We can interpret the relative risk of having liver cancer for a patient getting 1 dose and a patient who does not take any treatment is:

ilogit(sum(coef(afl_fit)*c(1,1)))/ilogit(sum(coef(afl_fit)*c(1,0)))

## [1] 1.089565

The relative risk of 1.0896 is not far different from the odds ratio of 1.094. For low probability outcomes, the relative risk and the odds ratio will be very similar.

Confidence intervals for parameters:

confint(afl_fit)

##                   2.5 %     97.5 %
## (Intercept) -4.10594724 -2.1868361
## dose         0.06440816  0.1217981

The deviance is a measure of how well the model fit the data. The model performance could also be evaluated by Chi-square test goodness of fit, the p-value is:

pchisq(deviance(afl_fit),df.residual(afl_fit),lower=FALSE)

## [1] 0.5752128

Since this p-value is well in excess of 0.05, we conclude that this model fits sufficiently well. Of course, this does not mean that this model is correct. Or we can test with the model without dose is just the null model and the difference in degrees of freedom or parameters is one:

pchisq(113.62738,1,lower=FALSE)

## [1] 1.57248e-26

Since this p-value is so small, we are confident that the effect of dose level is statistically significant, which we also see the small p-value of Z-test in the above summary.

Diagnostics:

With extremely small sample size, it is not helpful to look at residual analysis (Figure 13) since there could be assumption violation or outliers exist, but transformation or outlier removal only would make the model worse.

par(mfrow=c(2,2))
plot(afl_fit)

Residual Diagnostics

Another way to access how good the fitted values are, we can compute residuals as a difference between observed and fitted (or predicted) values. Six residuals are:

(res_afl <- residuals(afl_fit, type="response"))

##            1            2            3            4            5            6 
## -0.045824161  0.040980707 -0.024616887  0.034002714 -0.012814432  0.002540524

We could compare the fitted values with the observe number of liver cancer incidents:

aflatoxin <- aflatoxin %>% 
  mutate(prob=afl_fit$y, 
         prob_fitted=afl_fit$fitted.values, 
         residuals = res_afl,
         tumor_fitted=round(total*prob_fitted))

aflatoxin[,c(3,7)]

##   tumor tumor_fitted
## 1     0            1
## 2     2            1
## 3     1            2
## 4     4            3
## 5    20           20
## 6    28           28

2. Estimate the ED50, the dose level at which 50% of lab animals exposed to the that dose level will develop liver cancer.

When p = 50%, \(log{\frac{p}{1-p}} = 0\), so \(dose = \frac{-\hat{\beta}_0}{\hat{\beta}_1}\), which is:

(ed50 <- -afl_fit$coef[1]/afl_fit$coef[2])

## (Intercept) 
##     33.7005

95% confidence intervals for dose:

dr <- c(-1/afl_fit$coef[2],afl_fit$coef[1]/afl_fit$coef[2]^2)
se <- sqrt(dr %*% summary(afl_fit)$cov.un %*% dr)[,]
c(ed50-1.96*se, ed50+1.96*se)

## (Intercept) (Intercept) 
##    26.25231    41.14868

3. Estimate the dose level at which only 1% of the animals will develop liver cancer.

We estimate the dose level by equation: \[dose = \frac{-\hat{\beta}_0+log{\frac{p}{1-p}}}{\hat{\beta}_1}\]

With \(p = 1\%\), the dose level is \(\frac{3.036+log{\frac{0.01}{1-0.01}}}{0.09} = -17.3\)
Dose level could not be a negative value, so there is not any dose level could make 1% of the animals will develop liver cancer. By looking at the plot with confidence intervals (Figure 14), we could see that the lowest value of p could be is 1.8%.

afl_x <- data.frame(dose = seq(min(aflatoxin$dose), 
                               max(aflatoxin$dose),length = 100))
afl_pred <- predict(afl_fit, afl_x, type = "link", se.fit = TRUE)
df <- bind_cols(afl_x,afl_pred) %>% 
  select(-4) %>% 
  mutate(p_fitted = faraway::ilogit(fit), 
         p_upper = faraway::ilogit(fit + (1.96* se.fit)),
         p_lower = faraway::ilogit(fit - (1.96 * se.fit)))
ggplot(aflatoxin, aes(x = dose, y = tumor/total)) +
  geom_ribbon(data = df, aes(ymin = p_lower, ymax =p_upper, x = dose),
                fill = "steelblue2", alpha = 0.2, inherit.aes = FALSE) +
  geom_line(data = df, aes(y = p_fitted, x = dose)) +
  geom_point() +
  scale_y_continuous(breaks = c(0.018, 0.25, 0.5, 1)) +
  labs(y = "Probability")

Logistic regression with confidence bands

Logistic Regression

Ngan Pham

Aflatoxin B1 was fed to lab animals at various doses and the number responding with liver cancer recorded.

1. Build a model to predict the occurrence of liver cancer.

2. Estimate the ED50, the dose level at which 50% of lab animals exposed to the that dose level will develop liver cancer.

3. Estimate the dose level at which only 1% of the animals will develop liver cancer.