The data aflatoxin includes only 6 observations with 3 variables:
- dose: dose in ppb
- total: number of test animals
- tumor: number with liver cancer
data("aflatoxin")
kable(aflatoxin)
dose | total | tumor |
---|---|---|
0 | 18 | 0 |
1 | 22 | 2 |
5 | 22 | 1 |
15 | 21 | 4 |
50 | 25 | 20 |
100 | 28 | 28 |
For this data, we only could use one predictor which is dose to fit a logistic regression to predict the occurrence of liver cancer. The summary of the logistic regression is:
summary(afl_fit <- glm(cbind(tumor, total-tumor) ~ dose,
data=aflatoxin, family = "binomial"))
##
## Call:
## glm(formula = cbind(tumor, total - tumor) ~ dose, family = "binomial",
## data = aflatoxin)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -1.2995 0.7959 -0.4814 0.4174 -0.1629 0.3774
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.03604 0.48226 -6.295 3.07e-10 ***
## dose 0.09009 0.01456 6.189 6.04e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 116.524 on 5 degrees of freedom
## Residual deviance: 2.897 on 4 degrees of freedom
## AIC: 17.685
##
## Number of Fisher Scoring iterations: 5
ggplot(aflatoxin, aes(x=dose, y=tumor/total)) +
geom_point() +
geom_smooth(method="glm",method.args = list(family = "binomial"),se=FALSE) +
ylim(0,1) +
labs(x="", y="Prob of tumor")
In glm(), the response to be a two-column matrix where the first column is the number successes (tumor) and the second column is the number of failures (total - tumor). With small p-value, dose is significant to predict the occurrence of liver cancer. The fitted model is:
\[log(odds) = log{\frac{p}{1-p}} = -3.036 + 0.09 x\]
OR
\[odds = e^{-3.036} e^{0.09 x}\]
The exponential coefficients are:
exp(coef(afl_fit))
## (Intercept) dose
## 0.04802486 1.09427143
A unit increase in dose decreases the log-odds of getting tumor by 0.09 or the odds of getting tumor (liver cancer) decrease by 9.4% with each additional ppb in dose level.
We can interpret the relative risk of having liver cancer for a patient getting 1 dose and a patient who does not take any treatment is:
ilogit(sum(coef(afl_fit)*c(1,1)))/ilogit(sum(coef(afl_fit)*c(1,0)))
## [1] 1.089565
The relative risk of 1.0896 is not far different from the odds ratio of 1.094. For low probability outcomes, the relative risk and the odds ratio will be very similar.
Confidence intervals for parameters:
confint(afl_fit)
## 2.5 % 97.5 %
## (Intercept) -4.10594724 -2.1868361
## dose 0.06440816 0.1217981
The deviance is a measure of how well the model fit the data. The model performance could also be evaluated by Chi-square test goodness of fit, the p-value is:
pchisq(deviance(afl_fit),df.residual(afl_fit),lower=FALSE)
## [1] 0.5752128
Since this p-value is well in excess of 0.05, we conclude that this model fits sufficiently well. Of course, this does not mean that this model is correct. Or we can test with the model without dose is just the null model and the difference in degrees of freedom or parameters is one:
pchisq(113.62738,1,lower=FALSE)
## [1] 1.57248e-26
Since this p-value is so small, we are confident that the effect of dose level is statistically significant, which we also see the small p-value of Z-test in the above summary.
With extremely small sample size, it is not helpful to look at residual analysis (Figure 13) since there could be assumption violation or outliers exist, but transformation or outlier removal only would make the model worse.
par(mfrow=c(2,2))
plot(afl_fit)
Another way to access how good the fitted values are, we can compute residuals as a difference between observed and fitted (or predicted) values. Six residuals are:
(res_afl <- residuals(afl_fit, type="response"))
## 1 2 3 4 5 6
## -0.045824161 0.040980707 -0.024616887 0.034002714 -0.012814432 0.002540524
We could compare the fitted values with the observe number of liver cancer incidents:
aflatoxin <- aflatoxin %>%
mutate(prob=afl_fit$y,
prob_fitted=afl_fit$fitted.values,
residuals = res_afl,
tumor_fitted=round(total*prob_fitted))
aflatoxin[,c(3,7)]
## tumor tumor_fitted
## 1 0 1
## 2 2 1
## 3 1 2
## 4 4 3
## 5 20 20
## 6 28 28
When p = 50%, \(log{\frac{p}{1-p}} = 0\), so \(dose = \frac{-\hat{\beta}_0}{\hat{\beta}_1}\), which is:
(ed50 <- -afl_fit$coef[1]/afl_fit$coef[2])
## (Intercept)
## 33.7005
95% confidence intervals for dose:
dr <- c(-1/afl_fit$coef[2],afl_fit$coef[1]/afl_fit$coef[2]^2)
se <- sqrt(dr %*% summary(afl_fit)$cov.un %*% dr)[,]
c(ed50-1.96*se, ed50+1.96*se)
## (Intercept) (Intercept)
## 26.25231 41.14868
We estimate the dose level by equation: \[dose = \frac{-\hat{\beta}_0+log{\frac{p}{1-p}}}{\hat{\beta}_1}\]
With \(p = 1\%\), the dose level is \(\frac{3.036+log{\frac{0.01}{1-0.01}}}{0.09} = -17.3\)
Dose level could not be a negative value, so there is not any dose level could make 1% of the animals will develop liver cancer. By looking at the plot with confidence intervals (Figure 14), we could see that the lowest value of p could be is 1.8%.
afl_x <- data.frame(dose = seq(min(aflatoxin$dose),
max(aflatoxin$dose),length = 100))
afl_pred <- predict(afl_fit, afl_x, type = "link", se.fit = TRUE)
df <- bind_cols(afl_x,afl_pred) %>%
select(-4) %>%
mutate(p_fitted = faraway::ilogit(fit),
p_upper = faraway::ilogit(fit + (1.96* se.fit)),
p_lower = faraway::ilogit(fit - (1.96 * se.fit)))
ggplot(aflatoxin, aes(x = dose, y = tumor/total)) +
geom_ribbon(data = df, aes(ymin = p_lower, ymax =p_upper, x = dose),
fill = "steelblue2", alpha = 0.2, inherit.aes = FALSE) +
geom_line(data = df, aes(y = p_fitted, x = dose)) +
geom_point() +
scale_y_continuous(breaks = c(0.018, 0.25, 0.5, 1)) +
labs(y = "Probability")