Markov Chain Monte Carlo (MCMC) Simulations

Based on the talk “From Simulation to Markov Chain Monte Carlo Methods” of Professor Lih Y Deng - Department of Mathematical Sciences, The University of Memphis

Outline:

Simulated Annealing
Markov Chain Monte Carlo
- Rejection method and Metropolis algorithm
- Metropolis-Hastings Method
- Application
Extension to d-dimensions
- Gibbs sampler
- Slice sampler

Simulated annealing

Given that generating direct simulations from the target function H is often a major difficulty, a different stochastic approach to the maximization of h is to explore the surface of h in a local manner (that is, by defining a sequence $\{\theta_j\}_j$j by moving from $\theta_j$ to $\theta_{j+1}$ in a dependent step)
The dependence of $\theta_{j+1}$ on $\theta_j$ is often chosen to be linear: \[\theta_{j+1}=\theta_j+\epsilon_j\]

+This method simulates the local perturbation of the current value ($\epsilon_j$) by defining a sequence $\{\pi_t\}$ of densities whose argmax are confounded with $argmax(h)$. Each $\theta_t$ in the sequence is then simulated from the density $\pi_t$ according to a specific update mechanism.

The most standard choice is based on the Boltzman-Gibbs transform of $h$: \[\pi_t(\theta) \propto exp(h(\theta)/T_t\] where the sequence of temperatures $\{T_t\}$ is decreasing.
A metal manufactured by a slow decrease of temperature (annealing) is stronger than a metal manufactured by a fast decrease of temperature.
The fundamental idea of simulated annealing methods is that a change of scale, called temperature, allows for faster moves on the surface of the function h to maximize.
Alogrithm:
Starting from $\theta_0$, $\zeta$ is generated from Uniform distribution in a neighborhood of $\theta_0$.
The new value, $\theta_1$, is generated as: \[ \theta = \begin{cases} \zeta & \text{with probability $\rho=exp(\Delta{h}/T \land 1)$} \\ \theta_0 & \text{with probability $1-\rho$} \end{cases} \]

where

\[\begin{align*} &\Delta{h} = h(\zeta) - h(\theta_0) \\ &\text{If $h(\zeta) \ge h(\theta_0)$, $\zeta$ is accepted} \\ &\text{If $h(\zeta) < h(\theta_0)$, $\zeta$ may still be accepted with $\rho > 0$} \\ &\text{This allows escape from local maxima} \end{align*}\]

The algorithm rendering of the simulated annealing:
At iteration i,

Simulate $\zeta$ from an instrumental distribution with density $g(|\zeta-\theta_i|)$,
Accept $\theta_{i+1}=\zeta$ with probability $\rho_i=exp\{\Delta{h_i}/T_i\} \land 1$}
take $\theta_{i+1}=\theta_i$ otherwise
Update $T_i$ to $T_{i+1}$

Example:

Recall the function $h(x)=[\cos(50x)+\sin(20x)]^2$.
The sepcific algorithm we use: Starting at iteration i, the iteration is at $(x^{(i)}, h_x^{(i)})$:

Simulate $u \sim U(a_i,b_i)$ where $a_i=max(x^{(i)}-r,0)$ and $b_i=min(x^{(i)}+r,1)$
Accept $x^{(i+1)}=u$ with probability $p^{(i)} = min \left \{ exp\left (\frac{h(u)-h(x^{(i)})}{T_i},1 \right) \right\}$, take $x^{(i+1)}=x^{(i)}$ otherwise.
Update $T_i$ to $T_{i+1}$
where r controls the size of the interval around the corrent point (staying in (0,1)) and $T_i$ controls the cooling.

R implementation:

#From 5.3
h <- function(x){(cos(50*x)+sin(20*x))^2} 
plot(function(x)h(x), xlim=c(0,1),ylim=c(0,4),lwd=2,main='function h(x)')

optimize(h, c(0, 1), tol = 0.0001, maximum=TRUE)

## $maximum
## [1] 0.3791249
## 
## $objective
## [1] 3.832543

nsim <- 2500
u <- runif(nsim)
#Simulated annealing
xval <- array(0,c(nsim,1))
r <- .5
for(i in 2:nsim){
  #Generate u from U(a_t, b_t)
  test <- runif(1, min=max(xval[i-1]-r,0),max=min(xval[i-1]+r,1))
  #Calculate probability
  delta <- h(test)-h(xval[i-1])
  rho <- min(exp(delta*log(i)/1),1)
  xval[i] <- test*(u[i]<rho)+xval[i-1]*(u[i]>rho)
}
h(xval[nsim])

## [1] 3.803225

plot(xval,h(xval),type="l",lwd=2,main='Simulated Annealing Trajectory')

Recall the function:

\[h(x,y)=(xsin(20y)+ysin(20x))^2cosh(sin(10x)x)+(xcos(10y)-ysin(10x))^2cosh(cos(20y)y)\] * We will apply simulated Annealing with different choices of $T_i$: + Results dependent on choice of $T_i$ + $T_i \propto 1/log(i+1)$ + $g \sim U(-0.1, 0.1)$ + Starting point is (0.5, 0.4)

For $T_i = 100/log(i+1)$

h <- function(x,y){(x*sin(20*y)+y*sin(20*x))^2 * cosh(sin(10*x)*x) + (x*cos(10*y)-y*sin(10*x))^2  *cosh(cos(20*y)*y)}

Nsim <- 5000
XY <- matrix(data = NA, nrow = Nsim, ncol=2)
#Starting point
XY[1,1] <- .5
XY[1,2] <- .4
m <-  0
for(i in 2:Nsim) {
  T_i <- 100/log(i+1)
  U <- runif(1, max(XY[i-1, 1]-.1, -2), min(XY[i-1, 1]+.1, 2))
  V <- runif(1, max(XY[i-1, 2]-.1,-2), min(XY[i-1, 2]+.1, 2))
  rho <- min(exp(-(h(U,V)-h(XY[i-1,1], XY[i-1,2]))/T_i), 1)
  
  if (rho>runif(1)) x <- U; y <- V; m <- m+1
  XY[i,1] <- x
  XY[i,2] <- y
}

plot(XY[,1], XY[,2], type='l', lwd=.5, main='Ti=100/log(i+1)')

Markov Chain Monte Carlo

Markov Chain Monte Carlo is the approach using Markov Chain processes to learn about a distribution by random sampling especially in high-dimension. The interest is the estimation of the expected value of a function which is hard to generate: $E[h(X)]$ where $X \sim p(x)$ and $h(X)$ is a function of $X$.
To perform a MCMC, we generate the sequence $X_0, X_1, X_2, ...$ by a Markov Chain processes: $X_i$ depends on $X_0,...,X_{i-2}, X_{i-1}$ only through $X_{i-1}$
which means $h(x_i|x_0,...,x_{i-2}, x_{i-1}) = h(x_i|x_{i-1})$ under some general conditions, this Markov Chain will converge to a stationary distribution with probability density function is $p(x)$
$E[h(X)] = \displaystyle \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^{n}h(X_i)$
when n is large,
$V[\displaystyle \sum_{i=1}^{n}h(X_i)] \approx \frac{1 + \rho_1}{1 - \rho_1} \frac{V[h(X)]}{n}$
where $\rho_1$ is the lag 1 autocorrelation of the Markov Chain. For most MCMC methods, $\rho_1 \ge 0$. The best choice is $\rho_1=0$ then $V[\displaystyle \sum_{i=1}^{n}h(X_i)] = \frac{V[h(X)]}{n}$
In general, we would like to make $\rho_1$ as small as possible, i.e. to closest to 0 as possible. The closer $\rho_1$ to 0, the better estimation for $E[h(X)]$.

Rejection method and Metropolis algorithm

The goal is to generate a random variate $X$ with p.d.f $p(x)$, assuming we know how to generate $Y$ with p.d.f $g(y)$ and we can find a constant $M$ such that $p(x)\le Mg(x)$.

$g(x)$ is the majorizing, trial, or instrumental density.

Algorithm:

Generate $y$ from $g(y)$.
Genertae $u$ from $U(0,1)$.
if $u \le \frac{p(y)}{Mg(y)}$, then take $X=y$,
else goto (1).

Probability of acceptance

\[pr(acceptance)=\frac{1}{M}\]

Example: Generation of $p_X(x)=\frac{10}{7}(x+x^4)$

Choose $g(x)=1, 0<x<1$ and we find $M=\frac{20}{7}$ because \[p_X(x)=\frac{10}{7}(x+x^4)\le \frac{20}{7}\] and \[\frac{p_X(x)}{Mg(x)}=\frac{1}{2}(x+x^4)\].

Algorithm:

Generate $y\sim U(0,1)$
Generate $u\sim U(0,1)$
If $u\le \frac{1}{2}(y+y^4)$, then deliver $X=y$
else goto (1).

\[pr(acceptance)=\frac{7}{20}=0.35\]

Nsim <- 10^4
v <- runif(Nsim);y <- v
u <- runif(Nsim)
x <- na.omit(ifelse(u <= (1/2)*(y+y^4),y,NA))
#hist(X, freq = F)
curve((1/2)*(x+x^4),col='green')
abline(h = 1, col = "red")
title (main="Plot of p(x)/Mg(x), g(x) =1",font.main=4)
text (0.8, 0.6, "p(x)/Mg(x)", adj=c (-.1,-.1)) # adding text to plot
# add accept/reject line segments to plot 
segments(x0=0.4, x1=0.4, y0=0.0, y1=0.2, col="blue",lty=1)
segments(x0=0.4, x1=0.4, y0=0.2, y1=1.0, col="yellow",lty=1)
text (0.41, 0.1, "Accept", adj=c (-.1,-.1)) # adding “accept” text to plot
text (0.41, 0.8, "Reject", adj=c (-.1,-.1)) # adding “reject” text to plot

Example:Generate $p_X(x)=\frac{10}{7}(x+x^4)$ using another majorizing function.

Choose $g(x)=2x, 0<x<1$ and we find $M=\frac{10}{7}$ because \[\frac{p(x)}{g(x)}=\frac{5}{7}(1+x^3)\le \frac{10}{7}\]. Hence, $M=\frac{5}{7}$ and $\frac{p_X(x)}{Mg(x)}=\frac{1}{2}(1+x^3)$.

Algorithm:

Generate $v\sim U(0,1)$, let $y=v^\frac{1}{2}$
Generate $u\sim U(0,1)$
If $u\le \frac{1}{2}(1+y^3)$, then deliver $X=y$
else goto (1).

\[pr(acceptance)=\frac{7}{10}=0.70\]

Nsim <- 10^4
v <- runif(Nsim);y <- v^(1/2)
u <- runif(Nsim)
x <- na.omit(ifelse(u <= (1/2)*(1+y^3),y,NA))
#hist(X, freq = F)
curve((1/2)*(1+x^3),col='green')
abline(h = 1, col = "red")
title (main="Plot of p(x)/Mg(x), g(x) =2x",font.main=4)
text (0.8, 0.8, "p(x)/Mg(x)", adj=c (-.1,-.1)) # adding text to plot
# add accept/reject line segments to plot 
segments(x0=0.6, x1=0.6, y0=0.0, y1=0.6, col="blue",lty=1)
segments(x0=0.6, x1=0.6, y0=0.6, y1=1.0, col="yellow",lty=1)
text (0.61, 0.52, "Accept", adj=c (-.1,-.1)) # adding “accept” text to plot
text (0.61, 0.9, "Reject", adj=c (-.1,-.1)) # adding “reject” text to plot

Metropolis-Hastings Method

In Rejection Method, there is no clear rule to find $g(y)$ and constant “M” may not be easy to find. In these cases, we need to use Metropolis-Hastings Method.

Algorithm:

Initialize $y_0$, $X_0=y_0$.
At the $(i+1)^{th}$ stage, generate: $y_{i+1} = X_i + s$ where $s \sim U(-a,a)$.
Generate $u \sim U(0,1)$
If $\frac{p(y_{i+1})}{p(X_i)} \ge u$, then $X_{i+1}=y_{i+1}$, else $X_{i+1}=X_i$

When Markov Chain processes is applied in Metropolis-Hastings Method, we have a nice implementaion to sample a desired function since $P\{X_i|X_0,...,X_{i-2}, X_{i-1}\} = P\{X_i|X_{i-1}\}$ and when n is large, $X_n \to X \sim p(x)$. We can achieve our goal without finding the constant $M$ and $g(x)$, also no normalizing constant of $p(x)$ is needed.

Algorithm:

Initialize $y_0$, $X_0=y_0$.
At the $(i+1)^{th}$ stage, generate: $y_{i+1}$ from $g(y|x_i)$ - the transition probability from $x_i$ to $y$.
Generate $u \sim U(0,1)$.
$r_i = min \left(1, \frac{p(y)}{p(x_i)} \frac{g(x_i|y)}{g(y|x_i)} \right)$.
If $r \ge u$, then $X_{i+1}=y_{i+1}$, else $X_{i+1}=X_i$

Random walk chain
A simple random walk Markov chain satisfies: $Y = X_i + \epsilon$, where $\epsilon \sim g(.)$, g(.) is a symmetric function around 0.
Independence chain
Since $Y$ and $X_i$ are independent, $g(y|x_i) = g(y)$. To have the good estimates, we need to choose $g(y)$ as close approximate to $p(y)$ as possible and $g(x)$ needs to be easy to generate.

Example: Posterior distribution (with uniform prior) of the linkage model:

\[p(\theta | \mathbf{y}) \propto (2+\theta)^{125} (1-\theta)^{38} \theta^{34}\] given that $\theta \sim U(0,1)$

First, let look at the distribution of posterior.

Nsim <- 10^4
p <- function(x) dbeta(x,55.75994892,33.77023722)
x <- runif(Nsim)
plot(x, p(x), col='red',cex=0.05, 
    main= bquote('p('~theta ~'|y)=c'(2+theta)^125*~(1-theta)^38 ~theta^34),ylab='')

Compare the distribution of $p(\theta|y)$ to 3 different Beta distributions. As we can see, the posterior distribution has a close form to Beta(56,34).

x <- runif(1000)
x_seq <- seq(from=min(x), to=max(x), length = 1000)
plot(x_seq, p(x_seq), col='red', 
     main=bquote('p'~(theta~'|y') ~ 'vs. Beta(55,30), Beta(30,30), Beta(56, 34)'),
     ylab='',xlab=expression(theta),type='l')
lines(x_seq, dbeta(x_seq,55,30),type='l',col='green')
lines(x_seq, dbeta(x_seq,30,30),type='l',col='blue')
lines(x_seq, dbeta(x_seq,56,34),type='l',col='sienna')
legend(0,7, c('p(theta|y)','B(50,50)','B(30,30)','B(56,34)'),lty=c(1,1,1),col=c('red','green','blue','sienna'),cex=0.5)

Application: We can use Metropolis-Hastings Algorithm to estimate the posterior mean. After large number of iterations, the sample mean converges to the true mean.
Example for $g \sim Beta(55,30)$

#Beta(55,30)
Nsim <- 10000
f <- function(x) (2+x)^125*(1-x)^38*x^34
g <- function(x) rbeta(1,55,30)
TH <- rep(0,Nsim)
TH[1] <- 0.5

for (i in 2:Nsim) {
  th_current <- TH[i-1]
  th_proposed <-  g()
  r <- (f(th_proposed)/f(th_current)) * 
    (dbeta(th_current, 55, 30)/dbeta(th_proposed, 55, 30))
  if(r >= runif(1)) TH[i] <- th_proposed      
  else TH[i] <- th_current
}

mean_th_1 <- cumsum(TH)/(1:Nsim)

plot(TH, type='p',col='blue',cex=0.2,main='MCMC with g~Beta(55,30)',xlab='iteration', ylab = expression(theta))
abline(h=0.6228061319, col='black')
lines(mean_th_1,col='red',type='l')

Example for $g \sim Beta(30,30)$

#Beta(30,30)
Nsim <- 10000
g <- function(x) rbeta(1,30,30)

for (i in 2:Nsim) {
  th_current <- TH[i-1]
  th_proposed <-  g()
  r <- (f(th_proposed)/f(th_current)) * 
    (dbeta(th_current, 30, 30)/dbeta(th_proposed, 30, 30))
  if(r >= runif(1)) TH[i] <- th_proposed      
  else TH[i] <- th_current
}

mean_th_2 <- cumsum(TH)/(1:Nsim)

plot(TH, type='p',col='green',cex=0.2,main='MCMC with g~Beta(30,30)',xlab='iterations')
abline(h=0.6228061319, col='black')
lines(mean_th_2,col='red',type='l')

Example for $g \sim Beta(56,34)$

#Beat(56,34)
g <- function(x) rbeta(1,50,30)
for (i in 2:Nsim) {
  th_current <- TH[i-1]
  th_proposed <-  g()
  r <- (f(th_proposed)/f(th_current)) * 
    (dbeta(th_current, 56, 34)/dbeta(th_proposed, 56, 34))
  if(r >= runif(1)) TH[i] <- th_proposed      
  else TH[i] <- th_current
}

mean_th_3 <- cumsum(TH)/(1:Nsim)

plot(TH, type='p',col='sienna',cex=0.2,main='MCMC with g~Beta(56,34)',xlab='iterations')
abline(h=0.6228061319, col='black')
lines(mean_th_3,col='red',type='l')

As we can see, means in three MCMC with different parameters for Beta distribution all converge to the true mean (0.6228061319)

plot(NULL, xlim=c(0,Nsim), ylim=c(0,1), ylab="", xlab="", main='All three MCMC')
abline(h=0.6228061319, col='black')
lines(mean_th_1, col='blue')
lines(mean_th_2, col='green')
lines(mean_th_3, col='red')
legend(0,0.4, c('true mean','g~B(55,30)','g~B(30,30)','g~B(56,34)'),lty=c(1,1,1),col=c('black','blue','green','red'),cex=0.5)

Extension to d-dimensions

It is difficult to apply Metropolis-Hastings Algorithm and/or rejection method for high dimensional multivariate random vectors.
The joint p.d.f. $p(x_1,x_2,...,x_d)$ can not be uniquely determined by its marginal p.d.f. $p_1(x_1),p_2(x_2),...p_d(x_d)$, but by its full conditionals:

\[p(x_1|x_2,...,x_d)\] \[p(x_2|x_1,x_3,...,x_d)\] \[p(x_d|x_1,...,x_{d-1})\]

Gibbs Sampler

Gibbs sampler is a method which generates random variables from a marginal distribution directly without having to calculate the density $f(x),$ with the assumption that is is easier to calculate these marginals. It is based on elementary properties of Markov Chains.

The two-stage Gibbs sampler

If two random variables $X$ and $Y$ have joint density $f(x,y)$ with corresponding densitites $f_{y/x}$ and $f_{x/y},$ the two-stage Gibbs sampler generates a Markov chain $(X_t, Y_t)$ according to the following steps:

Algorithm: Two-stage Gibbs Sampler

Take $X_0=x_0$

For $t=1,2,...,$ generate

$Y_t \sim f_{y|x}(.|x_{t-1})$
$X_t \sim f_{x|y}(.|y_{t})$

There are many examples given in the text but i have choosen example 7.2 for the implementation of this algorithm.

The multistage-stage Gibbs sampler

Natural extension from the two-stage Gibbs sampler.

Suppose that for $p>1,$ the random variable $X \in x$ can be written as $X=(X_1, ...,X_p),$ where the $X_{i}\,'s$ are either unidemensional or multidimensional components.

Moreover, suppose that we can simulate from the corresponding conditional densities $f_1,f_2,...,f_p,$ that is, we can simulate \[X_i|x_1,x_2,...,x_{i-1},x_{i+1},...,x_p \sim f_i(x_i|x_1,x_2,...,x_{i-1},x_{i+1},...,x_p)\] for $i=1,2,...,p$

The associated Gibbs sampler is given by the following transition from $X^{(t)}$ to $X^{(t+1)};$

Algorithm: The multistage Gibbs Sampler

At iteration $t=1,2,...,$ given $X^{(t)}=(x_1^{(t)},...,x_p^{(t)}),$ generate

$X_1^{(t+1)}\sim f_1(x_1|x_2^{(t)},...,x_p^{(t)});$

$X_2^{(t+1)}\sim f_2(x_1|x_1^{(t+1)},x_3^{(t)},...,x_p^{(t)});$

$X_p^{(t+1)}\sim f_p(x_p|x_1^{(t+1)},...,x_{p-1}^{(t+1)})$

The densities $f_1, ...,f_p$ are called the full conditionals, and a particular feature of the Gibbs sampler is that they are the only densities used for simulation. Thus, even in a high-dimensional problem, all of the simulations may be univariate, which is usually an advantage.

Missing data and Latent Variables

In a general missing-data setting \[g(x)=\int_z \mathrm{f(x,z)}\,\mathrm{d}z\] for $p\ge2,$ we write $y=(x,z)=(y_1,...,y_p)$ and denote the conditional densities of $f(y)=f(y_1,...,y_n)$ by \[Y_1|y_2,...,y_p \sim f_1(y_1|y_2,....y_p)\] \[Y_2|y_1, y_3,...,y_p \sim f_2(y_2|y_1, y_3,....y_p)\]

\[Y_p|y_1, y_2,...,y_{p-1} \sim f_p(y_1|y_1, y_2,....y_{p-1})\] Applying a multistage Gibbs sampler to those full conditionals and assuming they can be simulated leads to a Markov $(Y^{(t)})_t$ that converges to $f$ and therefore a subchain $(X^{(t)})_t$ that converges to $g.$

Gibbs sampler is useful because It turns out that under reasonable general conditions, with large enough number of iterations, the distribution of $X_k$ converges to $f(x)$ which is the true marginal distribution of X. So the final point $X_k$ is effectively a sample point from $f(x)$

Gibbs sampling can be used to estimate the density itself by averaging the final conditional densities from each Gibbs sequence. Just as values $X_k=x_k$ yeild a realization of $X_1,..., X_m \sim f(x)$, the values $Y_k=y_k$ yeild a realization of $Y_1,...,Y_m \sim f(y)$. The average of the conditional densities $f(x|Y_k=y_k$ will be a close approximation to $f(x)$ as \[\hat {f(x)} = \frac{1}{m} \sum_{i=1}^{m}f(x|y_i)\] where $y_1,...,y_m$ is the sequence of relized values of final $Y$ observations from each Gibbs sequence.

The Gibbs sampler provides better estimators since they are evaluated by $f(x|y_1),...,f(x|y_m)$ using the simulated values $y_1,...,y_m$ which carry more information about $f(x)$ than $x_1,...,x_m$.

Applications of Gibbs sampler

Data augmentation is a special case of Gibbs sampler.
Gibbs sampler is most useful when its full conditionals are easy to find and simulate.
- multivariate normal distribution
- multivariate exponential model
- Ising model (Geman and Geman (1984))
- Semi-conjugate priors.
- censored data or missing data.
- hyperpriors and hierarchical models.
- data augmentation/auxiliary variables.
For some univariate problems, we can “complete” it to d-dimensional problems and then apply Gibbs sampler.

Examples:

Beta Binomial

Considering the pair of distributions

\[X|\theta \sim Bin(n, \;\theta),\quad \theta\sim Be(a\;b)\] leads to the joint distribution \[f(x,\;\theta)=\binom{n}{x}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{x+a-1}(1-\theta)^{n-x+b-1}\]

The corresponding conditional distribution of $X|\theta$ is given above, while \[\theta|x \sim Be(x+a,\;n-x+b).\]

The associated Gibbs sampler can be implemented as

  betabi=function(x,a,b,n){
    beta(x+a,n-x+b)/(beta(a,b)*beta(x+1,n-x+1)*(n+1))}

nsim=10^4

n=15;a=3;b=7

X=T=rep(0,nsim)

T[1]=rbeta(1,a,b)               #initialize the chain

X[1]=rbinom(1,n,T[1])           #initialize the chain

for (i in 2:nsim){
           X[i]=rbinom(1,n,T[i-1])
           T[i]=rbeta(1,a+X[i],n-X[i]+b)
   }

par(mfrow=c(1,2),mar=c(4,4,2,1))

hist(X[2000:nsim],nclass=16,col="grey",freq=FALSE, xlim=c(0,15),main="",xlab="X")

  
curve(betabi(x,a,b,n),from=0, to=15,col="gold4",lwd=2,add=TRUE)

hist(T[2000:nsim],nclass=134,col="grey",freq=FALSE,xlim=c(0,.8),main="", xlab=expression(theta))

curve(dbeta(x,shape1=a,shape2=b),from=0, to=.8,col="sienna",lwd=2,add=TRUE)

since this is a toy example, the closed form marginals are available and thus produced on top of the histogram, and they show a good fit for both Gibbs sampler.

Bivariate normal distribution.

As we know, the marginal distributions of X and Y are $N(\mu_X, \sigma_X)$ and $N(\mu_Y, \sigma_Y)$. The conditional distributions are:
\[Y|X=x \sim N \left( \mu_Y, \frac{\rho \sigma_Y}{\sigma_X}(x - \mu_X), (1 - \rho^2) \sigma_Y^2 \right) \] and \[X|Y=y \sim N \left( \mu_X, \frac{\rho \sigma_X}{\sigma_Y}(y - \mu_Y), (1 - \rho^2) \sigma_X^2 \right) \] where $\rho$ is the correlation of X and Y

R implementation

Nsim <- 10000
r=0.9;mu_x=1;mu_y=2;sd_x=1.2;sd_y=0.75
s1 <- sqrt(1-r^2)*sd_x
s2 <- sqrt(1-r^2)*sd_y
#output matrix
XY <- matrix(nrow=Nsim,ncol=2)
XY[1,] <- c(mu_x,mu_y)
Y <- XY[1,2]
for(i in 1:Nsim-1){
  mx <- mu_x+r*(Y-mu_y)*sd_x/sd_y
  #get X with new mean
  X <- rnorm(1,mx,s1)
  #update X into output matrix
  XY[i+1,1] <- X 
  my <- mu_y+r*(X-mu_x)*sd_y/sd_x
  #sample Y with new mean
  Y <- rnorm(1,my,s2)
  #update Y into output matrix
  XY[i+1,2] <- Y
}
#plot X and Y with burn-in = 500. Both X and Y converged.
par(mfrow=c(1,2))
plot(XY[501:10000,1],type='l',col='darkblue',main='Plot of X', ylab='X',xlab='iteration')
plot(XY[501:10000,2],type='l',col='wheat3',main='Plot of y', ylab='Y',xlab='iteration')

Gibbs sampling works since clearly, both X and Y converge to specific values.

Multisatge Gibbs sampler

Given the normal target $N_p(0,(1-\rho)I+\rho J)$, we produce a Gibbs sampler using the conditional distribution \[X_i|x_{(-i)} \sim N\left( \frac{(p-1)\rho}{1+(p-2)\rho} \overline{x}_{-i}, \frac{1+(p-2)\rho-(p-1)p^2)}{1+(p-2)\rho} \right)\] where $x_{(-i)} = (x_1, x_2, ..., x_{i-1}, x_{i+1},..., x_p)$ and $\overline{x}_{-i}$ is the mean of vector $x_{(-i)}$, $p=5, \rho=0.25$.

The R implementation:

T=5000 ;p=5 ;r=0.25
X=cur=rnorm(p)
for (t in 1 :T){
  for (j in 1 :p){
    m=sum(cur[-j])/(p-1)
    cur[j]=rnorm(1,(p-1)*r*m/(1+(p-2)*r), sqrt((1+(p-2)*r-(p-1)*r^2)/(1+(p-2)*r)))
  }
  X=cbind(X,cur)
}
#Distributions of X's
par(mfrow=c(2,3))
for (i in 1:p){
  hist(X[i,],prob=TRUE,col="wheat2",xlab="",main="")
  curve(dnorm(x),add=TRUE,col="sienna",lwd=2)
}
#Whether X's converge
cl <- c('darkblue', 'darkgreen', 'darkred', 'sienna', 'wheat')
par(mfrow=c(2,3))

for (i in 1:p){
  plot(X[i,501:5000],type='l',col=cl[i],main=paste('Plot of X_',i), ylab=paste('X_',i),xlab='iteration')
}

We could see that all 5 marginals have a close form of a standard normal distribution and X’s converged

Missing data with $N(\theta,1)$

The likelihood function: \[g(x|\theta)=L(\theta|x) \propto \prod_{i=1}^{m}e^{\frac{-(x_i-\theta)^2}{2}}\]
and \[f(x,z|\theta)=L(\theta|x,z) \propto \prod_{i=1}^{m}e^{\frac{-(x_i-\theta)^2}{2}} \prod_{i=m+1}^{n}e^{\frac{-(z_i-\theta)^2}{2}}\]

is the complete-data likelihood. Given a prior distribution $\pi(\theta)$ on $\theta$, we can then create a Gibbs sampler that iterates between the conditional distributions $\pi(\theta|x,z)$ and $f(z|x,\theta)$ and will have stationary distribuion $\pi(\theta,z|x)$, the posterior distribution of $(\theta,z)$.
Taking a non informative prior $\pi(\theta)=1$, the conditional distribution of $\theta|x,z$ is given by: \[\theta|x,z \sim N \left(\frac{m.\overline{x}+(n-m)\overline{z}}{n},\frac{1}{n}\right)\] while the conditional distribution of $Z|x,\theta$ is the product of the truncated normals $Z_i|x,\theta \sim \varphi(z-\theta) / \{1-\Phi(a-\theta)\}$ where $\varphi$ and $\Phi$ are the normal pdf and cdf, respectively. Each $Z_i$ must be greater than the truncated point $a$.

Here are the R implementation for using Gibbs sampler to estimate the posterior distribution of $\theta.$

x <- rnorm(20, mean=3)
m <- length(x)
n <- 30;a <- 3.5 #1/3 missing data
nsim <- 10^4
xbar <- mean(x)
#outputs
th <- array(xbar,dim=c(nsim,1))
zbar <- array(a,dim=c(nsim,1))

for (i in 2:nsim){
  #get probabilities to sample z values
  temp <- runif(n-m, min=pnorm(a,mean=th[i-1],sd=1), max=1)
  #calculate zbar from sampling of z
  zbar[i] <- mean(qnorm(temp, mean=th[i-1], sd=1))
  #sample theta with posterior distribution
  th[i] <- rnorm(1, mean=(m*xbar+(n-m)*zbar[i])/n, sd=sqrt(1/n))
}
par(mfrow=c(1,2))
#burn-in is 500, we start drawing histogram from the 500th iteration
hist(th[500:nsim],col="grey",breaks=25,xlab=expression(theta),main="",freq=FALSE)
curve(dnorm(x,mean(th),sd=sd(th)),add=T,lwd=2)

#Latent variables
hist(zbar[500:nsim],col="wheat",breaks=25,main="",xlab= expression(bar(Z)),freq=FALSE)
curve(dnorm(x,mean(zbar),sd=sd(zbar)),add=T,lwd=2, col='wheat4')

Both $\theta$ and $Z$ have normal distributiona.

Slice Sampler

Given a density of interest, $f_X(x),$ we can always represent it as the marginal density of the joint density \[f(x,u)=I\;\{0\,< u\,<f_X(x)\}\] Since integrating the above in $u$ returns $f_X.$

The associated conditional densities are \[f_{X|U}(x|u)=\frac{I\;\{0\,< u\,<f_X(x)\}}{\int\mathrm{I\;\{0\,< u\,<f_X(x)\}\mathrm{d}x} },\]\[f_{U|X}(u|x)=\frac{I\;\{0\,< u\,<f_X(x)\}}{\int\mathrm{I\;\{0\,< u\,<f_X(x)\}\mathrm{d}u} }\] Which means they are both uniform.

Those two conditionals then define the slice sampler as the associated Gibbs sampler.

Algorithm: 2D Slice-sampler:

At iteration $t,$ simulate

$U^{t+1} \sim U_{[0,\;f(x^{(t)})];}$
$X^{(t+1)\; \sim \; U_{A^{(t+1)}}}$, with \[ A^{(t+1)} \; = \; \{ x:\;f(x) \ge \; u^{(t+1)}\}.\]

The appeal of this algorithm is that it formally applies to any density known up to a multiplicative constant with no restriction on its shape or dimension. Obviously, its implementation may be hindered by the uniform simulation over the set $A^{(t)}.$

Example: Slice sampler

Consider the density $f(x)=\frac{1}{2}e^{-\sqrt(x)}$defined for $x\;>\;0$. While it can be directly simulated, it also yields easily to the slice sampler. Indeed, applying the formulas above, we have \[U|x \sim U\;(0,\;\frac{1}{2}e^{-\sqrt(x)}),\quad X|u \sim U\;(0,\;[log(2u)]^2).\]

We implement the sampler to generate 5000 variates and plot them along with the density in Figure below, which shows that the agreement is very good. The right panel does show some strong autocorrelations, which is typical of the slice sampler.

par(mfrow=c(1,1))
  fff=function(y)exp(-sqrt(y))/2;

nsim=5000;x=rep(pi,nsim);x[1]=-log(runif(1))

for (i in 2:nsim){
  w=runif(1,min=0,max=fff(x[i-1]));x[i]=runif(1,min=0,max=(-log(2*w))^2)
  }

X11(h=3.5);par(mfrow=c(1,2),mar=c(4,4,2,1))

## Warning in system2("otool", c("-L", shQuote(DSO)), stdout = TRUE): running
## command ''otool' -L '/Library/Frameworks/R.framework/Resources/modules/
## R_X11.so'' had status 1

hist(x,xlab="Slice Sampler",main="",xlim=c(0,40),ylim=c(0,.25),freq=FALSE,col="grey",breaks=250)

curve(fff,add=TRUE,lwd=3,col="sienna",xlim=c(0,40))

acf(x,xlab="Autocorrelation",ylab="",lwd=3)

Slice Sampler Algorithm

If \[p(x) \propto \prod_{j=1}^{k}\;f_j(x),\qquad f_j(x)\ge 0, \qquad 1\le j\le k,\] then $p(x)$ can be “completed” into \[g(Z_1, Z_2,...,Z_k, X)\; \propto \; \prod_{j=i}^{k} \mathbf{I}_{0\le z_j \ge f_j(x)}.\]

Each conditional p.d.f is a uniform distiribution!

Algorithm:

Generate $Z_1^{(i+1)} \sim U(0, f_1(X^{(i)}))$
Generate $Z_2^{(i+1)} \sim U(0, f_2(X^{(i)}))$
…
…
Generate $Z_k^{(i+1)} \sim U(0, f_k(X^{(i)}))$
Generate $X^{(i+1)} \sim U(A^{(i+1)}),$ where

\[A^{(i+1)}=\{y|f_j(y)\;\ge\;Z_j^{(i+1)},\; j=1,2,...,k\}\]

Slice Sampler Application

Example: Posterior distribution (with uniform prior) of the linkage model:

\[p(\theta|\mathbf{y}) \propto (2+\theta)^{125} (1-\theta)^{38} \theta^{34}\] Goal: Estimation of its posterior mean via Slice/Gibbs Sampler

\[f_1(\theta)\;=\;(2+\theta)^{125}, \qquad f_2(\theta)\;=\;(1-\theta)^{38}, \qquad f_3(\theta)\;=\;\theta^{34}.\]

Slice Sampler Implementation

Startung with any $0 < \theta = \theta_0 < 1,$ repeat steps $(1)-(5)$

Generate $Z_1 \sim U(0, (2+\theta)^{125})$
Generate $Z_ \sim U(0, (1-\theta)^{38})$
Generate $Z_3 \sim U(0, \theta^{34})$
For $Z_1,\; Z_2,\; Z_3$ generated, find the range of $\theta:$ \[\theta \ge Z_1^{1/125}-2, \qquad \theta \le 1-Z_2^{1/38}, \qquad \theta \ge Z_3^{1/34}.\]
Generate $\theta^{new} \sim U(A, B),$ where $(A, B)$ is \[A\;=\;max(Z_1^{1/125}\;-2,\; Z_3^{1/34}), \qquad B\;=\;1-Z_2^{1/38}.\]

Three initial values of $\theta$ tested: $0.05,\; 0.50, \; 0.95.$$

Nsim <- 5000
slice_sampler <- function (th) {
  Th <- rep(0,Nsim)
  Th[1] <- th
  for (i in 2:Nsim) {
    z1 <- runif(1,0,(2+Th[i-1])^125)
    z2 <- runif(1,0,(1-Th[i-1])^38)
    z3 <- runif(1,0,(Th[i-1])^34)
    Th[i] <- runif(1, max(z1^(1/125)-2, z3^(1/34)), 1-z2^(1/38))
  }
  return(Th)
}

Th_1 <- slice_sampler(0.05)
Th_2 <- slice_sampler(0.5)
Th_3 <- slice_sampler(0.95)
plot(NULL, xlim=c(0,Nsim), ylim=c(0,1), ylab="", xlab="", main='Slice Sampler with three initial values')
abline(h=0.6228061319, col='black')
lines(cumsum(Th_1)/(1:Nsim), col='blue')
lines(cumsum(Th_2)/(1:Nsim), col='green')
lines(cumsum(Th_3)/(1:Nsim), col='red')
legend(3700,0.3, c('true mean','initial value=0.05','initial value=0.5','initial value=0.95'),lty=c(1,1,1),col=c('black','blue','green','red'),cex=0.7)

All three values converge to:

\[E(\theta|\mathbf{y})=\frac{\int_0^1 (2+\theta)^{125}(1-\theta)^{38}\theta^{34}\theta\, \mathrm{d}\theta}{{\int_0^1 (2+\theta)^{125}(1-\theta)^{38}\theta^{34}\, \mathrm{d}\theta}} \;=\; 0.6228061319\]

Markov Chain Monte Carlo (MCMC) Simulations

Ngan Pham

Outline:

Simulated annealing

Example:

Recall the function:

Markov Chain Monte Carlo

Rejection method and Metropolis algorithm

Example: Generation of \(p_X(x)=\frac{10}{7}(x+x^4)\)

Example:Generate \(p_X(x)=\frac{10}{7}(x+x^4)\) using another majorizing function.

Metropolis-Hastings Method

Example: Posterior distribution (with uniform prior) of the linkage model:

Extension to d-dimensions

Gibbs Sampler

The two-stage Gibbs sampler

The multistage-stage Gibbs sampler

Missing data and Latent Variables

Applications of Gibbs sampler

Examples:

Slice Sampler

Slice Sampler Algorithm

Slice Sampler Application

Example: Posterior distribution (with uniform prior) of the linkage model:

Slice Sampler Implementation