Likelihood of Hierarchical Models

概要

We examine how to find & formally determine the likelihood function of hierarchical models. As a real-world example, we consider the ideal point model, also known as the 2-parameter logistic item response model.

1 Problem

1.1 Starting Point

As a starting point, consider a simple univariate linear regression model with 3 parameters, i.e., \(\alpha,\beta,\sigma^2\): \[ \operatorname{E}[Y|X]=\alpha+\beta X+\epsilon,\qquad\epsilon\sim N(0,\sigma^2). \tag{1}\] Since (1) is just another expression for \(Y|X\sim N(\alpha+\beta X,\sigma^2)\), the likelihood function is given by \[ p(y|\alpha,\beta,\sigma)=\phi(y|\alpha+\beta x,\sigma^2), \] given data \(x\in\mathbb{R}\), where \(\phi(-|\mu,\sigma^2)\) represents the Gaussian density for \(N(\mu,\sigma^2)\).

1.2 Hierarchical Model

To make our model (1) more realistic, let’s say, we would like to model variation in the intercept \(\alpha\) by imposing another regression structure (you might call it super-population structure): \[ \alpha=\mu_\alpha+\epsilon_\alpha,\qquad\epsilon_\alpha\sim N(0,\sigma_\alpha^2). \tag{2}\]

Now, what does the likelihood function \(p(y|\mu_\alpha,\sigma_\alpha,\beta,\sigma)\) of this model look like?

The answer is normal likelihood.

1.3 Likelihood Calculation

First, rewrite (1) and (2) as \[ Y|X,\alpha\sim N(\alpha+\beta X,\sigma^2), \] \[ \alpha|\mu_\alpha,\sigma_\alpha\sim N(\mu_\alpha,\sigma_\alpha^2). \]

Using the formula (note this is not always true, see Section 2.4) \[ p(y|\mu_\alpha,\sigma_\alpha,\beta,\sigma)=\int_\mathbb{R}p(y|\alpha,\beta,\sigma)p(\alpha|\mu_\alpha,\sigma_\alpha)\,d\alpha, \] we get a normal density \[ p(y|\mu_\alpha,\sigma_\alpha,\beta,\sigma)=\phi(y|\widetilde{\mu},\widetilde{\sigma}^2), \] where \[ \widetilde{\mu}(\beta,\mu_\alpha):=\beta x+\mu_\alpha, \] \[ \widetilde{\sigma}^2(\sigma,\sigma_\alpha):=\sigma^2+\sigma_\alpha^2. \]

This also follows from the reproducing property of normal distribution: \[ N(\beta X,\sigma^2)*N(\mu_\alpha,\sigma_\alpha^2)=N(\beta X+\mu_\alpha,\sigma^2+\sigma_\alpha^2). \]

2 Background in Probability

2.1 Core Proposition

Abstracting away by setting \[ \theta:=(\beta,\sigma),\qquad\varphi:=(\mu_\alpha,\sigma_\alpha), \] the core identity was the following: \[ p(y|\theta,\varphi)=\int p(y|\theta,\alpha)p(\alpha|\varphi)\,d\alpha. \tag{3}\]

Having such a good notation, this formula might look obvious for stats people, but how actually do we prove it?

2.2 Conditional Probability

Essentially, Equation 3 comes from the tower property of conditional expectation:

Tower Property

For sigma algebras \(\mathcal{G}_1,\mathcal{G}_2\), \[ \mathcal{G}_1\subset\mathcal{G}_2\quad\Rightarrow\quad\operatorname{E}[X|\mathcal{G}_1]=\operatorname{E}[\operatorname{E}[X|\mathcal{G}_2]|\mathcal{G}_1]\quad\operatorname{P}\text{-a.s.} \] for any random variable \(X\) on \((\Omega,\mathcal{F},\operatorname{P})\), \(\mathcal{G}_1,\mathcal{G}_2\subset\mathcal{F}\).

Let’s see how we apply this tower property to our problem:

(Th’m 8.15 Kallenberg, 2021, p. 174)

For any \(A\in\mathcal{F}\), \[ \operatorname{P}[A|X]=\operatorname{E}\biggl[\operatorname{P}[A|X,Y]\,\bigg|\,X\biggr]\quad\operatorname{P}\text{-a.s.} \tag{4}\]

Concerning the definition of conditional probability, we first have the definition of conditional expectation, through which we define conditional probability as¹ \[ \operatorname{P}[A|X]:=\operatorname{E}[1_A|X]. \]

Proof

\[ \operatorname{E}[1_A|X]=\operatorname{E}\biggl[\operatorname{E}[1_A|X,Y]\,\bigg|\,X\biggr]\quad\operatorname{P}\text{-a.s.} \] holds from tower property, noting that \[ \sigma(X)\subset\sigma(X,Y). \]

2.3 Conditional Density

The last piece we need is the definition of conditional density.

Def (Conditional Density)

For absolutely continuous random variables \(X,Y\) on \(\mathbb{R}^d\), we define the conditional density of \(Y\) given \(X\) as \[ p(y|x):=\frac{p(x,y)}{p(x)}1_{\left\{p(x)>0\right\}}, \] where \(p(x,y)\) is the joint density of \((X,Y)\) and \(p(x)\) is the marginal density of \(X\).

Characterization of Conditional Density

\[ \operatorname{E}[Y|X=x]=\int_{\mathbb{R}^d}y\,p(y|x)\,dy\quad\operatorname{P}^X\text{-a.s.} \]

Proof

Since conditional expectation is unique up to \(\operatorname{P}\)-null sets, we’ll check the RHS (right-hand side) satisfies the conditions of conditional expectation.

By Fubini’s theorem, the RHS is a measurable function of \(x\) and is \(\operatorname{P}^X\)-integrable.
For all \(\sigma(X)\)-measurable set \(A\in\sigma(X)\),

\[ \int_A\int_{\mathbb{R}^d}y\,p(y|x)\,dy\,p(x)\,dx=\int_{A\times\mathbb{R}^d}y\,p(x,y)\,dxdy=\operatorname{E}[Y1_A]. \]

Plugging in this characterization into (4), we get \[ \operatorname{P}[A|X=x]=\int_{\mathbb{R}^d}\operatorname{P}[A|X=x,Y=y]\,p(y|x)\,dy. \] Denoting the density of \(\operatorname{P}[A|X=x]\) as \(p(a|x)\), we have \[ p(a|x)=\int_{\mathbb{R}^d}p(a|x,y)\,p(y|x)\,dy. \]

2.4 Last Piece

Reterning to the notation in (3), we have \[ p(y|\theta,\varphi)=\int p(y|\theta,\varphi,\alpha)p(\alpha|\varphi)\,d\alpha. \]

This is slightly different from (3) in that there is \(p(y|\theta,\varphi,\alpha)\) instead of \(p(y|\theta,\alpha)\).

So the last piece we need is the modeling assumption in the regression setting of (1), which is conditional independence between \(Y\) and the hyperparameters \((\mu_\alpha,\sigma_\alpha)\) given \(\alpha\): \[ Y\perp\!\!\!\perp\varphi\mid\alpha. \]

Under this assumption, we have² \[ p(y|\theta,\alpha,\varphi)=p(y|\theta,\alpha). \]

In the regression setting of (1), \[ p(y|\beta,\sigma,\alpha,\mu_\alpha,\sigma_\alpha)=p(y|\beta,\sigma,\alpha). \]

We implicitly assumed \[ p(y|\beta,\sigma,\alpha,\mu_\alpha,\sigma_\alpha)=p(y|\beta,\sigma,\alpha), \] since it was reasonable to assume the conditional independence between \(Y\) and the hyperparameters \((\mu_\alpha,\sigma_\alpha)\) given \(\alpha\): \[ Y\perp\!\!\!\perp(\mu_\alpha,\sigma_\alpha)\mid\alpha. \]

3 Real-world Example: Ideal Point Model

3.1 Specification

Let’s consider the following hierarchical model with 6 parameters, i.e., \(\alpha,\beta,\gamma,\delta,\sigma^2\) and \(X\): \[ Y\sim\operatorname{Bernoulli}(\mu), \] \[ \operatorname{logit}(\mu)=\alpha+\beta X, \] \[ X=\gamma+\delta Z+\epsilon,\qquad\epsilon\sim N(0,\sigma^2). \]

We see the hierarchical structure \[ Y|X\sim\operatorname{Bernoulli}(\operatorname{logit}^{-1}(\alpha+\beta X)), \] \[ X|Z\sim N(\gamma+\delta Z,\sigma^2), \] just as in Section 1.3.

This time, however, we cannot detour the calculation by the tower property, while in Section 1.3 we could sanity check the result by the reproducing property of normal distribution.

3.2 Likelihood Calculation

Through the core Equation 3,

\[\begin{align*} p(y|\alpha,\beta,\gamma,\delta,\sigma)&=\int_\mathbb{R}p(y|\alpha,\beta,x)p(x|\gamma,\delta,\sigma)\,dx\\ &=1_{\left\{1\right\}}(y)\int_\mathbb{R}\operatorname{logit}^{-1}(\alpha+\beta x)\phi(x|\gamma+\delta z,\sigma)\,dx\\ &\qquad+1_{\left\{0\right\}}(y)\int_\mathbb{R}\biggr(1-\operatorname{logit}^{-1}(\alpha+\beta x)\biggl)\phi(x|\gamma+\delta z,\sigma)\,dx\\ &=\qquad\cdots\cdots \end{align*}\]

We won’t proceed any more, observing the integral is not always tractable.

If the likelihood involves an intractable integral, how can we proceed maximum likelihood / Bayesian estimation?

3.3 Data Augmentation

We can still find the mode of the likelihood function by the EM algorithm (Dempster et al., 1977).

The case is similar in the Bayesian approach, where we enlarge the parameter space by treating \(x\) as a parameter (or a latent variable) and sample from \(x\) as well.

This is called data augmentation approach, initiated in the EM algorithm, later applied to Bayesian computations in (Tanner and Wong, 1987).

Although \(p(y|\alpha,\beta,\gamma,\delta,\sigma)\) might involve possibly intractable integrals over \(x\), \(p(y|\textcolor{red}{x},\alpha,\beta,\gamma,\delta,\sigma)\) does not, since it holds that \[ p(y|x,\alpha,\beta,\gamma,\delta,\sigma)=p(y|x,\alpha,\beta) \] given that \(y\) and \(\gamma,\delta,\sigma\) are conditionally independent given \(x\).

Therefore, the posterior distribution of \((x,\alpha,\beta,\gamma,\delta,\sigma)\) is given by

\[ p(x,\alpha,\beta,\gamma,\delta,\sigma|y)\,\propto\,p(y|x,\alpha,\beta,\gamma,\delta,\sigma)p(x,\alpha,\beta,\gamma,\delta,\sigma) \] \[ =p(y|x,\alpha,\beta)p(x|\gamma,\delta,\sigma)p(\alpha,\beta,\gamma,\delta,\sigma). \tag{5}\]

Basically, higher level parameters \(\gamma,\delta,\sigma\) are treated as a part of the machinery that systematically defines a prior for the lower level parameter \(x\). In fact, this was the motivation when the hierarchical structure is first introduced in (Lindley and Smith, 1972).

3.4 Bayesian Computation

Using the decomposition (5), we can readily run MCMC algorithms to sample from the posterior distribution.

Introducing another latent variable, we may also run the Gibbs sampler based on Polya-Gamma decomposition as in (Polson et al., 2013).

Alternatively, we can run the HMC (Hamiltonian Monte Carlo) algorithm based on the gradient of the log posterior density.

In the following articles, we prepared you with the codes for gradient-based Bayesian estimation, including HMCs via Stan and Zig-Zag via PDMPFlux. Visit the following articles (although they are in Japanese) for more details:

Listing 1

References

Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 1–22.

Dudley, R. M. (2002). Real analysis and probability,Vol. 74. Cambridge University Press.

Kallenberg, O. (2021). Foundations of modern probability,Vol. 99. Springer Cham.

Lindley, D. V., and Smith, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society. Series B (Methodological), 34(1), 1–41.

Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic models using pólya–gamma latent variables. Journal of the American Statistical Association, 108(504), 1339–1349.

Tanner, M. A., and Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398).

Footnotes

For this definition of conditional probability, see (Kallenberg, 2021, p. 167), (Dudley, 2002, p. 347). If we are to define conditional probability first, we need the concept of regular conditional probability, which is not discussed here.↩︎
see, for example, (Th’m 8.9 Kallenberg, 2021, p. 170)↩︎

Citation

BibTeX citation:

@online{shiba2024,
  author = {Shiba, Hirofumi},
  title = {Likelihood of {Hierarchical} {Models}},
  date = {2024-12-23},
  url = {https://162348.github.io/posts/2024/Probability/likelihood.html},
  langid = {en},
  abstract = {We examine how to find \& formally determine the
    likelihood function of hierarchical models. As a real-world example,
    we consider the ideal point model, also known as the 2-parameter
    logistic item response model.}
}

For attribution, please cite this work as:

Shiba, H. (2024, December 23). Likelihood of Hierarchical Models.