class: center, middle, inverse, title-slide .title[ # VanML 2024: Stan Workshop ] .date[ ### February 22, 2024 ] --- # Overview ### 1. Example Dataset ### 2. Proposed Statistical Model ### 3. The Model as a Stan Program ### 4. Implementation in R ### 5. Analyzing Results --- class: inverse, center, middle # Example Problem Statement --- # Gene Expression Data Suppose we have gene expression data in the form of RNA counts taken from bronchial epithelial tissue from 192 subjects for a particular gene. Some of these subjects had lung cancer while others did not. We would like to make inference about the mean expression level for this gene which accounts for potential difference in means between those with and those without lung cancer. A full description of the dataset can be found [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE4115) <!-- --> --- class: inverse, center, middle # Proposed Statistical Model --- # Finite Gaussian Mixture Model A *Gaussian mixture model* is a linear combination of Gaussian random variables, where each component corresponds to a sub-population and each weight is the probability of belonging to each sub-population. - In general, have `\(K\)` components (sub-populations) - A randomly sampled individual belongs to component `\(k\in\{1,\ldots,K\}\)` with probability `\(\alpha_k\)` - Random variable `\(Z\)` used to indicate sub-population: `\(Z\sim\text{Categorical}(K,\vec\alpha)\)` - An observation on an individual belonging to component `\(k\)` will be a `\(Y|Z=k\sim\mathcal N(\mu_k,\sigma)\)` -- <br> - Since `\(Z\)` is discrete it is incompatible with HMC - The workaround is to marginalize out this indicator by evaluating the likelihood as a sum `$$Y\sim\sum_{k=1}^K\alpha_k\mathcal N(\mu_k,\sigma)$$` --- # Bayesian GMM To make this model Bayesian, we need to assign prior distributions over the component proportions, means, and variance: `\begin{align*} Y|\vec\alpha,\vec\mu,\sigma &\sim \sum_{k=1}^K\alpha_k \mathcal N(\mu_k,\sigma) \\ \vec\alpha &\sim \text{Dirichlet}(1,K) \\ \vec\mu &\sim \text{Normal}(10,10) \\ \sigma^2 &\sim \text{InvGamma}(1, 1) \end{align*}` - Dirichlet is a multivariate extension of the Beta distribution - Used for random vectors whose components take values in (0,1) and sum to 1 -- The resulting posterior distribution is: `$$\pi(\vec\alpha,\vec\mu,\sigma|\vec y) \propto\prod_{i=1}^n\left[\sum_{k=1}^K\alpha_k\mathcal N(y_i|\mu_k,\sigma)\right]\mathcal N(\vec\mu;10,10)\mathcal{IG}(\sigma^2,1,1)\mathcal D(\vec\alpha;1,K)$$` --- class: inverse, center, middle # The Model as a Stan Program --- # Basic Components of a Stan Program Our model is specified in a separate .stan file - Written in the Stan probabilistic programming language - Contains up to seven blocks, each serving a different purpose - We only need the three basic blocks: 1. `data{...}`: Specify all user input 1. `parameters{...}`: Specify any parameters of the model which need to be estimated 1. `model{...}`: Evaluates the log-posterior - Stan uses this file to automatically construct a Hamiltonian Monte Carlo sampler for this model --- # Variable Definition The syntax for defining a variable in Stan is: ```{eval=F} type<lower=a><upper=b>[dimension] name; ``` - `type`: Specifies type of variables - Basic types are `int` and `real` - Can also have `vector`, `matrix`, and others - `<lower=a>`: Specifies the lower bound of the variable; default is unbounded below - `<upper=b>`: Specifies the upper bound of the variable; default is unbounded above - `dimension`: Specifies the dimension of the variable - For `vector` it is the length - For `\(n\times m\)` `matrix` is specified by `[n,m]` --- # GMM: Data and Parameters Now we can specify the `data` and `parameters` section of our RStan code: ```{echo=F} data { int<lower=1> n; // number of observations int<lower=2> K; vector[n] y; // observations } parameters { simplex[K] alpha; // mixing proportions ordered[K] mu; // means of components real<lower=0> sigma; // standard deviations of components } ``` Identifiability: Without ordering `\(\vec\mu\)`, the model is unidentifiable - E.g. between two iterations, swap `\(\mu_1\)` and `\(\mu_2\)` - This would leave the likelihood unchanged, and therefore the model is unidentifiable - Imposing an ordering prevents this swapping --- # Evaluating the Posterior The log-posterior can be specified in two ways - Suppose we want to model `\(Y\sim\mathcal N(\mu,\sigma)\)` 1. Distributional statement: `y ~ normal(mu, sigma);` 1. Increment statement: `target += normal_lpdf(y | mu, sigma);` - `target` is a protected keyword and requires special functions to access <br> - These are equivalent statements; can have combination in your `model` block - Both are vectorized, so if `y` is a vector, then it sums over all log-pdf evaluations --- # Note on Log-Sum-Exponential Recall in our posterior, we needed to evaluate `\(\sum_{k=1}^K\alpha_k\mathcal N(y|\mu_k,\sigma)\)` - Since Stan operates on the log-posterior, we need the log of this quantity - In terms of log densities this would be written as: `$$\log\left(\sum_{k=1}^K\exp[\log(\alpha_k)+\log(\mathcal N(y|\mu_k,\sigma))]\right)$$` - In Stan, if we have a vector `x=[a,b,c]`, the function `log_sum_exp(x)` will evaluate `\(\log(\exp(a)+\exp(b)+\exp(c))\)` - This will be used to evaluate the log-posterior --- # GMM: Model Recall that the posterior we want to encode is: `$$\pi(\vec\alpha,\vec\mu,\sigma|\vec y) \propto\prod_{i=1}^n\left[\sum_{k=1}^K\alpha_k\mathcal N(y_i|\mu_k,\sigma)\right]\mathcal N(\vec\mu;10,10)\mathcal{IG}(\sigma^2,1,1)\mathcal D(\vec\alpha;1,K)$$` ```{eval=F} model { // priors target += inv_gamma_lpdf(sigma^2 | 1, 1); target += normal_lpdf(mu | 10, 10); target += dirichlet_lpdf(alpha | rep_vector(1, K)); // likelihood for(i in 1:n){ vector[K] temp = log(alpha); for(k in 1:K){ temp[k] += normal_lpdf(y[i] | mu[k], sigma); } target += log_sum_exp(temp); } } ``` --- class: inverse, center, middle # Implementation in R --- # The RStan Package The Stan software is accessed in R through the `rstan` package <br> General flow of implementation is 1. Create .stan file specifying our model 1. Read data into R and reformat to match the Stan program 1. Execute the `stan` function - Automatically generate a Hamiltonian Monte Carlo sampler - Produce posterior samples, plus diagnostics 1. Generate plots, numerical summaries, etc. <br> Note on Compute Canada installation: may require manual installation of `V8`, `Rcpp`, or `RcppEigen` packages --- # The `stan` function The required arguments to the `stan` function are: - `file`: A string indicating the path of the .stan file containing the model code - `data`: A named list with elements matching those specified in the data block of the Stan file <br> Some optional, but often used arguments: - `chains`: The number of MCMC chains used (default 4) - `iter`: Total length of each chain (default 2000) - `init`: A list or function used to initialize each chain (default "random") - `seed`: The seed used for random number generation <br> A full list of parameters can be found in the [CRAN documentation](https://cran.r-project.org/web/packages/rstan/rstan.pdf) --- # Sampling from the Bayesian GMM ```r # load the RStan library library(rstan) # read in the data Expression <- read_csv("expression.csv")$Expression # enable parallel computing options(mc.cores=5) # set up data n <- length(Expression) K <- 2 stan_data <- list(n=n, K=K, y=Expression) ``` --- # Sampling from the Bayesian GMM ```r # for the rdirichlet and rinvgamma functions library(extraDistr) # initial values init_func <- function(){ list(alpha = as.numeric(rdirichlet(1, rep(1, K))), mu = sort(rnorm(K, 10, 10)), sigma = 1 / sqrt(rgamma(1, 1, 1))) } # run stan fit <- stan(file = "gaussian_mixture.stan", data = stan_data, chains = 5, iter = 1e4, init = init_func, seed = 2024) } ``` Note: if `seed` not supplied then Stan will set it randomly, overriding previously set seeds --- class: inverse, center, middle # Analyzing the Results --- # Examples of Plots .pull-left[ ```r library(bayesplot) mcmc_areas(fit, regex_pars="alpha", prob=0.95) + theme(axis.text = element_text(size=20)) ``` <!-- --> ] .pull-right[ ```r mcmc_areas(fit, regex_pars="mu", prob=0.95) + theme(axis.text = element_text(size=20)) ``` <!-- --> ] --- # Numerical Summary of Results RStan also has a summary function for `stan-fit` objects: ```r summ <- summary(fit)$summary colnames(summ) ``` ``` [1] "mean" "se_mean" "sd" "2.5%" "25%" "50%" "75%" [8] "97.5%" "n_eff" "Rhat" ``` ```r round(summ[rownames(summ)!="lp__", c("2.5%", "mean", "97.5%")],2) ``` ``` 2.5% mean 97.5% alpha[1] 0.51 0.61 0.72 alpha[2] 0.28 0.39 0.49 mu[1] 5.72 5.81 5.91 mu[2] 6.55 6.68 6.80 sigma 0.30 0.34 0.42 ``` --- # Extracting the Data It might be preferable to just extract the samples from the `stanfit` object using the `extract` function ```r par_fit <- extract(fit, pars="lp__", include=F) ``` - This returns a list of named matrices or vectors of posterior samples - E.g. `par_fit$alpha` is a matrix with 2 columns, one for each `\(\alpha_k\)` - `lp__` is the value of the log-posterior at each iteration - `include` can be used to invert the meaning of `pars` (i.e. `lp__` is excluded from the extraction) <br> This can then be converted to a data frame ```r par_fit <- as.data.frame(par_fit) ``` --- # Concluding Remarks RStan provides an appealing Bayesian analysis software - Probabilistic programming allows for flexibility; any model that can be written in the Stan language can be sampled from - A Hamiltonian Monte Carlo sampler is automatically created and executed to sample from the posterior - The package is well-maintained; no debugging fo the algorithm itself necessary