Markov Random Fields 2

Final project topics

Post your topic to blackboard!
Some example topics are in this google drive folder
- Feel free to add more!
- Bonus if you find your own data
Please let me know what group you are in before Tuesday 11:00

Outline

Covariance of non-normal data
- Continuous but not normal
- Ordinal
Computing Markov Random Fields
- Recap
- Toy example
- Estimating sparse models with the LASSO

Covariance of Non-normal Data

Non-normal Data

The Gaussian Random Field assumes normally distributed data
- What if this is not the case?
Two types of data that might not be normal:
- Continuous data that is not normally distributed
- Ordinal data with relatively few possible outcomes
  - Likert scales

Continuous Non-normal Data

If data is continuous but not normal, the nonparanormal transformation can be used to transform data to be normal
- Semi-parametric transformation
- huge.npn function in the huge package

DF <- data.frame(
  x = ifelse(sample(c(TRUE,FALSE), 1000, TRUE), 
             rnorm(1000,-10,4), rexp(1000,20)),
  y = rbeta(1000, 0.5, 0.5)
)

library('ggplot2')
ggplot(DF, aes(x = x)) + geom_density()

plot of chunk unnamed-chunk-3

ggplot(DF, aes(x = y)) + geom_density()

plot of chunk unnamed-chunk-4

ggplot(DF, aes(x = x, y = y )) + geom_point()

plot of chunk unnamed-chunk-5

Transform to normal data:

library('huge')
DFnormal <- as.data.frame(huge.npn(DF))

## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.

library('ggplot2')
ggplot(DFnormal, aes(x = x)) + geom_density()

plot of chunk unnamed-chunk-7

ggplot(DFnormal, aes(x = y)) + geom_density()

plot of chunk unnamed-chunk-8

ggplot(DFnormal, aes(x = x, y = y )) + geom_point()

plot of chunk unnamed-chunk-9

Ordinal Data

If data is ordinal and consists of only a few levels of measurement the nonparanormal transformation does not work
In this case threshold models should be used
Then, it is assumed that underlying the response is a latent item that is normally distributed
The covariance between this latent items and other such latent items or other continuous items can be estimated
- Polychoric correlation if both variables are ordinal
- Polyserial correlation if one item is ordinal and the other is continuous

Likert Scale

I see myself as someone who is talkative…

plot of chunk unnamed-chunk-10

Likert Scale

I see myself as someone who is talkative…

Compute Correlations

The cor_auto() function in the qgraph package can be used to compute a correlation matrix of variables of which some are ordinal
- By default searchers for variables with integer values of at most \(7\) different levels
Computes polychoric, polyserial and Pearson correlations
Can be used in replacement of cor!

# True correlation matrix:
Cor <- matrix(c(1,0.5,0.25,
                0.5,1,0.5,
                0.25,0.5,1),3,3)

Cor

##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00

# Simulate true scores:
library(mvtnorm)

## Warning: package 'mvtnorm' was built under R version 3.1.1

Ylat <- rmvnorm(1000,c(0,0,0),Cor)

# Discretize variables 1 and 2
Y <- Ylat
Y[,1:2] <- as.numeric(cut(Ylat[,1:2], breaks = c(-Inf,0,1,Inf)))

head(Y,10)

##       [,1] [,2]     [,3]
##  [1,]    1    1  0.02917
##  [2,]    2    2  1.73107
##  [3,]    1    1 -1.58392
##  [4,]    1    1 -0.01996
##  [5,]    1    2  0.61200
##  [6,]    2    3 -0.30646
##  [7,]    3    3  0.51575
##  [8,]    3    1 -1.46357
##  [9,]    2    3  0.79041
## [10,]    2    1  1.27870

# True correlation matrix:
Cor

##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00

# Pearson correlations:
cor(Y)

##        [,1]   [,2]   [,3]
## [1,] 1.0000 0.3563 0.2276
## [2,] 0.3563 1.0000 0.4682
## [3,] 0.2276 0.4682 1.0000

Pearson correlations underestimate the true correlation!

# True correlation matrix:
Cor

##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00

# Polychoric correlations
cor_auto(Y)

## Variables detected as ordinal: V1; V2

##        V1     V2     V3
## V1 1.0000 0.4499 0.2576
## V2 0.4499 1.0000 0.5393
## V3 0.2576 0.5393 1.0000

Polychoric correlations work much better!

Estimating Markov Random Fields

Markov Random Fields

Ising Model
- All nodes are binary
  - \(-1\) or \(1\)
- Combination of multiple logistic regression models
Gaussian Random Field
- All nodes assumed normally distributed
- Graph structure directly related to the inverse variance-covariance matrix
- Graph usually standardized to partial correlations
- Combination of multiple linear regression models

Because the Ising Model is a combination of multiple logistic models and the Gaussian Random field a combination of multiple linear models we can form MRF's in the following way:
1. Predict each node from all other nodes in a multiple (logistic) regression
2. This gives you two parameters for each edge in the network. For example:
  - The regression parameter from predicting \(A\) from \(B\) and the regression parameter from \(B\) to \(A\).
  - In logistic regression, these should be about equal. In linear regression, they can be standardized to be about equal
3. Take the mean of these two parameters as edge weight
  - If the regression parameter is \(0\) there is no edge!

Ising Model Estimation

plot of chunk unnamed-chunk-16

Ising Model Estimation

plot of chunk unnamed-chunk-17

\[ \Pr(X_1 = 1) \propto \exp\left( \tau_1 + \beta_{12} x_2 + \beta_{13} x_3 + \beta_{14} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-18

\[ \Pr(X_2 = 1) \propto \exp\left( \tau_2 + \beta_{21} x_1 + \beta_{23} x_3 + \beta_{24} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-19

\[ \Pr(X_3 = 1) \propto \exp\left( \tau_3 + \beta_{31} x_1 + \beta_{32} x_2 + \beta_{34} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-20

\[ \Pr(X_4 = 1) \propto \exp\left( \tau_4 + \beta_{41} x_1 + \beta_{42} x_2 + \beta_{43} x_3 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-21

Ising Model Estimation

plot of chunk unnamed-chunk-22

\[ \omega_{ij} = \beta_{ij} = \beta_{ji} \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-23

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-24

\[ X_1 = \tau_1 + \beta_{12} x_2 + \beta_{13} x_3 + \beta_{14} x_4 + \varepsilon_1 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-25

\[ X_2 = \tau_2 + \beta_{21} x_1 + \beta_{23} x_3 + \beta_{24} x_4 + \varepsilon_2 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-26

\[ X_3 = \tau_3 + \beta_{31} x_1 + \beta_{32} x_2 + \beta_{34} x_4 + \varepsilon_3 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-27

\[ X_4 = \tau_4 + \beta_{41} x_1 + \beta_{42} x_2 + \beta_{43} x_3 + \varepsilon_4 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-28

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-29

\[ \rho_{ij} = \frac{\beta_{ij} \mathrm{Var}\left(\varepsilon_j\right)}{\mathrm{Var}\left(\varepsilon_i\right)} = \frac{\beta_{ji} \mathrm{Var}\left(\varepsilon_i\right)}{\mathrm{Var}\left(\varepsilon_j\right)} \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-30

\[ \rho_{ij} = -\frac{ \theta_{ij} }{\sqrt{ \theta_{ii} \theta_{jj} }} \hspace{35pt} \boldsymbol{\Theta} = \boldsymbol{\Sigma}^{-1} \]

In which \(\boldsymbol{\Theta}\) is the precision matrix; the inverse of the variance-covariance matrix.

Estimating Markov Random Fields

In these slides I will show how to estimate a Gaussian Random Field on a simulated toy dataset:
- \(W\): Worrying
- \(C\): Concentration problems
- \(F\): Fatigue
- \(I\): Insomnia
The same logic applies to the Ising Model

plot of chunk unnamed-chunk-31

Data <- structure(list(W = c(4, 1, 5, 4, 2, 2, 8, 1, 4, 4), C = c(2, 
6, 4, 2, 9, 6, 2, 5, 4, 2), F = c(3, 2, 4, 8, 1, 2, 4, 4, 5, 
5), I = c(1, 1, 6, 8, 3, 5, 8, 1, 3, 4)), .Names = c("W", "C", 
"F", "I"), row.names = c(NA, -10L), class = "data.frame")

Data

##    W C F I
## 1  4 2 3 1
## 2  1 6 2 1
## 3  5 4 4 6
## 4  4 2 8 8
## 5  2 9 1 3
## 6  2 6 2 5
## 7  8 2 4 8
## 8  1 5 4 1
## 9  4 4 5 3
## 10 4 2 5 4

Assosciation

round(cor(Data),2)

##       W     C     F     I
## W  1.00 -0.67  0.40  0.70
## C -0.67  1.00 -0.73 -0.38
## F  0.40 -0.73  1.00  0.52
## I  0.70 -0.38  0.52  1.00

Assosciation

library("qgraph")
qgraph(cor(Data), esize = 20)

plot of chunk unnamed-chunk-35

Concentration

We can construct a concentration network by predicting all variables from all other variables:
- \(W = \beta_{12} C + \beta_{13} F + \beta_{14} I + \varepsilon_W\)
- \(C = \beta_{21} W + \beta_{23} F + \beta_{24} I + \varepsilon_C\)
- \(F = \beta_{31} W + \beta_{32} C + \beta_{34} I + \varepsilon_F\)
- \(I = \beta_{41} W + \beta_{42} C + \beta_{43} F + \varepsilon_I\)
In which for all errors \(\varepsilon_i \sim N(0, \theta_i)\) and are assumed to be independent of all other nodes.

fitW <- lm(W ~ C + F + I, data = Data)
fitW

## 
## Call:
## lm(formula = W ~ C + F + I, data = Data)
## 
## Coefficients:
## (Intercept)            C            F            I  
##       6.606       -0.723       -0.564        0.518

fitC <- lm(C ~ W + F + I, data = Data)
fitF <- lm(F ~ W + C + I, data = Data)
fitI <- lm(I ~ W + C + F, data = Data)

\[ \begin{aligned} W &= \beta_{12} C + \beta_{13} F + \beta_{14} I + \varepsilon_W \\ C &= \beta_{21} W + \beta_{23} F + \beta_{24} I + \varepsilon_C \\ F &= \beta_{31} W + \beta_{32} C + \beta_{34} I + \varepsilon_F \\ I &= \beta_{41} W + \beta_{42} C + \beta_{43} F + \varepsilon_I \end{aligned} \]

The regression coefficients have a symmetrical relationship:
- \(\beta_{ij} \mathrm{Var}\left( \varepsilon_j \right) = \beta_{ji} \mathrm{Var}\left( \varepsilon_i \right)\)

beta_WC <- coef(fitW)[2]
beta_CW <- coef(fitC)[2]
errorVariance_W <- summary(fitW)$sigma^2
errorVariance_C <- summary(fitC)$sigma^2

beta_WC * errorVariance_C

##      C 
## -1.163

beta_CW * errorVariance_W

##      W 
## -1.163

\[ \begin{aligned} W = {\color{red} \beta_{12}} C + {\color{blue} \beta_{13}} F + {\color{green}\beta_{14}} I + \varepsilon_W \\ C = {\color{red}\beta_{21}} W + {\color{purple} \beta_{23} } F + {\color{orange} \beta_{24}} I + \varepsilon_C \\ F = {\color{blue} \beta_{31}} W + {\color{purple} \beta_{32} } C + {\color{brown}\beta_{34}} I + \varepsilon_F \\ I = {\color{green}\beta_{41}} W + {\color{orange} \beta_{42} } C + {\color{brown} \beta_{43}} F + \varepsilon_I \\ \end{aligned} \]

These regression parameters can be standardized to obtain the partial correlation coefficient:
- \(\rho_{ij} = \frac{\beta_{ij} \sqrt{ \mathrm{Var}\left( \varepsilon_j \right)}}{\sqrt{ \mathrm{Var}\left( \varepsilon_i \right)}} = \frac{\beta_{ji} \sqrt{ \mathrm{Var}\left( \varepsilon_i \right)}}{\sqrt{ \mathrm{Var}\left( \varepsilon_j \right)}}\)
Are equal to standardized elements of the negative inverse covariance matrix!
- \(\rho_{ij} = -\frac{ \theta_{ij} }{\sqrt{ \theta_{ii} \theta_{jj} }} \hspace{35pt} \boldsymbol{\Theta} = \boldsymbol{\Sigma}^{-1}\)

beta_WC * sqrt(errorVariance_C) / sqrt(errorVariance_W)

##       C 
## -0.7651

beta_CW * sqrt(errorVariance_W) / sqrt(errorVariance_C)

##       W 
## -0.7651

-cov2cor(solve(cov(Data)))

##         W       C       F       I
## W -1.0000 -0.7651 -0.5889  0.7754
## C -0.7651 -1.0000 -0.7993  0.5871
## F -0.5889 -0.7993 -1.0000  0.6468
## I  0.7754  0.5871  0.6468 -1.0000

These partial correlations give us the concentration graph:

library("qgraph")
qgraph(cor(Data), graph = "concentration")

plot of chunk unnamed-chunk-39

Why does this still not look good?

N = 10!

Data

##    W C F I
## 1  4 2 3 1
## 2  1 6 2 1
## 3  5 4 4 6
## 4  4 2 8 8
## 5  2 9 1 3
## 6  2 6 2 5
## 7  8 2 4 8
## 8  1 5 4 1
## 9  4 4 5 3
## 10 4 2 5 4

Model Selection

We are specifically interested in identifying which partial correlations are zero
- These elements correspond to missing edges in the graph
We could identify zeroes by significance testing or stepwise model-selection, but…
- Problem of multiple comparison
- Intractable model space
- \(p\)-values and NHST are evil

LASSO Estimation

A well accepted approach for jointly model selection and parameter estimation is the least absolute shrinkage and selection operator (LASSO)
- Limits the sum of absolute regression weights, which causes insignificant edges to shrink to zero
  - Regularization
- Useful in generalized linear regression model!
  - Multiple linear regression
  - Multiple logistic regression
- Even usable when you have more variables then measures!

LASSO Estimation

\[ Y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon_Y \]

LASSO Estimation

When estimating regression coefficients, we typically find the regression coefficients for which the likelihood is the highest:
- \(\max_{\boldsymbol{\beta}}\left[ \Pr\left( Y = y \mid \boldsymbol{X} = \boldsymbol{x}, \boldsymbol{\beta} \right) \right]\)
In LASSO estimation, we maximize the penelized likelihood instead:
- \(\max_{\boldsymbol{\beta}}\left[ \Pr\left( Y = y \mid \boldsymbol{X} = \boldsymbol{x}, \boldsymbol{\beta} \right) - \lambda \sum_i |\beta_i| \right]\)
The \(\lambda\) tuning parameter chooses how strongly the likelihood is penalized on the sum of absolute regression parameters
- The higher \(\lambda\), the better it is to have many weak regression parameters that are close to or exactly \(0\)
- Higher values for \(\lambda\) directly correspond to setting a stricter upper bound to the sum of absolute regression parameters

Using LASSO in (logistic) regression causes a model to be sparse
- Many regression parameters of exactly 0
Sparse models are very interpretable
Cross-validation often shows that sparse models work better
- No over-fitting
Because Markov Random Fields are a combination of (logistic) regression models, the LASSO can be used to estimate a sparse network
- This gives us edge weights of exactly zero, indicating Conditional independence!
For the Gaussian Graphical Model, a variant called the Graphical Lasso (glasso) exists that applies LASSO directly to the inverse covariance matrix
- Fast and requires only a covariance matrix as input

\[ \lambda = 0 \]

plot of chunk unnamed-chunk-41

\[ \lambda = 0.1 \]

plot of chunk unnamed-chunk-42

\[ \lambda = 0.25 \]

plot of chunk unnamed-chunk-43

\[ \lambda = 0.3 \]

plot of chunk unnamed-chunk-44

\[ \lambda = 0.4 \]

plot of chunk unnamed-chunk-45

\[ \lambda = 0.5 \]

plot of chunk unnamed-chunk-46

\[ \lambda = 1 \]

plot of chunk unnamed-chunk-47

\[ \lambda = 2 \]

plot of chunk unnamed-chunk-48

\[ \lambda = 3 \]

plot of chunk unnamed-chunk-49

\[ \lambda = 4 \] plot of chunk unnamed-chunk-50

How to choose which \(\lambda\) to use?

Extended Bayesian Information Criterium (EBIC)
- Weights the likelihood of a model against the number of nonzero parameters (edges in the network)
- Occam's Razor
  - If two models are comparable, the one that is more sparse should be preferred
- Theoretically driven
Crossvalidation
- Under which \(\lambda\) does a network estimated on one part of the data best predict another part of the data?
- Data driven

Extended Bayesian Information Criterium (EBIC)

We can compute an information criterion for each penalty parameter, and choose the model with the best (lowest) value. In Markov Random Fields the Extended Bayesian Information Criterium (EBIC) works well:
- \(EBIC = -2 L + |E| \log{n} + 4 |E| \gamma \log{ p }\)
- \(L\) is the log-likelihood, \(|E|\) the number of edges in the graph, \(n\) the sample size and \(p\) the number of variables in the graph.
The \(\gamma\) parameter is another tuning parameter, but much better defined than \(\lambda\)
- Set \(\gamma\) to zero to obtain more edges, but also more false positives
- Set \(\gamma\) to \(1\) to obtain a very sparse model
- Typical packages use a default of \(0.25\) of \(0.5\) which work well

Find a sparse concentration graph using glasso:

qgraph(cor(Data), graph = "glasso", sampleSize = 10)

plot of chunk unnamed-chunk-52

To estimate a Markov Random Field using EBIC to select a tuning parameter \(\lambda\), use:

Gaussian Graphical Model
- qgraph function in the qgraph package with arguments graph = "glasso" and sampleSize
  - Requires correlation or covariance matrix as input
  - tuning argument sets the \(\gamma\) hyper tuning parameter
Ising Model
- IsingFit function in the IsingFit package
  - Requires raw data as input
  - gamma argument sets the \(\gamma\) hyper tuning parameter
  - The weiadj element of the resulting list gives the weights matrix

Final project topics

Outline

Covariance of Non-normal Data

Non-normal Data

Continuous Non-normal Data

Ordinal Data

Likert Scale

Likert Scale

Compute Correlations

Estimating Markov Random Fields

Markov Random Fields

Ising Model Estimation

Ising Model Estimation

Ising Model Estimation

Ising Model Estimation

Ising Model Estimation

Ising Model Estimation

Ising Model Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Gaussian Random Field Estimation

Estimating Markov Random Fields

Assosciation

Assosciation

Concentration

N = 10!

Model Selection

LASSO Estimation

LASSO Estimation

LASSO Estimation

How to choose which \(\lambda\) to use?

Extended Bayesian Information Criterium (EBIC)

Thank you for your attention!

Good luck with the final project!