Final project topics

  • Post your topic to blackboard!
  • Some example topics are in this google drive folder
    • Feel free to add more!
    • Bonus if you find your own data
  • Please let me know what group you are in before Tuesday 11:00

Outline

  • Covariance of non-normal data
    • Continuous but not normal
    • Ordinal
  • Computing Markov Random Fields
    • Recap
    • Toy example
    • Estimating sparse models with the LASSO

Covariance of Non-normal Data

Non-normal Data

  • The Gaussian Random Field assumes normally distributed data
    • What if this is not the case?
  • Two types of data that might not be normal:
    • Continuous data that is not normally distributed
    • Ordinal data with relatively few possible outcomes
      • Likert scales

Continuous Non-normal Data

  • If data is continuous but not normal, the nonparanormal transformation can be used to transform data to be normal
    • Semi-parametric transformation
    • huge.npn function in the huge package

DF <- data.frame(
  x = ifelse(sample(c(TRUE,FALSE), 1000, TRUE), 
             rnorm(1000,-10,4), rexp(1000,20)),
  y = rbeta(1000, 0.5, 0.5)
)

library('ggplot2')
ggplot(DF, aes(x = x)) + geom_density()

plot of chunk unnamed-chunk-3

ggplot(DF, aes(x = y)) + geom_density()

plot of chunk unnamed-chunk-4

ggplot(DF, aes(x = x, y = y )) + geom_point()

plot of chunk unnamed-chunk-5

Transform to normal data:

library('huge')
DFnormal <- as.data.frame(huge.npn(DF))
## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.

library('ggplot2')
ggplot(DFnormal, aes(x = x)) + geom_density()

plot of chunk unnamed-chunk-7

ggplot(DFnormal, aes(x = y)) + geom_density()

plot of chunk unnamed-chunk-8

ggplot(DFnormal, aes(x = x, y = y )) + geom_point()

plot of chunk unnamed-chunk-9

Ordinal Data

  • If data is ordinal and consists of only a few levels of measurement the nonparanormal transformation does not work
  • In this case threshold models should be used
  • Then, it is assumed that underlying the response is a latent item that is normally distributed
  • The covariance between this latent items and other such latent items or other continuous items can be estimated
    • Polychoric correlation if both variables are ordinal
    • Polyserial correlation if one item is ordinal and the other is continuous

Likert Scale

I see myself as someone who is talkative…

plot of chunk unnamed-chunk-10

Likert Scale

I see myself as someone who is talkative…

Compute Correlations

  • The cor_auto() function in the qgraph package can be used to compute a correlation matrix of variables of which some are ordinal
    • By default searchers for variables with integer values of at most \(7\) different levels
  • Computes polychoric, polyserial and Pearson correlations
  • Can be used in replacement of cor!

# True correlation matrix:
Cor <- matrix(c(1,0.5,0.25,
                0.5,1,0.5,
                0.25,0.5,1),3,3)

Cor
##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00

# Simulate true scores:
library(mvtnorm)
## Warning: package 'mvtnorm' was built under R version 3.1.1
Ylat <- rmvnorm(1000,c(0,0,0),Cor)

# Discretize variables 1 and 2
Y <- Ylat
Y[,1:2] <- as.numeric(cut(Ylat[,1:2], breaks = c(-Inf,0,1,Inf)))

head(Y,10)
##       [,1] [,2]     [,3]
##  [1,]    1    1  0.02917
##  [2,]    2    2  1.73107
##  [3,]    1    1 -1.58392
##  [4,]    1    1 -0.01996
##  [5,]    1    2  0.61200
##  [6,]    2    3 -0.30646
##  [7,]    3    3  0.51575
##  [8,]    3    1 -1.46357
##  [9,]    2    3  0.79041
## [10,]    2    1  1.27870

# True correlation matrix:
Cor
##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00
# Pearson correlations:
cor(Y)
##        [,1]   [,2]   [,3]
## [1,] 1.0000 0.3563 0.2276
## [2,] 0.3563 1.0000 0.4682
## [3,] 0.2276 0.4682 1.0000
  • Pearson correlations underestimate the true correlation!

# True correlation matrix:
Cor
##      [,1] [,2] [,3]
## [1,] 1.00  0.5 0.25
## [2,] 0.50  1.0 0.50
## [3,] 0.25  0.5 1.00
# Polychoric correlations
cor_auto(Y)
## Variables detected as ordinal: V1; V2
##        V1     V2     V3
## V1 1.0000 0.4499 0.2576
## V2 0.4499 1.0000 0.5393
## V3 0.2576 0.5393 1.0000
  • Polychoric correlations work much better!

Estimating Markov Random Fields

Markov Random Fields

  • Ising Model
    • All nodes are binary
      • \(-1\) or \(1\)
    • Combination of multiple logistic regression models
  • Gaussian Random Field
    • All nodes assumed normally distributed
    • Graph structure directly related to the inverse variance-covariance matrix
    • Graph usually standardized to partial correlations
    • Combination of multiple linear regression models

  • Because the Ising Model is a combination of multiple logistic models and the Gaussian Random field a combination of multiple linear models we can form MRF's in the following way:

    1. Predict each node from all other nodes in a multiple (logistic) regression
    2. This gives you two parameters for each edge in the network. For example:
      • The regression parameter from predicting \(A\) from \(B\) and the regression parameter from \(B\) to \(A\).
      • In logistic regression, these should be about equal. In linear regression, they can be standardized to be about equal
    3. Take the mean of these two parameters as edge weight
      • If the regression parameter is \(0\) there is no edge!

Ising Model Estimation

plot of chunk unnamed-chunk-16

Ising Model Estimation

plot of chunk unnamed-chunk-17

\[ \Pr(X_1 = 1) \propto \exp\left( \tau_1 + \beta_{12} x_2 + \beta_{13} x_3 + \beta_{14} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-18

\[ \Pr(X_2 = 1) \propto \exp\left( \tau_2 + \beta_{21} x_1 + \beta_{23} x_3 + \beta_{24} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-19

\[ \Pr(X_3 = 1) \propto \exp\left( \tau_3 + \beta_{31} x_1 + \beta_{32} x_2 + \beta_{34} x_4 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-20

\[ \Pr(X_4 = 1) \propto \exp\left( \tau_4 + \beta_{41} x_1 + \beta_{42} x_2 + \beta_{43} x_3 \right) \]

Ising Model Estimation

plot of chunk unnamed-chunk-21

Ising Model Estimation

plot of chunk unnamed-chunk-22

\[ \omega_{ij} = \beta_{ij} = \beta_{ji} \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-23

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-24

\[ X_1 = \tau_1 + \beta_{12} x_2 + \beta_{13} x_3 + \beta_{14} x_4 + \varepsilon_1 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-25

\[ X_2 = \tau_2 + \beta_{21} x_1 + \beta_{23} x_3 + \beta_{24} x_4 + \varepsilon_2 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-26

\[ X_3 = \tau_3 + \beta_{31} x_1 + \beta_{32} x_2 + \beta_{34} x_4 + \varepsilon_3 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-27

\[ X_4 = \tau_4 + \beta_{41} x_1 + \beta_{42} x_2 + \beta_{43} x_3 + \varepsilon_4 \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-28

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-29

\[ \rho_{ij} = \frac{\beta_{ij} \mathrm{Var}\left(\varepsilon_j\right)}{\mathrm{Var}\left(\varepsilon_i\right)} = \frac{\beta_{ji} \mathrm{Var}\left(\varepsilon_i\right)}{\mathrm{Var}\left(\varepsilon_j\right)} \]

Gaussian Random Field Estimation

plot of chunk unnamed-chunk-30

\[ \rho_{ij} = -\frac{ \theta_{ij} }{\sqrt{ \theta_{ii} \theta_{jj} }} \hspace{35pt} \boldsymbol{\Theta} = \boldsymbol{\Sigma}^{-1} \]

In which \(\boldsymbol{\Theta}\) is the precision matrix; the inverse of the variance-covariance matrix.

Estimating Markov Random Fields

  • In these slides I will show how to estimate a Gaussian Random Field on a simulated toy dataset:
    • \(W\): Worrying
    • \(C\): Concentration problems
    • \(F\): Fatigue
    • \(I\): Insomnia
  • The same logic applies to the Ising Model

plot of chunk unnamed-chunk-31

Data <- structure(list(W = c(4, 1, 5, 4, 2, 2, 8, 1, 4, 4), C = c(2, 
6, 4, 2, 9, 6, 2, 5, 4, 2), F = c(3, 2, 4, 8, 1, 2, 4, 4, 5, 
5), I = c(1, 1, 6, 8, 3, 5, 8, 1, 3, 4)), .Names = c("W", "C", 
"F", "I"), row.names = c(NA, -10L), class = "data.frame")

Data
##    W C F I
## 1  4 2 3 1
## 2  1 6 2 1
## 3  5 4 4 6
## 4  4 2 8 8
## 5  2 9 1 3
## 6  2 6 2 5
## 7  8 2 4 8
## 8  1 5 4 1
## 9  4 4 5 3
## 10 4 2 5 4

Assosciation

round(cor(Data),2)
##       W     C     F     I
## W  1.00 -0.67  0.40  0.70
## C -0.67  1.00 -0.73 -0.38
## F  0.40 -0.73  1.00  0.52
## I  0.70 -0.38  0.52  1.00

Assosciation

library("qgraph")
qgraph(cor(Data), esize = 20)

plot of chunk unnamed-chunk-35

Concentration

  • We can construct a concentration network by predicting all variables from all other variables:
    • \(W = \beta_{12} C + \beta_{13} F + \beta_{14} I + \varepsilon_W\)
    • \(C = \beta_{21} W + \beta_{23} F + \beta_{24} I + \varepsilon_C\)
    • \(F = \beta_{31} W + \beta_{32} C + \beta_{34} I + \varepsilon_F\)
    • \(I = \beta_{41} W + \beta_{42} C + \beta_{43} F + \varepsilon_I\)
  • In which for all errors \(\varepsilon_i \sim N(0, \theta_i)\) and are assumed to be independent of all other nodes.

fitW <- lm(W ~ C + F + I, data = Data)
fitW
## 
## Call:
## lm(formula = W ~ C + F + I, data = Data)
## 
## Coefficients:
## (Intercept)            C            F            I  
##       6.606       -0.723       -0.564        0.518
fitC <- lm(C ~ W + F + I, data = Data)
fitF <- lm(F ~ W + C + I, data = Data)
fitI <- lm(I ~ W + C + F, data = Data)

\[ \begin{aligned} W &= \beta_{12} C + \beta_{13} F + \beta_{14} I + \varepsilon_W \\ C &= \beta_{21} W + \beta_{23} F + \beta_{24} I + \varepsilon_C \\ F &= \beta_{31} W + \beta_{32} C + \beta_{34} I + \varepsilon_F \\ I &= \beta_{41} W + \beta_{42} C + \beta_{43} F + \varepsilon_I \end{aligned} \]

  • The regression coefficients have a symmetrical relationship:
    • \(\beta_{ij} \mathrm{Var}\left( \varepsilon_j \right) = \beta_{ji} \mathrm{Var}\left( \varepsilon_i \right)\)

beta_WC <- coef(fitW)[2]
beta_CW <- coef(fitC)[2]
errorVariance_W <- summary(fitW)$sigma^2
errorVariance_C <- summary(fitC)$sigma^2

beta_WC * errorVariance_C
##      C 
## -1.163
beta_CW * errorVariance_W
##      W 
## -1.163

\[ \begin{aligned} W = {\color{red} \beta_{12}} C + {\color{blue} \beta_{13}} F + {\color{green}\beta_{14}} I + \varepsilon_W \\ C = {\color{red}\beta_{21}} W + {\color{purple} \beta_{23} } F + {\color{orange} \beta_{24}} I + \varepsilon_C \\ F = {\color{blue} \beta_{31}} W + {\color{purple} \beta_{32} } C + {\color{brown}\beta_{34}} I + \varepsilon_F \\ I = {\color{green}\beta_{41}} W + {\color{orange} \beta_{42} } C + {\color{brown} \beta_{43}} F + \varepsilon_I \\ \end{aligned} \]

  • These regression parameters can be standardized to obtain the partial correlation coefficient:
    • \(\rho_{ij} = \frac{\beta_{ij} \sqrt{ \mathrm{Var}\left( \varepsilon_j \right)}}{\sqrt{ \mathrm{Var}\left( \varepsilon_i \right)}} = \frac{\beta_{ji} \sqrt{ \mathrm{Var}\left( \varepsilon_i \right)}}{\sqrt{ \mathrm{Var}\left( \varepsilon_j \right)}}\)
  • Are equal to standardized elements of the negative inverse covariance matrix!
    • \(\rho_{ij} = -\frac{ \theta_{ij} }{\sqrt{ \theta_{ii} \theta_{jj} }} \hspace{35pt} \boldsymbol{\Theta} = \boldsymbol{\Sigma}^{-1}\)

beta_WC * sqrt(errorVariance_C) / sqrt(errorVariance_W)
##       C 
## -0.7651
beta_CW * sqrt(errorVariance_W) / sqrt(errorVariance_C)
##       W 
## -0.7651
-cov2cor(solve(cov(Data)))
##         W       C       F       I
## W -1.0000 -0.7651 -0.5889  0.7754
## C -0.7651 -1.0000 -0.7993  0.5871
## F -0.5889 -0.7993 -1.0000  0.6468
## I  0.7754  0.5871  0.6468 -1.0000

These partial correlations give us the concentration graph:

library("qgraph")
qgraph(cor(Data), graph = "concentration")

plot of chunk unnamed-chunk-39

  • Why does this still not look good?

N = 10!

Data
##    W C F I
## 1  4 2 3 1
## 2  1 6 2 1
## 3  5 4 4 6
## 4  4 2 8 8
## 5  2 9 1 3
## 6  2 6 2 5
## 7  8 2 4 8
## 8  1 5 4 1
## 9  4 4 5 3
## 10 4 2 5 4

Model Selection

  • We are specifically interested in identifying which partial correlations are zero
    • These elements correspond to missing edges in the graph
  • We could identify zeroes by significance testing or stepwise model-selection, but…
    • Problem of multiple comparison
    • Intractable model space
    • \(p\)-values and NHST are evil

LASSO Estimation

  • A well accepted approach for jointly model selection and parameter estimation is the least absolute shrinkage and selection operator (LASSO)
    • Limits the sum of absolute regression weights, which causes insignificant edges to shrink to zero
      • Regularization
    • Useful in generalized linear regression model!
      • Multiple linear regression
      • Multiple logistic regression
    • Even usable when you have more variables then measures!

LASSO Estimation

\[ Y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon_Y \]

LASSO Estimation

  • When estimating regression coefficients, we typically find the regression coefficients for which the likelihood is the highest:
    • \(\max_{\boldsymbol{\beta}}\left[ \Pr\left( Y = y \mid \boldsymbol{X} = \boldsymbol{x}, \boldsymbol{\beta} \right) \right]\)
  • In LASSO estimation, we maximize the penelized likelihood instead:
    • \(\max_{\boldsymbol{\beta}}\left[ \Pr\left( Y = y \mid \boldsymbol{X} = \boldsymbol{x}, \boldsymbol{\beta} \right) - \lambda \sum_i |\beta_i| \right]\)
  • The \(\lambda\) tuning parameter chooses how strongly the likelihood is penalized on the sum of absolute regression parameters
    • The higher \(\lambda\), the better it is to have many weak regression parameters that are close to or exactly \(0\)
    • Higher values for \(\lambda\) directly correspond to setting a stricter upper bound to the sum of absolute regression parameters

  • Using LASSO in (logistic) regression causes a model to be sparse
    • Many regression parameters of exactly 0
  • Sparse models are very interpretable
  • Cross-validation often shows that sparse models work better
    • No over-fitting
  • Because Markov Random Fields are a combination of (logistic) regression models, the LASSO can be used to estimate a sparse network
    • This gives us edge weights of exactly zero, indicating Conditional independence!
  • For the Gaussian Graphical Model, a variant called the Graphical Lasso (glasso) exists that applies LASSO directly to the inverse covariance matrix
    • Fast and requires only a covariance matrix as input

\[ \lambda = 0 \]

plot of chunk unnamed-chunk-41

\[ \lambda = 0.1 \]

plot of chunk unnamed-chunk-42

\[ \lambda = 0.25 \]

plot of chunk unnamed-chunk-43

\[ \lambda = 0.3 \]

plot of chunk unnamed-chunk-44

\[ \lambda = 0.4 \]

plot of chunk unnamed-chunk-45

\[ \lambda = 0.5 \]

plot of chunk unnamed-chunk-46

\[ \lambda = 1 \]

plot of chunk unnamed-chunk-47

\[ \lambda = 2 \]

plot of chunk unnamed-chunk-48

\[ \lambda = 3 \]

plot of chunk unnamed-chunk-49

\[ \lambda = 4 \] plot of chunk unnamed-chunk-50

How to choose which \(\lambda\) to use?

  • Extended Bayesian Information Criterium (EBIC)
    • Weights the likelihood of a model against the number of nonzero parameters (edges in the network)
    • Occam's Razor
      • If two models are comparable, the one that is more sparse should be preferred
    • Theoretically driven
  • Crossvalidation
    • Under which \(\lambda\) does a network estimated on one part of the data best predict another part of the data?
    • Data driven

Extended Bayesian Information Criterium (EBIC)

  • We can compute an information criterion for each penalty parameter, and choose the model with the best (lowest) value. In Markov Random Fields the Extended Bayesian Information Criterium (EBIC) works well:
    • \(EBIC = -2 L + |E| \log{n} + 4 |E| \gamma \log{ p }\)
    • \(L\) is the log-likelihood, \(|E|\) the number of edges in the graph, \(n\) the sample size and \(p\) the number of variables in the graph.
  • The \(\gamma\) parameter is another tuning parameter, but much better defined than \(\lambda\)
    • Set \(\gamma\) to zero to obtain more edges, but also more false positives
    • Set \(\gamma\) to \(1\) to obtain a very sparse model
    • Typical packages use a default of \(0.25\) of \(0.5\) which work well

Find a sparse concentration graph using glasso:

qgraph(cor(Data), graph = "glasso", sampleSize = 10)

plot of chunk unnamed-chunk-52

To estimate a Markov Random Field using EBIC to select a tuning parameter \(\lambda\), use:

  • Gaussian Graphical Model
    • qgraph function in the qgraph package with arguments graph = "glasso" and sampleSize
      • Requires correlation or covariance matrix as input
      • tuning argument sets the \(\gamma\) hyper tuning parameter
  • Ising Model
    • IsingFit function in the IsingFit package
      • Requires raw data as input
      • gamma argument sets the \(\gamma\) hyper tuning parameter
      • The weiadj element of the resulting list gives the weights matrix

Thank you for your attention!

Good luck with the final project!