Statistical Models of Suicidal Behavior and Brain Biology Using Large Data Sets

Project 6

Statistical Models of Suicidal Behavior and Brain Biology Using Large Data Sets

At the heart of the Conte Center is the effort to improve understanding of the causes and the biological basis of suicidal behavior. Toward that aim, as part of the component projects, valuable and complex data sets will be collected. This project proposes to leverage the unique research opportunity offered by such “big” data.

This Project Aims to Build, Fit, and Validate Statistical Models to Study Biological Factors Related to Suicidal Behavior, with an Eventual Goal of Improving Suicide Risk Prediction.

We propose to design statistical models that will make efficient use of all available data by employing techniques that have developed recently with high dimensional data analysis, with particular emphasis on functional data analysis. Previously, there has been a long tradition (Occam’s razor) of preferring relatively simple models to more complicated ones, provided they are comparable in predictive accuracy. However, since the causes of suicide and the data being collected are quite complex, very simple models will not generally suffice. Currently, a number of computationally intensive procedures (e.g., machine learning algorithms) are in widespread application for situations with large numbers of predictors. However, while predictive performance can be quite good, these do not generally provide much insight into the relationship between the predictors and the response variables, where our interest lies.

In short, we aim to

  • adapt and extend existing methods for using very high-dimensional data as predictors, i.e. the imaging and genomic data we are gathering;
  • construct statistical models with good predictive accuracy;
  • build interpretable models that can give meaningful insight into the relationship between the various predictors and suicide risk; and
  • ensure stability of the models through validation studies comparing the various approaches using both simulated and real data.