Discovery at the VICB

Beware the False Assumption in Systems Biology
By: Carol A. Rouzer, VICB Communications
Published: July 2, 2018
Standard approaches used to generate quantitative models of biological processes may frequently fail due to false assumptions about data distribution.
A major goal of systems biology is to develop quantitative models of complex biological behavior to enable the prediction of that behavior in new situations. Model development comprises two essential steps. The first, fitting, involves selecting an appropriate model framework, assembling an experimental dataset to serve as the "training set", and then using the training set to identify the numerical parameters that minimize the differences between values calculated by the model and the data. Once fitting is complete, the second step is to use the model to predict the response of the system under a new set of conditions and determine how well the predictions match a new set of experimental data. As the sophistication of our statistical computational approaches has increased, the fitting step of model development has become almost routine. However, we are also finding that despite excellent fits to the training set, many models have very poor ability to predict new experimental results. This led Vanderbilt Institute of Chemical Biology member Gregor Neuert, his collaborator Brian Munsky (Colorado State University), and their laboratories to determine the factors that lead to the development of models with a high predictive capability [B. Munsky, et al., Proc. Natl. Acad. Sci. U.S.A., (2018) published online June 29, DOI: 10.1073/pnas.1804060115].
Many attribute the poor predictive ability of mathematical models to the complexity and variability of biological systems. The solution, they say, is to have more and better data. To achieve this goal, we now have methods that enable us to measure biochemical events with amazing resolution at the single cell level over many thousands of cells at a time. The resulting huge, informationrich datasets have addressed the need for better data, yet problems with poor model predictive ability remain. Another solution to the problem is the use of methods to assess the statistical uncertainty in all of the generated model parameters. Theoretically, this provides critical information regarding potential sources of error and the magnitude of those errors. However, a major problem with these and all statistical approaches routinely used for model development and analysis is that they assume that the experimental data are in the form of a symmetrical Gaussian (normal) distribution or that for nongaussian data, the sample mean is a Gaussian distribution. Unfortunately, despite the large size of many datasets, they frequently do not satisfy this criterion. In response, the Neuert and Munsky labs hypothesized that the major reason most quantitative models only poorly predict the behavior of biological systems is that the mathematical foundation upon which they are structured is flawed.
A fundamental process in biology is transcription, which has been shown to be highly variable from one cell to another. Measurements of transcription in organisms from viruses to humans frequently result in asymmetrically distributed datasets. Thus, to test their hypothesis the researchers began with prior investigations they had carried out on the transcriptional responses of the yeast Saccharomyces cerevisiae upon exposure to osmotic stress. In that work, they succeeded in developing a predictive mathematical model describing transcription of genes regulated by the high osmolarity glycerol kinase (Hog1, a p38 mitogenactivated protein kinase homologue). In this new study, they sought to extend this model by including mRNA elongation, transport from the nucleus to the cytosol, and degradation (Figure 1). However, their major goal was to understand a fundamental unsolved problem in biology – why some approaches lead to a model with a high predictive ability while others do not.
FIGURE 1. Experimental design. Yeast cells are exposed to osmotic stress in the form of medium containing a high salt concentration. This triggers activation of the Hog1 kinase, which translocates to the nucleus. Phosphorylation of target proteins leads to changes in chromatin, resulting in the transcription of target genes. High resolution measurements obtained at the single cell level over time enable assessments of transcription initiation, mRNA export from nucleus to cytoplasm, and mRNA degradation. Figure reproduced with permission from B. Munsky, et al., Proc. Natl. Acad. Sci. U.S.A., (2018) published online June 29, DOI: 10.1073/pnas.1804060115. Copyright 2018, B. Munsky, et al.
The investigators exposed S. cerevisiae to osmotic stress in the form of medium containing 0.2 M or 0.4 M NaCl and used single molecule fluorescence in situ hybridization to monitor the transcription of two Hog1regulated genes, STL1 and CTT1, over time in >65,000 individual cells (Figure 2). They then used four approaches to fit the gene transcription data in search of an extended model of transcription. The first approach used conventional statistical techniques to fit the means of the data at each time point. In the second approach, they used the same techniques, but fit both the means and variances of the data. In the third approach, they included two additional statistical characteristics (moments) in the fitting process. In the fourth approach, they used an entirely different method, known as finite state projection (FSP). Notably, the first three approaches are based on data parameters that are assumed to fit a Gaussian distribution. In contrast, FSP is based on the chemical master equation that assumes that the data are derived from a Markov process  that is, a series of steps, each of which is totally random and depends only on the outcome of the prior step. FSP utilizes all of the data for fitting and does not assume a Gaussian or any other form of distribution. The researchers hypothesized that FSP would offer a distinct advantage, as initial evaluation of the dataset confirmed that it was neither symmetrical nor Gaussian.
FIGURE 2. Examples of data used for fitting. Each example represents the results of single molecule fluorescence in situ hybridization of cells at the indicated times after exposure to 0.2 M NaCl. Nuclei are stained blue and outlined in white. Cells are outlined in gray. STL1 mRNA is in green, and CTT1 mRNA is in red. Very bright staining in a nucleus indicates a transcription site (TS). Figure reproduced with permission from B. Munsky, et al., Proc. Natl. Acad. Sci. U.S.A., (2018) published online June 29, DOI: 10.1073/pnas.1804060115. Copyright 2018, B. Munsky, et al.
All four approaches resulted in excellent fits to the data, but they also yielded different model parameters. Furthermore, the excellent fit in each case was restricted to the data actually used for fitting. For example, the first approach provided parameters that resulted in a very good fit to the data means, but not the variances or the overall data distributions. The second approach yielded a model that fit both means and variances, but nothing else. The third approach, despite using more data characteristics, yielded a good fit only to the means. In contrast, the FSPderived model provided a good fit for all aspects of the data. In addition, the FSP model yielded parameters for transcription initiation and mRNA decay that agreed well with previously published values.
The investigators used various statistical approaches to show that the failure of the first three models was not due to a theoretical inability to identify correct parameters or an inadequate constraint of the parameters during fitting. A more careful look at the data demonstrated that the medians of either the real data or data generated from the FSPderived model were lower than the mean values of the same data generated by the FSP approach. They concluded that this mismatch, which was most pronounced at later time points, resulted from the highly asymmetric distributions of the data being sampled. This supported their hypothesis that attempts to generate models using approaches that assume a symmetrical distribution will be thwarted by systemic biases.
Having demonstrated the superiority of the FSP approach, the investigators expanded this model by including data regarding the location of the mRNA (nucleus versus cytoplasm) within the cells. They found that, despite the increased complexity, this model exhibited improved parameter bias and reduced uncertainty. Finally, they evaluated the various models for their ability to predict elongation dynamics of individual mRNAs at transcription sites (Figure 3). They found that the FSPderived model predicted mRNA elongation rates that agreed remarkably well with experimental values. They then compared all four modeling approaches for their ability to predict the average number of full length mRNAs generated per active transcription site. Only the FSP approach was able to make predictions that were comparable to experimentally measured values whereas standard modeling approaches failed to do so by several orders of magnitude. The FSP model also correctly predicted the distribution of full length mRNAs generated per active transcription site and the fraction of cells that have an active transcription site over time.
FIGURE 3. Examples of nascent mRNA data used for model validation and predictions. Each example represents the results of single molecule fluorescence in situ hybridization of cells after exposure to 2 M NaCl. Nuclei are stained blue and outlined in white. Cells are outlined in gray. STL1 mRNA is in green, and CTT1 mRNA is in red. Very bright staining in a nucleus indicates a transcription site (TS). Figure reproduced with permission from B. Munsky, et al., Proc. Natl. Acad. Sci. U.S.A., (2018) published online June 29, DOI: 10.1073/pnas.1804060115. Copyright 2018, B. Munsky, et al.
The results demonstrate the importance of selecting the correct approach when developing quantitative models to describe complex biological phenomena. All mathematical approaches are based on certain simplifying assumptions regarding the characteristics of the data. Such assumptions render manageable what would otherwise be a massive computational undertaking. In this context, it is very tempting to apply approaches based on familiar assumptions, such as the one that data are normally distributed. However, as the Neuert and Munsky labs clearly demonstrate, failure to confirm that this is, indeed, the case may lead to the generation of a model that beautifully describes the current set of data but has no applicability to any other. Such efforts not only waste resources, they can lead to highly misleading conclusions.
View Proc Natl Sci USA article: Distribution shapes govern the discovery of predictive models for gene regulation