There are two data attributes which are nearly ubiquitous but often under-appreciated and misunderstood. The first is time: a dimensions which is physically unavoidable, but often ignored to the detriment of predictive power and insight. The second is missing data: a true nuisance for analysis which can also easily become a major pitfall. This is especialy true if, as is often done, overly simplistic methods are relied on and thus major statistical bias introduced.
These ubiquitous and misunderstood attributes have fascinated me on countless projects, but stand forefront on analysis I developed working to unravel climate signals in a multi-decadal environmental series of environmental data. In this project, I looked to analyze how nutrient pollution had changed over decades in Narragansett Bay, across policy changes, and how this had affected the algae (aka phytoplankon) that make the base of the food-web, are responsible for eutrophication, and toxic bloom events.
Though the series is unprecidented in length (>60 years to date), the data has major challenges that require careful design. Missing data is present throughout the series in some cases with durations > 1 year, posing a challenge for unbiased inference. Further the changes to the underlying system, mean we expect the dependence structure evolves through time. Plainly, by this I mean coefficients, covariance, variance, etc. are unlikely to be static along the time dimension, and need to be able to change to describe changing relationships and structure.
Though, I will briefly summarize my work here, you can read my 133 page thesis about it (embedded below)… or read my publication in Statistics and Its Interface, both of which will cover both greater breadth and depth.
The data for this study come from the University of Rhode Island, Long-term Narragansett Bay Monitoring Series (2003-2020). Within this period, from 2005-2012, Rhode Island law mandated a 50% reduction in nutrient pollution from coastal wastewater treatement centers, and a major question as to the efficacy of this action.
Temperature, NH_{4} (ammonium), NO_{3} + NO_{2} (nitrate + nitrite), chlorophyl (chl) <20 μm, chl > 20 μm were the measured features of the series. The chemical variants of nitrogen are known pollutant types to affect algal growth, and the algae itself is measured via it’s dominant pigment (chlorophyl a). The algal measurements are divided on size because of the major effect of size on ecosystem function in everything from food-web interactions to carbon sequestration. Thus, through these measures we hope to see how relevant nutrient polution has changed, and the effect on algae.
The only processing taken before model development and fit is to perform a natural log transformation on nutrient and chlorophyll levels, simply to reduce skew in the data and meet Gaussian distribution assumptions for model errors. All other attributes with the data (e.g., missingingess, heteroskedasticity in time) I handled with the model architecture.
As described above, there are several key applied goals for the environmental series which are made difficult in particular by missing data. Ultimately, the technique I develop is with a multistage model where the first stage serves to analyze univariate time-series traits, and impute missing data with uncertainty, and a second stage is developed to target questions which are multivariate in nature (i.e., how algal biomass is affected seasonally and in the longterm by nutrient pollution).
The core of the modeling approach is the Bayesian dynamic linear model (or DLM), a popular time-series tools which model our data as observations of a latent state evolving through time according to an observation and state equation (eqns. 1, 2). These models have an inherent ability to imput missing data through forward filtering with the Kalman filter followed by backward sampling (Kalman 1960).
Equation 1. Observation equation:
\[Y_t = F_t \Theta_t + v,\quad v \sim N(0,V)\]Equation 2. State equation:
\[\Theta_t = G_t \Theta_{t-1} + w,\quad w \sim N(0, W)\]Kalman Filtering and Smoothing:
One-step-ahead predictive distribution of the latent state, \(f(\Theta_t \vert y_{1:t-1}) = N(a_t, R_t)\) where:
\[a_t = E(\Theta_t|y_{1:t-1}) = G_t m_{t-1}\] \[R_t = Var((\Theta_t|y_{1:t-1})) = G_t m_{t-1} G'_t + W_t\]One-step-ahead predictive distribution of the observation, \(f(Y_t \vert y_{1:t-1}) = N(f_t, Q_t)\), where:
\[f_t = E(Y_t \vert y_{t-1}) = F_t a_t\] \[Q_t = Var(Y_t \vert y_{t-1}) = F_t R_t F'_t + V_t\]The filtered distribution of the latent state, \(f(\Theta_t \vert y_{1:t}) = N(m_t, C_t)\), where:
\[m_t = E(\Theta_t \vert y_{1:t}) = a_t + R_t F'_t Q^{-1}_t e_t\] \[C_t = Var(\Theta_t \vert y_{1:t}) = R_t - R_t F'_t Q'_t F_t R_t\] \[e_t = Y_t - f_t\]The smoothed distribution of the latent state, \(f(\Theta_t \vert y_{1:T})=N(s_t, S_t)\), where:
\[s_t = E(\Theta_t \vert y_{1:T}) = m_t + C_t G'_{t+1} R'_{t+1}(s_{t+1} - a_{t+1})\] \[S_t = C_t - C_t G'_{t+1}R^{-1}_{t+1}(R_{t+1})R^{-1}_{t+1}G_{t+1}C_t\]These models also allow any parameter of our model to time vary, allowing us to study temporally changing relationships. This includes covariance matrices through a method called discount factoring. This method works extremely well for dynamic covariance when data are mostly complete. The method, outlined in West and Harrison 1997, mathematically suggest that the covariance matrix \(W_t\) for the latent state specification of the model, should decay from one time step to the next.
The DLM itself, like a NN or regression is just a general framework to which we can adapt to very specific architecture. Through the general system of equations shown above, very detailed solutions are possible, for example, ARIMAX or dynamic regression are popular architectures. In our case, we build architectures specifically to answer our questions in each of the stages.
In stage 1, we setout to answer questions about long-term patterns such as seasonality, and the long-term trend. Thus we can build a structure which specifically includes these components. In the parameterization outlined below, each time-series is modeled with a dynamic intercept (long-term trend, \(\mu\)), and Fourier form seasonal components from a period of 1 year to 23 weeks to capture complex seasonal patterns.
\[F_i^Q = [1,\ (1,\ 0),\ (1,\ 0),\ ...\, (1,\ 0)_J]\] \[G_i^Q = \begin{bmatrix} 1 & \\\ & G_s^Q \end{bmatrix}\] \[G_s^Q = \begin{bmatrix} H_1 & & \\\ & \ddots & \\\ & & H_J \end{bmatrix}\] \[H_j = \begin{bmatrix} cos(\omega_j) & sin(\omega_j)\\\ -sin(\omega_j) & cos(\omega_j) \end{bmatrix}\] \[\omega_j = 2\pi_j/s, j=1, ..., J\] \[\theta_{i,t}^Q = \begin{bmatrix} \mu_i \\\ S_{i,1,t} \\\ S_{*i,1,t} \\\ \vdots \\\ S_{i,J,t} \\\ S_{*i,J,t} \end{bmatrix}\]In the second stage, the goals is to determine the dependence structure between nutrients and algal biomass. Rather than try to model the series entirely with trend and seasonal components, we try to quantify the impact of an exogenous predictor. The second stage model also includes a dynamic intercept, a annual cycle with a period of 1 year, and a dynamic regression component on nitrogen sources. In this way, the model tells us how nitrogen predicts the anomaly from bulk seasonal and long-term patterns which are affected by a plethora of other traits.
\[F_Z = [1, g(X),\ (1,\ 0)]\] \[G_Z = \begin{bmatrix} 1 & & \\\ & 1 & \\\ & & G_s^Z \end{bmatrix}\] \[G_s^Q = H\] \[H_j = \begin{bmatrix} cos(\omega_j) & sin(\omega_j)\\\ -sin(\omega_j) & cos(\omega_j) \end{bmatrix}\] \[\omega_j = 2\pi/s\] \[\theta_{i,t}^Q = \begin{bmatrix} \mu_{i, t} & \dots & \mu_{z,t}\\\ \beta_{1,t} & \dots & \beta_{1,t} \\\ S_{i,t} & \dots & S_{z,t} \\\ S_{*i,t} & \dots & S_{*z,t} \end{bmatrix}\]The model was fit with Markov-chain monte-carlo to solve for the posterior distribution of each parameter. It’s worth noting that the method of dividing the modeling into two stages allowed the MCMC inference to be carried out independently for each stage. This separation meant that the posteriors from stage 1 could be sampled in stage 2, and the inference on missing data did not need to be repeated in the regression analysis. This added efficiency to the model fitting process.
The study presents a two-stage Dynamic Linear Model (DLM) as a versatile tool for handling noisy, incomplete, and non-monotonic time-series data in long-term environmental monitoring. In this work:
Decomposition of the DIN series DLM (2003–2019), fit with the stage 1 model structure. a. the dynamic intercept, b. the seasonal trend, c. the posterior predicted mean with the true data (red). The median (black), 80% (dark grey shading), and 95% (light grey shading) pointwise credible intervals are shown. Blue dotted lines denote the beginning and end years of policy mandated nutrient remediation.
Dependence structure between components of the stage 1 and stage 2 model for the Narragansett Bay ecological model. A bivariate model was run for small and large chl. a as well as nitrate + nitrite to describe long-term patterns among all series. Examination of prewhitened cross-correlations between the imputed series after stage 1 led to the use of DIN as a predictor in stage 2 to explore the influence of nitrogen on size structure of phytoplankton. Stage 2 used latent levels of ammonia and nitrate + nitrite.
Practical discount methods were found critical for the evolution covariance matrix, preventing over-parameterization and mixing issues in the Markov Chain Monte Carlo (MCMC) algorithm, which is used for estimation and inference in the DLM.
The multistage DLM allows for the MCMC inference to be carried out independently for each stage, improving computational efficiency by not repeating the inference on missing data in the regression analysis.
Ultimately, this model answered the key question on how algae blooms were effected by the policy changes. Results suggest that smaller phytoplankton (<20 μm) are relatively unaffected by changes in DIN levels, while larger phytoplankton (>20 μm) show a negative relationship with DIN. This relationship varies seasonally, with the strongest associations occurring in the winter.
The dynamic regression coefficient, $\beta^{DIN}_t$ , for both the a. Small chl. a series b. Large chl. a series. c. Posterior distribution of the dynamic regression coefficient, $\beta^{DIN}_t$ , on DIN for the large chl. a, plotted by week on the x–axis, and by year as denoted by color shading. The median (black), 80% (dark grey shading), and 95% (light grey shading) are shown.
The multistage DLM can accommodate data with significant missing points, disparate data streams, and multiple modeling goals.
The multi-stage state-space model architecture shows value for environmental monitoring and similar long-term analyses.
Policy changes did not have a statistically distinguishable effect on Nitrogen levels at the study site (though other research has found localized effects closer to the sources).
The size of algal organisms shifted from being dominated by large to being dominated by small organisms, which could potentially impact everything from food-web structures to carbon sequestration.
The dependence on nitrogen has not significantly changed, and is highly seasonal, meaning seasonally targeted efforts, especially in the winter could have the greatest effect on algal growth.
Practical discounting methods, though previously untested, show superior accuracy for imputation in cases of non-static covariance.
The multi-stage architecture is particularly advantageous for modeling efforts with multiple goals and extensive missing data. With this architecture, the first stage can server to both characterize the time-series and provide advanced multiple imputation, leaving the flexibility to experiment in the second stage with a more parsimonious model.
Time series analysis and time series of data are powerful for building our understand on a topic, but there is almost invariably a big challenge: missing data. Measuring devices brake, data collection is interupted, and periodically the funding may dry up. The way that we handle this missing data can have large effects and potential biases for our inference.
Bayesian state-space models and the particular case of the dynamic linear model are popular time-series tools which model our data as observations of a latent state evolving through time according to an observation and state equation (eqns. 1, 2). These models have an inherent ability to imput missing data through forward filtering with the Kalman filter followed by backward sampling (Kalman 1960).
Equation 1. Observation equation:
\[Y_t = F_t \Theta_t + v,\quad v \sim N(0,V)\]Equation 2. State equation:
\[\Theta_t = G_t \Theta_{t-1} + w,\quad w \sim N(0, W)\]Kalman Filtering and Smoothing:
One-step-ahead predictive distribution of the latent state, \(f(\Theta_t \vert y_{1:t-1}) = N(a_t, R_t)\) where:
\[a_t = E(\Theta_t|y_{1:t-1}) = G_t m_{t-1}\] \[R_t = Var((\Theta_t|y_{1:t-1})) = G_t m_{t-1} G'_t + W_t\]One-step-ahead predictive distribution of the observation, \(f(Y_t \vert y_{1:t-1}) = N(f_t, Q_t)\), where:
\[f_t = E(Y_t \vert y_{t-1}) = F_t a_t\] \[Q_t = Var(Y_t \vert y_{t-1}) = F_t R_t F'_t + V_t\]The filtered distribution of the latent state, \(f(\Theta_t \vert y_{1:t}) = N(m_t, C_t)\), where:
\[m_t = E(\Theta_t \vert y_{1:t}) = a_t + R_t F'_t Q^{-1}_t e_t\] \[C_t = Var(\Theta_t \vert y_{1:t}) = R_t - R_t F'_t Q'_t F_t R_t\] \[e_t = Y_t - f_t\]The smoothed distribution of the latent state, \(f(\Theta_t \vert y_{1:T})=N(s_t, S_t)\), where:
\[s_t = E(\Theta_t \vert y_{1:T}) = m_t + C_t G'_{t+1} R'_{t+1}(s_{t+1} - a_{t+1})\] \[S_t = C_t - C_t G'_{t+1}R^{-1}_{t+1}(R_{t+1})R^{-1}_{t+1}G_{t+1}C_t\]These models also allow any parameter of our model to time vary, allow us to study temporally changing relationships. This includes covariance matrices through a method called discount factoring. This method works extremely well for dynamic covariance when data are mostly complete. The method, outlined in West and Harrison 1997, mathematically suggest that the covariance matrix \(W_t\) for the latent state specification of the model, should decay from one time step to the next.
Discounting Covariance:
\[R_t = Var(\Theta_t \vert y_{1:t-1}) = G_t C_{t-1} G'_t + W_t\] \[R_t = P_t + W_t\] \[W_t = \frac{1-\delta}{\delta}P_t\]Practically, this equation represents how the information is lost from one time step to the next.
The issue is that for data with extended periods of missingness, standard discounting methods would have the loss of information grow at an exponential rate in the forward filter. If we are predicting \(k\) steps ahead with data, then the covariance at step \(k\) ahead becomes:
\[C_t(k) = \frac{G^k C_t G'^{k}}{\delta^k}\]This is as opposed to a linear rate in a static covariance specification that comes with the Kalman filter above:
Inutition suggests that an exponential loss of information in missing data may be overly conservative, but what covariance specification gives us the most accurate results for periods of extensive missing data?
Further, what is the optimal selection criteria for selecting a discount factor?
Identify the optimal discounting strategy for prolonged periods of missingness
Identify the criteria with the strongest statistical power to compare models with different configurations such as fixed discount factors.
This theoretical investigation was spurred by real time series data from the Narragansett Bay Long-term Time Series. I will use the log transformed Chlorophyll (algal pigment data), because it is the subject of another time series analysis I have been working on. The data is from 2003 to 2020, collected at weekly resolution. While the use of a multivariate analysis would be optimal for imputation, I use a univariate series to focus on the question of covariance specification.
Raw data can be found here.
While it might seem counter-intuitive, I am using the real data from the time-series to parameterize known data generation models. I choose to do this over completely random data generation models becuase this theoretical investigation is directly tied to the applied problem of analyzing the real data with heteroskedastic behavior and prolonged missingness.
The data were first logged transformed due to their highly positively skewed nature. Second, a dynamic linear model was fit to the data. To capture long-term trend and seasonal behavior, the latent state contained a dynamic intercept and fourier form seasonal components. The models were fit with different levels of fixed discount factors. The posteior mean of \(V\) and \(W_t\) was calculated. New data series were generated from each model fit, with known parameterization. In each copy of the simulated series, missingness was randomly introduced with a frequency distribution matching the original data.
With the simulated data, DLMs were fit with different discount factor levels. The idea is to see if we can recover the discount factors of the data generation model, and which performance criteria helps us make this recovery with the highest accuracy and statistical power. Six performance metrics were calculated and compared to identify which most strongly identified the correct data generation structure.
Performance metrics were compared between fits with practical and standard discounting.
For the data generation model of high discount factors our performance metrics are the following for each model fit, where the x-axis is a set of discount factors used in a model fit:
For the data generation model of low discount factors our performance metrics are the following for each model fit, where the x-axis is a set of discount factors used in a model fit:
Comparing the standard and practical discounting methods we find the following for each data generation model:
Kalman, R. E. 1960. “A New Approach to Linear Filtering and Prediction Problems.” Journal of Fluids Engineering, Transactions of the ASME 82 (1): 35 45. https://doi.org/10.1115/1.3662552.
West, Mike, and Jeff Harrison. 1997. Bayesian Forecasting and Dynamic Models . 2nd ed. Verlag New York: Springer. https://doi.org/10.1007/b98971.
]]>Marine microbes are some of the most numerous organisms in the world. Although you may have never seen or though about them, they make our planet survivable in many ways (not the least of which is producing 50% of the atmospheric oxygen we breathe and forming the base of the food we in all the world’s oceans). Despite being so numerous and critical to life on our planet, there are major gaps in our understanding of how these tiny organisms are shaped by their environment.
Under a fellowship with NASA Space Grant, I studied how temperature impacts these critical organisms. In this project, by creating mathematical models of microbial growth and size as a function of temperatures, I provided a small piece of NASA’s large numerical models that work to model and predict our own planetary function.
If you would like the full details of the published study, you can read it here:
Measure the growth and cellular size traits of a globally relevant marine algae.
Mathematically model how the cellular traits are impacted by temperature. Use interpretable statistical models that can be used for prediction and ecological interpretation. From the start, we had several targeted questions:
Use ocean thermal data and mathematical models of thermal responses to estimate the magnitude of thermal effects on microbial function.
Cells and Culture Growth
As a study case for the biological response of algae to temperature and temperature change, I chose an algae which is both common and toxic, Heterosigma akashiwo. The cells were kept healthy, in exponential growth. To avoid convolution of the thermal response with the response to new media, cultures were only transferred to new media more than 1 d prior to and 1 d post to a change in temperature.
Temperature treatments
With little prior information as to which features of changing temperature might influence growth, two major traits were examined in the experimental design: (1) the direction of temperature change (increasing or decreasing) and (2) the magnitude of temperature change (small shifts vs. larger cumulative changes). To address these features, and represent realistic rates of change, cultures were shifted sequentially to new growth temperatures outward from 15\(^{\circ}\)C (Fig. 1). Including the control culture, growth rate, and acclimation was measured at 10 temperatures: 6\(^{\circ}\)C, 8\(^{\circ}\)C, 10\(^{\circ}\)C, 12\(^{\circ}\)C, 15\(^{\circ}\)C, 18\(^{\circ}\)C, 22\(^{\circ}\)C, 25\(^{\circ}\)C, 28\(^{\circ}\)C, and 31\(^{\circ}\)C. As each incubator had a static temperature, we used small discrete shifts in temperature over time. Beginning with the culture that was acclimated to 15\(^{\circ}\)C, every 4 d a triplicate set of the cultures growing at the highest and lowest current temperatures were split, with one fraction retained at its current temperature and the other fraction shifted one temperature step outward (i.e., further toward the temperature extremes of 6\(^{\circ}\)C and 31\(^{\circ}\)C.
Figure 1: Experimental temperature treatement design and temporal component to data collection.
Data Collection
To quantify changes in cell size, population growth rate (cell numbers), and volumetric growth rate, the abundance and cell size distribution were measured with a Beckman Coulter Multisizer 3 for 15 d following the initial transfer to the target temperature.
The measurement taken in this study, for which, all statistics are derived are cell size distributions. The Coulter counter is a machine which counts the number of particles in size bins from ~1 to 100 \(\mu\)m. Measuring the number of cells in culture with automated instruments, can be produce suprisingly noisy data that need to be cleaned. Particularly, decaying material and undesireable cell populations in the culture may show signals in the data.
While normally the data are processed by hand, I automated the processing of thousands of cell distributions by fitting a mixture of a Gaussian and exponential mixuture model to the cell size distribution data, I quantified the mean and variance of the size distribution and number of cells in the target population. The traits of these population measures served as the key independent varaibles in the desired mathematical models of growth. Below you can see an example of a cell size distribution with overlaid Gaussian density curves:
Figure 2 : A particle and cell size distribution is fit with a mixture of distributions to identify the true cell count and population characteristics. The raw data collected from the Coulter counter is the underlying histogram of cell size with frequency on the y-axis and cell size on the x-axis. To fit the mixture distribution, the frequency data is first converted to raw measurements of individual cells. The density distributions from the mixture model are overlaid for graphical reference to the model fit. Unlike most clustering cases where the true number of clusters cannot be known, microscopy was used to verify the number of true cell populations and possibly multiple cell size groups of the same population.
Population growth rates were calculated by fitting an exponential growth curve to the population data:
\[P_t = P_0 * e^{rt}\]where:
1.) How does temperature affect growth rate?
With maximum likelihood estimation, I used the calculated specific growth rates were to fit a standard thermal reaction curve used in biology:
\[k(T) = a*e^{bT}[1 - (\frac{T-z}{w/2})^2]\]where :
Figure 3 : The thermal reaction norm showed a clear thermal dependence with a tolerable range (\(w = 25.9^{\circ}\)) spanning from \(7^{\circ}C\) to \(33^{\circ}\). This upper temperature range suggests that the marine algae can survive to some of the hottest temperatures seen in the ocean, and are likely to tolerate further warming with continued positive growth rates. However, warming beyond the optima (\(z=20^{\circ}C\)) will result in thermal stress.
2.) How are the sizes of cells impacted by temperature?
And analysis of cell size showed that both the devision rate and cell size were closely tied to temperature:
Figure 4: ESD (Cell diameter) measurements across temperature. Results show a clear relationship with generally decreasing cell size as temperature increases.
While the relationship between cell size and temperature is strong, nuance here is important. From the graphical outputs, I had noted, the cell size relationships and aberations from linearity seemed to match an inverse of the temperature growth curve. After noting some-nonlinearity in the relationship, and conducting other exploratory comparisons, I tested whether there was another explainatory reason for cell size changes.
After comparing multiple models (AIC) with a type II regression, I showed that cell size was best explained by cell division rates. This is important to our understanding because the evidence suggests that it’s not temperature directly, and potentially any influences on division rates will affect cell size.
Figure 5 : A type II regression fit between ESD (cell size) and division Rate (population growth rate). The temperature treatement where each measurement was recorded is shown in color.
3.) How much does change itself impact population growth and cell size?
To quantify the effect of time (acclimation) on specific growth rate, the final rates (\(\mu_f\), 7 ≤ \(\Delta\) time ≤ 15 d) were subtracted from the initial rates, measured over the first 3 d (\(\mu_0\); \(\Delta\) time ≤ 3 d) for each temperature treatment.
A break-point detection and Dunnet’s test showed three distinct response groups: 1.) Small thermal changes which lowered growth rate immediately 2.) Medium changes which accelerated growth 3.) Extreme changes which again lowered growth. To be brief, there are biological mechanisms we hypothesize which can explain these patterns, but further detailed molecular metabolic research would be needed to understand these response patterns fully. Nonetheless, we have evidence that changing temperature dramatically effects growth rates.
Figure 6: Growth rates were dramatically impacted by the cumulative magnitude of temperature change itself.
Last, to understand patterns in the variability of growth rate among replicates as a function of the cumulative temperature change (\(\Delta\)Temperature) and time (translated into a binary variable; \(t_0\) and \(t_f\) to 0 and 1, respectively). After graphically examining the variance in growth rate relative to temperature changes, an exponential relationship was fit to the SD of the specific growth rate (σ). The following regression was again fit with maximum likelihood expectation (MLE):
\[\sigma = c * e^{d*\vert\Delta T\vert - e*time}\]where:
Temperature change caused an exponential level increase in biological variability. And the power of the exponent decreased with time. This may seem esoteric, but the fundamental origins of biolocal variability are an open ended question in biology, and any evidence that can explain why organisms vary is valuable. Here our data evidences how environmental peturbations can rapidly diversify a population.
Figure 7: Standard deviation in division rate (between replicates) as a function of the magnitude of temperature change. The exponential fit suggests that biological variablity increases exponentially as a result of environmental variation.
How much can thermal variability impact estimates from a standard thermal performance curve?
To exemplify, the impact of the thermal response on microbial growth as well as the potential magnitude of new-found acclimation effects, I extracted seasonal data from varied oceanographic sites within the thermal range of our study organisms. By imputting, this data into the fitted thermal performance curve we made a prediction on population growth and production.
Collecting enough laboratory data to fit any complex function or non-parameteric model to the acclimation data was impossible. As you can see from the raw data of figure 6 above, limited sample size limits the modeling options. Nevertheless, we would be remiss not to make an estimate of potential effects. Using a moving window to quantify the total temperature change an organism would experience, a scaling factor was used based on the response shown in figure 6. Consistent with the results where different magnitudes of temperature change elicited different acclimation responses, we applied a correction equal to the difference in initial and final growth rates observed at each magnitude in temperature change. Specifically, three temperature ranges in the thermal history window elicited different responses: \(3–5^{\circ}C\), \(5–13^{\circ}C\), or greater than \(13^{\circ}C\) (see “Results” section). The average specific growth rate difference (i.e., acclimation) for these three thermal ranges were −0.14 \(d^{−1}\), 0.10 \(d^{−1}\), and −0.18 \(d^{−1}\), respectively. These differences were added to the acclimated rate interpolated from the thermal performance curve. That is, when the thermal history window had a temperature range of \(3–5^{\circ}C\), \(5–13^{\circ}C\), or greater than \(13^{\circ}C\), a −0.14, 0.10, or −0.18 \(d^{−1}\) correction was added to the growth rate inferred from the thermal performance curve. Cases where no growth was observed wereomitted. The percent difference between final and initial growth responses were compared for each day.
Figure 7: With real-world data used for simulation, within the habitat of our target organism, temperature frequently varied to levels where we had seen measureable response to the variability itself.
Figure 8: Comparing cases to when acclimation was and wasn’t accounted for, there was potential for major discrepencies in growth estimates. The x-axis shows the magnitude of population growth differences in percent. The y-axis shows the number of days (out of 1 year) falling in each histogram bin. This results suggest that in any environment with high seasonality, further research to quantify variabilty responses will be necessary for accurate estimations.
During the COVID-19 pandemic, I had the privelage to work on the Northeast Big Data Innovation Hub project CritCOVIDView: A Critical Care Visualization Tool for COVID-19. Led with the clinical and research expertise of Todd Brothers PhD, PharmD and Mohammad Al-Mamun PhD, as part of the team, I created data processing pipelines, conducted statistical analysis for insight and trained statistical models for predicting patient outcome and the clinical influences on these outcomes. The core goal of this project was to develop predictive models and statistical insights to help clinicians make data driven medical decisions during the COVID crisis with the insights and analysis from high dimensional patient data. Recently, some of the survival (aka time to event) modeling work was published in the medical journal SAGE open Medicine. In this post, I will summarize the modeling components of this paper that I worked on as part of the team.
This project targets analysis of the interaction between the prevalent and dangerous condition of Acute Kidney Injury (AKI), and the interaction with medications, patient demographics, and COVID infection.
Because of the sensitivity of the data I can neither share the data nor analysis code, but I can and will provide a description of the analysis process, results, and implications.
Acute Kidney Injury (AKI) is a dangerous condition and unfortunately common condition in the ICU, impacting millions of patients every year and associated with a wide range in adverse patient outcomes including death. The goal of this analysis is to model the outcomes of these patients (both mortality and recovery) as well as to identify the treatment and patient conditions that may impact the patient outcome.
This study recieved all proper ethical approval (see publication details), and used deidentified, retrospective data.
226 ICU patients were included in this study. Data included all available records of vitals, demographics, medications recieved, laboratory readings (from blood-work), and oxygenation. All patient data were available in completion during the entirety of the patient stay.
AKI as a condition can be classified into 3 stages (stage 3 being the most severe), and was calculated via the criteria of the Kidney Disease Improving Global Outcomes standard. To accomplish this, I created a data pipeline which used rolling windows, applied over the temporally structured data, to detect the AKI criteria during the patient staty. By applying the AKI criteria in rolling windows of the patient data, important temporal traits were retained such as the time to onset of AKI, worsening/improving of condition over time, and time between diagnose and outcome (recovery/mortality).
First, the descriptive statistics were calculated for each study group. The mean and interquantile range were calculated for each continuous variable, and for categorical variables, the count and percentage of patients meeting the conditon were calculated. To statistically test differences between the study groups, pair-wise t-tests were used for continous varaibles, and a chi-square test was used for categorical features. Fisher’s exact test was used to statistically test differences in count data for medications classes received due to the small sample sizes for patients recieiving certain medication categories.
To prepare the data for survival modeling, the data had to be processed into a time-to-event format. In this format, each row was occupied by the data for a given patient and condition status. In example, it may designate the AKI stage 1 status of patient X. If patient X was also non-AKI, a seperate row would designate their data for the non-AKI period of their stay. Each row contained an event (recovery, worsening AKI condition, mortality), the time from diagnosis or ICU entry to the event, the patient demographics, and average laboratory, oxygen, and vitals data. In addition, a dummy encoding was used for the medication classes received during their stay. Altogether, the transformed data gave information on the time to event, the event, and the patient conditions leading up to the event.
The first basic question is how our AKI vs non-AKI patients compare. A descriptive table and pairwise comparisons (chi-square and t-test for categorical and continous variables respectively) were used to describe the differences between the cohort of AKI and non-AKI patients.
Outcomes included the time in ICU, time on mechanical ventilation, and mortality rates were also compared for each cohort
After establishing medical differences between our AKI and non-AKI patients, we then ask what might predict a patient becoming an AKI patient. AKI classification (binary reponse variable) was predicted in a LASSO logistic model, a model which penalizes parameters, such that the model can serve for feature selection. The features selected by the model provide information on what predicts AKI classification and comparison of the odds ratio between the AKI and non-AKI group inform us to the effect size.
We also asked which medication classes might influence the AKI outcome (recovery, persistence of AKI, or mortality). Because the sample size limited statistical power in survival models, I employed a fisher’s-exact test to compare the number of patients receiving a given medication class, with the patient outcome as a dependent variable. That is the test compared whether more patients with a certain outcome received a given medication category. The Fisher’s exact test was used because of the ability to statistically test small sample sizes.
I calculated the Kaplan-Meier curve, which shows survival probability by each AKI class. It is a standard in survival modeling and we chose to show it as a reference. However, usage of this curve often violates statistical assumptions, and the same is true here (Later we will go on to show with non-proportional hazard modeling how survival differences become apparent with the correct model structure). While Kaplan-Meier estimates of emperical surival probabiltiy show little difference in the survival probablity in any of the AKI groups, it does not account for competing outcomes which means, that multiple events can happen which prevent a patient from experiencing a single target event. For example, in our data, a patient may be diagnosed with AKI stage 1. The event we are monitoring may be recovery, but they may also become censored from the study because they became a more severe AKI stage, or died. Further, the often cited log-rank statistic which test for differences in survival curves by group cannot handle time-varying differences in survival probablity. Clearly, these are major assumptions that must be addressed.
As a consequence of the nature of competing hazards in our data, a cox regression was used to test the difference in the proportional hazards for recovery and mortality in each of the AKI classified patient groups and control group. The target variables in independent models were recovery and mortality. The included predictors were AKI status, demographic traits, and medication class received.
The Cox proportional hazards model, is a model for the hazard rate, or the instantaneous probability of an event occuring. Obviously, in the real world, instantaneous rates are difficult to interpret, but if this rate is integrated over time, it gives the probability of an event occuring up to that point in time. Further, the parameterization of the Cox model for the instantaneous hazard will give us valuable information about what influences the probability of an event (like recovery or mortality) occuring. The Cox proportional hazards model is configured as such:
\[\lambda(t \vert X_i) = \lambda_0(t)exp(\beta_1 X_{i1}, + ... + \beta_p X_{i,p})\]However, analysis of the Schoenfield residuals showed significant time-variation in the covariate effects. Therefore, a non-proportional hazards model was run whereby features that signficantly time varied as indicated by Schoenfeld residuals, were given temporal flexibility.
The pairwise analysis of AKI and non-AKI patients showed that AKI patients varied significantly in many traits including vitals, laboratory results, and commorbidities
Pairwise comparisons of outcomes showed significantly higher time in the ICU and time on mechanical ventilation for patients who were AKI classified.
LASSO logistic regression identified the increased risk of developing AKI for patients with higher BMI, with hypo-osmolality/hypo-natremia, and on certain classes of medications, especially diuretics, anti-infectives, and gastrointestinal agents.
The non-proportional hazards model predicting mortality showed that severe AKI patients have a mortality risk that dramatically rises over time, as compared to stage 1 and stage 2 patients who have static mortality risk. This highlights the severe danger of entering stage 3 AKI without recovering.
The non-proportional hazards model predicting recovery showed that stage 1 AKI patients have a recovery risk that dramatically rises over time, as compared to more severe stages. This suggests that patients who develop AKI and can be kept to a low stage are likely to recover quickly. This emphasizes the importance of mitigation and reducing the potential for further kidney damage (such as by stressful medication regimes) that could worsen AKI condition and thus the potential for adverse outcomes.
Having just completed this website, I thought it would be befitting to first post what I learned about building a static website, and how it can help you in your professional journey. A personal website can be a great tool to teach others, build your network, and hey-show off some of your hard work. These are all reasons why I built this site. If you’re also in industry like me, probably most of what you do is confidential, but it can still be a good place to share techniques and open source examples.
This post will be about how to setup a free website of your own using Github pages, how to customize it as much as your heart desires with Jekyll and Ruby, and last how to add some extra flair.
What is Github Pages?
Github Pages are public web pages any user can utilize that will be freely hosted on Github. Given this is Github based, you’ll need a Github account. It would also be helpful to have git on your desktop.
What is Jekyll?
Jekyll is an open source static site generator with which you can easily write content like this in basic Markdown, use HTML and CSS for structure and presentation. Jekyll does all the hard work of compiling this into HTML so you don’t have to. If you’ve ever had to write in HTML you’ll quickly learn to appreciate Jekyll.
What is Ruby?
Ruby is the programming language Jekyll is written in. If you’re doing anything ordinary you probably won’t need to write anything in Ruby, but it is helpful to know that Ruby underlies Jekyll for building the site.
It is possible to build your site either with Jekyll or without. We will overview both options here:
Option 1: If you want the easiest route that gives pretty good results, you can do everything from Github without really having to think about Jekyll. To do this, the following will get you setup:
Voila, you have made a site under the url https://username.github.io . If you want to make additions, such as to add more pages, you can do so with Jekyll. If you are taking this option 1 route, you can see the github documentation here for additional details. Do note however, that it may take a few minutes for changes to update.
Option 2:
This second option is for those who want more control. With this, I will cover how to use custom themes, adapt them to your needs, add pages, add posts, and even getting a custom domain name for your website.
The amount of open source themes for Jekyll is astounding. With little effort, you can make use themes built with the design skill of professional web designers. While there are a wealth of themes of high quality, it can be a little intimidating at first to know how to update and implement them for your own needs, especially if like me you were not familiar with Jekyll and web-design before hand. Do not worry, I will take you through how to build a beautiful site with the example of the minimal mistakes theme. If you are interested in another theme, you can find a gallery of downloadable, github page compatible themes here.
There are actually several ways to implement the minimal mistakes theme, but I recommend forking the repository and using this as a template. Obviously, with your own forked version of the template, it let’s us make more changes down the line since we have access to all the code.
To build your Github page from the repository via forking, simply navigate to the minimal mistakes repository. Click the fork icon on the top right menu like such:
As in option 1, we need to rename this repository on our own profile with name username.github.io. Under the forked repository, you can do this simply by clicking on settings on the menu bar, and typing the new name of your repository in the top box:
Under “Code & operations” go to pages. Your github site should now be up and running on https://username.github.io.
Of course, what you see are all the default pages and content for this theme. How do we include our information and customize?
Because we are really starting with a blank template of a site, and you are going to want to visualize your changes as you go, it is best to add your content locally. In essence, adding everything is possible directly from your profile it Github but it will be particularly painful because commits will take several minutes to take effect.
For your local environment, you will need git installed and ssh-key properly setup. You will also need Jekyll and Ruby. For installation and setup, I will point you to an excellent tutorial for both Windows and Mac. Last, you will need a text editor, I recommend Visual Studio Code, because it facilitates fast staging, commits, and push to your Github. This means you’ll be able to develop quickly and easily. Lets get started.
First, clone your Github repo locally. You can do this from Git Bash via _git clone
From your command line, navigate to your cloned repo and enter bundle exec jekyll serve. This will serve the site locally on your PC. It should default to localhost:4000, which you can type in any browser. Changes to posts and pages will take place immediately, but config changes will require quitting and restarting the serve command. Nevertheless, this will let you start making additions and changes to your site while seeing how they impact the webpage.
With your PC serving your site locally, go ahead and navigate to your repository folder in VS code. There are several major folders and files to adapt the site to your needs:
So far so good. But if you’re like me you might be bothered by having to stick with the standard domain name for your site. If you want to get creative or maybe just don’t want to have .github.io in the address, not to fear, Github has made this easy.
If you don’t already own a domain name, this is pretty simple. There are a number of domain name services. I chose GoDaddy.com, but it doesn’t matter which you use.
Once you buy a domain name, you’ll have to make a few changes with your domain managment and Github to get it working, but it’s a sinch.
On the DNS management page, you need to make the following changes:
In the Github repository for your site, add a file “CNAME” in the root directory. In CNAME, add your purchased domain name.
]]>