Analyzing complex relationships in time series: clustering, causal discovery, and root cause analysis
Date
2024-12
Authors
Adesunkanmi, Rahmat
Major Professor
Advisor
Khokhar, Ashfaq
Carriquiry, Alicia
Ramamoorthy, Aditya
Dogandzic, Aleksandar
Weber, Eric
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Abstract
\specialchapt{ABSTRACT}
\textbf{Time series analysis} studies the dependence among observations at different points in time. What distinguishes time series analysis from general multivariate analysis is the temporal order imposed on the observations. Over the years, research on analyzing time series has presented challenges and methods for uncovering the underlying patterns, relationships, and dynamics that govern complex systems. Recently, the application of machine learning (ML) for use in the tasks of classification, forecasting, intervention, and making predictions on temporal data has become popular. This Ph.D. research focuses on techniques from advanced machine learning, applied mathematics, and statistical models to address some challenges encountered and extract meaningful insights from time series. The research is structured around three interconnected areas of analysis: classification through robust clustering of noisy data, forecasting by discovering causality in non-linear systems, and root cause analysis by leveraging causal models.
Firstly, we present a \textbf{robust clustering} to identify and group similar patterns within noisy datasets. Traditional clustering methods like $K$-means and $K$-medoids work with raw-data and are not designed to be robust to noisy data. However, data is naturally and inherently affected by the random nature of the physical generation process and measurement inaccuracies, sampling discrepancy, outdated data sources, or other errors, making it prone to noise/uncertainty. To this challenge, we propose a novel statistical metric known as the expectation distance (ED) of random variables that is demonstrated to be a statistical distance measure. Utilizing this newly proposed metric alongside the well-known $2$-Wasserstein ($W_2$) distance, we develop noise-robust clustering algorithms that operate over data distributions rather than raw data points. By extending traditional $K$-means and $K$-medoids clustering algorithms with these proposed statistical metrics, our approach proves to be more effective in handling noisy data. Our research shows that while the $W_2$ distance relies only on marginal distributions and ignores the correlation information, the proposed ED metric captures this correlation, leading to superior noise-robust clustering results. While clustering noisy data is particularly critical in fields where data uncertainty and noise are prevalent, such as finance, healthcare, and environmental monitoring, clustering lacks sufficient interpretability. It does not necessarily take the temporal nature of data into its algorithms. Thus, this makes us further analyze time series.
The second part of this research is centered on \textbf{causality}, specifically causal discovery, a process that uncovers causal relationships between variables in time series data. Understanding these causal relationships is essential not only for accurately modeling the data but also for making reliable predictions about future behavior, which is a step further from clustering. This part of the research emphasizes the Granger Causality (GC) to uncover these causal connections. To address the limitations of traditional linear GC methods, we introduce a framework called NeuroKoopman Dynamic Causal Discovery (NKDCD). The Koopman operator theory inspires this framework and harnesses the computational power of neural networks to learn the Koopman basis functions automatically. These basis functions aid in lifting the non-linear dynamics inherent in time series data into a higher-dimensional space where they can be more effectively analyzed using linear system techniques. NKDCD employs an autoencoder architecture that facilitates this transformation, enabling the application of GC in non-linear settings.
Finally, the research is concluded with the task of \textbf{root cause analysis}, using causal models to identify the causal variables that influence hydrogen bond separation in molecular dynamics simulations (MDS) data. Specifically, we approach the breaking of hydrogen bonds as an "intervention" and use graphical causal models to represent both the bonding and separation events in the MDS data. To do this, we employ a dynamic Bayesian network (DBN) tailored for time-series data, allowing us to extract proabilistic directed acyclic graphs (DAGs) that capture causal relationships.
This causality-based framework models interatomic and dynamical interactions in molecular systems by inferring causal relationships among atoms from observational data. The causal models are built using a variational autoencoder (VAE) architecture, which facilitates the inference of causal relationships across samples with diverse underlying causal graphs while leveraging shared dynamic information.
Once the causal models are learned, the analysis progresses further by inferring the root causes of changes in the joint distribution of the causal models, capturing the long-range dynamics of the data. By constructing probabilistic causal models that track shifts in the conditional distributions of molecular interactions during bond formation or breaking, this framework offers a novel approach to root cause analysis in molecular dynamic systems.
Throughout this research, the proposed methodologies are evaluated using both synthetic and real-world applicable datasets to assess their implementation, performance, and robustness. By integrating these advanced techniques in a cohesive analytical framework, this research offers significant contributions to the field of time series analysis.
Series Number
Journal Issue
Is Version Of
Versions
Series
Academic or Administrative Unit
Type
dissertation