Interpretable Time Series Similarity with Hidden Markov Models

How to find similarities intuitively and accurately, with an added example using the Dow 30 stocks (code included)

Haniel Campos
Towards Data Science

--

The need to measure the distance between two time series is a common occurrence for many professionals working in data science and machine learning. Collection of temporally correlated data happens everywhere, from the finance industry to ecology, and since the only constraint to getting another data point is usually time, such data can be rich with information we can exploit for new insights.

An often encountered stepping stone in this process is the ability to measure how similar one time series is to another one. After all, there is a massive amount of insight to be gained if we can determine that two stocks or animal populations behave alike in time, the key factor here being time. Most well-known similarity/distance measures, such as Euclidean, Minkowski, Hamming, and Jaccard distances, are only concerned with a single data point. We, however, are interested in finding a distance measure between a collection of data points all correlated with each other in a very interesting (i.e. temporal) way. There are, unsurprisingly, a myriad of methods for doing this, each applicable to a certain subset of problems, but the one I’ll promote to you here is using Hidden Markov Models (HMMs). This approach focuses on comparing the temporal dynamics of two time series with a global and Bayesian point of view.

A TLDR on approaches to time series distances

Shape-based measures

This type of similarity measure generally seeks to compare time series directly, so that time series with similar shapes are assigned lower distances. The canonical example of a shape-based measure is the Euclidean distance. For two time series X=(X_1, X_2, … , X_n) and Y=(Y_1, Y_2, … , Y_n), the Euclidean distance is given by

Note that the time series must have equal length and identical indexing in time. If X and Y have similar values, and by extension similar shapes, then the distance will be small. These measures are great for short time series and are easily interpretable, but they often must work around noise robustness issues. For instance, suppose that X is given by X = b * t_i for any time point t_i and Y = 0 identically. We could make a new time series Z = N(0, σ) composed solely of Gaussian noise so that d(X,Y) = d(Y,Z). The extent to which this is an issue depends on your problem at hand of course.

Feature-based measures

This type of distance measure seeks to find a mapping from the time series space to a feature space with a simple distance measure (or metric!). A great example is a method by Antoniadis et. al[2]. where they use the Discrete Wavelet Transform to decompose a time series into a time-frequency spectrum and use the frequency energies as features.

This is a great approach to a variety of time series problems, especially if there exists an intuitive or well-tested feature representation of the data, such as using wavelets to encode audio signals or using Fourier series to decompose periodic product demands. However, they still mostly suffer from noise sensitivity issues and generally require careful attention to the choice of hyperparameters.

Structure based measures

This type of measure seeks to identify differences in the underlying structure and dynamics of a time series, and is where our HMM approach lies. In our particular case, our objective is to find stochastic models that accurately describe the time series and then define the time series distances as the distance between the models. The first time I came across this idea for HMMs specifically was in a paper by Ghassempour et. al.[2], but I can’t be certain of its origin.

A very short TLDR on Hidden Markov Models

A Hidden Markov Model, in its most basic form, simply seeks to model the data using a collection of probability distributions conditionally given by a Markov chain.

As an example, consider the task of modeling a univariate discrete time series Y of continuous real values and suppose we’d like to use an HMM to do this. The model would consist of main components, the hidden states and the emission distribution. The hidden states are a Markov chain with N states, where we have to choose N. This means that at each time t, the model can be in a single state X_t belonging to the state set S = {s_1, s_2, … , s_N} and the probability of X_{t+1} being in any state is solely dependent on the current state X_t and no other past information. In succinct terms, we describe this independence as the Markov property

This allows we to define a transition probability matrix A, where

Here we’ll only consider stationary hidden states, meaning the transition probabilities are constant in time, which are sufficient in most applications with proper preprocessing.

To get the estimates of Y_t, the HMM simply samples from a univariate Gaussian N(μ_t, σ_t) with parameters conditional on the state X_t. This structure allows HMMs to decompose the underlying behavior of the time series Y into a series of N possible states, where at each state it follows a different Gaussian distribution.

If Y were, for instance, the growth rates of an animal population, we could set N=2 to try to capture the effects of a mating season, so that X_t = s_1 corresponds to the mating season where Y_t ~ N(μ_1, σ_1) with μ_1 being large, and X_t = s_2 corresponds to the rest of the year with Y_t ~ N(μ_2, σ_2) and μ_2 being very small.

We’ve used a univariate Gaussian as an example, but HMMs can work with any kind of distribution and any mixture of such distributions as well. Thus, if you’ve got a multidimensional time series with normal, categorical and log-normal variables, an HMM will still work just as well. The parameter estimation process is rather lengthy, so I’ve chosen to omit it since it’s effectively handled by any HMM packages out there. One thing to note is that we can easily define the AIC and BIC for these models, which takes away some bias when selecting the hyperparameter for the number of hidden states.

A distance measure for Hidden Markov Models

Knowing behaviors at infinity

“Ok, this HMM thing is interesting, but how does it relate to time series similarity?”

The amazing thing about HMMs is that we get a stochastic process that models the temporal dynamics of the data for all times in the past, present and future. Thus, if we can fit HMMs to any two time series, we can equate the time series similarity to how similar the output of the two HMMs would look were we to sample them an infinite number of times into the future.

The Markov property of the hidden states allows us do just that. In practice, almost all Markov chains have a stationary distribution π over their states, which gives the limit probability of the process being at any state at any time. We can express this as

and the super convenient property of Markov chains is that as a row vector π is the solution to the eigenproblem

where A is our transition probability matrix. Thus, we can know how the processes derived from our time series behave at infinity without having to go to infinity itself! I’m glossing over the derivation details of course, but for those I recommend you check out a book on stochastic processes.

The distance measure

The stationary distribution is central to the HMM distance measure created by Sahraeian and Yoon[3], which is better understood after seeing its definition.

Let λ and λ* be two HMMs from two time series Y and Y*, with respective transition probability matrices A and A* and emission distributions B = {b_i} and B* = {b*_i*}, where b_i and b*_i* are the distributions at states i of λ and i* of λ* respectively. We define ES(λ, λ*) as the expected probability distribution over all possible state pairs, given as

where S_e is a positive monotonically decreasing function of your probability distribution distance measure of choice for b_i and b*_i*, such as KLD, JS distance, etc. Next, we define the correspondence matrix Q as

This matrix quantifies how similar the distributions of any HMM state pair are and weighs them by how likely the two HMMs are to arrive at such state pair. Thus, if the two HMMs are very likely to be in states where they produce emission outputs that are more similar than usual, then the correspondence matrix Q will be saturated on a few entries and become sparse. Else, if the state pairs with similar distributions are unlikely or if the emission distributions are not similar, then conversely Q will not saturate anywhere and therefore become dense. There are many matrix sparsity measures out there, but the Gini Index is a popular and almost universally versatile choice. As such, we have that our time series similarity measure is given by

In short, the intuition is that if two time series can be modeled by Hidden Markov Models that are likely to be in states with similar distributions, the time series will have a higher similarity score. Else, if the HMMs have different dynamics or distributions, the score will be lower.

Some useful properties/things you might want to consider

  1. The correspondence matrix doesn’t need to be square, so we can still compute HMM similarities even if they have different numbers of hidden states, i.e. different complexities!
  2. We can also compare time series sampled at different times, so long as they have the same sampling frequency. The stochastic structure of the underlying process is invariant in time after all.
  3. If you choose to use the normalized Gini index as a sparsity measure, the measure is also bounded between 0 and 1. Similarity of 0 means the time series are nothing alike (more so than noise), while similarity of 1 means they’re deterministically alike.
  4. It’s robust to noise.
  5. The HMM parameters are fitted with Bayesian methods, so very short time series will result in noise-like models, which reflects how we can’t distinguish them from noise for certain. The desirability of this HEAVILY depends on your problem at hand.

A practical example using the Dow 30

To exemplify how this HMM time series similarity can provide us with valuable insight, we’ll use it to build a network of the Dow 30 stocks and use it to identify stocks with similar daily log-return trends over the year of 2020. All data was retrieved from Yahoo Finance using the tidyquant R package. The purpose of using log-returns is the same as differencing in ARIMA, we must de-trend the time series and put them on a common scale so that we may appropriately model and compare them with HMMs. The process for creating the network can be summarized as follows.

  1. For each log-return series in the Dow 30 listings, we fit univariate Gaussian HMMs with increasing an increasing number of hidden states until the AIC starts to increase, at which point we stop.
  2. We compute the similarity measures between all models and store the results in an adjacency matrix. For the Gaussian emission distribution similarity S_e, we use the reciprocal Fisher-Rao distance with an added epsilon to avoid division by 0, and for the sparsity measure, we use the normalized Gini Index as defined by Sahraeian and Yoon in their original paper.
  3. With the adjacency matrix built, we import it into Gephi to perform modularity clustering and output the network visualizations.

The result is the network shown below.

Each node label is colored according to its modularity cluster (i.e. node groups that maximize the total weights between them and minimize the total weights to other groups), the label size is determined by its weighted degree and the edges are colored darker and thicker for stronger similarity weights. This network is already useful, especially if you’d like to use it as a basis for a spatial model, but it’s cluttered and we’re more interested in stocks that really behave alike, as opposed to somewhat alike.

Thus, we filter the network edges by keeping the top 75% percentile of weights and redo the clusters. What we get is the much more interpretable network below.

From here you could dive into hundreds of different analyses, but none of it would be possible had we not used the amazing tool of HMM similarity. It leverages the strong theoretical foundation of HMMs and couples it with the intuitive idea of comparing time series by comparing their stochastic models, resulting in an easily interpretable similarity measure. I hope you’ve found the information here useful and maybe that you’ll get to use this measure in one of your future projects. It is not a one size fits all solution, but a professional always knows the right tool for the right job.

References

[1] Antoniadis, A., Brossat, X., Cugliari, J., & Poggi, J. M. (2013). Clustering functional data using wavelets. International Journal of Wavelets, Multiresolution and Information Processing, 11(01), 1350003.

[2] Ghassempour, S., Girosi, F., & Maeder, A. (2014). Clustering multivariate time series using hidden Markov models. International journal of environmental research and public health, 11(3), 2741–2763.

[3] Sahraeian, S. M. E., & Yoon, B. J. (2010). A novel low-complexity HMM similarity measure. IEEE Signal Processing Letters, 18(2), 87–90.

R code used to fetch data, find similarities and build the network matrix

--

--

I’m a NYC Financial Engineer and Mathematician writing about whatever makes my mental gears turn. Follow my Twitter @thatguyhaniel for thoughts of the moment!