Complexity Measures and Features for Times Series classification

Additional material on the Complexity Measures and Features for Times Series classification (CMFTS)

This is a specific material for the article “Complexity Measures and Features for Times Series classification”, including datasets, theoretical information on the measures used, results, etc.

Introduction

Complexity Measures and Features for Times Series classification

Abstratc -
Keywords: Time series features, Complexity measures, Time series interpretation, Classification

Theoretical definition of the measures used

This section includes the theoretical definitions of the measures proposed in this paper.

LempelZiv

The Kolmogorov complexity K(x) of an object x is the length, in bits, of the smallest program (in bits) that when run on a Universal Turing Machine (U) outputs K(x) and then stops with the execution [14]. This measure was independently developed by Andrey N. Kolmogorov in the late 1960s. A good introduction to the Kolmogorov complexity (in further text KL) can be found in [5] and with a comprehensive description in [13].

On the basis of Kolmogorov’s idea, Lempel and Ziv [10] developed an algorithm (LZA), which is often used in assessing the randomness of finite sequences as a measure of its disorder.

The Kolmogorov complexity K(x) of an object x is the length, in bits, of the smallest program (in bits) that when run on a Universal Turing Machine (U) outputs K(x) and then stops with the execution [12]. This measure was independently developed by Andrey N. Kolmogorov in the late 1960s. A good introduction to the Kolmogorov complexity (in further text KL) can be found in [5, p.463–508] and with a comprehensive description in [11].

The Kolmogorov complexity of a time series { $x_i$ }, $i$ =1,2,3,4,…, $N$ by this algorithm (LZA) can be summarized as follows:

[Step A]: Encode the time series by constructing a sequence S consisting of the characters 0 and 1 written as { $s(i)$ }, $i$ =1,2,3,4,…, $N$ , according to the rule $s(i)= \left\{ \begin{array}{lcc} 0 & x_{i} < x_{*} \\ 1 & x_{i} \geq x_{*} \end{array} \right.$

Here $x$ , is a threshold that should be properly chosen. The mean value of the time series has often been used as the threshold [21]. Depending on the application, other encoding schemes are also available [14].

[Step B]: Calculate the complexity counter $C(N)$ , which is defined as the minimum number of distinct patterns contained in a given character sequence [7]; $c(N)$ is a function of the length of the sequence N . The value of $c(N)$ is approaching an ultimate value $b(N)$ as $N$ approaching infinite, i.e.

$c(N)=O(b(N)), b(N)=\frac{N}{log_2 N}$

[Step C]: Calculate the normalized complexity measure $C_k (N)$ , which is defined as

$C_k(N)= \frac{c(N)}{b(N)} = c(N) \frac{log_2 N}{N}$

The $C_k (N)$ is a parameter to represent the information quantity contained in a time series, and it is to be a 0 for a periodic or regular time series and to be a 1 for a random time series, if N is large enough. For a non-linear time series, $C_k (N)$ is to be between 0 and 1. They note that Hu et al. [9] derived analytic expression for $C_k$ in the Kolmogorov complexity, for regular and random sequences. In addition they showed that for the shorter length of the time series, the larger $C_k$ value and correspondingly the complexity for a random sequence can be considerably larger than 1.

AproximationEntropy

The approximate entropy (ApEn) is a form quantify the concept of changing complexity [13], so that it is a technique used to quantify the amount of regularity and the unpredictability of fluctuations over time-series data. ApEn can be computed for any time series, chaotic or otherwise. The functioning of this measure is as follows:

Given a time-series of data u(1), u(2),…, u(N), from measurements equally spaced in time.

Fix $m$ , a positive integer, and $r$ , a positive real number. The value of $m$ represents the length of compared run of data, and $r$ specifies a filtering level.

Form a sequence of vectors x(1), x(2), . . ., x(N - m + 1) in ${\rm I\!R^m}$ , defined by
x(i) =[u(i), u(i + 1), . . ., u(i + m -1)].

Define for each i, $1 \le i \le N - m + 1$ , $C_i ^m(r)= ( \textnormal{number of j such that } d[x(i), x(j)] \le r)/(N - m + 1)$

We must define $d[x(i), x(j)]$ for vectors $x(i)$ and $x(j)$ . They follow Takens [18] by defining

$d[x(i), x(j)] = \max\limits_{k=1,2,...,m} (|u(i + k - 1) - u(j + k - 1)|)$

Define

$\Phi^m (r)= (N-m+1)^{-1} \sum\limits_{i=1}^{N-m+1} log C^m_i (r)$

Finally

$ApEn(m,r,N)= \Phi^m (r) - \Phi^{m+1} (r)$

where $log$ is the natural logarithm, for $m$ and $r$ .

Greater utility for ApEn arises when the means and standard deviation of evolving systems show little change with system evolution. Different r values must be chosen, for different systems, to provide the ApEn statistics a good likelihood of distinguishing versions of each system from one another.

Sample Entropy

SampEn statistics have reduced bias [15]. They developed SampEn statistics to be free of the bias caused by self-matching. The name refers to the applicability to time series data sampled from a continuous process. In addition, the algorithm suggests ways to employ sample statistics to evaluate the results, as explained below.

There are two major differences between SampEn and ApEn statistics. First, SampEn does not count self-matches. They justified discounting self-matches on the grounds that entropy is conceived as a measure of the rate of information production , and in this context comparing data with themselves is meaningless. Furthermore, self-matches are explicitly
dismissed in the later work of Grassberger and co-workers. Second, SampEn does not use a template-wise approach when estimating conditional probabilities. To be defined, SampEn requires only that one template find a match of length $m+1$ .

They defined $B_i^m(r)$ as $(N - m - 1)^{-1}$ times the number of vectors ${\bf x}_m(j)$ within $r$ of ${\bf x}_m(i)$ , where $j$ ranges from 1 to $N - m$ , and $j \neq i$ to exclude self-matches. They then defined $B^m(r) = (N - m)^{-1} \sum^{N-m}_{i=1} B^m_i(r)$ . Similarly, they defined $A_i^m(r)$ as $(N - m - 1)^{-1}$ times the number of vectors ${\bf x}_{m+t}(j)$ within $r$ of ${\bf x}_{m+t}(i)$ , where $j$ ranges from 1 to $N - m (j \neq i)$ , and set $A^m(r) = (N - m)^{-1} \sum^{N-m}_{i=1} A^m_i(r)$ . $B^m(r)$ is then the probability that two sequences will match for $m$ points, whereas $A^m(r)$ is the probability that two sequences will match for $m + 1$ points. We then defined the parameter ${\textnormal SampEn}(m, r)=lim_{N \rightarrow \infty} \{-ln [A^m(r)/ B^m(r)]\}$ , which is estimated by the statistic ${\textnormal SampEn}(m, r, N)= -ln [A^m(r)/B^m(r)]$ . Where there is no confusion about the parameter $r$ and the length $m$ of the template vector, we set $B = \{[(N - m - 1)(N - m)]/2\} B^m(r)$ and $A= \{[(N - m - 1)(N - m)]/2\} A^m(r)$ , so that B is the total number of template matches of length $m$ and A is the total number of forward matches of length $m + 1$ . They note that $A/B = [A^m(r)]/B^m(r)]$ , so ${\textnormal SampEn}(m, r, N)$ can be expressed as $-ln (A/B)$ .

Permutation Entropy

A complexity measures which are easily calculated for any type of time series, be it regular, chaotic, noisy or from reality [2].

Consider a time series { $x_t$ } $_{t=1..T}$ . They study all $n!$ permutations π of order n which are considered here as possible order types of $n$ different numbers. For each $\pi$ they determine the relative frequency

$p(\pi)=\frac{\{\#{t|0\leq t\leq T - n , (x_{t+1},...,x_{t+n}) \textnormal{\ has type } \pi}\}}{T-n+1}$

This estimates the frequency of $\pi$ as good as possible for a finite series of values. To determine $p(\pi)$ exactly, they have to assume an infinite time series $x_1$ , $x_2$ , … and take the limit for $T \rightarrow \infty$ in the above formula. This limit exists with probability 1 when the underlying stochastic process fulfils a very weak stationarity condition: for $k \leq n$ , the probability for $x_t < x_t+k$ should not depend on $t$ . The permutation entropy of order $n \geq 2$ is defined as

$H(n) = -\sum p(\pi)log\ p(\pi)$

where the sum runs over all $n!$ permutations $\pi$ of order $n$ . This is the information contained in comparing $n$ consecutive values of the time series. It is clear that $0 \leq H(n) \leq log\ n!$ where the lower bound is attained for an increasing or decreasing sequence of values, and the upper bound for a completely random system (i.i.d. sequence) where all $n!$ possible permutations appear with the same probability. The time series presents some sort of dynamics when $H(n) < log\ n!$ . In their experiments with chaotic time series, $H(n)$ did increase at most linearly with $n$ . So it seems useful to define the permutation entropy per symbol of order $n$ , dividing by $n-1$ since comparisons start with the second value: $h_{n} = \frac{H(n)}{(n-1)}$

Shannon Entropy: ChaoShenEntropy

The Chao-Shen entropy estimator (2003) is a Horvitz-Thompson (1952) estimator applied to the problem of entropy estimation, with additional coverage correction as proposed by Good (1953). Note that the Chao-Shen estimator is not a plug-in estimator, hence there are no
explicit underlying bin frequencies.

Assume that there are S species in a community and they are labeled from 1 to S. Denote the probabilities of species discovery (or relative abundance) by ${\pi_i, \pi_2,..., \pi_S}$ where $\sum_{i=1}^S \pi_i=1$ .

The CV is define as $CV=[\sum_{i=1}Ŝ(\pi_i- \bar{\pi})^2/S ]^{1/2}/\bar{\pi}$ , where ${\pi_i, \pi_2,..., \pi_S}$ is the species abundance and $\bar{\pi}=\sum_{i=1}^S \pi_i /S$ . The value of CV characterizes the degree of heterogeneity [3] among the species abundances. The CV is zero if and only if the species have equal abundance. The larger the CV, the greater the degree of heterogeneity.

Shannon Entropy: Schurmann-Grassberger Entropy Estimator

We consider one-sided infinite sequences $s_1, s_2,...$ where $s_1 \in {0,1,...,d-1}$ . In most examples we shall deal with $d = 2$ , but everything holds also for $d > 2$ with minor modifications [16]. We assume that these are realizations of a stochastic process ${\bf s}_1, {\bf s}_2,...$ with probabilities $p_1(s_1,...,s_n)=prob({\bf s}_{t+1}=s_1,...,{\bf s}_{t+n}=s_n)$

They measure the average amount of information contained in a word of length $n$ . The differential entropies, $h_n= H_n-H_{n-1}= - \sum\limits_{s_1,...,s_n} p(s_1,...,s_n)log\ p({s_n|s_1,...,s_{n-1}})$ give the new information of the n-th symbol if the preceding (n-1) symbols are known. Here, $p({s_n|s_1,...,s_{n-1}})$ , is the conditional probability for ${\bf s}_n$ being $s_n$ , conditioned on the previous symbols $s_1,...,s_{n-1}$ . The Shannon entropy is $h=\lim\limits_{n\rightarrow \infty} h_n$ It measures the average amount of information per symbol if all correlations and constraints are taken into account. This limit is approached monotonically from above, i.e. all $h_n$ are upper bounds on $h$ .

For the numerical estimation of $h$ from a finite sequence of length $N$ one usually estimates all word probabilities $p(s_1,...,s_n)$ , up to some fixed $n$ by the standard likelihood estimate, $\hat{p}(s_1,...,s_n)= \frac{n_{s_1,...,s_n}}{N}$ where $n_{s_1,...,s_n}$ is the number of occurrences of the word $s_1,...,s_n$ (Strictly, the denominator should be $N-n+1$ , but this difference is in general negligible). From these one computes estimates ${\hat H}_n$ by inserting them into the second eq. Finally, an estimator $\hat{h}$ is obtained either by computing $\hat{h}_n$ and extrapolating, or simply as $\hat{h}=lim_{n \rightarrow \infty} \hat{H}_n /n$ . The latter is less influenced by statistical errors but shows slower convergence.

Spectral Entropy

PSE based-on Shannon information theory is an index of complexity measure for an uncertain system. It has a good metrical effect for the change of nonlinear dynamic states, and that it just needs a little data [20]. PSE is defined as follows:

Calculate the spectrum $X(ω_i)$ of the signal.
Calculate the Power Spectral Density of the signal via squaring its amplitude and normalizing by the number of bins. $P(ω_i)=\frac{1}{N}|X(ω_i)|^2$
Normalize the calculated PSD so that it can be viewed as a Probability Density Function (integral is equal to 1). $p_i=\frac {P(ω_i)}{\sum_i P(ωi)}$
The Power Spectral entropy can be now calculated using a standard formula for an entropy calculation. $PSE=−\sum\limits_{i=1}^n p_i\ ln\ p_i$

Forbiden patterns

Number of forbiden patterns [1]. When analyzing a time series of length N, we obtain N−d+ 1 overlapping groups of adjacent values, each one with a corresponding order pattern [19]. If the series has a random behavior, any permutation can appear, and therefore no pattern is forbidden. Moreover, their probability distribution should be flat since any permutation has the same probability of occurrence when the dataset is long enough to exclude statistical fluctuations. Nevertheless, when the series corresponds to a chaotic variable, there are some patterns that cannot be encountered in the data due to the underlying deterministic structure: they are the so-called forbidden patterns. It has been demonstrated that most chaotic systems exhibit forbidden patterns, and that in many cases (ergodic finite-alphabet information sources) the measure of the number of these patterns is related to other classic metric entropy rates (e.g., the Lyapunov exponent).

Kurtosis

For symmetric unimodal distributions, positive kurtosis indicates heavy tails and peakedness relative to the normal distribution, whereas negative kurtosis indicates light tails and flatness [6].

Skewness

Skewness characterizes the degree of asymmetry of a given distribution [4] around its mean. If the distribution of the data are symmetric then skewness will be close to 0. Positive skewness indicates a distribution with an asymmetric tail extending toward more
positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values.

Bibliography

[1] J. Amigó. Permutation complexity in dynamical systems: ordinal patterns, permutation entropy and all that. Springer Science & Business Media, 2010.
[2] C. Bandt and B. Pompe. Permutation entropy: a natural complexity measure for time series. Physical review letters, 88(17):174102, 2002.
[3] A. Chao and T.-J. Shen. Nonparametric estimation of shannons index of diversity when there are unseen species in sample. Environmental and ecological statistics, 10(4):429–443, 2003.
[4] P. Čisar and S. M. Čisar. Skewness and kurtosis in function of selection of network traffic distribution. Acta Polytechnica Hungarica, 7(2):95–106, 2010.
[5] T. M. Cover and J. A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, 2006.
[6] L. T. DeCarlo. On the meaning and use of kurtosis. Psychological methods, 2(3):292, 1997.
[7] R. Ferenets, T. Lipping, A. Anier, V. Jäntti, S. Melto, and S. Hovilehto. Comparison of entropy and complexity measures for the assessment of depth of sedation. Biomedical Engineering, IEEE Transactions on, 53(6):1067–1077, 2006.
[8] J. Hausser and K. Strimmer. R-package entropy, 2014.
[9] J. Hu, J. Gao, and J. C. Principe. Analysis of biomedical signals by the lempel-ziv complexity: the effect of finite data size. Biomedical Engineering, IEEE Transactions on, 53(12):2606–2609, 2006.
[10] A. Lempel and J. Ziv. On the complexity of finite sequences. Information Theory, IEEE Transactions on, 22(1):75–81, 1976.
[11] M. Li and P. Vitanyi. An introduction to kolmogorov complexity and its applications. 1997.
[12] D. T. Mihailović, G. Mimić, N. Drešković, and I. Arsenić. Kolmogorov complexity based information measures applied to the analysis of different river flow regimes. Entropy, 17(5):2973–2987, 2015.
[13] S. M. Pincus. Approximate entropy as a measure of system complexity. Proceedings of the National Academy of Sciences, 88(6):2297–2301, 1991.
[14] N. Radhakrishnan, J. D. Wilson, and P. C. Loizou. An alternate partitioning technique to quantify the regularity of complex time series. International Journal of Bifurcation and Chaos, 10(07):1773–1779, 2000.
[15] J. S. Richman and J. R. Moorman. Physiological time-series analysis using approximate entropy and sample entropy. American Journal of Physiology-Heart and Circulatory Physiology, 278(6):H2039–H2049, 2000.
[16] T. Schürmann and P. Grassberger. Entropy estimation of symbol sequences. Chaos: An Interdisciplinary Journal of Nonlinear Science, 6(3):414–427, 1996.
[18] F. Takens. Atas do 13. col brasiliero de matematicas. Rio de Janerio, Brazil, 1983.
[19] M. Zanin. Forbidden patterns in financial time series. Chaos: An Interdisciplinary Journal of Nonlinear Science, 18(1):013119, 2008.
[20] A. Zhang, B. Yang, and L. Huang. Feature extraction of eeg signals using power spectral entropy. In 2008 International Conference on BioMedical Engineering and Informatics, volume 2, pages 435–439, May 2008.
[21] X.-S. Zhang, R. J. Roy, and E. W. Jensen. Eeg complexity as a measure of depth of anesthesia for patients. Biomedical Engineering, IEEE Transactions on, 48(12):1424–1433, 2001.

Source code

This section includes the software developed in this work.
CMFTS:
https://github.com/fjbaldan/CMFTS
(The source code will be available when the work is published)

Datasets

We have used the complete repository UCR Time Series Classification Archive, 128 time series classification datasets:
https://www.cs.ucr.edu/~eamonn/time_series_data_2018

Results

.csv file containing all the results of the experimentation carried out.
https://github.com/fjbaldan/CMFTS/results/results.csv

Autors

List of authors and affiliations:

Francisco J. Baldán (fjbaldan@decsai.ugr.es), University of Granada.

José M. Benítez (J.M.Benítez@decsai.ugr.es), University of Granada.