Order from us for quality, customized work in due time of your choice.
The desire to predict the future and understand the past drives the search for laws that explain the behaviour of observed phenomena; examples range from the irregularity in a heartbeat to the volatility of a currency exchange rate. If there are known underlying deterministic equations, in principle they can be solved to forecast the outcome of an experiment based on knowledge of the initial conditions.
To make a forecast if the equations are not known, one must find both the rules governing system evolution and the actual state of the system. In this chapter we will focus on phenomena for which underlying equations are not given; the rules that govern the evolution must be inferred from regularities in the past. For example, the motion of a pendulum or the rhythm of the seasons carries within them the potential for predicting their future behaviour from knowledge of their oscillations without requiring insight into the underlying mechanism. We will use the terms understanding and learning to refer to two complementary approaches taken to analyze an unfamiliar time series. Understanding is based on explicit mathematical insight into how systems behave, and learning is based on algorithms that can emulate the structure in a time series.
In both cases, the goal is to explain observations; we will not consider the important related problem of using knowledge about a system for controlling it in order to produce some desired behaviour. Time series analysis has three goals: forecasting, modelling, and characterization. The aim of forecasting (also called predicting) is to accurately predict the short-term evolution of the system; the goal of modelling is to find a description that accurately captures features of the long-term behaviour of the system.
These are not necessarily identical: finding governing equations with proper long-term properties may not be the most reliable way to determine parameters for good short-term forecasts, and a model that is useful for short-term forecasts may have incorrect long-term properties. The third goal, system characterization, attempts with little or no a priori knowledge to determine fundamental properties, such as the number of degrees of freedom of a system or the amount of randomness. This overlaps with forecasting but can differ: the complexity of a model useful for forecasting may not be related to the actual complexity of the system. Before the 1920s, forecasting was done by simply extrapolating the series through a global fit in the time domain.
The beginning of modern time series prediction might be set at 1927 when Yule invented the autoregressive technique in order to predict the annual number of sunspots. His model predicted the next value as a weighted sum of previous observations of the series. In order to obtain interesting behaviour from such a linear system, outside intervention in the form of external shocks must be assumed. For the half-century following Yule, the reigning paradigm remained that of linear models driven by noise.
However, there are simple cases for which this paradigm is inadequate. For example, a simple iterated map, such as the logistic equation (Eq. (11), in Section The Future of Time Series: Learning and Understanding 3 3.2), can generate a broadband power spectrum that cannot be obtained by a linear approximation. The realization that apparently complicated time series can be generated by very simple equations pointed to the need for a more general theoretical framework for time series analysis and prediction. Two crucial developments occurred around 1980; both were enabled by the general availability of powerful computers that permitted much longer time series to be recorded, more complex algorithms to be applied to them, and the data and the results of these algorithms to be interactively visualized.
The first development, state-space reconstruction by time-delay embedding, drew on ideas from differential topology and dynamical systems to provide a technique for recognizing when a time series has been generated by deterministic governing equations and, if so, for understanding the geometrical structure underlying the observed behaviour. The second development was the emergence of the field of machine learning, typified by neural networks, that can adaptively explore a large space of potential models. With the shift in artificial intelligence from rule-based methods towards data-driven methods,[1] the field was ready to apply itself to time series, and time series, now recorded with orders of magnitude more data points than were available previously, were ready to be analyzed with machine-learning techniques requiring relatively large data sets.
The realization of the promise of these two approaches has been hampered by the lack of a general framework for the evaluation of progress. Because time series problems arise in so many disciplines, and because it is much easier to describe an algorithm than to evaluate its accuracy and its relationship to mature techniques, the literature in these areas has become fragmented and somewhat anecdotal. The breadth (and the range in reliability) of relevant material makes it difficult for new research to build on the accumulated insight of past experience (researchers standing on each others toes rather than shoulders).
Global computer networks now offer a mechanism for the disjoint communities to attack common problems through the widespread exchange of data and information. In order to foster this process and to help clarify the current state of time series analysis, we organized the Santa Fe Time Series Prediction and Analysis Competition under the auspices of the Santa Fe Institute during the fall of 1991. To explore the results of the competition, a NATO Advanced Research Workshop was held in the spring of 1992; workshop participants included members of the competition advisory board, representatives of the groups that had collected the data, participants in the competition, and interested observers.
Although the participants came from a broad range of disciplines, the discussions were framed by the analysis of common data sets and it was (usually) possible to find a meaningful common ground. In this overview chapter we describe the structure and the results of this competition and review the theoretical material required to understand the successful entries; much more detail is available in the articles by the participants in this volume.
The Competition
The planning for the competition emerged from informal discussions at the Complex Systems Summer School at the Santa Fe Institute in the summer of 1990; the first step was to assemble an advisory board to represent the interests of many of the relevant fields. Second step with the help of this group we gathered roughly 200 megabytes of experimental time series for possible use in the competition. This volume of data reflects the growth of techniques that use enormous data sets (where automatic collection and processing is essential) over traditional time series (such as quarterly economic indicators, where it is possible to develop an intimate relationship with each data point).
In order to be widely accessible, the data needed to be distributed by ftp over the Internet, by electronic mail, and by floppy disks for people without network access. The latter distribution channels limited the size of the competition data to a few megabytes; the final data sets were chosen to span as many of a desired group of attributes as possible given this size limitation (the attributes are shown in Figure 2). The final selection was:
-
A clean physics laboratory experiment. 1,000 points of the fluctuations in a far-infrared laser, approximately described by three coupled nonlinear ordinary deferential equations.
-
Physiological data from a patient with sleep apnea. 34,000 points of the heart rate, chest volume, blood oxygen concentration, and EEG state of a sleeping patient. These observables interact, but the underlying regulatory mechanism is not well understood.
-
High-frequency currency exchange rate data. Ten segments of 3,000 points each of the exchange rate between the Swiss franc and the U.S. Dollar. The average time between two quotes is between one and two minutes.
-
A numerically generated series designed for this competition. A driven particle in a four-dimensional nonlinear multiple-well potential (nine degrees of freedom) with a small non-stationarity drift in the well depths. (Details are given in the Appendix.)
-
Astrophysical data from a variable star. 27,704 points in 17 segments of the time variation of the intensity of a variable white dwarf star, collected by the Whole Earth Telescope (Clemens, this volume). The intensity variation arises from a superposition of relatively independent spherical harmonic multiplets, and there is significant observational noise.
The amount of information available to the entrants about the origin of each data set varied from extensive (Data Sets B and E) to blind (Data Set D). The original files will remain available. The data sets are graphed in Figure 1, and some of the characteristics are summarized in Figure 2. The appropriate level of description for models of these data ranges from low-dimensional stationary dynamics to stochastic processes. After selecting the data sets, we next chose competition tasks appropriate to the data sets and research interests.
The participants were asked to: predict the (withheld) continuations of the data sets with respect to given error measures, characterize the systems (including aspects such as the number of degrees of freedom, predictability, noise characteristics, and the nonlinearity of the system), infer a model of the governing equations, and describe the algorithms employed. The data sets and competition tasks were made publicly available on August 1, 1991, and competition entries were accepted until January 15, 1992.
Participants were required to describe their algorithms. (Insight in some previous competitions was hampered by the acceptance of proprietary techniques.) One interesting trend in the entries was the focus on prediction, for which three motivations were given:
-
because predictions are falsifiable, insight into a model used for prediction is verifiable;
-
there are a variety of financial incentives to study prediction; and
-
the growth of interest in machine learning brings with it the hope that there can be universally and easily applicable algorithms that can be used to generate forecasts.
Another trend was the general failure of simplistic black box approaches in all successful entries, exploratory data analysis preceded the algorithm application.
It is interesting to compare this time series competition to the previous state of the art as reflected in two earlier competitions. In these, a very large number of time series was provided (111 and 1001, respectively), taken from business (forecasting sales), economics (predicting recovery from the recession), finance, and the social sciences. However, all of the series used were very short, generally less than 100 values long. Most of the algorithms entered were fully automated, and most of the discussion centered around linear models.[4] In the Santa Fe Competition all of the successful entries were fundamentally nonlinear and, even though significantly more computer power was used to analyze the larger data sets with more complex models, the application of the algorithms required more careful manual control than in the past.
Wan. Predicted values are indicated by c, predicted error bars by vertical lines. The true continuation (not available at the time when the predictions were received) is shown in grey (the points are connected to guide the eye).
Figure 4 Predictions obtained by the same two models as in the previous figure, but continued 500 points further into the future. The solid line connects the predicted points; the grey line indicates the true continuation.
As an example of the results, consider the intensity of the laser (Data Set A; see Figure 1). On the one hand, the laser can be described by a relatively simple correct model of three nonlinear deferential equations, the same equations that Lorenz (1963) used to approximate weather phenomena. On the other hand, since the 1,000-point training set showed only three of four collapses, it is difficult to predict the next collapse based on so few instances. For this data set we asked for predictions of the next 100 points as well as estimates of the error bars associated with these predictions. We used two measures to evaluate the submissions.
The first measure (normalized mean squared error) was based on the predicted values only; the second measure used the submitted error predictions to compute the likelihood of the observed data given the predictions. The Appendix to this chapter gives the definitions and explanations of the error measures as well as a table of all entries received. We would like to point out a few interesting features. Although this single trial does not permit fine distinctions to be made between techniques with comparable performance, two techniques clearly did much better than the others for Data Set A; one used state-space reconstruction to build an explicit model for the dynamics and the other used a connectionist network (also called a neural network).
Incidentally, a prediction based solely on visually examining and extrapolating the training data did much worse than the best techniques, but also much better than the worst. Figure 3 shows the two best predictions. Sauer (this volume) attempts to understand and develop a representation for the geometry in the systems state space, which is the best that can be done without knowing something about the systems governing equations, while Wan (this volume) addresses the issue of function approximation by using a connectionist network to learn to emulate the input-output behaviour. Both methods generated remarkably accurate predictions for the specified task.
In terms of the measures defined for the competition, Wans squared errors are one-third as large as Sauers, and taking the predicted uncertainty into account Wans model is four times more likely than Sauers. According to the competition scores for Data Set A, this puts Wans network in the first place. A different picture, which cautions the hurried researcher against declaring one method to be universally superior to another, emerges when one examines the evolution of these two prediction methods further into the future. Figure 4 shows the same two predictors, but now the continuations extend 500 points beyond the 100 points submitted for the competition entry (no error estimates are shown).
The neural networks class of potential behaviour is much broader than what can be generated from a small set of coupled ordinary differential equations, but the state-space model is able to reliably forecast the data much further because its explicit description can correctly capture the character of the long-term dynamics. In order to understand the details of these approaches, we will detour to review the framework for (and then the failure of) linear time series analysis.
Linear Time Series Models
Linear time series models have two particularly desirable features: they can be understood in great detail and they are straightforward to implement. The penalty for this convenience is that they may be entirely inappropriate for even moderately complicated systems. In this section we will review their basic features and then consider why and how such models fail. The literature on linear time series analysis is vast; a good introduction is the very readable book by Chatfield (1989), many derivations can be found (and understood) in the comprehensive text by Priestley
(1981), and a classic reference is Box and Jenkins book (1976). Historically, the general theory of linear predictors can be traced back to Kolmogorov (1941) and to Wiener (1949). Two crucial assumptions will be made in this section: the system is assumed to be linear and stationary. In the rest of this chapter we will say a great deal about relaxing the assumption of linearity; much less is known about models that have coefficients that vary with time. To be precise, unless explicitly stated (such as for Data Set D), we assume that the underlying equations do not change in time, i.e., time invariance of the system.
Arma, Fir, and All That
There are two complementary tasks that need to be discussed: understanding how a given model behaves and finding a particular model that is appropriate for a given time series. We start with the former task. It is simplest to discuss separately the role of external inputs (moving average models) and internal memory (autoregressive models).
Properties of a Given Linear Model
Moving average (MA) models. Assume we are given an external input series {et} and want to modify it to produce another series {xt}. Assuming linearity of the system and causality (the present value of x is influenced by the present and N past values of the input series e), the relationship between the input and output is
This equation describes a convolution filter: the new series x is generated by an Nth-order filter with coefficients b0&&&&&&.bn from the series e. Statisticians and econometricians call this an Nth-order moving average model, MA(N). The origin of this (sometimes confusing) terminology can be seen if one pictures a simple smoothing filter which averages the last few values of series e. Engineers call this a finite impulse response (FIR) filter, because the output is guaranteed to go to zero at N time steps after the input becomes zero. Properties of the output series x clearly depend on the input series e. The question is whether there are characteristic features independent of a specific input sequence. For a linear system, the response of the filter is independent of the input. A characterization focuses on properties of the system, rather than on properties of the time series. (For example, it does not make sense to attribute linearity to a time series itself, only to a system.)
We will give three equivalent characterizations of an MA model: in the time domain (the impulse response of the filter), in the frequency domain (its spectrum), and in terms of its autocorrelation coefficients. In the first case, we assume that the input is nonzero only at a single time step t0 and that it vanishes for all other times. The response (in the time domain) to this impulse is simply given by the bs in Eq. (1): at each time step the impulse moves up to the next coefficient until, after N steps, the output disappears. The series bN,bN-1&&&&&. b0 is thus the impulse response of the system.
The response to an arbitrary input can be computed by superimposing the responses at appropriate delays, weighted by the respective input values (convolution). The transfer function thus completely describes a linear system, i.e., a system where the superposition principle holds: the output is determined by impulse response and input. Sometimes it is more convenient to describe the filter in the frequency domain. This is useful (and simple) because a convolution in the time domain becomes a product in the frequency domain. If the input to a MA model is an impulse (which has a flat power spectrum), the discrete Fourier transform of the output is given by
(see, for example, Box & Jenkins, 1976, p.69). The power spectrum is given by the squared magnitude of this:
The third way of representing yet again the same information is, in terms of the autocorrelation coefficients, defined in terms of the mean µ = x{t} and the
Variance Ã2 = [{ Xt µ 2 }]
The autocorrelation coefficients describe how much, on average, two values of a series that are ¶ time steps apart co-vary with each other. (We will later replace this linear measure with mutual information, suited also to describe nonlinear relations.) If the input to the system is a stochastic process with input values at different times uncorrelated, {ei ej} = 0 or i not equal to j, then all of the cross terms will disappear from the expectation value in Eq. (3), and the resulting autocorrelation coefficients are
The Breakdown of Linear Models
We have seen that ARMA coefficients, power spectra, and autocorrelation coefficients contain the same information about a linear system that is driven by uncorrelated white noise. Thus, if and only if the power spectrum is a useful characterization of the relevant features of a time series, an ARMA model will be a good choice for describing it. This appealing simplicity can fail entirely for even simple nonlinearities if they lead to complicated power spectra (as they can). Two time series can have very similar broadband spectra but can be generated from systems with very different properties, such as a linear system that is driven stochastically by external noise, and a deterministic (noise-free) nonlinear system with a small number of degrees of freedom. One the key problems addressed in this chapter is how these cases can be distinguished linear operators definitely will not be able to do the job.
Let us consider two nonlinear examples of discrete-time maps (like an AR model, but now nonlinear):
The first example can be traced back to Ulam (1957): the next value of a series is derived from the present one by a simple parabola
Popularized in the context of population dynamics as an example of a simple mathematical model with very complicated dynamics (May, 1976), it has been found to describe a number of controlled laboratory systems such as hydrodynamic flows and chemical reactions, because of the universality of smooth unimodal maps (Collet, 1980). In this context, this parabola is called the logistic map or quadratic map. The value xt deterministically depends on the previous value xt-1 ; » ¸ is a parameter that controls the qualitative behaviour, ranging from a fixed point (for small values of ) to deterministic chaos. For example, for » = 4, each iteration destroys one bit of information.
Consider that, by plotting xt against xt-1, each value of xt has two equally likely predecessors or, equally well, the average slope (its absolute value) is two: if we know the location within before the iteration, we will on average know it within 2e afterwards. This exponential increase in uncertainty is the hallmark of deterministic chaos (divergence of nearby trajectories). The second example is equally simple: consider the time series generated by the map
Understanding and Learning
Strong models have strong assumptions. They are usually expressed in a few equations with a few parameters, and can often explain a plethora of phenomena. In weak models, on the other hand, there are only a few domain-specific assumptions. To compensate for the lack of explicit knowledge, weak models usually contain many more parameters (which can make a clear interpretation difficult). It can be helpful to conceptualize models in the two-dimensional space spanned by the axes data-poor data-rich and theory-poor theory-rich. Due to the dramatic expansion of the capability for automatic data acquisition and processing, it is increasingly feasible to venture into the theory-poor and data-rich domain.
Strong models are clearly preferable, but they often originate in weak models. (However, if the behaviour of an observed system does not arise from simple rules, they may not be appropriate.) Consider planetary motion (Gingerich, 1992). Tycho Brahes (1546{1601) experimental observations of planetary motion were accurately described by Johannes Keplers (1571{1630) phenomenological laws; this success helped lead to Isaac Newtons (1642{1727) simpler but much more general theory of gravity which could derive these laws; Henri Poincar¶es (1854{1912) inability to solve the resulting three-body gravitational problem helped lead to the modern theory of dynamical systems and, ultimately, to the identification of chaotic planetary motion (Busman & Wisdom, 1988, 1992).
As in the previous section on linear systems, there are two complementary tasks: discovering the properties of a time series generated from a given model, and inferring a model from observed data. We focus here on the latter, but there has been comparable progress for the former. Exploring the behaviour of a model has become feasible in interactive computer environments, such as Cornells dstool,[9] and the combination of traditional numerical algorithms with algebraic, geometric, symbolic, and artificial intelligence techniques is leading to automated platforms for exploring dynamics (Abelsons, 1990; Yip, 1991; Bradley, 1992).
For a nonlinear system, it is no longer possible to decompose an output into an input signal and an independent transfer function (and thereby find the correct input signal to produce a desired output), but there are adaptive techniques for controlling nonlinear systems (Hauler, 1989; Tot, Grebe & Yorke, 1990) that make use of techniques similar to the modelling methods that we will describe. The idea of weak modelling (data-rich and theory-poor) is by no means new| an ARMA model is a good example.
What is new is the emergence of weak models (such as neural networks) that combine broad generality with insight into how to manage their complexity. For such models with broad approximation abilities and few specific assumptions, the distinction between memorization and generalization becomes important. Whereas the signal-processing community sometimes uses the available by anonymous ftp from macomb.tn.cornell.edu in pub/stool. The Future of Time Series:
Learning and Understanding term learning for any adaptation of parameters, we need to contrast learning without generalization from learning with generalization. Let us consider the widely and wildly celebrated fact that neural networks can learn to implement the exclusive OR (XOR). But what kind of learning is this? When four out of four cases are specified, no generalization exists! Learning a truth table is nothing but rote memorization: learning XOR is as interesting as memorizing the phone book. More interesting and more realistic are real-world problems, such as the prediction of financial data. In forecasting, nobody cares how well a model fits the training data| only the quality of future predictions counts, i.e., the performance on novel data or the generalization ability.
Learning means extracting regularities from training examples that do transfer to new examples. Learning procedures are, in essence, statistical devices for performing inductive inference. There is a tension between two goals. The immediate goal is to ¯t the training examples, suggesting devices as general as possible so that they can learn a broad range of problems. In connectionism, this suggests large and flexible networks, since networks that are too small might not have the complexity needed to model the data. The ultimate goal of an inductive device is, however, its performance on cases it has not yet seen, i.e., the quality of its predictions outside the training set.
This suggests at least for noisy training data networks that are not too large since networks with too many high-precision weights will pick out idiosyncrasies of the training set and will not generalize well. An instructive example is polynomial curve fitting in the presence of noise. On the one hand, a polynomial of too low an order cannot capture the structure present in the data. On the other hand, a polynomial of too high an order, going through all of the training points and merely interpolating between them, captures the noise as well as the signal and is likely to be a very poor predictor for new cases. This problem of fitting the noise in addition to the signal is called over fitting.
By employing a regularizer (i.e., a term that penalizes the complexity of the model) it is often possible to ¯t the parameters and to select the relevant variables at the same time. Neural networks, for example, can be cast in such a Bayesian framework (Buntine & Weigend, 1991). To clearly separate memorization from generalization, the true continuation of the competition data was kept secret until the deadline, ensuring that the continuation data could not be used by the participants for tasks such as parameter estimation or model selection.
Successful forecasts of the withheld test set (also called out-of-sample predictions) from the provided training set (also called fitting set) were produced by two general classes of techniques: those based on state space reconstruction (which make use of explicit understanding of the relationship between the internal degrees of freedom of a deterministic system and an observable of the systems state in order to build a model of the rules governing the measured behaviour of the system), and connectionist modelling (which uses potentially rich models along with learning algorithms to develop an implicit model of the system). We will see that neither is uniquely preferable. The domains of applicability are not the same, and the choice of which to use depends on the goals of the analysis (such as an understandable description vs. Accurate short-term forecasts).
The Future
We have surveyed the results of what appears to be a steady progress of insight over ignorance in analyzing time series. Is there a limit to this development? Can we hope for the discovery of a universal forecasting algorithm that will predict everything about all time series? The answer is emphat
Order from us for quality, customized work in due time of your choice.