Financial time series forecasting with support vector machines

From Teachwiki
Jump to: navigation, search


Figure 1: The setup of the model. First indicators are calculated, then the resulting time series are preprocessed. With these the support vector machine is trained and the parameters found with cross-validation on a training set. Finally the prediction is calculated and tested.

Stock return predictability has been a subject of great controversy. The debate followed issues from market efficiency to the number of factors containing information on future stock returns. The analytical tool of support vector regression on the other hand, has gained great momentum in its ability to predict time series in various applications and also in finance (Smola and Schölkopf, 1998).

The construction of a prediction model requires factors that are believed to have some intrinsic explanatory power. These explanatory factors fall largely into two categories: fundamental and technical. Fundamental factors include for example macroeconimical indicators, which however, are usually only unfrequently published. Technical factors are based solely on the properties of the underlying time series and can therefore be calculated at the same frequency as the time series. Since this study applies support vector regression to high frequent data, only technical factors are considered.

The goal of this study is to predict stock price movements only from the statistical properties of the underlying financial time series. Therefore, financial indicators are extracted from the time series, which are then used by a support vector regression (SVR) to predict market movement. The indicators are arbitrarily chosen among a high variety of financial indicators. The chosen indicators include price differences, moving averages, relative strength and so-called stochastic indicators as shown in the figure. These indicators are then preprocessed in the sense that the mean vector is subtracted and each indicator time series in divided by its variance in order to receive indicator values with zero mean and unit variance. Before the SVR model is trained, the parameters of the SVR model are optimized using a crossvalidation procedure on a training set. After that, the optimized model is used to predict financial market movement.

In the process of model selection, models are chosen only on the basis of performance over out-of-sample data, in order to avoid the critique of judging the model on the basis of in-sample performance. The model selection is based on a cross-validation procedure commonly used in Data Mining.

Our main results show that stock market prediction based on suport vector regression is significantly outperforming a random stock market prediction. However, the prediction in average is only correct in 50.69 percent of times with a standard deviation of 0.26 percent.


Indicator calculation[edit]

Figure 2: Kernel densities of the indicators for a random time series. RDP and EMA indicator seemt to be rather gaussian distributed, while RSI and Sto indicator have several modi.

Several financial indicators are calculated in order to redruce dimensionality of the time series:

  • RDP_t = \frac{p(t) - p(t-1) }{p(t)}: The relative price difference of prices p(t) at time t and p(t-1) at time t-1
  •  EMA_t = \sum_{i=-n}^0 exp(i) p(t-i) : The exponential moving average of the prices p(t)
  •  RSI_t = \frac{U_{[t-n;t]}}{D_{[t-n ;t]}} : The relative strength indicator of the number of upward movement U_{[t-n ;t]} and downward movments D_{[t-n ;t]} in the period of t-n until time t
  •  Stochastic_t = \frac{p(t) - L_{[t-n ;t]}}{H_{[t-n ;t]} - L_{[t-n ;t]}}: The stochastic indicator of the stock price p(t), lowest stock price L_{[t-n ;t]} and highest stock price H_{[t-n ;t]} in the period of t-n until time t

The figure illustrates some of the properties of the indicators derived as above from a random time series. The kernel densities are estimated for each indicator with a bandwidth of 0.001. Note, that the RDP and EMA indicator are rather gaussian distributed, while the RSI and Stochastic indicators have several modi and especially the Stochastic indicator seems to be a mixture of two different gaussian distributions.


Figure 3: The machine parameters. The kernel transforms the original space into a usually higher dimensional space, in which the classification problem becomes in this example trivial. Another parameter of the machine is e of the e-insensitive loss function.

The method of support vector regresion includes several parameters to be chosen, which can e.g. optimized using cross-validation. These parameter include the chosen kernel with parameter \gamma, the e of the e-insensitive loss function, the cost of error c and the number of training samples. The advantage of using a kernel is sometimes to be able to leanearly classify unseperable cases like shown on the top of the figure. In this case, the black and white label points on the left side are not linearly seperable. After the kernel transformation, however, the black and white labeled points might fall onto the same point in the new space. Here, the classification problem becomes trivial. Therefore chosing a kernel is high importance, as well as the parameter  \gamma of the kernel function.

Another parameter is the e of the insensitive loss function, which is illustrated on the bottom of the figure. The support vector regression model is trained placing a penalty for values, which are off target. The penalty depends on the e-insensitive loss function, with parameter e. The idea is to penalize values off target only if the difference is higher than the absolute value of e.

Given the kernel K(x_i,x_j) = \phi(x_i)^T\phi(x_j), the trianing set of instance-label pairs (x_i,y_i), i = 1,...,l, where x_i \in R^n and  y_i \in {1, -1}^l , the optimization problem of the support vector machines can be formulated as

\frac{1}{2} w^Tw+c \sum_{i=1}^l E_i

subject to 
y_i(w^T \phi(x_i) + b) \le 1 - E_i,
E_i \le 0.

The support vector machine then maximizes the margin of the separating hyperplane of the classes, which is equal to minimizing  \frac{|w|}{|2|} and therefore also to minimizing  \frac{|w|^2}{2} .


Figure 4: Cross-validation setup. Several parameter values are tested in the prediction accuracy on a training set, of which then the optimal parameter combination is chosen for further prediction on the test set.

Since the SVR parameters can be easily controlled manually, the optimal set of parameters is chosen on a test set and then used on the following training set. The cross-validation is applied as illustrated in the figure. The total data set is devided into two parts, one for cross-validation and one for testing. A third part of the data set in order to optimize the structure of the model, like the used indicators, is omitted in this study.

In order to optimize the number of training samples, the cost of error c, the kernel parameter and the parameter e of the e-insensitive loss-function, a k-fold cross-validation is used as follows: the dataset is divided into k folders of equal size; subsequently, a model is built on all possible (k) combinations of k-1 folders, and each time the remaining one folder is used for validation. The best model is the one that performs best on average over the k validation folders. The benefit of using a cross-validation procedure is that by construction it ensures that model selection is based entirely on out-of-sample rather than in-sample performance. Thus, the search for the best Support Vector Regression model is immune to a critique of drawing conclusions about the merits of a factor model based on its in-sample performance.

In this study, a 10-fold cross-validation procedure was used for each parameter above. In each validation loop, different values for each parameter are chosen, while the other parameters are set constant. Then the SVR model is trained with this set of parameters and the prediction accuracy is calculated. This is done for all parameter combinations and then the combination with the maximal prediction accuracy chosen.

Basic model[edit]

Figure 5: The basic model. The machine is trained on the past values of the indicators. The resulting model is used to predict the movement on the next day (= 108 data points). After that the model is shifted and proceeds again.

The basic simulation consists of two steps: First, at month t, all historical values for all explanatory factors toghether with the difference in returns for the periods t - n1 till t - 1 are used to build numereous support vector regression. Thus the dependent variable is the return of the stock in the period of t till t + n2. The variable n2 is arbitrarily chosen to 108, in order to decrease calculation time. The independent variables are the technical indicators as described above.

Second, once the prediction is calculated, the model is shifted 108 data points and the model is build again in order to predict the next 108 stock price movements.

Using only historically available data ensures the implementation of the trading strategiy is carried out without the benefit of foresight, in the sense that investment decisions are not based on data that have become available after any of the to-be-predicted periods. Moreover, investment decisions for the to-be-predicted months are always based on the entire factor set of historical data, ensuring that no variable-selection procedures based on extensive manipulation of the whole available data have been carried out. At any rate, the utilized cross-validation procedure for model selection ensures that the best candidate model is selected on the basis of performance in the training set and not on the basis of performance on external validation samples.

Results and discussion[edit]

The data set[edit]

Figure 6: 5 minute log returns of 28 DAX shares above the market average.

The data set consists of 5 minute closing prices p(t) for 28 stocks in the Deutsche Aktien Index (DAX). The missing stocks are TUI and Hypo Real Estate due to data inavailability. With a time frame of nearly 7 years between April 2001 and August 2006, the data set comprises 140.000 data points per stock.

From this data set, the log return of each stock i is calculated as

 x_i(t) = log(\frac{p(t)}{p(t-1)})

with price p(t) at time t as well as the market average as

 x_{market}(t) = \sum_{i=1}^{28} x_i(t)

over all stocks i. From this the log return above market is calculated as  x_i'(t) = x_i(t) - x_{market}(t) for each stock i.


Figure 7: Cross-validation on the training set. Several parameter values are chosen for each of the machine parameters. The cost and training length parameter show linear dependencies, while the kernel parameter gamma show a quadratic dependency. The e parameter is rather nonlinear dependent to the prediction accuracy.

Several parameter conditions were tested on the first half of the data set. The figure shows the tested values for each parameter. The optimality criterion used here, is the cumulated return. Therefore, the model is trained with the parameter set, the prediction calculated and then the return resulting from the prediction is cumulated over time. The parameter values are tested on half of the data set, that is between May 2001 and Juli 2003.

On the top left of the figure, the results for different parameter values of the cost function is shown. With an increasing cost function value, the cumulated return increases. This seems plausible, since with an increasing cost the model is trained longer. However, the parameter exploration is stoped at a cost value of 1000, since higher values increase computation time dramatically.

The top right of the figure shows different parameter values for e of the e-insensitive loss-function. Here the results seem to be rather nonlinearly related to the cumulated return, since with increasing parameter e, the cumulated return decreases only in general. However, generally, smaller values of e seem to be more succesfull. Since this value controls the penalty of the training algorithm, a small value indicates a fast penalty for values off-target.

The kernel parameter gamma, plotted for different values on the bottom left, seems to approach an optimum value around 1. The parameter controls the shape of the kernel. With high parameter values, the kernel becomes rather flat and the model increasingly predicts future movements only linearly, which is obviously unsufficient. With small parameter values, the kernel becomes very thin and training data are increasingly overfitted with decreasing generalization performance. This again results in a low prediction performance.

Last, with an increasing number of training points the prediction performance increases. Therefore the quality of the trained model increases with the number of training samples.

Prediction accuracy[edit]

The optimized parameters were tested with the basic model approach described above on the second half of the data set. The prediction accuracy over all 28 stocks reached a mean of 50.69 percent with standard deviation of .26%. With this performance, the reported approach significantly outperformed a random prediction approach. Even if a gain of .69 percent might be a valuable trading prediction, this approach is market neutral and operated only on the basic statistical properties of market movements.


This study reported on financial time series forecasting with support vector machines. The underlying time series were derived from the Deutsche Aktien Index. The support vector machine was then trained in order to predict the movement of 28 stocks of the index against market. Features for training were directly extracted from the statistical properties of the time series and no fundamental information was used.

The model selection was based on the performance on out-of-sample data, in order to avoid critique of foresight and was performed as cross-validation. The main result of this study is that the movement of stocks is significantly predicted only using technical indicators with support vector regression.


  • Bauer, R. and R. Molenaar, Is the Value Premium Predictable in Real Time?, Working Paper 02-003, Limburg Institute of Financial Economics, 2002
  • Fama, E. and K. French, The cross-section of expected stock returns, Journal of Finance 47, pp. 427-465, 1992
  • Müller, K., A. Smola, G. Rätsch, B. Schölkopf, J. Kohlmorgen and V. Vapnik, Predicting time series with support vector machines, Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Springer, 1997
  • Nalbantov G., Short-horizon value-growth style rotation with support vector machines, Final thesis, Maastricht University, 2003
  • Smola A. and B. Schölkopf, A tutorial on support vector regression, NeuroCOLT2 Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998
  • Vapnik, V., The Nature of Statistical Learning Theory, Springer, New York, 1995; 2nd edition, 2000


  • Where does the graphics in Figure 3 come from?
  • What is the meaning of \gamma?