Special Issue on Data Mining in Finance c○World Scientific Publishing Company A CONSTRAINED NEURAL NETWORK KALMAN FILTER FOR PRICE ESTIMATION IN HIGH FREQUENCY FINANCIAL DATA

In this paper we present a neural network extended Kalman filter for modeling noisy financial time series. The neural network is employed to estimate the nonlinear dynamics of the extended Kalman filter. Conditions for the neural network weight matrix are provided to guarantee the stability of the filter. The extended Kalman filter presented is designed to filter three types of noise commonly observed in financial data: process noise, measurement noise, and arrival noise. The erratic arrival of data (arrival noise) results in the neural network predictions being iterated into the future. Constraining the neural network to have a fixed point at the origin produces better iterated predictions and more stable results. The performance of constrained and unconstrained neural networks within the extended Kalman filter is demonstrated on "Quote" tick data from the $/DM exchange rate (1993-1995).


Introduction
The study of financial tick data (trade data) is becoming increasingly important as the financial industry trades on shorter and shorter time scales.Tick data has many problematic features, it is often heavy tailed (Dacorogna 1995, Butlin andConnor 1996), it is prone to data corruption and outliers (Chung 1991), and it's variance is heteroscedastic with a seasonal pattern within each day (Dacorogna 1995).However the most serious problem with applying conventional methodologies to tick data is it's erratic arrival.The focus of this study is the prediction of erratic time series with neural networks.The issues of robust prediction and non-stationary variance are explored in Bolland and Connor (1996a) and Bolland and Connor (1996b).
There are three distinct types of noise found in real world time series such as financial tick data: Process noise represents the shocks that drive the dynamics of the stochastic process.The distribution of the process/system noise is generally assumed to be Gaussian.For financial data the noise distributions can often be heavy tailed.
Measurement noise is the noise encountered when observing and measuring the time series.The measurement error is usually assumed to be Gaussian.The measurement of financial data is often corrupted by gross outliers.
Arrival noise reflects uncertainty concerning whether an observation will occur at the next time step.
Foreign exchange quote data is strongly effected by erratic data arrival.At times the quotes are missing for forty seconds, at other times several ticks are contemporaneously aggregated.
These three types of noise have been widely studied in the engineering field for the case of a known deterministic system.The Kalman filter was invented to estimate the state vector of a linear deterministic system in the presence of the process, measurement, and arrival noise.The Kalman filter has been applied in the field of econometrics for the case when a deterministic system is unknown and must be estimated from the data, see for example Engle and Watson (1987).In Sec. 2, we give a brief description of the workings of the Kalman filter on linear models.
Neural networks have been successfully applied to the prediction of time series by many practitioners in a diverse range of problem domains (Weigend 1990).In many cases neural networks have been shown to yield significant performance improvements over conventional statistical methodologies.Neural networks have many attractive features compared to other statistical techniques.They make no parametric assumption about the functional relationship being modeled, they are capable of modeling interaction effects and the smoothness of fit is locally conditioned by the data.Neural networks are generally designed under careful laboratory conditions which while taking into consideration process noise, usually ignore the presence of measurement and arrival noise.In Sec. 3, we show how the extended Kalman filter can be used with neural network models to produce reliable predictions in the presence of process, measurement and arrival noise.
When observations are missing, the neural networks predictions are iterated.Because the iterated feedforward neural network predictions are based on previous predictions, it acts as a discrete time recurrent neural network.The dynamics of the recurrent network may converge to a stable point; however it is equally possible that the recurrent network could oscillate or become chaotic.Section 4 describes a neural network model which is constrained to have a stable point at the origin.Conditions are determined for which the neural network will always converge to a single fixed point.This is the neural network analogue of a stable linear system which will always converge to the fixed point of zero.The constrained neural network is useful for two reasons: (1) The extended Kalman filter is guaranteed to have bounded errors if the neural network model is observable and controllable (discussed in Sec. 3).The existence of a stable fixed point will help make this the case.(2) It reflects our belief that price increments beyond a certain horizon are unpredictable for the Dollar-DM foreign exchange rates.
For different problems, a neural network with a fixed point at zero may not make sense, in which case we do not advocate the constrained neural network.However, for modeling foreign exchange data, this constrained neural network should yield better results.
In Sec. 5, the performance of the extended Kalman filter with a constrained neural network is shown on Dollar-Deutsche Mark foreign exchange tick data.The performance of the constrained neural network is shown to be both quantitatively and qualitatively superior to the unconstrained neural network.

Linear Kalman Filter
Kalman filters originated in the engineering community with Kalman and Bucy (1960) and have been extensively applied to filtering, smoothing and control problems.The modeling of time series in state space form has advantages over other techniques both in interpretability and estimation.The Kalman filter lies at the heart of state space analysis and provides the basis for likelihood estimation.The general state space form for a multivariate time series is expressed as follows.The observable variables, a n × 1 vector y t , are related to an m × 1 vector x t , known as the state vector, via the observation equation, where H t is the n × m observation matrix and ε t , which denotes the measurement noise, is a n × 1 vector of serially uncorrelated disturbance terms with mean zero and covariance matrix R t .The states are unobservable and are known to be generated by a first-order Markov process, where Φ τ is the m × m state matrix, Γ t is an m × g matrix, and η t a g × 1 vector of serially uncorrelated disturbance terms with mean zero and covariance matrix Q t .A linear AR(p) time series model can be represented in the general state form by (3) where Γ t is 1 for the element (1, 1) and zero elsewhere.The state space model given by ( 3) and ( 4) is known as the phase canonical form and is not unique for AR(p) models.There are three other forms which differ in how the parameters are displayed but all representations give equivalent results, see for example Aoki (1987) or Akaike (1975).The Kalman filter is a recursive procedure for computing the optimal estimates of the state vector at time t, based on the information available at time t.The Kalman filter enables the estimate of the state vector to be continually updated as new observations become available.For linear system equations with normally distributed disturbance terms the Kalman filter produces the minimum mean squared error estimate of the state x t .The filtering equation written in the error prediction correction format is, where xt+1|t = Φ t xt|t is the predicted state vector based on information available at time t, K t+1 is the m × n Kalman gain matrix, and ȳt+1|t denotes the innovations process given by ȳt+1|t = y t+1 − H t+1 xt+1|t .The estimate of the state at t + 1, is produced from the prediction based on information available at time t, and a correction term based on the observed prediction error at time t + 1.The Kalman gain matrix is specified by the set of relations, where P t|t is the filter error covariance.Given the initial conditions, P 0|0 and x0|0 , the Kalman filter gives the optimal estimate of the state vector as each new observation becomes available.The parameters of the system of the Kalman filter, represented by the vector ψ, can be estimated using maximum likelihood.For the typical system the parameters, ψ, would consist of the elements of Q t and R t , auto-regressive parameters within Φ t and sometimes parameters from within the observation matrix H t .The parameters will depend upon the specific formulation of the system being modeled.For normally distributed disturbance terms, writing the likelihood function L(y t ; ψ), in terms of the density of y t conditioned on the information set at time t − 1, Y t−1 = {y t−1 , . . ., y 1 }, the log likelihood function is, (9) where ȳt|t−1 ψ is the innovations process and N t|t−1 ψ is the covariance of that process, The log likelihood function given in (9) depends on the initial conditions P 0|0 and x0|0 .de Jong (1989) showed how the Kalman filter can be augmented to model the initial conditions as diffuse priors which allow the initial conditions to be estimated without filtering implications.After τ time steps, where τ is specified by the modeller, a proper prior for state vectors will be estimated and the log likelihood can then be described as in ( 9) and ( 10).The Kalman filter described is discrete, (as tick data is quantisized into time steps, i.e. seconds), however the methodology could be extended to continuous time problems with the Kalman-Bucy filter (Meditch 1969).

Missing data
Irregular (erratic) times series presents a serious problem to conventional modeling methodologies.Several methodologies for dealing with erratic data have been suggested in the literature.Muller et al. (1990), suggest methods of linear interpolation between erratic observations to obtain a regular homogenous times series.Other authors (Ghysels and Jasiak 1995) have favored nonlinear time deformation ("business-time" or "tick-time).The Kalman filter can be simply modified to deal with observations that are either missing or subject to contemporaneous aggregation.The choice of time discretization (i.e.seconds, minutes, hours) will depend on the specific irregular time series.Too small a discretization and the data set will be composed of mainly missing observation.Too long a discretization and non-synchronous data will be aggregated.The timing (time-stamp) of financial transactions is only measured to a precision of one second.A time discretization of one second is a natural choice for modeling financial time series as aggregation will be across synchronous data and missing observations will be in the minority.
Augmenting the Kalman filter to deal with erratic time series is achieved by allowing the dimensions of the observation vector y t and the observation errors ε t , to vary at each time step (n t ×1).The observation equation dimensions now vary with time so, where W t is an n t × n matrix of fixed weights.This gives rise to several possible situations: • Contemporaneous aggregation of the first component of the observation vector, in addition to the other components.So the weight matrix is ((n + 1) × n) and has a row augment for the first component to give rise to the two values y 1 t,1 and y 1 t,2 .
• All components of the observation equation occur, where the weight matrix is square (n × n) and an identity.
• The ith component of the observation vector is unobserved, so n t is (n − 1) and the weight matrix (n × (n − 1)) has a single row removed.
• All components of the observations vector are unobserved, n t is zero, and the weight matrix is undefined.
The dimensions of the innovations ȳt|t−1 ψ , and their covariance N t|t−1 ψ also vary with n t .Equation ( 10) is undefined when n t is zero.When there are no observations on a given time t, the Kalman filter updating equations can simply be skipped.The resulting predictions for the states and filtering error covariance's are, xt = xt|t−1 and P t = P t|t−1 .For consecutive missing observations (n t = 0, . . ., n t+l = 0) a multiple step prediction is required, with repeated substitution into the transition Eq. ( 2) multiple step predictions are obtained by The conditional expectations at time t of ( 16) is given by and similarly the expectation of the multiple step ahead prediction error covariance for the case of time invariant system is given by The estimates for multiple step predictions of ŷt+l|t and xt+l|t can be shown to be the minimum mean square error estimators E t (y t+l ).

Non-Gaussian Kalman filter
The Gaussian estimation methods outlined above can produce very poor results in the presence of gross outliers.The Kalman filter can be adapted to filter processes where the disturbance terms are non-Gaussian.The distributions of the process and measurement noise for the standard Kalman filter is assumed to be Gaussian.The Kalman filter can be adapted to filter systems where either the process or measurement noise is non-Gaussian (Mazreliez 1973).Mazreliez showed that Kalman filter update equations are dependent on the distribution of the innovations and its score function.For a known distribution the optimal minimum mean variance estimator can be produced.In the situation where the distribution is unknown, piecewise linear methods can employed to approximate it (Kitagawa 1987).Since the filtering equations depends on the derivative of the distribution, such methods can be inaccurate.Other methods take advantage of higher moments of the density functions and require no assumptions about the shape of the probability densities (Hilands and Thomoploulos 1994).Carlin et al. (1992), demonstrate the Kalman filter robust to non-Gaussian state noise.For robustness, heavy tailed distributions can be consider as a mixture of Gaussians, or an ε-contaminated distribution.If the bulk of the data behaves in a Gaussian manner then we can employ a density function which is not dominated by the tails.The Huber function (1980), can be shown to be the least informative distribution in the case of ε-contaminated data.
In this paper we achieve robustness to measurement outliers by the methodology first described in Bolland and Connor (1996a).A Huber function is employed as the score function of the residuals.For spherically ε-contaminated normal distributions in R 3 the score function g 0 for the Huber function is given by, The Huber function behaves as a Gaussian for the center of the distribution and as an exponential in the tails.The heavy tails will allow the robust Kalman filter to down weight large residuals and provide a degree of insurance against the corrupting influence of outliers.The degree of insurance is determined by the distribution parameter c in (19).
The parameter can be related to the level of assumed contamination as shown by Huber (1980).

Sufficient conditions
For a Kalman filter to estimate the time varying states of the system given by ( 1) and ( 2), the system must be both observable and controllable.
An observable system allows the states to be uniquely estimated from the data.For example, in the no noise condition (ε t = 0) the state x t can be determined uniquely from future observations, if the observability matrix given by O where m is the number of elements in the state vector x t .The condition for observability is also equivalent to (see Aoki, 1990) where the observability Grammian, G O , satisfies the following Lyaponov equation, Φ The notion of controllability emerged from the control theory literature where v t denotes an action that can be made by an operator of a plant.If the system is controllable, any state can be reached with the correct sequence of actions.A system Φ is controllable if the reachability matrix defined by Aoki, 1990) where the controllability Grammian, G C , satisfies the following Lyaponov equation, When a state space model is in one of the four canonical variate forms such as (3) and (4), the Grammians G O and G C are identical and the Controllability and Observability requirements are the same.If Φ is non-singular and the absolute values of eigenvalues are less than one, the Observability and Controllability requirements of ( 20) and ( 21) will be met and the Kalman filter will estimate a unique sequence of underlying states.In the next two sections a neural network analogue of the stable linear system is introduced which will be used within the extended Kalman filter.If the eigenvalues are greater than one, the Kalman filter may still estimate a unique sequence of underlying states but this must be evaluated on a system by system basis.See for example the work by Harvey on the random trend model.

Extended Kalman Filter
The Kalman filter can be extended to filter nonlinear state space models.These models are not generally conditionally Gaussian and so an approximate filter is used.The state space model's observation function h t (x t ) and the state update function f t (x t−1 ) are no longer linear functions of the state vector.Using a Taylor expansion of these nonlinear functions around the predicted state vector and the filtered state vector we have, The extended Kalman filter (Jazwinski 1970) is produced by modeling the linearized state space model using a modified Kalman filter.The linearized observation equation and state update equation are approximated by, and the corresponding Kalman prediction equation and the state update equation (in prediction error correction format) are, The quality of the approximation depends on the smoothness of the nonlinearity since the extended Kalman filter is only a first order approximation of E{x t |Y t−1 }.The extended Kalman filter can be augmented as described in Eqs. ( 11)-( 15) to deal with missing data and contemporaneous aggregation.
The functional form of h t (x t ) and f t (x t−1 ) are estimated using a neural network described in Sec. 5.

Sufficient conditions
The nonlinear analog of the AR(p) model expressed in phase canonical form given by ( 3) and ( 4) is expressed as as in the linear time series model of ( 3) and ( 4).
The Extended Kalman Filter has a long history of working in practice, but only recently has theory produced bounds on the resulting error dynamics.Baras et al. (1988) and Song and Grizzle (1995) have shown bounds on the EKF error dynamics of deterministic systems in continuous and discrete time respectively.La Scala, Bitmead, and James (1995) have shown how the error dynamics of an EKF on a general nonlinear stochastic discrete time are bounded.The nonlinear system considered by La Scala et al. has a linear output map which is also true for our system described in ( 27)-( 29).As in the case of the linear Kalman filter, the results of La Scala et al. require that the nonlinear system be both observable and controllable.
The nonlinear observability Grammian is more complicated than its linear counterpart in (21) and must be evaluated upon the trajectory of interest (t 1 → t 2 ).The nonlinear observability Grammian as defined by La Scala et al. and applied to the system given by ( 27)-( 29) is where ) and F y (t) = ∂f /∂x(y(t)).The nonlinear controllability Grammian as defined by La Scala et al. and applied to the system given by ( 27)-( 29) is As in the linear case of Sec.2.3, the requirements that the nonlinear observability and controllability Grammians given by ( 30) and ( 31) are positive are identical with the appropriate choices of t and M due to If the Grammians of ( 30) and ( 31) are both positive and finite then the system is said to be controllable (observable).
Because of the similarity between observability and controllability criterions, only the observability criterion is examined closely.If the observability criterion is positive definite and finite for some values of M, the system is observable.For the nonlinear auto-regressive system given by ( 27)-( 29), The observability Grammian, (30), can be shown to be finite if S(t+∆, t) converges exponentially to zero.S(t + ∆, t) will converge exponentially to zero if all the eigenvalues of F y (k) are inside the unit circle for all possible values of y.For any fixed choice of y, Eq. ( 32) is equivalent to an AR(p) updating equation in state space form, this equivalence is easily understood because F y (k) corresponds to the linearization of a nonlinear AR(p) model given by ( 27)-( 29).If the corresponding AR(p) model converges exponentially to zero for all values of y, then S(t + ∆, t) will converge exponentially to zero also for this y.An AR(p) time series model given by w t = p i=1 a i w t−i +e t will converge exponentially to zero if all of the zeros of the polynomial p i=1 (1 − a i z −1 ) are inside the unit circle, see for example Roberts and Mulis (1987).Thus, the nonlinear autoregressive system, S(t + ∆, t) will converge exponentially to zero provided all the roots of the polynomial given in (33) are inside the unit circle If in addition, ∂f (y)/∂x i is non-zero, G O is finite positive definite and there exists an optimal choice of states which can be estimated with the EKF.In Sec. 4, conditions are discussed which guarantee the observability of a neural network system.The notions of observability and controllablility are not unique to Kalman filtering.Both notions are used extensively within control theory.For information related to neural networks and control theory, see for example Levin and Narendra (1993) and Levin and Narendra (1996).

Constrained Neural Networks
Often data generating processes have symmetries or constraints which can be exploited during the estimation of a neural network model.The problem is how to constrain a neural network to better exploit these symmetries.One of the key strengths of neural networks is the ability to approximate any function to any desired level of accuracy whether the function has or lacks symmetries, see for example Cybenko (1989).In this section a constraint is examined which is both natural to the financial problems investigated and desirable from a Kalman filtering perspective.
A neural network embedded within a Kalman Filter is used in an iterative fashion in the event of missing data.As mentioned earlier, linear models will either diverge if they are unstable or go to zero if they are stable.In addition, the Kalman Filter will converge on a unique state sequence if the system is stable, if it is unstable the Kalman Filter may or may not converge on a unique state sequence.It is thus natural when dealing with linear systems to want to constrain the system to have stable behavior.Constraining a linear model to be stable is as simple as insuring the eigenvalues of the state transition matrix are between ±1.
For neural networks the dynamics of the system can exhibit complicated behavior where the state trajectories can be cyclical, chaotic, or converge to one of multiple fixed point attractors.For the problems we are interested in, a limiting point of zero is desirable.From the perspective of a Kalman filter, the system is stable.From view of the financial example presented later, a limiting point of zero will correspond to future price increments being unknown beyond a certain distance into the future.
There are two ways of putting symmetry constraints in neural networks, a hard symmetry constraint obtained by direct embedding of the symmetry in the weight space and a soft constraint of pushing a neural network towards a symmetric solution but not enforcing the final solution to be symmetric.
The soft constraint of biasing towards symmetric solutions can be viewed from two perspectives, providing hints and regularization.Abu-Mostafa (1990), (1993), and (1995), showed among other things that it is possible to augment a data set with examples obtained by generating new data under the symmetry constraint.The neural network is then trained on the augmented training set and is likely to have a solution that is closer to the desired symmetric solution then would otherwise be the case.The soft constraint can come in the form of extra data generated from a constraint or as a constraint within the learning algorithm itself.Alternatively, neural networks which drift from the symmetric solution can be penalized by a regularization term, see for example the tangent prop by Simard, Victorri, La Cunn, and Denker (1992).Both the hint and the regularization approach to soft constraints were shown to be related by Leen (1995).
The alternative of hard constraints was first proposed by Giles et al. (1990) in which a neural network was made invariant under geometric transformations of the input space.We propose to incorporate a hard constraint in which a neural network is forced to have a fixed point at the origin, producing a forecast of zero when past observations, y t−i for i = 1, . . ., p are equal to zero.
The imposition of a fixed point in the neural network will have the largest effect when the predictor is being iterated.The iterated predictor is, as will be described in Sec.4.1, a recurrent neural network.A fixed point need not alter the estimated neural network significantly.Jin et al. (1994) show there must be at least one fixed point in a recurrent network anyway.As the example in Sec. 5 demonstrates, an unconstrained iterated predictor will often converge to a fixed point in any case, the problem is that the resulting fixed point is found to be undesirable.
Other possible dangers exist with recurrent neural networks.The existence of a fixed point does not preclude chaos, the fixed point may be unstable or only stable locally.Often recurrent neural networks can perform oscillations or more complicated types of stable or chaotic patterns, see for example Marcus and Westervelt (1989), Pearlmutter (1989), Pineda (1989), Cohen (1992), Blum and Wang (1992) and many others.Under some circumstances this complicated behavior in an iterated predictor could be desirable, however it is the view of this paper that any advantages are outweighed by the dangers of using a Kalman filter on a poorly understood nonstable system.
A fixed point at ŷt = 0 is achieved by augmenting a neural network with a constrained output bias parameter.For a network consisting of H hidden units and with activation function f, the functional form of the constrained network is given by, ŷt = where parameters of the network W i , w ij , and θ i , represent the output weights, input weights and the input biases.The first term of (34) describes a standard feedforward neural network and the second term represents the "hard wired" constraint.Conditions which will ensure that the fixed point is a stable attractor will be discussed in Sec.4.2.The fixed point will only be guaranteed to be a local attractor.Outside of the local area surrounding the origin, the behavior of the neural network model will be determined by the data used for training.
The estimation of the parameters can be achieved by using a slightly augmented back-propagation algorithm.For a mean squared error cost function the fixed point constraint leads to a slightly more complicated learning rule based on the following derivatives, with the derivative of the prediction w.r.t. the inputs, dŷ t /dw ik , being unaffected by the constraint on the neural network.The neural network training algorithm assumes that the initial weight matrix satisfies the fixed point constraint.

Iterated behavior of neural network
The neural network in ( 34) is a simple feedforward neural network with no feedback connections.As the neural network stops receiving new data and begins running in an iterated manner, the neural network predictions will be based on past neural network predictions Recurrent connections are implicitly being added when the neural network is running iteratively within the Extended Kalman filtering system.After p time steps have been processed without receiving any data, the system is no longer receiving data from outside the system.The iterated system is as depicted in Fig. 1.The recurrent activations of the neural units can be divided into three sets; the activations of the hidden units, s i for 1 ≤ i ≤ H; the activation of the output unit, s H+1 for i = H + 1; the activation of the input units, s H+I for where the parameters W and w are identical to the system given by ( 34).The recurrent neural network is different from the feedforward neural network because of the addition of p neurons, s H+1 (t), . . ., s H+p (t), which enable the previous predictions to be stored in the system via ( 39) and ( 40) and used as a basis for further predictions.The resulting recurrent connections are much more limited than considered in most recurrent network studies.
Whether the behavior of the recurrent network given by ( 38)-( 40) is stable, oscillatory or chaotic is determined by the weights.The next section will state under what conditions the neural network will converge to a fixed point.

Absolute stability of discrete recurrent neural networks
The strongest results on the stability of recurrent networks are for the continuous time versions.The popular Hopfield network (Hopfield 1984) is guaranteed to be stable because the weight matrix is confined to be symmetric.The convergence of nonsymmetric recurrent neural networks in continuous time has been investigated by Kelly (1990) and others.Matsuoka (1992) in particular has shown how weights of a non-symmetric recurrent neural network can be extremely large and convergence can still be guaranteed in some cases.
The guarantees for convergence of discrete time recurrent networks are not as strong as that of continuous time neural networks.Many weight configurations which lead to convergence in continuous time neural networks do not produce convergence for the case of discrete time recurrent networks.
For a discrete time recurrent network to be absolutely stable, it must converge to a fixed point independent of the initial starting conditions.Results of Jin et al. (1994) for fully recurrent networks will be reviewed.These results will be applied to the special case of the constrained network.Jin et al. (1994) reported that a fully recurrent network of the form will converge to a fixed point if all of the eigenvalues of the matrix W of network weights ω ij fall within the unit circle.Jin et al. then generalize the results by noting that the eigenvalues are not change by the transformation P −1 WP when is a non-singular matrix of equivalent dimensions.Jin et al. consider the transformation P = diag(p 1 , . . ., p N ) where p i are all positive which leads to leads to following guarantee of stability with c i denoting the maximum slope of the ith nonlinear neuron transfer function, γ ∈ [0, 1], R p i and C p i are given by and (p 1 , . . ., p n ) are all positive.

The general case
Stability guarantees for neural networks using (41) are typically quoted for two cases, γ = 0 and γ = 1.The fully recurrent network given by ( 41) is more general than the iterated neural network given in (38)-( 40).For the iterated neural network, the stability guarantees (38) reduces to the following two cases.
With a choice of γ = 0: This is equivalent to a weighted sum of the absolute value of the weights leaving each neuron multiplied by the maximum slope of a neuron nonlinearity being less than one.With a choice of γ = 1.
This is equivalent to a weighted sum of the absolute value of the weights going entering each neuron multiplied by the maximum slope of a neuron nonlinearity being less than one.Any choice of positive (p 1 , . . ., p n ) is allowable, two natural choices for the iterated neural network are now listed.

Scale invariance
A transformation which will not alter the recurrent network is obtained by taking advantage of its autoregressive structure.The output of the network can be scaled as long as the weights connected to past predictions are divided by the same constant, this leads to a network with equivalent dynamics and weights given by W i = kW i and The guarantee for stability given by ( 51) and ( 52) becomes Equations ( 51) and ( 52) can easily be optimized by choosing k −1 = Σ H j=1 |W j | + ε with ε being vanishingly small and positive and checking that (51) still holds.

The relationship between absolute stability and observability
In Sec. 3 the extended Kalman filter was said to be observable if the eigenvalues of the transition matrix were inside the unit circle.The stability results of the constrained neural network presented in Sec.4.1 are equivalent to guaranteeing the roots of the polynomial g(z) given in (33) are inside the unit circle.
In addition, if (32) is invertible, a constrained neural network will be observable.

Stability of the constrained neural network
When the constrained neural network given in ( 34) is iterated it is the same form as the iterated neural network given in ( 38)-( 40), with the exception of a bias term.Bias terms have no effect on the stability of an iterated constrained neural network, hence the above stability arguments apply also to the stability of the constrained neural network.

Application to "Tick" Data
The financial institutions are continually striving for a competitive edge.With the increase in computer power and the advent of computer trading, the financial markets have dramatically increased in speed.
Technology is creating new trading horizons which give access to possible inefficiencies of the market.The volume of trading has increased hugely and the price series (tick data) of these trades offers a great source of information about the dynamics of the markets and the behavior of its practitioners.A major problem for financial institutions that trade at high frequency is estimating the true value of the commodity being traded at the instance of a trade.This problem is faced by both sides of transaction, the market maker and the investor.The uncertainty arises from many sources.There are several sources of additive noise.The state noise dominates, however bid ask bounce, price quantization, and typographic errors result in observation noise.In addition to these uncertainties the estimate of the value has to be based on information that may be several time periods old.The methodology we present addresses these problems and produces an estimate of the "true mid-price" irrespective of the data's erratic nature.
The Dollar/Deutsche Mark exchange rates were modeled.The data was obtained from Reuters FXFX pages, and covered the period March 1993-April 1995.As the data is composed of only quotes then the possible sources of noise are amplified.Compared to brokers data or actual transactions data the quality of quote data is very poor.The price spreads (bid/ask) are larger, the data is prone to miss-typing error and also some of the quotes are simply advertisements for the market makers.In general quotation have more process noise and more sources of measurement noise.However quotation data is readily available and the most liquid and accessible view of the market.
To eliminate some of the bid-ask noise in the series, the changes in the mid-price were modeled.The filtered states represent the estimated "true" midprices.For an NAR(p) model the one step ahead prediction is, where g is the predicted change in state based on the changes in previous time periods.The function g was estimated by both a constrained feedforward neural network and an unconstrained neural network.For the results shown here a simple NAR(1) model was estimated.A parsimonious neural network specification was used, with a 4 hidden unit feed forward network using sigmoidal activation functions.
An estimation maximization (EM) algorithm is employed at the center of a robust estimation procedure based on filtered data (for full details see Bolland and Connor 1996).The EM algorithm, see Dempster, Laird, and Rubin (1977), is the standard approach when estimating model parameters with missing data.The EM algorithm has been used in the neural network community before, see for example Jordan and Jacobs (1993) or Connor, Martin, and Atlas (1994).During the estimation step, the missing data, namely the x t , ε t , and η t of ( 1) and ( 2) must be estimated.This amounts to estimating parameters of the state update function f and the noise variance matrices Q t and R t .With the estimated missing data assumed to be true, the parameters are then chosen by way of maximizing the likelihood.This procedure is iterative with new parameter estimates giving rise to new estimates of missing data which in turn give rise to newer parameter estimates.The iterative estimation procedure was initialized by constructing a contiguous data set (no arrival noise) and estimating a linear auto-regressive model.The variances of the disturbance terms are non-stationary.To remove some of this non-stationarity the intra day seasonal pattern of the variances were estimated (Bolland and Connor 1996).The parameters of the state update function were assumed to be stationary across the length of the data set.Table 1 gives the performance of the two models for non-iterated forecast.The constraints on the network are not detrimental to the overall performance, with the percentage variance explained (r-squared) and the correlation being very similar.
Figure 2 shows the fitted function of a simple NAR(1) model for the constrained neural network and the unconstrained neural network.The qualitative difference in the models estimated function are only slight.
Figure 3 shows the estimated function around the origin.At the origin the constrained network has a bias as it has been restrained from learning the mean of the estimation set.Although this bias is only very   small (for linear regression the bias is 5.12 × 10 −7 with a t-statistic of 0.872), its effect large as it is compounded by iterating the forecast.
The filter produces estimated states (shown in Fig. 4) which can be viewed as the "true midprices," the noise due to market friction's has been estimated and filtered (bid-ask bounce, price quantization, etc.).The iterated forecasts reach the stable point after only small number of iterations (approx.5).
Figure 5 shows a close up of the iterated forecasts of the two networks.The value of the stable point in the case of the simple NAR(1) is the final gradient of the iterated forecasts.The stable point of the constrained neural network is zero, and the stable point of unconstrained is 1.57 × 10 −6 .This is the result of a small bias in the model.When the forecast is iterated this bias is accumulated and therefore the unconstrained network predictions trend.For the constrained network the iterated forecast soon reach the stable point zero, reflecting our prior belief in the long term predictability of the series.The mean squared error (MSE) as well as the median absolute deviations (MAD) of the constrained and unconstrained networks are given in Table 2, and shown in Fig. 6.As the forecast is iterated the MSE for the unconstrained grows rapidly.This is due to its trending forecast.
It is clear that the performance is improved by constraining the neural network.The MSE for the constrained neural network remains relatively constant with prediction.The accuracy of iterated prediction should decrease as the forecast is iterated.From Fig. 6 it is clear that the MSE is not increasing with number of iterations.However, only in periods of very low trading activity are forecasts iterated for 40 time steps also in periods of low trading activity the variance in the time series is low.So the errors in these periods are only small even though the time between observations can be large.

Conclusion and Discussion
Using neural networks within an extended Kalman filter is desirable because of the measurement and arrival noise associated with foreign exchange tick data.The desirability of using a stable system within a Kalman filter was used as an analogy for developing a "stable neural network" for use within an extended Kalman filter.The "stable neural network" was obtained by constraining the neural network to have a fixed point of zero input gives zero output.In addition, the fixed point at zero reflected our belief that price increments beyond a certain horizon are unknowable and a predicted price increment of zero is best (random walk).This constrained neural network is optimized for foreign exchange modeling, for other problems a constrained neural network with a fixed point at zero would be undesirable.
The behavior of the neural network within the extended Kalman filter under normal operating conditions is roughly the same as the unconstrained neural network.But in the presence of missing data, the iterated predictions of the constrained neural network far outperformed the unconstrained neural network in both quality and performance metrics.

Table 2 .
Test set performance.