Stochastic analysis of neural network modeling and identification of nonlinear memoryless MIMO systems
© Ibnkahla; licensee Springer. 2012
Received: 6 December 2011
Accepted: 13 July 2012
Published: 21 August 2012
Neural network (NN) approaches have been widely applied for modeling and identification of nonlinear multiple-input multiple-output (MIMO) systems. This paper proposes a stochastic analysis of a class of these NN algorithms. The class of MIMO systems considered in this paper is composed of a set of single-input nonlinearities followed by a linear combiner. The NN model consists of a set of single-input memoryless NN blocks followed by a linear combiner. A gradient descent algorithm is used for the learning process. Here we give analytical expressions for the mean squared error (MSE), explore the stationary points of the algorithm, evaluate the misadjustment error due to weight fluctuations, and derive recursions for the mean weight transient behavior during the learning process. The paper shows that in the case of independent inputs, the adaptive linear combiner identifies the linear combining matrix of the MIMO system (to within a scaling diagonal matrix) and that each NN block identifies the corresponding unknown nonlinearity to within a scale factor. The paper also investigates the particular case of linear identification of the nonlinear MIMO system. It is shown in this case that, for independent inputs, the adaptive linear combiner identifies a scaled version of the unknown linear combining matrix. The paper is supported with computer simulations which confirm the theoretical results.
KeywordsNonlinear system identification Neural networks Gradient descent Statistical analysis
Neural network approaches have been extensively used in the past few years for nonlinear MIMO system modeling, identification and control where they have shown very good performances compared to classical techniques[2–6].
If these NN approaches are to be used in real systems, it is important for the algorithm designer and the user to understand their learning behavior and performance capabilities. Several authors have analyzed NN algorithms during the last two decades which considerably helped the neural network community to better understand the mechanisms of neural networks[1, 7–15]. For example, the authors in have studied a simple structure consisting of two inputs and a single neuron. The authors in studied a memoryless single-input single-output (SISO) system identification model for the single neuron case. In the authors proposed a stochastic analysis of gradient adaptive identification of nonlinear Wiener systems composed of a linear filter followed with a Zero-memory nonlinearity. The model was composed of a linear adaptive filter followed by an adaptive parameterized version of the nonlinearity. This study has been later generalized for the analysis of stochastic gradient tracking of time-varying polynomial Wiener systems. In the author analyzed NN identification of nonlinear SISO Wiener systems with memory for the case where the adaptive nonlinearity is a memoryless NN with an arbitrary number of neurons. The case of a nonlinear SISO Wiener-Hammerstein system (i.e., an adaptive filter followed by an adaptive Zero-memory NN followed by an adaptive filter) has been analyzed in.
The purpose of this paper is to provide a stochastic analysis of NN modeling of this class of MIMO systems. The paper provides a general methodology that may be used to solve other problems in stochastic NN learning analysis. The methodology consists of splitting the study into simple structures, before studying the complete structure. Here, as a first step we start by analyzing a simple linear adaptive MIMO scheme (consisting of an adaptive matrix) that identifies the nonlinear MIMO system (i.e., problem of linear identification of a nonlinear MIMO system). Then we analyze a nonlinear adaptive system in which the nonlinearities are assumed to be known and frozen during the learning process, only the linear combiner is made adaptive. Finally, the complete adaptive scheme is analyzed taking into account the insights given by the analysis of the simpler structures. In our analytical approach, we derive the general formulas and recursions, which we apply to a case study that we believe is illustrative to the reader.
The paper is organized as follows. The problem statement is given in Section 2. The study of the simple structures is detailed in Section 3. Section 4 presents the analysis for the complete structure. Simulation results and illustrations are given in Section 5. Finally, conclusions and future work are given in Section 6.
Nonlinear MIMO system
The class of nonlinear MIMO systems discussed in this paper is presented in Figure 1. Each input x i (n) (i = 1,…,M) is nonlinearly transformed by a memoryless nonlinearity g i (.). The outputs of these nonlinearities are then linearly combined by an L × M matrix H = [h ji ] (assumed in this paper to be constant). Matrix H is defined by the unknown system to be identified. For example, in wireless MIMO communication systems, M is the propagation matrix representing the channel between M transmitting antennas and L receiving antennas.
where N j is a white Gaussian noise with variance σ02. Let
Neural Network identification structure and algorithm
Weights w jk will be represented by an LxM matrix: W = [w jk ]. Let
where μ is a small positive constant and represents the derivative:
After the derivation of the general formulas, it is important that we apply them to special cases in order to get closed-form expressions of the different recursions that can be illustrated to the reader. We have chosen here a case study that we think is good to illustrate our results. In this case study, the inputs x i (n) will be assumed uncorrelated Zero-mean Gaussian variables with variance. The NN activation function will be taken as the erf function. The unknown nonlinear transfer functions are taken from a family of nonlinear functions of the form, where α i and β i are positive constants. These nonlinear functions are reasonable models for amplitude conversions of nonlinear high power amplifiers (HPA) used in digital communications[12, 25, 26]. Note that other nonlinear functions may be considered, however, explicit closed-form solutions of the different derivations may not be possible.
Study of simplified structures: Linear adaptation
The adaptive system is composed of an adaptive linear combiner W (Section 3.1).
The adaptive system is composed of W and scaled versions of the unknown nonlinearities (Section 3.2).
Linear adaptive system
Mean weight behavior and Wiener solution
Since matrix W is linear, it will not be able to identify the nonlinear blocks. However, we will see that it is able to identify matrix H to within a diagonal scaling matrix if the inputs are Zero-mean and independent.
If μ is sufficiently small, the first term converges to 0 and the second term converges to.
where λmax is the largest eigenvalue of the covariance matrix R XX .
In this case, the linear adaptation allows the identification of matrix W to within a scaling matrix, which depends on the nonlinearities and the input signals. As expected, the scaling matrix reduces to the identity matrix if g k (x k ) = x k .
Application to the case study:
Transient MSE and Wiener MSE
It is clear from this equation that if the unknown functions are linear, then the Wiener MSE reduces to the noise power. The MSE is always larger than ζ0 because of the misadjustment error introduced by the weight fluctuations.
Derivation of the misadjustment:
Thus, as expected, if μ is sufficiently small E( V (n)) converges to 0.
The additional terms are due to the nonlinearities and they should be calculated specifically for each nonlinearity.
Application to the case study:
Adaptive W, the nonlinearities are frozen and known with scale factors
Mean weight behavior and stationary points
The stability condition on μ is:
Where λmax is the largest eigenvalue of the covariance matrix Ω2Rg (X)g(X).
Thus, if each nonlinear function g k (.) is known with a scaling factor η k , then weights h jk will be identified by w jk (to the inverse of the scaling factor).
Therefore the Wiener MSE is equal to the noise floor: There are no terms due to the nonlinearities. This is expected since the nonlinearities are known with a scaling matrix Ω (we have seen that the scaling matrix is canceled by W0 since W0 = HΩ-1).
Derivation of the misadjustment:
Thus, as expected, if μ is sufficiently small, E(V(n)) converges to 0.
This expression can not be further simplified because R ZZ is not necessarily of the form.
Therefore, tr(R ZZ K VjVj (∞)) should be calculated for each nonlinearity and for each Ω.
Here the value of the misadjustment is similar to that of linear identification of a linear system (LMS algorithm). This is expected since in this case there are no errors due to the approximation of the nonlinearities.
Study of the full structure
Mean weight transient behavior
We take the following notations for the weights:.
These matrices are time-dependent since they depend on the NN block weights which are updated through time.
These equations hold for any nonlinearity. In the following, we will calculate them explicitly for the case study described in Section 2.3
Application to the case study:
The explicit expressions of the different derivatives are detailed in Appendix II.
We obtain the stationary points by setting to 0 the expectations of the updating gradient terms in (64) and(4.5-7).
The above equations are nonlinear in the NN variables. They can be solved numerically, but they are very difficult to solve analytically.
Convergence of the algorithm to the stationary points:
It is always interesting to show whether an algorithm is capable of converging to its stationary points or not. In our case it is difficult to establish this, since the updating equations of the weights are nonlinear, except for W.
In the case where the NN weights are frozen we can establish the convergence condition for W.
The covariance matrices are fixed, since in this case the NN weights are frozen.
Where λmax is the largest eigenvalue of the correlation matrix R NN(X)NN(X) .
Application to the case study:
This indicates that weights w jk are scaled versions of the unknown weights h jk , the scale factor γ k is the same for all the weights connecting the k th NN block to the outputs and it depends only on block k weights. If the error is sufficiently small, the k th block NN will approximate the k th nonlinearity to the inverse of the scale factor.
Application to the case study:
The 1 st term ofrepresents the noise power, the 2nd term is the signal power of the j th MIMO output, the 3 rd term is the sum of the individual contributions of the neurons weighed by W and H weights, the 4th term represents the sum of the coupling terms between neurons inside the same block weighed by W. Note that since the inputs are Zero-mean and independent, there are no coupling terms between neurons in different blocks (as in Eq. (89)).
Case of frozen NN weights:
It is interesting to see the behavior of the MSE in the case where the NN weights are frozen.
Here the minimum MSE depends on the noise floor and on the NN approximation error of the nonlinearities. It is clear from this equation and from Section 3.2 that, if the NN blocks ideally identify the nonlinearities (to within scale factors), then ζ0 reduces to the noise floor.
The misadjustment can be derived similarly as in Sections 3.1 and 3.2. We obtain a similar equation as (53), by replacing R ZZ by R NN(X)NN(X) . The equation can not be simplified further.
In this section we present some simulation results which are applied to the case study described in Section 2.3. In these simulations, we have considered a 2 × 2 MIMO system (i.e., M = L = 2). For the parameterized nonlinearities we have chosen α1=α2=1, β1=1, β2=2. Unless otherwise specified, the inputs are uncorrelated Zero-mean white Gaussian processes with σ xi = 1. In the simulations, the unknown combining matrix was fixed and was taken as. For example, in a MIMO communication system, H can be seen as the propagation matrix between 2 transmitting antennas and 2 receiving antennas.
Matrix W converges to a scaled version of H: Note the typical behavior of the LMS algorithm: A time constant controls the transient part of the learning curve and the mean weight curve. This is fundamentally different from the full NN system learning which is governed by several time constants and presents plateau regions (Section 5.2). It should be noted that the steady state MSE is high because of the error caused by the fact that the nonlinearities are not approximated (actually they are modeled by the identity function) (Equation (25)).
MSE surface for the full NN algorithm
Figure 11b shows the MSE evolution during the learning process for different values of the noise variance σ0. It can be seen that, as the noise variance decreases, the MSE decreases. However, below a certain value of σ0 (here σ0=0.0005), the MSE curves are almost identical. This is because in this case, the weight misadjustment error (for the linear part) and the nonlinear approximation error (of the nonlinear memoryless part) are much higher than the error caused by the presence of noise (see Eqs. 92-93).
Figure 11c investigates the influence of the learning rate μ. It can be seen that as μ increases (up to μ=0.002), the algorithm is faster and the MSE is lower at the end of the simulation time. However, for μ>0.002, as μ increases, the algorithm is faster at the beginning of the learning process, but the MSE is higher at the end of the simulation time. This is due to the misadjustment error which is higher for higher μ (see, e.g., Eq. 95).
Mean weight transient behavior for the full NN algorithm
Notice that, in Figure 14, W weights have a fast evolution at the beginning of the learning process (with values approaching H×U(n) where U is a diagonal matrix). They then evolve slowly till the end of the learning process. The slow evolution is justified by the plateau regions presented by the MSE surface. At the end of this simulation, matrix U was close to a diagonal matrix: (and). This result is expected since the inputs are uncorrelated (Equations 84-85).
Figure 12,13 show that functions g1 (x) and g2 (x) have been correctly identified by the corresponding NN blocks (the NN functions are normalized by the scaling factors γ1=1/1.2702, γ1=1/1.0946, respectively).
Impact of correlated inputs
Effect of correlated inputs
ρ=1 (same input)
E(U(n)), for n=2 × 105
Conclusion and future work
The paper provides a statistical analysis of NN modeling and identification of a class of nonlinear MIMO systems. The study investigates the MSE error, mean weight behavior, stationary points, misadjustment error, and stability conditions. The unknown system is composed of a set of single-input memoryless nonlinearities followed by a combining matrix. The NN model is composed of a set of single-input memoryless NN blocks followed by an adaptive linear combiner. The paper is supported with simulation results which show good agreement between the theoretical recursions and MC simulations. Future work will focus on 3 research directions. The first will explore the theoretical findings in order to express the effect of the number of neurons on the transient and steady state behavior of the algorithm. The second research axis will investigate the case where matrix H is time-varying and/or with memory (this may have applications, for example, in adaptive control of nonlinear dynamical MIMO systems). Finally, we will study the algorithm behavior and performance for specific inputs (such as space-time coded signals used in wireless communications and their impact on the system capacity).
Calculation of F kk Let x 1 and x 2 be two zero-mean Gaussian variables such that Therefore, Using Price’s theorem we have: Let Then:
Thus, using the un-correlation criteria between x1 and x2 for ρ=0 , we have: Thus: We have:
whereCombining the terms in the exponentials and completing the squares, the integrals can be calculated:Note that in the biasless case (i.e. all the bias terms are set to 0) this expression reduces to:The integral is then simple to calculate:
In the other hand, since, then:When the bias terms are not set to 0, a Taylor series expansion on the bias terms can be used in order to avoid the calculation of the integral.
- 2)Calculation of KThe inside integral can be eliminated by integrating by parts on variable x.The integral is then evaluated by combining the terms in the exponentials and completing the squares. This yields:Again, a Taylor series expansion can be used to simplify this expression.Note that in the biasless case we have:
- (96)is then expressed as: