Open Access

Group lassoing change-points in piecewise-constant AR processes

EURASIP Journal on Advances in Signal Processing20122012:70

DOI: 10.1186/1687-6180-2012-70

Received: 31 October 2011

Accepted: 21 March 2012

Published: 21 March 2012

Abstract

Regularizing the least-squares criterion with the total number of coefficient changes, it is possible to estimate time-varying (TV) autoregressive (AR) models with piecewise-constant coefficients. Such models emerge in various applications including speech segmentation, biomedical signal processing, and geophysics. To cope with the inherent lack of continuity and the high computational burden when dealing with high-dimensional data sets, this article introduces a convex regularization approach enabling efficient and continuous estimation of TV-AR models. To this end, the problem is cast as a sparse regression one with grouped variables, and is solved by resorting to the group least-absolute shrinkage and selection operator (Lasso). The fresh look advocated here permeates benefits from advances in variable selection and compressive sampling to signal segmentation. An efficient block-coordinate descent algorithm is developed to implement the novel segmentation method. Issues regarding regularization and uniqueness of the solution are also discussed. Finally, an alternative segmentation technique is introduced to improve the detection of change instants. Numerical tests using synthetic and real data corroborate the merits of the developed segmentation techniques in identifying piecewise-constant TV-AR models.

1. Introduction

Autoregressive (AR) models have been the workhorse for parametric spectral estimation since they form a dense set in the class of continuous spectra and, in many cases, they approximate parsimoniously the spectrum of a given random process [1], Chap. 3]. These are among the main reasons why AR models have been widely adopted in various applications as diverse as speech modeling [25], electroencephalogram (EEG) signal analysis [6], and geophysics [7]. While AR modeling of stationary random processes is well appreciated, a number of signals encountered in real life are non-stationary. This justifies the growing interest toward non-stationary signal analysis and time-varying (TV) AR models, which arise naturally in speech analysis due to the changing shape of the vocal tract as well as in EEG signal analysis due to the changes in the electrical activity of neurons. If the TV-AR coefficient trajectories can be well approximated by superimposing a small number of basis sequences, non-stationary modeling reduces to estimating via, e.g., least-squares (LS), the basis expansion coefficients [7]. On the other hand, it has been well-documented that piecewise-constant AR systems excited by white Gaussian noise are capable of modeling real-world signals such as speech and EEG [68]. Piecewise-constant AR models constitute a subset of TV-AR models wherein AR coefficients change abruptly. In this case, basis expansion techniques fall short in estimating the change points [8].

Exploiting the piecewise constancy of TV-AR models, several methods are available to detect the changing instants of the AR coefficients, and thus facilitate what is often referred to as signal segmentation. The literature on signal segmentation is large since the topic is of interest in signal processing, applied statistics, and several other branches of science and engineering. Recent advances can be mainly divided in two categories. The first class adopts regularized LS criteria in order to impose piecewise-constant AR coefficients. To avoid "oversegmentation," the LS cost is typically regularized with the total number of changes [6]. The resulting estimator can be implemented via dynamic programming (DP), which incurs computational burden that scales quadratically with the signal dimension. For large data sets, such as those considered in speech processing, this burden refrains practitioners from applying DP to segmentation, and heuristics are pursued instead based on the generalized likelihood ratio test (GLRT), or, approximations of the maximum likelihood approach [2], [[7], p. 401], [9]. The second class of methods relies on Bayesian inference and Markov Chain Monte Carlo (MCMC) methods [35]. A distinct advantage of this class is that model order selection can be performed automatically, and a variable model order can be chosen per segment. However, Bayesian techniques are known to require large computational resources.

The algorithm for change detection of piecewise-constant AR models developed in this article belongs to the first class of methods, and its first novelty consists in developing a new regularization function which encourages piecewise-constant TV-AR coefficients while being convex and continuous; hence, it can afford efficient convex optimization solvers. To this end, it is shown that the segmentation problem can be recast as a sparse regression problem. The regularization function in [6] is then relaxed with its tightest convex approximation. It turns out that the resultant change detector is a modification of the group least-absolute shrinkage and selection operator (Lasso) [10].

With the emphasis placed on large data sets, a candidate algorithm for implementing the developed change detector is a block-coordinate descent iteration, which is provably convergent to the group Lasso solution. Surprisingly, it turns out that each iteration of the block-coordinate descent can be implemented at complexity that scales linearly with the signal dimension, thus encouraging its application to large data sets. Regularization tuning and uniqueness of the group Lasso solution are also discussed.

The second novelty of the present study is an alternative change-point retrieval algorithm based on the smoothly-clipped absolute deviation (SCAD) regularization. The associated non-convex problem is tackled by resorting to a local linear approximation (LLA), which yields iterated weighted group Lasso minimization problems that can be solved via block-coordinate descent. Numerical tests using synthetic and real (speech and sound) data are performed to corroborate the capability of the developed algorithms to identify piecewise-constant TV-AR models.

The remainder of the article is structured as follows. Section 2 deals with piecewise-constant TV-AR model estimation preliminaries. In Section 3, the problem at hand is recast as a sparse linear regression, and the novel group Lasso approach is introduced. An efficient block-coordinate descent algorithm is developed in Section 4, while tuning issues and uniqueness of the group Lasso solution are addressed in Section 5. Section 6 introduces a non-convex segmentation method based on the SCAD regularization to enhance the sparsity of the solution, which translates to retrieving more precisely the change instants. Numerical tests are presented in Section 7, and concluding remarks are summarized in Section 8. The Appendix is devoted to technical proofs. Notation: Column vectors (matrices) are denoted using lower-case (upper-case) boldface letters; calligraphic letters are reserved for sets; (·) T stands for transposition, N ( μ , σ 2 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq1_HTML.gif denotes the Gaussian probability density function with mean μ and variance σ2; denotes the Kronecker product; 0 L is the L-dimensional column vector with all zeros, and I L is the L-dimensional identity matrix. The p norm of x : = [ x 1 , , x L ] T L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq2_HTML.gif is defined as x p : = l = 1 L x l p 1 p . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq3_HTML.gif.

2. Preliminaries and problem statement

Let y n n = - L N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq4_HTML.gif denotes the realization of an L th order TV-AR process obeying the discrete-time input-output relationship
y n = l = 1 L a l , n y n - l + v n , n = 0 , 1 , , N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ1_HTML.gif
(1)
where v n denotes the zero-mean white input noise at time n with variance σ 2 : = E v n 2 < + https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq5_HTML.gif, and aℓ,nis the th TV-AR coefficient at time n. With h n : = [ y n - 1 , y n - 2 , , y n - L ] T L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq6_HTML.gif and a n : = [ a 1 , n , a 2 , n , , a L , n ] T L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq7_HTML.gif, (1) can be rewritten as
y n = h n T a n + v n , n = 0 , 1 , , N . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ2_HTML.gif
(2)
Suppose that abrupt changes in the spectrum of {y n } occur, due to piecewise-constant changes of an; that is,
a n = a k , n k n n k + 1 - 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ3_HTML.gif
(3)

for k = 0,1,..., K, where K denotes the number of abrupt changes in the TV-AR spectrum, and n k the time instant of the k th abrupt change. The interval [n k , nk+1- 1] is referred to as the kth segment. Without loss of generality, n0 = 0 and nK+1- 1 = N.

The goal is to estimate the instants { n k } k = 1 K https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq8_HTML.gif where the given time series {y n } is split into K + 1 (stationary) segments, and also the constant AR coefficients per segment, i.e., { a k } k = 0 K https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq9_HTML.gif. The number of abrupt changes, namely K, is not necessarily known.

2.1. Optimum segmentation of TV-AR processes

Modeling real world signals using AR processes is well motivated because, for a given continuous spectral density S(f), it is possible to find an AR process (of high enough order) whose spectral density is arbitrarily close to S(f) [11], p. 130]. On the other hand, depending on the underlying non-stationary phenomena, variations of the AR coefficients can be either slow or abrupt. The problem stated in Section 2 is often referred to as signal segmentation, and emerges in numerous applications ranging from speech processing [25] to EEG signal analysis [6]. Regularized LS has been the workhorse approach for analyzing this kind of non-stationary processes [1215]. Denoting with μ a positive tuning constant, a Schwarz-like regularization is typically adopted to estimate jointly the change points and the AR coefficients, i.e.,
{ â n } n = 0 N : = arg min { a n } n = 0 N 1 2 n = 0 N ( y n - h n T a n ) 2 + μ n = 1 N δ 0 L ( a n - a n - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ4_HTML.gif
(4)
where δ 0 L ( ) : L { 0 , 1 } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq10_HTML.gif is defined as
δ 0 L ( a ) : = 0 , if a = 0 L 1 , otherwise . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ5_HTML.gif
(5)
The non-convex regularization term n = 1 N δ 0 L ( a n - a n - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq11_HTML.gif not only captures the total number of changes, but also encourages piecewise-constant { â n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq12_HTML.gif. Clearly, the larger the μ, the smaller the total number of changes. The estimator in (4) is optimal in the maximum a posteriori (MAP) sense when the change occurrences are modeled as Bernoulli random variables, and v n ~ N ( 0 , σ 2 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq13_HTML.gif[6]. In some problems, the total number of changes is known, and the following constrained version of (4) is adopted instead:
{ â n } n = 0 N = arg min { a n } n = 0 N n = 0 N ( y n - h n T a n ) 2 s .t . n = 1 N δ 0 L ( a n - a n - 1 ) = K . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ6_HTML.gif
(6)

From a practical point of view, the minimization in (4) or (6) is challenging since an exhaustive search over all possible sets of change instants has to be performed. However, several techniques based on DP, simulated annealing and interactive conditional model algorithms have been developed to evaluate (4) [6, 16]. Despite the fact that DP approaches solve (4) in polynomial time, the computational complexity is quadratic in N, which limits their applicability to signal segmentation in practice. In typical applications, N can be very large (up to several thousands), and even quadratic complexity cannot be afforded. On the other hand, when applied to real data, the performance of the estimator in (4) is not satisfactory [17].

To overcome these limitations of (4), heuristic approaches based on the GLRT are used in real world applications [[7], p. 401], [9, 18, 19]. However, GLRT-based change detectors are sensitive to modeling errors, and require fine tuning of the associated detection thresholds.

In what follows, a convex relaxation of the cost in (4) is advocated based on recent advances in sparse linear regression and compressive sampling. To this end, (4) is first reformulated to a sparse regression problem with non-convex regularization that is successively relaxed through its tightest convex approximation. The consequent optimization rule will yield sparse vector estimators which result in surprisingly accurate retrieval of change-points. Those are obtained by an efficient block-coordinate descent iteration that incurs only linear computational burden and memory storage. Unlike (4), based on well-established results in statistics, it will be further argued that the resultant TV-AR model estimates are a continuous function of the data.

3. Sparse linear regression and group lassoing

Let y : = [ y 0 , y 1 , , y N ] T N + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq14_HTML.gif denotes the observation vector, a : = a 0 T ,a 1 T , , a N T T ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq15_HTML.gif, m n : = [ 0 L T , , 0 L T n ,h n T , 0 L T , , 0 L T N - n ] T ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq16_HTML.gif for n = 0,1,..., N, and M : = [ m 0 , m 1 , , m N ] T N + 1 × ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq17_HTML.gif, such that
n = 0 N ( y n - h n T a n ) 2 = y - Ma 2 2 . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ7_HTML.gif
(7)
Define the "difference" vector d n L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq18_HTML.gif as
d n = a n , if n = 0 a n - a n - 1 , otherwise https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ8_HTML.gif
(8)
and d : = d 0 T , d 1 T , , d N T T ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq19_HTML.gif. Observe that d n = 0 L for n > 0 if and only if there is no change in the TV-AR coefficients between time instants n - 1 and n. Clearly, it is possible to recover { a n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq20_HTML.gif from { d n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq21_HTML.gif since
a n = n n d n . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ9_HTML.gif
(9)
Let T N + 1 × N + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq22_HTML.gif denotes a lower triangular matrix with all nonzero entries equal to one and X : = M ( T I L ) ( N + 1 ) × ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq23_HTML.gif, having the following structure:
X = h 0 T 0 L T 0 L T h 1 T h 1 T 0 L T 0 L T h N - 1 T h N - 1 T h N - 1 T h N - 1 T 0 N T h N T h N T h N T h N T h N T . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ10_HTML.gif
(10)
Since a = (T I L )d, an equivalent formulation of (4) in terms of { d n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq21_HTML.gif can be given as [cf. (7)]
{ d ^ n } n = 0 N : = arg min { d n } n = 0 N 1 2 y - Xd 2 2 + μ n = 1 N δ 0 L ( d n ) . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ11_HTML.gif
(11)

What makes the formulation in (11) attractive but also challenging is the non-convex and discontinuous Schwarz-like regularization term. The latter "pushes" most of the { d n } n = 1 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq24_HTML.gif vectors toward 0 L , while d0 is not penalized. As a consequence, the vector d ^ : = d ^ 0 T , d ^ 1 T , , d ^ N T T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq25_HTML.gif is group sparse, and the non-zero group indexes correspond to the change instants of the TV-AR coefficients. Recently, a convex model selector with grouped variables was put forth by [10], and successfully applied to biostatistics and compressive sampling [20]. It generalizes the (non-grouped) least-absolute shrinkage and selection operator (Lasso) [21] to regression problems where the unknown vector exhibits sparsity in groups; hence, its name group Lasso. The crux of group Lasso is to relax the Schwarz-like regularization in (11) with its tightest convex approximation.

The group Lasso is advocated here for catching change points by estimating the difference vectors as
{ d ^ n } n = 0 N = arg min { d n } n = 0 N 1 2 y - Xd 2 2 + λ n = 1 N d n 2 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ12_HTML.gif
(12)

where λ is a positive tuning parameter. It is known that the group Lasso regularization encourages group sparsity; that is, d ^ n = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq26_HTML.gif for most n > 0 [10]. Again, the larger the λ, the sparser the d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq27_HTML.gif.

The role of the regularization term in (12) is illustrated next through a simple example. Select L = 2 for simplicity, and let d:= [d1, d2] T . Consider the family of penalties d 2 p = ( d 1 2 + d 2 2 ) p 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq28_HTML.gif, where 0 < p ≤ 2. Figure 1 depicts the penalties d 2 p https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq29_HTML.gif for p = 2,1,0.5, and 0.1. Clearly, d 2 p https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq29_HTML.gif is convex for 1 ≤ p ≤ 2. On the other hand, d 2 p https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq29_HTML.gif is non-convex for 0 < p < 1 but it comes close to δ 0 L ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq30_HTML.gif(d) as p approaches 0. This demonstrates that ║d n 2 is the tightest convex approximation of δ 0 L ( d n ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq31_HTML.gif (d n ). Furthermore, ║d n 2 is non-differentiable at d n = 0 L , which enables group Lasso to encourage group sparsity. Needless to say that convexity of the regularizing functions avoids the presence of local minima, and allows for solving the resulting optimization problem efficiently. To this end, an efficient block-coordinate descent algorithm is developed next, with computational complexity per iteration that scales linearly with N. But first two remarks are in order.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig1_HTML.jpg
Figure 1

Regularization family d 2 p https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq32_HTML.gif for L = 2 and (a) p = 2, (b) p = 1, (c) p = 0.5, and (d) p = 0.1.

Remark 1. Different from Schwarz-like regularization, Figure 1 illustrates that the group Lasso one grows unbounded. This makes the resultant estimator biased. Nevertheless, unlike the Schwarz-like regularization, the group Lasso one is continuous which renders the resulting estimator more stable when applied to real data; see also [10, 22]. A continuous regularization function that reduces the bias of the group Lasso will be discussed in Section 6.

Remark 2. Convex relaxation for detecting changes in the mean of non-stationary processes was recently mentioned in [12], and analyzed in [17]. For the mean-change problem, the tightest convex approximation of the Schwarz-like regularized LS is provided by the Lasso, which can afford efficient solvers such as the least-angle regression (LARS) algorithm [23]. However, for the group Lasso cost proposed here to catch changes in TV-AR models, an exact LARS-like solver is not available [10]; thus, the pursuit of efficient algorithms for solving (12) is well motivated. This is the theme of the ensuing section.

4. Block-coordinate descent solver

The crux of block-coordinate descent is to iterate minimizations of the function of interest over a group of variables, while keeping the rest fixed. Consider the objective function
J ( d ) : = 1 2 y - Xd 2 2 + λ n = 1 N d n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ13_HTML.gif
(13)
and let d ( i - 1 ) : = [ d 0 ( i - 1 ) T , d 1 ( i - 1 ) T , , d N ( i - 1 ) T ] T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq33_HTML.gif denotes the provisional solution at iteration i- 1. The n th step of the i th block-coordinate descent iteration entails minimization of J(d) only with respect to d n , while retaining the provisional estimates at iteration i-1, namely { d n ( i - 1 ) } n = n + 1 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq34_HTML.gif, and the newly updated blocks at iterations i, namely { d n ( i ) } n = 0 n - 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq35_HTML.gif. Thus, block-coordinate descent at the n th step of the i th iteration yields
d n ( i ) = arg min d n J d 0 ( i ) , , d n - 1 ( i ) , d n , d n + 1 ( i - 1 ) , , d N ( i - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ14_HTML.gif
(14)
for n = 0,1,..., N, and i > 0. Skipping constant terms, J(d) in (13) can be rewritten as
J ( d ) = 1 2 d T X T Xd - d T X T y + λ n = 1 N d n 2 = 1 2 d T Rd - d T r + λ n = 1 N d n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ15_HTML.gif
(15)
where R := X T X, and r := X T y. Upon defining R n : n : = m = n n h m h m T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq36_HTML.gif and r n : n : = m = n n h m y m https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq37_HTML.gif for n'n , it holds that
R = R 0 : N R 1 : N R N - 1 : N R N : N R 1 : N R 1 : N R N - 1 : N R N : N R N - 1 : N R N - 1 : N R N - 1 : N R N : N R N : N R N : N R N : N R N : N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ16_HTML.gif
(16)
and
r = r 0 : N r 1 : N r N - 1 : N r N : N . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ17_HTML.gif
(17)
While for n = 0 (14) reduces to an LS problem, for n > 0, omitting again irrelevant terms, it can be rewritten as
d n ( i ) = arg min d n L 1 2 d n T R n : N d n + d n T g n ( i ) + λ d n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ18_HTML.gif
(18)
with
g n ( i ) : = R n : N n = 0 n - 1 d n ( i ) + n = n + 1 N R n : N d n ( i = 1 ) - r n : N . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ19_HTML.gif
(19)
The problem in (18) is a convex second-order cone program (SOCP). Typically, L N and (18) can be solved with fast optimization solvers based on interior point methods [24], at worst-case complexity O ( L 3 . 5 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq38_HTML.gif. Recently, it has been shown that the solution of (18) can be obtained as a function of the solution of the following scalar problem
γ n ( i ) : = arg min γ 0 [ γ ( 1 1 2 g n ( i ) T ( γ R n : N + λ 2 2 I L ) 1 g n ( i ) ) ] https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ20_HTML.gif
(20)
whose solution is given by [25]
γ n ( i ) = 0 , if g n ( i ) 2 λ γ > 0 : λ 2 γ R n : N + λ 2 2 I L - 1 g n ( i ) 2 2 = 1 , otherwise . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ21_HTML.gif
(21)
Finally, d n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq39_HTML.gif in (18) can be obtained from γ n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq40_HTML.gif in (21) as
d n ( i ) = 0 L , if g n ( i ) 2 λ - γ n ( i ) γ n ( i ) R n : N + λ 2 2 I L - 1 g n ( i ) , otherwise . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ22_HTML.gif
(22)

Notice that if g n ( i ) 2 λ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq41_HTML.gif, the solution of (18) is d n ( i ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq42_HTML.gif. Since it is expected that the solution of (12) is sparse, solving (18) is trivial most of the time. If g n ( i ) 2 > λ , d n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq43_HTML.gif can be obtained via interior point methods or by (numerically) solving the scalar equation in (21), which admits fast solvers via, e.g., Newton-Raphson iterations, as in [25].

Despite the fact that block-coordinate descent is typically adopted for large-size sparse linear regression, what makes it particularly appealing for catching change-points is the fact that the vector g n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq44_HTML.gif can be updated recursively in n due to the special structure of R in (16). Upon defining
c n ( i ) : = n = 0 n = 1 d n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ23_HTML.gif
(23)
s n ( i ) : = n = n + 1 N R n : N d n ( i - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ24_HTML.gif
(24)
it follows from (19) that
g n ( i ) = R n : N c n ( i ) + s n ( i ) - r n : N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ25_HTML.gif
(25)
which shows that evaluating g n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq44_HTML.gif requires the vectors c n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq45_HTML.gif and s n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq46_HTML.gif Given { d n ( i - 1 ) } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq47_HTML.gif from the (i - 1)st iteration, and initializing c n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq45_HTML.gif and s n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq46_HTML.gif at n = 0 as c 0 ( i ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq48_HTML.gif and s 0 ( i ) = n = 1 N R n : N d n ( i - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq49_HTML.gif, it is possible to recursively evaluate c n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq45_HTML.gif and s n ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq46_HTML.gif given c n - 1 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq50_HTML.gif, s n - 1 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq51_HTML.gif and d n - 1 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq52_HTML.gif from step n-1 for n > 0 as
c n ( i ) = c n - 1 ( i ) + d n - 1 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ26_HTML.gif
(26)
s n ( i ) = s n - 1 ( i ) - R n : N d n ( i - 1 ) . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ27_HTML.gif
(27)

The block-coordinate descent algorithm is summarized in Algorithm 1. Interestingly, matrix X ( N + 1 ) × ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq53_HTML.gif in (12) does not have to be stored since only { R n : N } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq54_HTML.gif and { r n : N } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq55_HTML.gif suffice to implement Algorithm 1. Thus, the memory storage and complexity to perform one block-coordinate descent iteration grow linearly with N. This attribute renders the block-coordinate descent appealing especially for large-size problems where DP approaches tend to be too expensive.

Regarding convergence, the ensuing assertion is a direct consequence of the results in [26].

Proposition 1. The iterates d ( i ) : = d 0 ( i ) T , d 1 ( i ) T , , d N ( i ) T T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq56_HTML.gif obtained by Algorithm 1 converge to the global minimum of (12); that is, lim i d ( i ) = d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq57_HTML.gif.

Block-coordinate descent will also be the basic building block for solving the non-convex problem introduced in Section 6 to improve the retrieval of change-points. But first, it is useful to consider two issues of the group Lasso change detector for TV-AR models.

Given { R n : N , r n : N } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq58_HTML.gif

Initialize with d n ( 0 ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq59_HTML.gif for n = 1, ... , N

for i > 0 do

  for n = 0,1,..., N do

    if n = 0 then

       c 0 ( i ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq48_HTML.gif

       s 0 ( i ) = n = 1 N R n : N d n - 1 ( i - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq60_HTML.gif

       g 0 ( i ) = s 0 ( i ) - r 0 : N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq61_HTML.gif

       d 0 ( i ) = - R 0 : N - 1 g 0 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq62_HTML.gif

    else

       c n ( i ) = c n - 1 ( i ) + d n - 1 ( i ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq63_HTML.gif

       s n ( i ) = s n - 1 ( i ) - R n : N d n ( i - 1 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq64_HTML.gif

       g n ( i ) = R n : N c n ( i ) + s n ( i ) - r n : N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq65_HTML.gif

      if g n ( i ) 2 λ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq66_HTML.gifthen

         d n ( i ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq67_HTML.gif

      else

         d n ( i ) = arg mi n d n L 1 2 d n T R n : N d n + d n T g n ( i ) + λ d n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq68_HTML.gif

         Algorithm 1: Block-coordinate descent algorithm

5. Regularization and uniqueness issues

Performance of model selection with grouped variables via group Lasso and related approaches has been analyzed in [10, 20, 27], while asymptotic analysis has been pursued in [28, 29]. In this section, a couple of issues are investigated regarding the group Lasso cost function, and the uniqueness of its minimum.

5.1. Tuning the regularization parameter

Selection of λ is a critical issue since larger λ's promote sparser solutions, which translate to fewer changes in the TV-AR spectrum. However, larger λ's increase the estimator bias as well. If the number of changes present are known a priori by other means, or, if a certain level of segmentation can be afforded, λ can be tuned accordingly by 'trial and error,' or by cross-validation. But in general, analytic methods to automatically choose the "best" value of λ are not available. In essence, selecting the regularization parameters is more a matter of engineering art, rather than systematic science.

In this section, heuristic but useful guidelines are provided to choose λ based on rigorously established lower bounds of this parameter. Given X N + 1 × ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq69_HTML.gif in (10), define X n N + 1 × L , n = 0 , 1 , , N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq70_HTML.gif such that X = [X0, X1,..., X N ]. To bound λ, we will rely on the following result; see Appendix 1 for the proof.

Proposition 2. If X0has full column rank, then d ^ = [ d 0 , c T , 0 L T , , 0 L T ] T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq71_HTML.gifwith d 0 , c : = ( X 0 T X 0 ) - 1 X 0 T y https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq72_HTML.gif, if and only if λ λ * : max n = 1 , , N X n T ( X 0 d 0 , c - y ) 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq73_HTML.gif

If λ exceeds a threshold, which is specified by the regression matrix and the observations, Proposition 2 asserts that d ^ 0 = d 0 , c https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq74_HTML.gif and d ^ n = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq75_HTML.gif for n = 1,..., N. This, along with (9), implies that a ^ n = d 0 , c https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq76_HTML.gif for n = 0, 1,..., N; that is, no change occurs in the coefficients of the TV-AR process. To avoid this trivial (change-free) solution, the guideline provided by Proposition 2 is that λ must be chosen strictly less than λ*. Our extensive simulations suggest that setting λ equal to a small percentage of λ*, say 5-20%, results in satisfactory estimates.

5.2. Uniqueness of the sparse solution

Uniqueness of sparse linear regression with non-grouped variables has been studied in [3032]. Next, uniqueness issues in recovering sparse vectors with group-variables are explored by exploiting the deterministic structure of the regression matrix in (12). The cost function in (12) is not strictly convex since X is a fat matrix, and the regularization term is not strictly convex; see also Figure 1. On the other hand, the block-coordinate descent algorithm developed in Section 4 is guaranteed to converge to a global minimum. In the following, a criterion is introduced to check a posteriori whether the obtained solution is unique for a given group-sparsity level.

Traditionally, the support of a sparse vector is defined as the set of indexes corresponding to the non-zero entries. In the group-sparsity framework herein, a different definition of support is required. Indeed, the vector of interest here, namely d = [ d 0 T , d 1 T , , d N T ] T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq77_HTML.gif comprises N + 1 groups of L-dimensional variables. Since the term d0 is not penalized in (12), d ^ 0 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq78_HTML.gif almost surely. Define the group support (g-support) of d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif to be the set containing the indexes relative to the nonzero group of variables; that is, g - supp ( d ^ ) : = { n { 1 , , N } : d ^ n 0 L } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq80_HTML.gif In the following, when G : = { s 1 , , s G } { 1 , , N } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq81_HTML.gif denotes the g-support of d, the set https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif is assumed ordered; i.e., s j < s k for each j < k.

The following lemma establishes a property of the matrix X in (10); see Appendix 2 for the proof.

Lemma 1. If any L out of N + 1 vectors { h n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq83_HTML.gifare linearly independent, for any g-support G = { s 1 , , s G } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq84_HTML.gifsuch that ( G + 1 ) L N + 1 , s 1 L , s G N - L + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq85_HTML.gif, and |s j - s k | ≥ L for each j ≠ k, the matrix X G : = [ X 0 , X s 1 , , X s G ] N + 1 × ( G + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq86_HTML.gifhas full column rank.

Lemma 1 asserts full column rank of the submatrix X G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq87_HTML.gif, if it is formed by the columns of X corresponding to the non-zero indexes of any sparse vector whose g-support is sufficiently small, and the non-zero groups are sufficiently distant from each other.

Next, Lemma 1 is exploited to establish an interesting property for the solutions of (12); see Appendix 3 for the proof.

Proposition 3. If any L out of N+1 vectors { h n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq83_HTML.gifare linearly independent, for any g-support G = { s 1 , , s G } { 1 , , N } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq88_HTML.gifsuch that ( G + 1 ) L N + 1 , s 1 L , s G N - L + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq85_HTML.gif, and |s j - s k |L for each j ≠ k, there exists at most one solution of (12) g-supported in https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif .

Proposition 3 ensures that if d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif is g-supported in https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif , and is sufficiently sparse with non-zero groups sufficiently far apart, then d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif is the only solution of (12) g-supported in https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif .

Remark 3. Analysis of the group Lasso and its modifications has revealed that its performance can be close to the Schwarz-regularized LS either when the regression matrix is sufficiently block incoherent, or, when the block restricted isometry property holds [10, 27]. If the regression matrix can be chosen by the designer, and it is randomly drawn from selected distributions (e.g., Gaussian or Bernoulli), these analyzes provide useful connections between problems (11) and (12). In the problem at hand however, the regression matrix is fixed, and its blocks [X0, X1,..., X N ] are highly correlated. In this case, the relationship between the solutions of (11) and (12) is much less understood, and constitutes an interesting future research direction.

6. Continuity, bias and the group SCAD

As already pointed out, convex relaxation of the Schwarz-like regularization was developed in [12, 17] for the mean-change problem using the (non-grouped) Lasso. Numerical results in [17] reveal that the Lasso tends to detect a "cloud" of small change points around an actual change. Post-processing via DP to select a few of the estimated change instants was proposed in [17]. Moreover, due to the bias introduced by the Lasso, once the change points are obtained, another step is required to re-estimate the mean within a segment. In the following, a novel change detector is developed based on recent advances in model selection via non-convex regularization. The resulting estimator reduces the bias of group Lasso and can afford a convergent optimization solver. The corresponding algorithm is based on iterative instantiations of weighted group Lasso, which is capable of enhancing the sparsity of the solution [33, 34], and thus improving the precision of the detected change points.

Attributes of a "good" regularization function are delineated in [22], and three properties are identified to this end:

  • Unbiasedness. The estimator has to be unbiased when the true unknown parameter has large amplitude.

  • Sparsity. The estimator has to set small-amplitude coefficients to zero to reduce the number of variables.

  • Continuity. The estimator has to be continuous in the data to avoid instability when estimating (non-)zero variables.

To appreciate these properties, a simple setting is presented next. Consider estimating a scalar parameter, call it d, in additive noise v, based on the scalar observation y = d + υ. In its general form, the regularized LS approach yields
d ^ = arg min d 1 2 ( y - d ) 2 + p λ ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ28_HTML.gif
(28)
where pλ(|d|) is the regularization function. The Lasso regularization is p λ Lasso d = λ d https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq89_HTML.gif while the Schwarz-like regularization is
p λ Schwarz d = 0 , if d = 0 λ , otherwise . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ29_HTML.gif
(29)
In both cases, the estimate d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq90_HTML.gif can be found in closed form, and the dependence of d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq90_HTML.gif on y is given as
d ^ Lasso = 0 , if y < λ sign ( y ) y - λ , otherwise https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ30_HTML.gif
(30)
d ^ Schwarz = 0 , if y < 2 λ y , otherwise . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ31_HTML.gif
(31)
The non-linear estimation rules in (30) and (31) are depicted in Figure 2a,b, respectively, for λ = 2. Observe that both regularization functions effect sparsity, since coefficients with small amplitude are set to 0. The Schwarz-like regularization yields unbiased estimates, but the solution is not continuous with respect to y. Hence, small variations of y or λ may result in large variations of d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq90_HTML.gif (this happens when one is uncertain whether to set the coefficient to 0 or not). On the other hand, the Lasso regularization possesses continuity but the estimates are biased, because in addition to small, large-amplitude coefficients are "shrunk" too.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig2_HTML.jpg
Figure 2

Regularized LS estimation rules in the scalar case: (a) (group) Lasso, (b) Schwarz-type regularization, and (c) SCAD.

To overcome the limitations of these regularization functions, the following smoothly clipped absolute deviation (SCAD) regularization can be used with a > 2 [22]
p λ SCAD d = λ d , if d λ - d 2 - 2 d a λ + λ 2 2 ( a - 1 ) , if λ < d a λ λ 2 2 ( a + 1 ) , d > a λ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ32_HTML.gif
(32)
to obtain estimates given by
d ^ SCAD = 0 if y λ sign ( y ) ( y - λ ) , if λ < d 2 λ ( a - 1 ) y - sign ( y ) a λ a - 2 , if 2 λ < y a λ y , y > a λ . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ33_HTML.gif
(33)

The data dependence of d ^ SCAD https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq91_HTML.gif is depicted in Figure 2c for λ = 2 and a = 3.7. Observe that the SCAD enjoys the three aforementioned attributes of a desirable regularization function.

Motivated by this scalar example, we propose as an alternative to (12), the following group SCAD approach for catching change-points:
{ d ^ n } n = 0 N = arg min { d n } n = 0 N 1 2 y - Xd 2 2 + n = 1 N p λ SCAD ( d n 2 ) . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ34_HTML.gif
(34)
The problem in (34) is non-convex and its exact minimization is challenging due to the presence of local minima. Nevertheless, it is possible to generalize the iterated local linear approximation (LLA) of [34] in order to ensure converge to a stationary point of the cost in (34). Let p λ SCAD ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq92_HTML.gif denote the derivative of p λ SCAD ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq93_HTML.giffor d≥ 0; that is,
p λ SCAD ( d ) : = λ , if d λ a λ - d ( a - 1 ) , if λ < d a λ 0 , d > a λ . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ35_HTML.gif
(35)
The idea behind the LLA is to approximate p λ SCAD ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq94_HTML.gif with its linear expansion around do; that is
p λ SCAD ( d ) p λ SCAD ( d o ) + p λ SCAD' ( d o ) ( d - d o ) . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ36_HTML.gif
(36)
Given a provisional estimate of d n n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq95_HTML.gif at iteration j - 1, namely d ^ n [ j - 1 ] n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq96_HTML.gif, the iterated LLA of (34) is
d ^ n [ j ] n = 0 N = arg min { d n } n = 0 N 1 2 y - Xd 2 2 + n = 1 N p λ SCAD d ^ n [ j - 1 ] 2 d n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ37_HTML.gif
(37)
for j = 1,..., J. Since p λ SCAD d ^ n [ j - 1 ] 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq97_HTML.gif are non-negative constants, the cost in (37) is convex, and can be minimized using the block-coordinate descent algorithm of Section 4. The role of the weights p λ SCAD d ^ n [ j - 1 ] 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq97_HTML.gif is to avoid penalizing terms that, most likely, are non-zero. Function p λ SCAD ( d ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq94_HTML.gif is depicted in Figure 3 for λ = 2, a = 3.7, and d ≥ 0. It is clearly a decreasing function of its argument. More precisely, if 0 ≤ d ≤ λ, then p λ SCAD ( d ) = λ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq98_HTML.gif hence, the regularization parameter is as large as the group Lasso. If λ ≤ da λ, the regularization parameter is linearly decreasing until da λ, where the regularization parameter is zero. Thus, the expression in (37) represents an iterated weighted group Lasso. Furthermore, if initialized with d ^ [ 0 ] = 0 ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq99_HTML.gif = 0(N+ 1)L, the first iteration of (37) corresponds to the (unweighted) group Lasso, and few iterations of the type in (37) are required for convergence.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig3_HTML.jpg
Figure 3

Derivative of the SCAD function.

Remark 4. In principle, one could also apply a block-coordinate descent iteration to minimize (34) directly. The resulting iterates converge to a local minimum of (34) that depends on the starting points; see also [26]. In general, it is impossible to assess properties of this solution. Instead, the solution of the LLA in (37) has provable merits in estimating the true support of sparse signals [34].

Remark 5. Recently, greedy algorithms such as the matching pursuit and the orthogonal matching pursuit have been shown to approach the performance of (11) and (12) when the regression matrix is sufficiently block incoherent, or, when the block restricted isometry property holds [27]. In the problem at hand, wherein the regression matrix exhibits correlation among blocks, simulated tests have shown that greedy algorithms suffer from severe error propagation. In fact, a cloud of change points is typically declared around a true change point. For these reasons, greedy algorithms will not be considered hereafter.

Remark 6. As already mentioned in the Introduction, one advantage of Bayesian techniques is that automatic model selection can be performed on a per-segment basis at the price of increasing computational cost. The convex relaxation developed in Section 3 can be modified to perform model selection too. Indeed, setting L to a prescribed upper bound on the model order, one may further regularize the cost in (12) to impose sparse AR coefficient vectors; that is,
{ a ^ n } n = 0 N = arg min { a n } n = 0 N 1 2 y - Ma 2 2 + λ n = 1 N a n - a n - 1 2 + γ a 1 . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ38_HTML.gif
(38)

The cost function in (38) is convex and effects a piecewise-constant and sparse TV-AR model. It is different from model order selection criteria in the sense that the selected non-zero AR coefficients do not necessarily have to be consecutive. A challenge associated with the optimization in (38) is that block-coordinate descent algorithms do not converge, since the differentiable part is not separable group-wise [26]. The cost function in (38) resembles the fused Lasso developed in [35]. Efficient algorithms exploiting this link, and the structure of the problem at hand are currently under investigation.

7. Simulated tests

The merits of the novel approaches to catching change-points in TV-AR processes are assessed via numerical simulations using synthetic and real data.

7.1. Synthetic data

The signal of interest here is a realization of a TV-AR process with N + 1 = 500, order L = 4, v n ~ N ( 0 , σ 2 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq13_HTML.gif, and σ 2 = 10-2, exhibiting K = 2 abrupt changes in the spectrum at time n1 = 100, and n2 = 350. The AR model during the first segment (n [0,99]) has coefficients a0 = [-0.8000, -0.1500, 0.1940, -0.0280] T , while in the second segment (n [100,349]) the AR coefficients are a1 = [0.1200, 0.0245, -0.2787, -0.0693] T . In the third segment (n [350, 499]), a0 is in act. Figure 4 shows the true signal along with the group Lasso, group SCAD, and DP-based estimates of the TV-AR coefficients obtained via (12), (37), and (6) with K = 2, respectively. The regularization parameter λ of the group Lasso and group SCAD was selected to return 2 change points. The block-coordinate descent algorithm of Section 4 was used to solve (12) up to a maximum of 103 iterations or when d ( i ) - d ( i - 1 ) 2 2 d ( i ) 2 2 1 0 - 8 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq100_HTML.gif The same stopping rule is used for each iteration in (37), and J = 5 outer iterations are run (the same stopping rules are adopted henceforth). Figure 5 depicts the variation of d ^ n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq101_HTML.gif across time. Clearly, the instants where d ^ n 2 > 0 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq102_HTML.gif represents the change points retrieved. While the DP has detected change-points at n ^ 1 = 105 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq103_HTML.gif and n ^ 2 = 349 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq104_HTML.gif the group Lasso and the group SCAD have detected changes at n ^ 1 = 103 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq105_HTML.gif and n ^ 2 = 356 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq106_HTML.gif confirming that the results returned by the DP programming, by the group Lasso, and by group SCAD are comparable. Since {y n } was generated as in (1), any L out of N + 1 vectors { h n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq83_HTML.gif are linearly independent almost surely, which means that Proposition 3 can be invoked to ensure uniqueness of the group Lasso TV-AR model estimate. Notice from Figure 4 and Figure 5 that the d ^ n 2 s https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq107_HTML.gif estimated by the group Lasso are much smaller than the true ones due to the bias, which results in poor estimates of the AR coefficients. However, those estimated by the group SCAD are closer to the true one.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig4_HTML.jpg
Figure 4

Synthetic data. From top to bottom: True signal, estimated of the TV-AR coefficients by group Lasso, group SCAD, and DP, respectively.

https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig5_HTML.jpg
Figure 5

Synthetic data. From top to bottom: True TV-AR coefficients change, changes detected by group Lasso, group SCAD, and DP, respectively.

7.2. Real data: piano sound

Next, a piano sound of 0.5 s comprising three monochromatic notes is sampled at 8 kHz to obtain N + L + 1 = 4000 samples. The signal is depicted in Figure 6. A TV-AR model with L = 1 is adopted, and λ is selected to be λ*/10. Figure 7 depicts the TV-AR coefficients estimated by the group Lasso and the group SCAD. The estimated d ^ n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq101_HTML.gif over time is displayed in Figure 8. Observe that the group Lasso captures the correct changes at time instants n ^ 1 = 1632 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq108_HTML.gif and n ^ 2 = 3155 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq109_HTML.gif along with a small false change at time instant n f = 2770. Instead, the group SCAD retrieves two changes at time instants n ^ 1 = 1632 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq108_HTML.gif and n ^ 2 = 3155 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq109_HTML.gif Notice also that group SCAD exhibits a reduced bias in the estimated d ^ n 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq101_HTML.gif which results in better estimates of the TV-AR coefficients.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig6_HTML.jpg
Figure 6

Piano sound comprising three monochromatic notes.

https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig7_HTML.jpg
Figure 7

Piano sound. Estimated of the TV-AR coefficients by group Lasso (top), and group SCAD (bottom).

https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig8_HTML.jpg
Figure 8

Piano sound. Changes detected by group Lasso (top), and group SCAD (bottom).

7.3. Real data: speech

Here, a speech signal of 0.5 s is sampled at 8 kHz, to obtain N + L + 1 = 4000 samples. The resultant time series depicted in Figure 9 comprises a descent diphthong /ai/ followed by an /o/, pronounced by another party. A TV-AR model with L = 8 is adopted, and λ is selected to be λ*/10. The change of vocoid in the diphthong occurs approximately at time instant n1 = 1500, while the /o/ occurs approximately at n2 = 3000. Figure 10 shows the TV-AR coefficients estimated by the group Lasso and group SCAD. In agreement with [17], the group Lasso tends to declare a cloud of change points in the proximity of an actual change, while the jumps of the group SCAD estimates are very sharp. Figure 11 depicts ║d n 2 over time. The group SCAD detects four segments with change points at n ^ 1 = 1065 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq110_HTML.gif, n ^ 2 = 1606 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq111_HTML.gif and n ^ 3 = 2993 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq112_HTML.gif. Clearly, the first segment corresponds to the /a/, the second, which is the shortest, to the transition of the diphthong, the third to the /i/, and the forth to the /o/. Observe that the group Lasso exhibits peaks around the actual change instants, and requires post-processing via either DP as advocated in [17], or, by simply peak picking.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig9_HTML.jpg
Figure 9

Speech signal: /ai//o/.

https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig10_HTML.jpg
Figure 10

Speech signal. Estimated of the TV-AR coefficients by group Lasso (top), and group SCAD (bottom).

https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig11_HTML.jpg
Figure 11

Speech signal. Changes detected by group Lasso (top), and group SCAD (bottom).

Further tests are performed on a sampled speech that has been widely adopted for TV-AR speech segmentation [35, 7]. According to [7, p. 401], this signal belongs to a database designed by the French National Agency for Telecommunications (CNET) for testing and evaluating speech recognition algorithms. It consists of a noisy French speech recorded in a car at sampling frequency 8 kHz, prefiltered by a highpass filter with cutoff frequency equal to 150 Hz, and quantized with 16 bits per sample. The time series is shown in Figure 12 along with the changes caught by the group SCAD change detector, and the GLR algorithm of [18] whose results are reported in [35] for L = 2. The group SCAD is tuned to return the same number of changes as the GLR. The detected change instants are listed in Table 1.
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig12_HTML.jpg
Figure 12

French speech signal and changes detected by the group SCAD (dot) and GLR (dash) change TV-AR detectors for L = 2.

Table 1

Change instants detected by the group SCAD and GLR algorithm

Group SCAD

799

1217

1570

1742

2017

2359

2814

3668

GLR

445

645

1550

1750

2151

2797

3400

3626

The first change detected by the GLR is at sample 445, while this change is not detected by the group SCAD. Interestingly, it is reported in [3] that this change is not relevant for segmentation purposes, and this fact is apparent by inspection of the true signal. Both algorithms have detected changes around samples 1750, 2100, 2800, and 3650. The group SCAD successfully removed the false change detected by the GLR at sample 3400. By inspecting the original signal, this change does not seem to be relevant. The group SCAD had detected a change instant around sample 1300 unlike the GLR. This change is also detected by advanced Bayesian techniques reported in [35]. The group SCAD has detected a change at sample 779, while the GLR at sample 645. Indeed, inspection of the original signal suggests that the detection of the GLR is preferable. Surprisingly, the group SCAD, unlike the GLR and the Bayesian techniques of [35], has detected a change at sample 2359. Observing the original signal around this point, there is a clear amplitude modulation that may cause a change in the TV-AR coefficients, which existing algorithms have passed undetected.

A way to univocally compare the two algorithms is via the segmented prediction error (SPE). Assuming that K changes have been detected at instants { n ^ k } k = 1 K https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq113_HTML.gif, let â k https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq114_HTML.gif denotes the LS estimates of the AR model in the k th segment, i.e., â k = arg min a L n = n ^ k n ^ k + 1 - 1 y n - h n a 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq115_HTML.gif. The SPE is defined as SPE : = k = 0 K n = n ^ n ^ k + 1 - 1 ( y n - h n â k ) 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq116_HTML.gif, and represents the error in approximating the original signal {y n } with a TV-AR model exhibiting abrupt changes at instants { n ^ k } k = 1 K https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq113_HTML.gif. The GLR segmentation entails SPEglr = 0.2638, while SPEg-scad = 0.2578 for the group SCAD. Clearly, the group SCAD based segmentation seems preferable to that of the GLR algorithm.

Finally, since general analysis of the convergence rate for the block-coordinate descent algorithm is not available, a simulated test assessing its converge speed is performed. Figure 13 depicts the normalized step size amplitude variations, namely d ( i ) - d ( i - 1 ) 2 2 / ( d ( i ) 2 2 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq117_HTML.gif, across the iteration index (i) for the first LLA of the group SCAD (which amounts to a group Lasso since d ^ [ 0 ] = 0 ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq99_HTML.gif for the speech signal in Figure 12. Observe that after a few hundred iterations, the normalized step amplitude size drops below 10-6. For practical purposes, the solution at this stage might be acceptable. Progressively, the speed of convergence slows down since the algorithm has to decide whether some components are truly zero or have small amplitudes, and large jumps in d ( i ) - d ( i - 1 ) 2 2 / ( d ( i ) 2 2 ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq117_HTML.gif correspond to iteration indexes where the small components are set to zero. Once the correct vector support is determined, the speed of converge is fast. It is worth pointing out that only 10 min are required to perform 5,000 iterations of the block-coordinate descent (implemented with Matlab) for the problem at hand, whereas advanced Bayesian techniques might require several hours (see e.g., [3]).
https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Fig13_HTML.jpg
Figure 13

Speed of the block-coordinate descent for the first LLA of the group SCAD.

8. Concluding remarks

Novel estimators were developed in this article for identification of piecewise-constant TV-AR models by exploiting recent advances in variable selection and compressive sampling. While traditional techniques consist in regularizing a LS criterion with the total number of coefficient changes, the novel estimator relies on a convex regularization function, which resembles the group Lasso and can afford efficient implementation using block-coordinate descent iterations. The latter incurs computational burden that scales linearly with the number of data samples, thus being particularly attractive for large-size problems. Regularization tuning issues are discussed along with conditions for uniqueness of the estimated piecewise-constant AR model. An alternative group smoothly-clipped absolute deviation regularization is also introduced, and an algorithm based on iterative weighted group Lasso minimizations is developed. Numerical tests using synthetic and real data confirm that the developed algorithms can effectively identify piecewise-constant AR models of large size at manageable complexity, and outperform heuristic alternatives that are based on the GLRT.

Appendix 1: proof of proposition 2

The necessary and sufficient first-order optimality condition for d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif to be the (unconstrained) minimum of (12), is that the subgradient of J(d) in (13) evaluated at d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif contains the zero vector [[36], p. 92]; i.e.,
J ( d ^ ) 0 ( N + 1 ) L . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ39_HTML.gif
(39)
Defining w ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq118_HTML.gif as w := X T (Xd - y) and w = [ w 0 T , w 1 T , , w N T ] T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq119_HTML.gif, with w n L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq120_HTML.gif for n = 0,1,..., N, the subgradient of J(d) evaluated at d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif is given by
J ( d ^ ) = w + λ b https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ40_HTML.gif
(40)
where b : = [ b 0 T , b 1 T , , b N T ] T ( N + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq121_HTML.gif and
b n : = 0 L , n = 0 d ^ n d ^ n 2 , n = 1 , , N , d ^ n 0 L s n , n = 1 , , N . d ^ n 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ41_HTML.gif
(41)
with s n L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq122_HTML.gif such that ║s n 2 ≤ 1. Using (40) and (41), Equation (39) translates to the following conditions:
  1. (c1)

    w0 = 0 l ; and,

     
  2. (c2)

    w n + λ d ^ n d ^ n 2 = 0 L , if d ^ n 0 L w n 2 λ if d ^ n 0 L for n = 1 , , N . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq123_HTML.gif

     

The change-free solution corresponds to having d ^ n = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq75_HTML.gif for n = 1,..., N. Thus, (c1) implies that X 0 T ( X 0 d ^ 0 - y ) = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq124_HTML.gif, which is uniquely satisfied by d ^ 0 = d 0 , c https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq74_HTML.gif, since X0 has full column rank. Hence, d ^ 0 = d 0 , c https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq74_HTML.gif and d ^ n = 0 L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq75_HTML.gif for n = 1,..., N hold if and only if (c2) is satisfied, which corresponds to ║w n 2 ≤ λ for n = 1,..., N. Since w n = X n T ( X 0 d 0 , c - y ) https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq125_HTML.gif, condition (c2) is satisfied if and only if λ λ * : = max n = 1 , , N X n T ( X 0 d 0 , c - y ) 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq126_HTML.gif.

Appendix 2: proof of lemma 1

Observe that
X G T X G = R s G : N R s G : N R s G : N R s G : N R s G : N R s G : N R s G : N R s G : N R s G : N + R s G - 1 : s G - 1 R s G - 1 : s G - 1 0 L × L R s G - 1 : s G - 1 R s G - 1 : s G - 1 0 L × L 0 L × L 0 L × L 0 L × L + + R 0 : s 1 - 1 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L 0 L × L = h s G h s G h s G h s G h s G h s G h s G h s G T + + h N h N h N h N h N h N h N h N T N - s G + 1 + h s G - 1 h s G - 1 h s G - 1 0 L h s G - 1 h s G - 1 h s G - 1 0 L T + + h s G - 1 h s G - 1 h s G - 1 0 L h s G - 1 h s G - 1 h s G - 1 0 L T s G - s G - 1 + + h 0 0 L 0 L 0 L h 0 0 L 0 L 0 L T + + h s 1 - 1 0 L 0 L 0 L h s 1 - 1 0 L 0 L 0 L T s 1 . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ42_HTML.gif
(42)

Notice that the first sub-sum in (42) comprises N - s G + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq127_HTML.gif rank-1 matrices, the last sub-sum comprises s1 rank-1 matrices, while the g th sub-sum comprises s G - g + 2 - s G - g + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq128_HTML.gif rank-1 matrices for g = 2 , , G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq129_HTML.gif. Since https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif is such that s 1 L , s G N - L + 1 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq130_HTML.gif, and |s j - s k | ≥ L for each jk, and any L out of N + 1 vectors { h n } n = 0 N https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq131_HTML.gif are linearly independent, each of the summands in (42) has rank L. Thus, it is possible to find L linearly independent vectors { h 1 , } = 1 L L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq132_HTML.gif such that the first sub-sum in (42) equals to = 1 L h ̃ 1, h ̃ 1, T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq133_HTML.gif with h ̃ 1 , : = [ h 1 , T , , h 1 , T ] T ( | G | + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq134_HTML.gif. Analogously, it is possible to find L linearly independent vectors { h g , } = 1 L L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq135_HTML.gif such that the g th sub-sum in (42) can be written as = 1 L h ̃ g , h ̃ g , T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq136_HTML.gif with h ̃ g , : = [ h g , T , , h g , T G - g + 2 , 0 L T , , 0 L T g - 1 ] T ( G + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq137_HTML.gif for g = 2 , , G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq129_HTML.gif. Finally, it is possible to find L linearly independent vectors { h G + 1 , } = 1 L L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq138_HTML.gif such that the last sub-sum in (42) can be written as = 1 L h ̃ G + 1 , h ̃ G + 1 , T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq139_HTML.gif with h ̃ G + 1 , : = [ h G + 1 , T , 0 L T , , 0 L T ] T ( G + 1 ) L https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq140_HTML.gif. Thus, X G T X G = g = 1 G + 1 = 1 L h ̃ g , h ̃ g , T https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq141_HTML.gif, and since { h ̃ g , } https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq142_HTML.gif are (| https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif | + 1) linearly independent vectors, X G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq87_HTML.gif has full-column rank.

Appendix 3: proof of proposition 3

Suppose that d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif and d ^ ' https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq143_HTML.gif are two solutions of (12) with the same g-support https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif . Let X G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq87_HTML.gif denotes the matrix obtained by selecting the columns of X relative to the nonzero indexes dictated by the set { 0 } G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq144_HTML.gif. The minimization of (13) over the vector having g-support in https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif amounts to
u ^ : = arg min u : = [ u 0 T , u 0 T , , u G T ] T ( G + 1 ) L 1 2 y - X G u 2 2 + λ s = 1 G u s 2 . https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_Equ43_HTML.gif
(43)

From Lemma 1, X G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq87_HTML.gif has full column rank which implies that 1 2 y - X G u 2 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq145_HTML.gif is strictly convex, and so is 1 2 y - X G u 2 2 + λ s = 1 G u s 2 https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq146_HTML.gif. Thus, (43) admits a unique solution.

Since both d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq79_HTML.gif and d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq143_HTML.gif are g-supported in https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq82_HTML.gif by hypothesis, and their restrictions to { 0 } G https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq144_HTML.gif are equal to û, it follows readily that d ^ = d ^ https://static-content.springer.com/image/art%3A10.1186%2F1687-6180-2012-70/MediaObjects/13634_2011_Article_224_IEq147_HTML.gif.

Declarations

Acknowledgements

This work was supported by MURI (AFOSR FA9550-10-1-0567) grant.

Authors’ Affiliations

(1)
Asea Brown Boveri (ABB) Corporate Research Center
(2)
Department of Electrical and Computer Engineering, University of Minnesota

References

  1. Stoica P, Moses RL: Introduction to Spectral Analysis. Prentice-Hall, New Jersey; 1997.MATH
  2. Djuric PM: A MAP solution to off-line segmentation of signals. In Proc of the International Conference on Acoustics, Speech, and Signal Processing. Volume 4. Adelaide, Australia; 1994:505-508.
  3. Dobigeon N, Tourneret J-Y, Davy M: Joint segmentation of piecewise constant autoregressive processes by using a hierarchical model and a Bayesian sampling approach. IEEE Trans Signal Process 2007, 55(4):1251-1263.MathSciNetView Article
  4. Fearnhead P: Exact Bayesian curve fitting and signal segmentation. IEEE Trans Signal Process 2005, 53(6):2160-2166.MathSciNetView Article
  5. Punskaya E, Andrieu C, Doucet A, Fitzgerald WJ: Bayesian curve fitting using MCMC with applications to signal segmentation. IEEE Trans Signal Process 2002, 50(3):747-758.View Article
  6. Lavielle M: Optimal segmentation of random processes. IEEE Trans Signal Process 1998, 46(5):1365-1373.View Article
  7. Basseville M, Nikiforov IV: Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Englewood Cliffs, NJ, USA; 1993.
  8. Hall MG, Oppenheim AV, Willsky AS: Time-varying parametric modeling of speech. Signal Process 1983, 5(3):267-285.View Article
  9. Rudoy D, Quatieri TF, Wolfe PJ: Time-varying autoregressive tests for multiscale speech analysis. In Proceedings of Interspeech. Volume 1. Brighton, UK; 2009:2839-2842.
  10. Yuan M, Lin Y: Model selection and estimation in regression with grouped variables. J Royal Stat Soc, Ser B 2006, 68(1):49-67.MathSciNetView ArticleMATH
  11. Brockwell PJ, Davis RA: Time Series: Theory and Methods. Springer-Verlag, New York, NY, USA; 1990.MATH
  12. Boysen L, Kempe A, Liebscher V, Munk A, Wittich O: Consistencies and rates of convergence of jump-penalized least-squares estimators. Annals Stat 2009, 37(1):157-183.MathSciNetView ArticleMATH
  13. Lavielle M: Using penalized contrasts for the change-point problem. Signal Process 2005, 85(8):1501-1510.MathSciNetView ArticleMATH
  14. Lavielle M, Moulines E: Least-squares estimation of an unknown number of shifts in a time series. J Time Series Anal 2000, 21(1):33-59.MathSciNetView ArticleMATH
  15. Lebarbier E: Detecting multiple change-points in the mean of Gaussian process by model selection. Signal Process 2005, 85(4):717-736.View ArticleMATH
  16. Winkler G, Liebscher V: Smoothers for discontinuous signals, J. Nonparametric Stat 2002, 14(1-2):203-222.MathSciNetView ArticleMATH
  17. Harchaoui Z, Levy-Leduc C: Catching change-points with Lasso. In Proceedings of the Advanced Neural Information Processes Systems. Volume 20. Vancouver, Canada; 2008:161-168.
  18. Appel U, Brandt AV: Adaptive sequential segmentation of piecewise stationary time series. Inf Sci 1983, 29(1):27-56.View ArticleMATH
  19. Willsky A, Jones H: A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans Autom Control 1976, 21(1):108-112.MathSciNetView ArticleMATH
  20. Eldar YC, Mishali M: Block sparsity and sampling over a union of subspaces. In Proc of the 16th International Conference on Digital Signal Processing. Volume 1. Santorini, Greece; 2009:1-8.
  21. Tibshirani R: Regression shrinkage and selection via the Lasso. J Royal Stat Soc Ser B 1996, 58(1):267-288.MathSciNetMATH
  22. Fan J, Li R: Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 2001, 96: 1348-1360.MathSciNetView ArticleMATH
  23. Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Annals Stat 2004, 32: 407-499.MathSciNetView ArticleMATH
  24. Sturm JF: Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optim Methods Softw 1999, 11-12: 625-653.MathSciNetView ArticleMATH
  25. Puig A, Wiesel A, Hero A: A multidimensional shrinkage-thresholding operator. In Proceedings of the 15th Workshop on Statistical Signal Processing. Volume 18. Cardiff, UK; 2009:363-366.
  26. Tseng P: Convergence of a block coordinate descent method for nondifferentiable minimization. J Optim Theory Appl 2001, 109(3):475-494.MathSciNetView ArticleMATH
  27. Eldar YC, Kuppinger P, Bölcskei H: Block-sparse signals: Uncertainty relations and efficient recovery. IEEE Trans Signal Process 2010, 58: 3042-3054.MathSciNetView Article
  28. Bach FR: Consistency of the group Lasso and multiple kernel learning. J Mach Learn Res 2008, 9: 1179-1225.MathSciNetMATH
  29. Nardi Y, Rinaldo A: On the asymptotic properties of the group Lasso estimator for linear models. Electron J Stat 2008, 2: 605-633.MathSciNetView ArticleMATH
  30. Fuchs J-J: On sparse representations in arbitrary redundant bases. IEEE Trans. Inf Theory 2004, 50(6):1341-1344.View ArticleMATH
  31. Gorodnitsky I, Rao B: Sparse signal reconstruction from limited data using FOCUSS: a re-weighted minimum norm algorithm. IEEE Trans Signal Process 1997, 45(3):600-616.View Article
  32. Tropp J: Just relax: convex programming methods for identifying sparse signals in noise. IEEE Trans Inf Theory 2006, 52(3):1030-1051.MathSciNetView ArticleMATH
  33. Candes EJ, Wakin MB, Boyd S: Enhancing sparsity by reweighted 1minimization. J Fourier Anal Appl 2008, 14(5):877-905.MathSciNetView ArticleMATH
  34. Zou H, Li R: One-step sparse estimates in nonconcave penalized likelihood models. Annals Stat 2008, 36: 1509-1533.MathSciNetView ArticleMATH
  35. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight k: Sparsity and smoothness via the fused Lasso. J Royal Stat Soc Ser B 2005, 67(1):91-108.MathSciNetView ArticleMATH
  36. Ruszczynski A: Nonlinear Optimization. Princeton University Press, Princeton, NJ, USA; 2006.MATH

Copyright

© Angelosante and Giannakis; licensee Springer. 2012

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.