Audio video based fast fixed-point independent vector analysis for multisource separation in a room environment
© Liang et al.; licensee Springer. 2012
Received: 10 April 2012
Accepted: 27 July 2012
Published: 22 August 2012
Fast fixed-point independent vector analysis (FastIVA) is an improved independent vector analysis (IVA) method, which can achieve faster and better separation performance than original IVA. As an example IVA method, it is designed to solve the permutation problem in frequency domain independent component analysis by retaining the higher order statistical dependency between frequencies during learning. However, the performance of all IVA methods is limited due to the dimensionality of the parameter space commonly encountered in practical frequency-domain source separation problems and the spherical symmetry assumed with the source model. In this article, a particular permutation problem encountered in using the FastIVA algorithm is highlighted, namely the block permutation problem. Therefore a new audio video based fast fixed-point independent vector analysis algorithm is proposed, which uses video information to provide a smart initialization for the optimization problem. The method cannot only avoid the ill convergence resulting from the block permutation problem but also improve the separation performance even in noisy and high reverberant environments. Different multisource datasets including the real audio video corpus AV16.3 are used to verify the proposed method. For the evaluation of the separation performance on real room recordings, a new pitch based evaluation criterion is also proposed.
The cocktail party problem was first described by Colin Cherry in 1953. Cherry and Taylor further worked on this problem, which is captured by the question: “How do we recognize what one person is saying when others are speaking at the same time (the “cocktail party problem”)?”. The problem relates to the situation where there are several people talking simultaneously in a room environment, and we only want to focus on one of them. For human beings, it is easy to focus attention on a target speaker. However, for a machine, it is much more difficult to achieve this goal. Solving the machine cocktail party problem requires the design of a method to focus on the desired speech signal while suppressing or ignoring all the other competing speech sounds. Attempts to solve the machine cocktail party problem have come from the signal processing community in the form of blind source separation (BSS) and generally from the computer science community in the form of computational auditory scene analysis (CASA). CASA is motivated by understanding the human auditory scene analysis. While our focus in this article is signal processing based approaches such as blind source separation.
To address the BSS problem, many methods have been proposed. Herault and Jutten seem to have been the first who addressed the problem of blind source separation in 1985. In their study, the mixtures are assumed to be instantaneous in the standard BSS problem, which means that the sound would only be transmitted directly from the sources to the microphones without any delay. Common formally established an instantaneous linear mixing model and clearly defined the term independent component analysis in 1994. Meanwhile, he also proposed an algorithm which can measure the independence by capturing higher-order statistics of the sources.
However, for a real room environment, the problem becomes more complicated because the acoustic sources take multiple paths to the microphone sensors due to the reflections from the ground, ceiling and walls. As such a convolutive model is required to describe the sound propagation in a real room environment. Thus the practical speech separation problem becomes a convolutive blind source separation (CBSS) problem. During the last several decades, many efforts have been made on overcoming this problem. Initially, solutions were posed in the time domain. However, since real room impulse responses are typically on the order of thousands of samples in length, the computational cost of these time domain methods renders them impractical. To mitigate this problem, a frequency domain solution was proposed by Parra. As convolution in the time domain corresponds to multiplication in the frequency domain, the transformation into the frequency domain converts the convolutive mixing problem to that of independent complex instantaneous mixing operations at each frequency bin provided the transform block length is not too large. Transformation into the frequency domain reduces the computational cost, but there are two indeterminacies which are inherent to BSS, namely the scaling and permutation ambiguities, which are magnified in the frequency domain operation.
The scaling ambiguities across frequencies can be managed by matrix normalization[4, 10–13]. On the other hand, the permutation ambiguities are more challenging to solve and various methods have been proposed. All of these methods need prior knowledge about the locations of the sources or post-processing exploiting some feature of the separated signals[14, 15]. A new algorithmic approach to mitigate the permutation problem, named independent vector analysis (IVA), was proposed by Kim. This approach can potentially preserve the higher order statistical dependencies and structures of signals across frequencies and thereby mitigate the permutation problem. It avoids the need for post-processing, and thus it is a natural way to overcome the permutation problem. Based on the original IVA method, several extended IVA methods have been recently proposed. An adaptive step size IVA method was proposed to improve the convergence speed by controlling the learning step size. A fast fixed-point IVA method which applies Newton’s method to a contrast function of complex-valued variables was given in which achieves a fast and good separation performance.
For the FastIVA, although it can achieve fast convergence, sometimes it can still suffer a special permutation problem which we term as the “block permutation problem”. Block permutation is different from the classical permutation problem. Block permutation means that the whole frequency range is divided into several blocks, each block containing several frequencies, and the intra-block permutation is consistent, but the inter-block permutation is different. However, the classical permutation problem means that the permutation is likely to be different for each frequency bin. In recent research study, a similar problem with the convergence of IVA is termed as “partial permutation”, but without analysis about why this problem can occur. In this article, this special problem is first highlighted and analytically demonstrated, we show that such ill-convergence can be mitigated by setting a good initialization of the unmixing matrix.
Initialization is important for the optimization problem because it can improve the convergence speed by ensuring a short cut convergence path avoiding local minimum points which yield poor separation. Source position information is important prior knowledge for setting a good initialization, and it can be obtained by audio localization or video localization. Audio localization for a single active speaker is difficult because human speech is an intermittent signal and contains much of its energy in the low-frequency bins where spatial discrimination is imprecise. Audio localization can also be affected by noise and room environment. Additionally, audio localization is not always effective due to the complexity in the case of multiple concurrent speakers. Therefore, the accuracy of the audio localization would be degraded in a multisource real room environment with noise and reverberations, but video localization is robust in such an environment. On the other hand, video localization is not always effective, especially when the face of a human being is not visible to at least two cameras due to some obstacles, for example when the environment is cluttered, camera angle is wide, or illumination conditions are varying. For human beings, we not only use our ears to solve the cocktail party problem, but also our eyes. Thus, it is natural to combine video information into the solution. For audio-video combined source separation method, besides the direction of arrival information, another type of combination is using lip reading for separation. For example, Wang et al. and Rivet et al. used this type of audio-video combination to help the separation. However, for a room environment such as AV16.3, it is not possible to do the lip reading due to practical environment. Therefore, we use cameras to capture the locations of the speakers in this article. Then the positions can be used to obtain a smart initialization for the convergence of the learning algorithm. Thus, we propose a new audio video based fast fixed-point independent vector analysis (AVIVA) method, which uses video information to initialize the algorithm. The issue of combined audio-video localization to provide more robust input to the smart initialization is left as future study.
In order to verify the advantages of AVIVA, datasets containing multiple speech and noise signals are used in its evaluation. Most speech separation evaluations have been done by using artificial recordings. Few of them use real room recordings due to the practical constraints. However, in this article, the proposed AVIVA method is tested with real room recordings, i.e., the AV16.3 corpus, which not only confirms the advantages of the proposed method, but also confirms the practical advantage of this study.
For real dataset, the separation performance evaluation becomes a problem. There is no objective evaluation method proposed to evaluate such real room recordings. Traditional evaluations are all based on prior knowledge such as the mixing filters or source signals. For instance, the performance index (PI) needs the mixing filters, and the signal-to-interference ratio (SIR) or signal-to-distortion ratio (SDR) require the original speech signals. However, for a real recorded dataset, the only information we have is the audio mixtures. Therefore, a new evaluation method is needed without requiring any other prior knowledge. In this article, we employ a new evaluation method based on pitch information. It detects the pitches of all the separated signals, and then calculates the pitch differences between them, and thereby provides an objective relative evaluation between methods.
The article is organized as follows, in Section ‘Fast fixed-point independent vector analysis’, a brief summary of the FastIVA algorithm is provided. The reason for the block permutation problem of FastIVA is analyzed in Section ‘Block permutation problem of FastIVA’. Then the AVIVA approach is proposed in Section ‘Audio video based fast fixed-point independent vector analysis’. The pitch based evaluation method for the real dataset is introduced in Section ‘Pitch based evaluation for real recordings’, and the experimental results by using different multisource datasets are discussed in Section ‘Experiments and results’. Finally, conclusions are drawn in Section ‘Conclusions’.
Fast fixed-point independent vector analysis
The basic noise free blind source separation generative model in the time domain is x(t) = H s(t), wherein, omitting the time index t for convenience, is the observed mixed signal vector, is the source signal vector, H is the mixing matrix with m × n dimension, and (·) T denotes the transpose operator. In this article, we focus on the exactly determined case, i.e., m = n. Our target is to find the inverse matrix W of mixing matrix H. Due to the scaling and permutation ambiguities, we cannot generally obtain W uniquely. Actually, W = P D H−1, therefore,, where P is a permutation matrix, D is a scaling diagonal matrix, and is the estimation of the source signal vector s.
where is the observed signal vector in the frequency domain, and is the estimated signal vector in the frequency domain. The index k denotes the k th frequency bin. It is a multivariate model.
Independent vector analysis
Traditionally, independent component analysis (ICA) is the central tool for the blind source separation problem. However, ICA cannot solve the permutation ambiguity by itself, but needs prior knowledge of source position or post processing based upon exploiting certain feature of the sources. In order to retain the dependency between different frequency bins, one method is the joint blind source separation based on multiset canonical correlation analysis, another widely used method is independent vector analysis, which is focused in this article. Independent vector analysis is a modification of independent component analysis by adopting multivariate quantities. It can preserve the higher order statistical dependencies between frequency bins and remove the dependencies between sources. Thus it can address the permutation problem during learning without the help of other prior knowledge or post processing.
where E[·] denotes the statistical expectation operator, is the matrix determinant operator, and K is the number of frequency bins. The dependency between the source vectors would be removed but the dependency between the components of each vector can be retained, when the cost function is minimized.
Fast fixed-point independent vector analysis
We next discuss an important convergence problem which is encountered in FastIVA algorithm.
Block permutation problem of FastIVA
Although it is claimed that IVA algorithms can solve the permutation problem in frequency domain source separation, convergence can be affected by the dimensionality of the parameter space as in time domain algorithms. Moreover, the spherical symmetry of the source model adopted by IVA algorithms can be a weakness. This kind of source model assumes that the dependencies between all frequency bins are the same. However, it is highly likely that the dependencies between frequency bins which are far away from each other are weak. Thus, such spherical symmetry is a constraint which can lead to a block permutation problem. The block permutation problem means the separation alignment is different for blocks of frequency bins, which is a different problem from the conventional permutation problem in the frequency domain independent component analysis approaches. In this section, we will analyze the block permutation problem based on a 2×2 exactly determined case when using the FastIVA algorithm.
where G ′ (·) denotes the derivative of G(·).
is satisfied, the cost function has the same value, i.e., J = J 1. This indicates that there is no penalty for the FastIVA converging to a block permutation solution, which is also a global minimum as the correct solution. For the case where there are more sources, a similar analysis can also be used to confirm that the block permutation can happen.
To confirm the problem occurs regularly, we chose different speech signals randomly fromhttp://www.kecl.ntt.co.jp/icl/signal/sawada/demo/bss2to4/index.html, which is Hiroshi Sawada’s dataset, and positioned them at a variety of different locations in a room environment to generate microphone measurements by using image method. Then the FastIVA method was used to separate them. We found that approximately 30% of them suffer the block permutation problem which justifies the need to overcome the ill-convergence. Moreover, if the block permutation problem happens, when the separated signals in the frequency domain are transferred back into the time domain, the mixtures cannot be separated at all. So it is a significant problem for FastIVA. Therefore, a good initialization is needed for the FastIVA to converge to the correct global minimum point.
Audio video based fast fixed-point independent vector analysis
Based on the analysis and discussion in the above section, it is necessary to set a proper initialization for the FastIVA algorithm to mitigate the block permutation problem. Moreover, a proper initialization can also achieve faster convergence and better performance, which is common for optimization problem. In additional, such a video localization based algorithm can improve the separation performance especially when there is background noise and a high reverberant room environment, because audio localization can be seriously affected by such noise and reverberation.
First, video localization based on face and head detection is used to obtain the visual location of each speaker which is approximated after processing the 2D image information and obtained from at least two synchronized color video cameras through calibration parameters and an optimization method.
where,, and are the 3D positions of the speaker i, while,, and are Cartesian coordinates of the center of the microphone array.
and κ=k/c where c is the speed of sound in air at room temperature. The coordinates,, and are the 3D positions of the i th microphone.
where Q is the whitening matrix. The above albeit biased estimation can be used as the initialization of the unmixing matrix of FastIVA rather than an identity matrix or random matrix. The real room recordings will be used to test this proposed method, and an evaluation criterion for real room recording will be presented in the following section.
Pitch based evaluation for real recordings
In this article, we will use real datasets with multiple signals to test the algorithm. Thus how to evaluate the separation performance becomes an issue. For real recording, the only measurements we obtain are the mixed signals captured by the microphone array. We cannot access either the mixing matrix or the pure source signals. Thus, we cannot evaluate the separation performance by traditional methods, such as performance index which is based on the prior knowledge of the mixing matrix, or the SIR or SDR which require prior knowledge about the source signals. It is a tough problem to evaluate objectively real recording separation performance. We can listen to the separated speech signals, but it is just a form of subjective evaluation. In order to evaluate the results objectively, the features of the separated signals should be used. Pitch information is one of the features which can help to evaluate the separation performance, because different speech sections at different time slots have different pitches provided that the original sources do not have substantially overlapping pitch characteristics. We adopt the sawtooth waveform inspired pitch estimator (SWIPE) method, which has better performance compared with traditional pitch estimators.
The separation performance improves as the separation rate increases. We need to highlight here that it cannot evaluate the absolute quality of the separated speech signal, but it can be used for comparing the separation performance when using different separation methods.
Experiments and results
In this section, we will show three different kinds of experimental results by using different multisource datasets to show the advantage of the proposed AVIVA algorithm. The first experiment will show that the proposed AVIVA algorithm can successfully avoid the block permutation problem. The second experiment will demonstrate the advantage of AVIVA in the aspect of convergence speed and separation performance improvement in a noisy environment and in a high reverberant environment. The positions of the source speech signal are assumed known in these two experiments, and the initialization is based on these positions. The last experiment shows the proposed method used in a real application by using the real multisource dataset. The 3D video localizer is used to capture the source positions.
Experimental demonstration of the block permutation problem
For the real room recordings, we can’t obtain the mixing filters, therefore we cannot observe the block permutation visually. In the first simulation, we assume that we know the source signals and mixing filters to experimentally demonstrate the block permutation problem. The speech signals are from Hiroshi Sawada’s dataset, the website ishttp://www.kecl.ntt.co.jp/icl/signal/sawada/demo/bss2to4/index.html. Each speech signal is approximately 7 s long. The image method was used to generate the room impulse responses, and the size of the room is [7, 5, 3], which represents the length, the width and the height, respectively, and the measure unit is meter. The DFT length is 1024, and RT60 = 200 ms. We use a 2×2 mixing case, for which the microphone positions are [3.48, 2.50, 1.50] and [3.52, 2.50, 1.50], respectively in Cartesian coordinates. The sampling frequency is 8 kHz. The separation performance is evaluated objectively by performance index (PI), the signal-to-distortion ratio (SDR) and signal-to-interference ratio (SIR). The toolbox used for calculating the SDR and SIR can be obtained from the websitehttp://sisec2010.wiki.irisa.fr/tiki-index.php.
Separation performance comparison when block permutation problem happens
These simulations have confirmed that the block permutation problem can happen. And the experimental results verify that the AVIVA algorithm can avoid the block permutation problem successfully by using a proper initialization.
Experiments in noisy and reverberant room environment
In the second simulation, we will show the separation performance of the AVIVA approach in a noisy environment to represent a multisource case. Moreover, we also show that the AVIVA approach can achieve better separation performance in a high reverberant environment. The positions of the sources and microphones are assumed known to generate different reverberant environments by changing the absorption coefficients of the image method. We use a 2 × 2 mixing case, for which the microphone positions are [3.48, 2.50, 1.50] and [3.52, 2.50, 1.50], respectively. The noise is assumed to be Gaussian distributed and its standard deviation is selected to be 2.5% of the maximum magnitude of the speech signal. We chose different speech signals from the TIMIT dataset. This simulation is used to show the AVIVA algorithm is suitable for different kinds of mixtures, and can achieve a better separation performance with faster convergence in a noisy environment. All the experiment parameters are the same as the 2 × 2 case in experiment 1. The separation performance is also evaluated by SDR and SIR.
Separation performance comparison in noisy environment
The results shown in Table2 confirm the advantage of the proposed AVIVA algorithm in that it can achieve a faster convergence and better separation performance in a noisy environment. The FastIVA is already a fast form algorithm, however, the AVIVA can improve the convergence speed approximately by 60%. Meanwhile, the separation performances are also improved generally. Comparing with the FastIVA algorithm, the average further improvement in SDR is approximately 0.75 dB, and the average further improvement in SIR is approximately 1.4 dB.
Experiments by using the real room recordings
Separation performance for the real room recordings
Time slot (s)
Figure12 shows that the pitches of the mixed signals are all mixed. Figure13 is the separation result by using FastIVA. Although the pitches are separated to some extent, there are still many mixed pitches. Figure14 is the separation results by using AVIVA. It shows that the pitches are separated better compared with the result of FastIVA. The objective evaluation separation rate is shown in Table3. Then we chose different time slots to repeat the simulation, and the results are also shown in Table3. It is highlighted that all the three speakers in this experiment are all male, and the proposed pitch based evaluation method still works well. The experimental results indicate that the proposed AVIVA algorithm can be used in a real multisource room environment successfully with faster convergence and better separation performance than FastIVA.
In this article, first, we analyzed the block permutation problem of independent vector analysis methods. Then we proposed an AVIVA algorithm which can use the geometric information obtained from video to set a proper initialization. The proposed algorithm can avoid the block permutation problem of independent vector analysis methods. Moreover, it can also achieve a faster and better separation performance in a noisy environment and a high reverberant environment when compared with FastIVA. Meanwhile, we also proposed a pitch based evaluation method for the real multisource dataset, which doesn’t need any prior information such as the mixing filters and source signals. The experimental results confirm the advantages of the proposed AVIVA algorithm, and also verified that the proposed pitch based evaluation method can be used for comparing the separation performance.
- Cherry C: Some experiments on the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 1953, 25: 975-979. 10.1121/1.1907229View Article
- Cherry C, Taylor W: Some further experiments upon the recognition of speech, with one and with two ears. J. Acoust. Soc. Am. 1954, 26: 554-559. 10.1121/1.1907373View Article
- Haykin S, Chen Z: The cocktail party problem. Neural Comput. 2005, 17: 1875-1902. 10.1162/0899766054322964View Article
- Comon P, Jutten C: Handbook of Blind Source Separation: Independent Component Analysis and Applications. (Academic Press, San Diego, CA, 2009)
- Cooke M, Ellis D: The auditory organization of speech and other sources in listeners and computational models. Speech Commun. 2001, 35: 141-177. 10.1016/S0167-6393(00)00078-9View ArticleMATH
- Herault J, Jutten C, Ans B: Detection de grandeurs primitives dans un message composite par une architecture de calcul neuromimetique en apprentissage non supervis. In Proc. GRETSI Vol 2. Nice; 1985):1017-1022.
- Comon P: Independent component analysis—a new concept? Signal Process. 1994, 36: 287-314. 10.1016/0165-1684(94)90029-9View ArticleMATH
- Pedersen MS, Larsen J, Kjems U, Parra LC: A survey of convolutive blind source separation methods. In Springer Handbook on Speech Processing and Speech Communication. (Springer, New York; 2007):pp. 1-34.
- Parra L, Spence C: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Process. 2000, 8: 320-327. 10.1109/89.841214View ArticleMATH
- Cichocki A, Amari S: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. (Wiley, New York, 2003)
- Hyvarinen A, Karhunen J, Oja E: Independent Component Analysis. (Wiley, New York, 2001)View Article
- Naqvi SM, Zhang Y, Tsalaile T, Sanei S, Chambers JA: A multimodal approach for frequency domain independent component analysis with geometrically-based initialization. In Proc. EUSIPCO. (Lausanne, Switzerland; 2008).
- Rivet B, Girin L, Jutten C: Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 2007, 15: 96-108.View Article
- Parra L, Alvino C: Geometric source separation: merging convolutive source separation with geometric beamforming. IEEE Trans. Speech Audio Process. 2002, 10: 352-362. 10.1109/TSA.2002.803443View Article
- Murata N, Ikeda S, Ziehe A: An approach to blind source separation based on temporal structure of speech signals. Neurocomputing. 2001, 41: 1-24. 10.1016/S0925-2312(00)00345-3View ArticleMATH
- Kim T, Lee I, Lee TW: Independent vector analysis: definition and algorithms. In Fortieth Asilomar Conference on Signals, Systems and Computers 2006. (Asilomar,USA; 2006):pp. 1393-1396.View Article
- Kim T, Attias H, Lee S, Lee T: Blind source separation exploiting higher-order frequency dependencies. IEEE Trans. Audio Speech Lang. Process. 2007, 15: 70-79.View Article
- Liang Y, Naqvi S, Chambers J: Adaptive step size indepndent vector analysis for blind source separation. In 17th International Conference on Digital Signal Processing. (Corfu, Greece; 2011):1-6.
- Lee I, Kim T, Lee TW: Fast fixed-point independent vector analysis algorithms for convolutive blind source separation. Signal Process. 2007, 87: 1859-1871. 10.1016/j.sigpro.2007.01.010View ArticleMATH
- Itahashi T, Matsuoka K: Stability of independent vector analysis. Signal Process. 2012, 93: 1809-1820.View Article
- Maganti HK, Gatica-Perez D, McCowan I: Speech enhancement and recognition in meetings with an audio-visual sensor array. IEEE Trans. Audio Speech Lang. Process. 2007, 15: 2257-2269.View Article
- Wang W, Cosker D, Hicks Y, Sanei S, Chambers JA: Video assisted speech source separation. In Proc. ICASSP, vol 5. (Philadelphia, USA; 2005):pp. 425-428.
- Lathoud G, Odobez J, Gatica-Perez D: AV16.3: an audio-visual corpus for speaker localization and tracking. In Proceedings of the MLMI’04 Workshop. (Martigny, Switzerland; 2004). LNCS 3361, pp. 182–195
- Vincent E, Fevotte C, Gribonval R: Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 2006, 14: 1462-1469.View Article
- Li YU, Adali T, Wang W, Calhoun VD: Joint blind source separation by multiset canonical correlation analysis. IEEE Trans. Signal Process. 2009, 57(10):3918-3929.MathSciNetView Article
- Nesta F, Svaizer P, Omologo M: Convolutive BSS of short mixtures by ICA recursively regularized across frequencies. IEEE Trans. Audio Speech Lang. Process. 2011, 19: 624-639.View Article
- Lee I, Jang GJ, Lee TW: Independent vector analysis using densities represented by chain-like overlapped cliques in graphical models for separation of convolutedly mixed signals. Electron. Lett. 2009, 45: 710-711. 10.1049/el.2009.0945View Article
- Naqvi SM, Yu M, Chambers JA: A multimodal approach to blind source separation of moving sources. IEEE J. Sel. Topics Signal Process. 2010, 4: 895-910.View Article
- Tsai RY: A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses. IEEE J. Robot. Autom. 1987, RA-3: 323-344.View Article
- Hartley R, Zisserman A: Multiple View Geometry in Computer Vision. (Cambridge University Press, Cambridge, 2001)MATH
- Shabani H, Kahaei MH: Missing feature mask generation in BSS outputs using pitch frequency. In 17th International Conference on Digital Signal Processing. Corfu, Greece; 2011:pp. 1-6.
- Camacho A, Harris JG: A sawtooth waveform inspired pitch estimator for speech and music. J. Acoust. Soc. Am.c 2008, 124(3):1638-1652. 10.1121/1.2951592View Article
- Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL, Zue V: TIMIT acoustic-phonetic continuous speech corpus. Linguistic Data Consortium (Philadelphia, 1993)
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.