The cocktail party problem was first described by Colin Cherry in 1953[1]. Cherry and Taylor[2] further worked on this problem, which is captured by the question: “How do we recognize what one person is saying when others are speaking at the same time (the “cocktail party problem”)?”. The problem relates to the situation where there are several people talking simultaneously in a room environment, and we only want to focus on one of them. For human beings, it is easy to focus attention on a target speaker. However, for a machine, it is much more difficult to achieve this goal. Solving the machine cocktail party problem requires the design of a method to focus on the desired speech signal while suppressing or ignoring all the other competing speech sounds[3]. Attempts to solve the machine cocktail party problem have come from the signal processing community in the form of blind source separation (BSS)[4] and generally from the computer science community in the form of computational auditory scene analysis (CASA)[5]. CASA is motivated by understanding the human auditory scene analysis. While our focus in this article is signal processing based approaches such as blind source separation.

To address the BSS problem, many methods have been proposed. Herault and Jutten seem to have been the first who addressed the problem of blind source separation in 1985[6]. In their study, the mixtures are assumed to be instantaneous in the standard BSS problem, which means that the sound would only be transmitted directly from the sources to the microphones without any delay. Common formally established an instantaneous linear mixing model and clearly defined the term independent component analysis in 1994[7]. Meanwhile, he also proposed an algorithm which can measure the independence by capturing higher-order statistics of the sources.

However, for a real room environment, the problem becomes more complicated because the acoustic sources take multiple paths to the microphone sensors due to the reflections from the ground, ceiling and walls. As such a convolutive model is required to describe the sound propagation in a real room environment. Thus the practical speech separation problem becomes a convolutive blind source separation (CBSS) problem. During the last several decades, many efforts have been made on overcoming this problem[8]. Initially, solutions were posed in the time domain. However, since real room impulse responses are typically on the order of thousands of samples in length, the computational cost of these time domain methods renders them impractical. To mitigate this problem, a frequency domain solution was proposed by Parra[9]. As convolution in the time domain corresponds to multiplication in the frequency domain, the transformation into the frequency domain converts the convolutive mixing problem to that of independent complex instantaneous mixing operations at each frequency bin provided the transform block length is not too large. Transformation into the frequency domain reduces the computational cost, but there are two indeterminacies which are inherent to BSS, namely the scaling and permutation ambiguities, which are magnified in the frequency domain operation.

The scaling ambiguities across frequencies can be managed by matrix normalization[4, 10–13]. On the other hand, the permutation ambiguities are more challenging to solve and various methods have been proposed[8]. All of these methods need prior knowledge about the locations of the sources or post-processing exploiting some feature of the separated signals[14, 15]. A new algorithmic approach to mitigate the permutation problem, named independent vector analysis (IVA), was proposed by Kim[16]. This approach can potentially preserve the higher order statistical dependencies and structures of signals across frequencies and thereby mitigate the permutation problem[17]. It avoids the need for post-processing, and thus it is a natural way to overcome the permutation problem. Based on the original IVA method, several extended IVA methods have been recently proposed. An adaptive step size IVA method was proposed to improve the convergence speed by controlling the learning step size[18]. A fast fixed-point IVA method which applies Newton’s method to a contrast function of complex-valued variables was given in[19] which achieves a fast and good separation performance.

For the FastIVA, although it can achieve fast convergence, sometimes it can still suffer a special permutation problem which we term as the “block permutation problem”. Block permutation is different from the classical permutation problem. Block permutation means that the whole frequency range is divided into several blocks, each block containing several frequencies, and the intra-block permutation is consistent, but the inter-block permutation is different. However, the classical permutation problem means that the permutation is likely to be different for each frequency bin. In recent research study[20], a similar problem with the convergence of IVA is termed as “partial permutation”, but without analysis about why this problem can occur. In this article, this special problem is first highlighted and analytically demonstrated, we show that such ill-convergence can be mitigated by setting a good initialization of the unmixing matrix.

Initialization is important for the optimization problem because it can improve the convergence speed by ensuring a short cut convergence path avoiding local minimum points which yield poor separation. Source position information is important prior knowledge for setting a good initialization, and it can be obtained by audio localization or video localization. Audio localization for a single active speaker is difficult because human speech is an intermittent signal and contains much of its energy in the low-frequency bins where spatial discrimination is imprecise. Audio localization can also be affected by noise and room environment. Additionally, audio localization is not always effective due to the complexity in the case of multiple concurrent speakers[21]. Therefore, the accuracy of the audio localization would be degraded in a multisource real room environment with noise and reverberations, but video localization is robust in such an environment. On the other hand, video localization is not always effective, especially when the face of a human being is not visible to at least two cameras due to some obstacles, for example when the environment is cluttered, camera angle is wide, or illumination conditions are varying. For human beings, we not only use our ears to solve the cocktail party problem, but also our eyes. Thus, it is natural to combine video information into the solution. For audio-video combined source separation method, besides the direction of arrival information, another type of combination is using lip reading for separation. For example, Wang et al.[22] and Rivet et al.[13] used this type of audio-video combination to help the separation. However, for a room environment such as AV16.3, it is not possible to do the lip reading due to practical environment. Therefore, we use cameras to capture the locations of the speakers in this article. Then the positions can be used to obtain a smart initialization for the convergence of the learning algorithm. Thus, we propose a new audio video based fast fixed-point independent vector analysis (AVIVA) method, which uses video information to initialize the algorithm. The issue of combined audio-video localization to provide more robust input to the smart initialization is left as future study.

In order to verify the advantages of AVIVA, datasets containing multiple speech and noise signals are used in its evaluation. Most speech separation evaluations have been done by using artificial recordings. Few of them use real room recordings due to the practical constraints. However, in this article, the proposed AVIVA method is tested with real room recordings, i.e., the AV16.3 corpus[23], which not only confirms the advantages of the proposed method, but also confirms the practical advantage of this study.

For real dataset, the separation performance evaluation becomes a problem. There is no objective evaluation method proposed to evaluate such real room recordings. Traditional evaluations are all based on prior knowledge such as the mixing filters or source signals. For instance, the performance index (PI) needs the mixing filters[10], and the signal-to-interference ratio (SIR) or signal-to-distortion ratio (SDR) require the original speech signals[24]. However, for a real recorded dataset, the only information we have is the audio mixtures. Therefore, a new evaluation method is needed without requiring any other prior knowledge. In this article, we employ a new evaluation method based on pitch information. It detects the pitches of all the separated signals, and then calculates the pitch differences between them, and thereby provides an objective relative evaluation between methods.

The article is organized as follows, in Section ‘Fast fixed-point independent vector analysis’, a brief summary of the FastIVA algorithm is provided. The reason for the block permutation problem of FastIVA is analyzed in Section ‘Block permutation problem of FastIVA’. Then the AVIVA approach is proposed in Section ‘Audio video based fast fixed-point independent vector analysis’. The pitch based evaluation method for the real dataset is introduced in Section ‘Pitch based evaluation for real recordings’, and the experimental results by using different multisource datasets are discussed in Section ‘Experiments and results’. Finally, conclusions are drawn in Section ‘Conclusions’.