- Research Article
- Open access
- Published:
Speaker Separation and Tracking System
EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 029104 (2006)
Abstract
Replicating human hearing in electronics under the constraints of using only two microphones (even with more than two speakers) and the user carrying the device at all times (i.e., mobile device weighing less than 100 g) is nontrivial. Our novel contribution in this area is a two-microphone system that incorporates both blind source separation and speaker tracking. This system handles more than two speakers and overlapping speech in a mobile environment. The system also supports the case in which a feedback loop from the speaker tracking step to the blind source separation can improve performance. In order to develop and optimize this system, we have established a novel benchmark that we herewith present. Using the introduced complexity metrics, we present the tradeoffs between system performance and computational load. Our results prove that in our case, source separation was significantly more dependent on frame duration than on sampling frequency.
References
Moore D: The IDIAP smart meeting room. IDIAP-COM 07, IDIAP, 2002
Wooters C, Mirghafori N, Stolcke A, et al.: The 2004 ICSI-SRI-UW meeting recognition system. Lecture Notes in Computer Science, January 2005 3361: 196–208.
Kern N, Schiele B, Junker H, Lukowicz P, Tröster G: Wearable sensing to annotate meeting recordings. Personal Ubiquitous Computing 2003, 7(5):263–274.
Choudhury T, Pentland A: The sociometer: a wearable device for understanding human networks. Proceedings of the Conference on Computer Supported Cooperative Work (CSCW '02), Workshop on Ad hoc Communications and Collaboration in Ubiquitous Computing Environments, November 2002, New Orleans, La, USA
Kwon S, Narayanan S: A method for on-line speaker indexing using generic reference models. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 2653–2656.
Nishida M, Kawahara T: Speaker model selection using Bayesian information criterion for speaker indexing and speaker adaptation. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 1849–1852.
Lu L, Zhang H-J: Speaker change detection and tracking in realtime news broadcasting analysis. Proceedings of the 10th ACM International Conference on Multimedia, December 2002, Juan les Pins, France 602–610.
Lathoud G, McCowan IA, Odobez J-M: Unsupervised location based segmentation of multi-party speech. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST '04), May 2004, Montreal, Canada IDIAP-RR 04-14
Siracusa M, Morency LP, Wilson K, Fisher J, Darrell T: A multi-modal approach for determining speaker location and focus. Proceedings of the International Conference on Multi-modal Interfaces (ICMI '03), November 2003, Vancouver, BC, Canada 77–80.
Ajmera J, Lathoud G, McCowan IA: Clustering and segmenting speakers and their locations in meetings. Research Report IDIAP-RR 03-55 December 2003.
Busso C, Hernanz S, Chu C-W, et al.: Smart room: participant and speaker localization and identification. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 2: 1117–1120.
Amft O, Lauffer M, Ossevoort S, Macaluso F, Lukowicz P, Tröster G: Design of the QBIC wearable computing platform. Proceedings of 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP '04), September 2004 398–410.
Mann S: Wearable computing as means for personal empowerment. 1st International Conference on Wearable Computing (ICWC '98), May 1998, Fairfax, Va, USA
Pentland A: Wearable intelligence. Scientific American 1998., 276(1es1):
Shriberg E, Stolcke A, Baron D: Observations on overlap: findings and implications for automatic processing of multi-party conversation. Poceedings of 7th European Conference on Speech Communication and Technology Eurospeech, September 2001, Aalborg, Denmark 2: 1359–1362.
Ferber R: Information Retrieval. dpunkt, Germany; 2003.
Yilmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 2004, 52(7):1830–1847. 10.1109/TSP.2004.828896
Rickard S, Balan R, Rosca J: Blind source separation based on space-time-frequency diversity. Proceedings of 4th International Symposium on Independent Component Analysis and Blind Source Separation, April 2003, Nara, Japan 493–498.
Rickard S, Yilmaz Z: On the approximate W-disjoint orthogonality of speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 529–532.
Aarabi P, Mahdavi A: The relation between speech segment selectivity and source localization accuracy. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 273–276.
Basu S, Schwartz S, Pentland A: Wearable phased arrays for sound localization enhancement. Proceedings of the IEEE International Symposium on Wearable Computing (ISWC '00), 2000, Atlanta, Ga, USA 103–110.
Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 679–682.
Lu L, Jiang H, Zhang HJ: A robust audio classification and segementation method. Proceedings of the 9th ACM International Conference on Multimedia, September–October 2001, Ottawa, Ontario, Canada 203–211.
Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T: Computational auditory scene recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 2: 1941–1944.
Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.
Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America 1974, 55(6):1304–1312. 10.1121/1.1914702
Schwarz G: Estimating the dimension of a model. The Annals of Statistics 1978, 6(2):461–464. 10.1214/aos/1176344136
Delacourt P, Kryze D, Wellekens C: Speaker-based segmentation for audio data indexing. Proceedings of the ESCA Tutorial and Research Workshop (ITRW '99). Accessing Information in Spoken Audio, April 1999, Cambridge, UK 78–83.
Cettolo M, Vescovi M: Efficient audio segmentation algorithms based on the BIC. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 6: 537–540.
Ajmera J, McCowanand I, Bourlard H: BIC revisited and applied to speaker change detection. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong
Campbell JP: Speaker recognition: a tutorial. Proceedings of the IEEE 1997, 85(9):1437–1462. 10.1109/5.628714
Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995, 3(1):72–83. 10.1109/89.365379
Nishida M, Kawahara T: Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 172–175.
Matsui T, Furui S: Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 2: 157–160.
Bimbot F, Bonastre J, Fredouille C, et al.: A tutorial on text-independent speaker verification. EURASIP Jounral on Applied Signal Processing 2004, 2004(4):430–451. 10.1155/S1110865704310024
Kawahara H, Irino T: Exploring temporal feature representations of speech using neural networks. Tech. Rep. SP88-31 1988.
Aoki M, Okamoto M, Aoki S, Matsui H, Sakurai T, Kaneda Y: Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones. Acoustical Science and Technology 2001, 22(2):149–157. 10.1250/ast.22.149
Baeck M, Zölzer U: Real-time implementation of a source separation algorithm. Proceedings of the 6th International Conference on Digital Audio Effects (DAFx '03), September 2003, London, UK
van Rijsbergen CJ: Information retrieval. Butterworths, London, UK; 1979.
Anliker U, Beutel J, Dyer M: A systematic approach to the design of distributed wearable systems. IEEE Transactions on Computers 2004, 53(8):1017–1033. 10.1109/TC.2004.36
He JL, Liu L, Palm G: A text-independent speaker identification system based on neural networks. Proceedings of the International Conference on Spoken Language Processsing (ICSLP '94), September 1994, Yokohama, Japan 1851–1854.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Anliker, U., Randall, J. & Tröster, G. Speaker Separation and Tracking System. EURASIP J. Adv. Signal Process. 2006, 029104 (2006). https://doi.org/10.1155/ASP/2006/29104
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1155/ASP/2006/29104