Speaker Separation and Tracking System

Anliker, U; Randall, JF; Tröster, G

doi:10.1155/ASP/2006/29104

Research Article
Open access
Published: 01 December 2006

Speaker Separation and Tracking System

U Anliker¹,
JF Randall¹ &
G Tröster¹

EURASIP Journal on Advances in Signal Processing volume 2006, Article number: 029104 (2006) Cite this article

1194 Accesses
4 Citations
Metrics details

Abstract

Replicating human hearing in electronics under the constraints of using only two microphones (even with more than two speakers) and the user carrying the device at all times (i.e., mobile device weighing less than 100 g) is nontrivial. Our novel contribution in this area is a two-microphone system that incorporates both blind source separation and speaker tracking. This system handles more than two speakers and overlapping speech in a mobile environment. The system also supports the case in which a feedback loop from the speaker tracking step to the blind source separation can improve performance. In order to develop and optimize this system, we have established a novel benchmark that we herewith present. Using the introduced complexity metrics, we present the tradeoffs between system performance and computational load. Our results prove that in our case, source separation was significantly more dependent on frame duration than on sampling frequency.

References

Moore D: The IDIAP smart meeting room. IDIAP-COM 07, IDIAP, 2002
Google Scholar
Wooters C, Mirghafori N, Stolcke A, et al.: The 2004 ICSI-SRI-UW meeting recognition system. Lecture Notes in Computer Science, January 2005 3361: 196–208.
Article Google Scholar
Kern N, Schiele B, Junker H, Lukowicz P, Tröster G: Wearable sensing to annotate meeting recordings. Personal Ubiquitous Computing 2003, 7(5):263–274.
Google Scholar
Choudhury T, Pentland A: The sociometer: a wearable device for understanding human networks. Proceedings of the Conference on Computer Supported Cooperative Work (CSCW '02), Workshop on Ad hoc Communications and Collaboration in Ubiquitous Computing Environments, November 2002, New Orleans, La, USA
Google Scholar
Kwon S, Narayanan S: A method for on-line speaker indexing using generic reference models. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 2653–2656.
Google Scholar
Nishida M, Kawahara T: Speaker model selection using Bayesian information criterion for speaker indexing and speaker adaptation. Proceedings of the 8th European Conference on Speech Communication and Technology, September 2003, Geneva, Switzerland 1849–1852.
Google Scholar
Lu L, Zhang H-J: Speaker change detection and tracking in realtime news broadcasting analysis. Proceedings of the 10th ACM International Conference on Multimedia, December 2002, Juan les Pins, France 602–610.
Google Scholar
Lathoud G, McCowan IA, Odobez J-M: Unsupervised location based segmentation of multi-party speech. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing – Meeting Recognition Workshop (ICASSP-NIST '04), May 2004, Montreal, Canada IDIAP-RR 04-14
Google Scholar
Siracusa M, Morency LP, Wilson K, Fisher J, Darrell T: A multi-modal approach for determining speaker location and focus. Proceedings of the International Conference on Multi-modal Interfaces (ICMI '03), November 2003, Vancouver, BC, Canada 77–80.
Chapter Google Scholar
Ajmera J, Lathoud G, McCowan IA: Clustering and segmenting speakers and their locations in meetings. Research Report IDIAP-RR 03-55 December 2003.
Google Scholar
Busso C, Hernanz S, Chu C-W, et al.: Smart room: participant and speaker localization and identification. Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '05), March 2005, Philadelphia, Pa, USA 2: 1117–1120.
Google Scholar
Amft O, Lauffer M, Ossevoort S, Macaluso F, Lukowicz P, Tröster G: Design of the QBIC wearable computing platform. Proceedings of 15th IEEE International Conference on Application-Specific Systems, Architectures and Processors (ASAP '04), September 2004 398–410.
Google Scholar
Mann S: Wearable computing as means for personal empowerment. 1st International Conference on Wearable Computing (ICWC '98), May 1998, Fairfax, Va, USA
Google Scholar
Pentland A: Wearable intelligence. Scientific American 1998., 276(1es1):
Shriberg E, Stolcke A, Baron D: Observations on overlap: findings and implications for automatic processing of multi-party conversation. Poceedings of 7th European Conference on Speech Communication and Technology Eurospeech, September 2001, Aalborg, Denmark 2: 1359–1362.
Google Scholar
Ferber R: Information Retrieval. dpunkt, Germany; 2003.
MATH Google Scholar
Yilmaz O, Rickard S: Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing 2004, 52(7):1830–1847. 10.1109/TSP.2004.828896
Article MathSciNet Google Scholar
Rickard S, Balan R, Rosca J: Blind source separation based on space-time-frequency diversity. Proceedings of 4th International Symposium on Independent Component Analysis and Blind Source Separation, April 2003, Nara, Japan 493–498.
Google Scholar
Rickard S, Yilmaz Z: On the approximate W-disjoint orthogonality of speech. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 529–532.
Google Scholar
Aarabi P, Mahdavi A: The relation between speech segment selectivity and source localization accuracy. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 1: 273–276.
Google Scholar
Basu S, Schwartz S, Pentland A: Wearable phased arrays for sound localization enhancement. Proceedings of the IEEE International Symposium on Wearable Computing (ISWC '00), 2000, Atlanta, Ga, USA 103–110.
Google Scholar
Tritschler A, Gopinath R: Improved speaker segmentation and segments clustering using the Bayesian information criterion. Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH '99), September 1999, Budapest, Hungary 679–682.
Google Scholar
Lu L, Jiang H, Zhang HJ: A robust audio classification and segementation method. Proceedings of the 9th ACM International Conference on Multimedia, September–October 2001, Ottawa, Ontario, Canada 203–211.
Google Scholar
Peltonen V, Tuomi J, Klapuri A, Huopaniemi J, Sorsa T: Computational auditory scene recognition. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '02), May 2002, Orlando, Fla, USA 2: 1941–1944.
Google Scholar
Scheirer E, Slaney M: Construction and evaluation of a robust multifeature speech/music discriminator. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '97), April 1997, Munich, Germany 2: 1331–1334.
Google Scholar
Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America 1974, 55(6):1304–1312. 10.1121/1.1914702
Article Google Scholar
Schwarz G: Estimating the dimension of a model. The Annals of Statistics 1978, 6(2):461–464. 10.1214/aos/1176344136
Article MathSciNet Google Scholar
Delacourt P, Kryze D, Wellekens C: Speaker-based segmentation for audio data indexing. Proceedings of the ESCA Tutorial and Research Workshop (ITRW '99). Accessing Information in Spoken Audio, April 1999, Cambridge, UK 78–83.
Google Scholar
Cettolo M, Vescovi M: Efficient audio segmentation algorithms based on the BIC. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 6: 537–540.
Google Scholar
Ajmera J, McCowanand I, Bourlard H: BIC revisited and applied to speaker change detection. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong
Google Scholar
Campbell JP: Speaker recognition: a tutorial. Proceedings of the IEEE 1997, 85(9):1437–1462. 10.1109/5.628714
Article Google Scholar
Reynolds DA, Rose RC: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 1995, 3(1):72–83. 10.1109/89.365379
Article Google Scholar
Nishida M, Kawahara T: Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '03), April 2003, Hong Kong 1: 172–175.
Article Google Scholar
Matsui T, Furui S: Comparison of text-independent speaker recognition methods using VQ-distortion and discrete/continuous HMMs. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP '92), March 1992, San Francisco, Calif, USA 2: 157–160.
Google Scholar
Bimbot F, Bonastre J, Fredouille C, et al.: A tutorial on text-independent speaker verification. EURASIP Jounral on Applied Signal Processing 2004, 2004(4):430–451. 10.1155/S1110865704310024
Google Scholar
Kawahara H, Irino T: Exploring temporal feature representations of speech using neural networks. Tech. Rep. SP88-31 1988.
Google Scholar
Aoki M, Okamoto M, Aoki S, Matsui H, Sakurai T, Kaneda Y: Sound source segregation based on estimating incident angle of each frequency component of input signals acquired by multiple microphones. Acoustical Science and Technology 2001, 22(2):149–157. 10.1250/ast.22.149
Article Google Scholar
Baeck M, Zölzer U: Real-time implementation of a source separation algorithm. Proceedings of the 6th International Conference on Digital Audio Effects (DAFx '03), September 2003, London, UK
Google Scholar
van Rijsbergen CJ: Information retrieval. Butterworths, London, UK; 1979.
MATH Google Scholar
Anliker U, Beutel J, Dyer M: A systematic approach to the design of distributed wearable systems. IEEE Transactions on Computers 2004, 53(8):1017–1033. 10.1109/TC.2004.36
Article Google Scholar
He JL, Liu L, Palm G: A text-independent speaker identification system based on neural networks. Proceedings of the International Conference on Spoken Language Processsing (ICSLP '94), September 1994, Yokohama, Japan 1851–1854.
Google Scholar

Download references

Author information

Authors and Affiliations

The Wearable Computing Lab, ETH Zurich, Zurich, 8097, Switzerland
U Anliker, JF Randall & G Tröster

Authors

U Anliker
View author publications
You can also search for this author in PubMed Google Scholar
JF Randall
View author publications
You can also search for this author in PubMed Google Scholar
G Tröster
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to U Anliker.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Anliker, U., Randall, J. & Tröster, G. Speaker Separation and Tracking System. EURASIP J. Adv. Signal Process. 2006, 029104 (2006). https://doi.org/10.1155/ASP/2006/29104

Download citation

Received: 26 January 2005
Revised: 05 December 2005
Accepted: 08 December 2005
Published: 01 December 2006
DOI: https://doi.org/10.1155/ASP/2006/29104

Speaker Separation and Tracking System

Abstract

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords