Strategies for distant speech recognitionin reverberant environments

EURASIP Journal on Advances in Signal Processing

Table 6 Results for the evaluation set using an acoustic model trained with extended training data

	Proc.	LM	Adap.	SimData							RealData
				Room1		Room2		Room3		Ave.	Room1		Ave.
				Near	Far	Near	Far	Near	Far	-	Near	Far	-
	Clean/Headset mic	RNN	✓	-	-	-	-	-	-	3.5	-	-	6.1
	Lapel mic	RNN	✓	-	-	-	-	-	-	-	-	-	7.3
0 - a	Distant	TRI	-	5.0	5.7	6.3	11.1	7.2	11.2	7.8	25.8	26.5	26.1
b			✓	4.5	5.4	6.1	9.9	7.0	10.2	7.2	19.7	22.1	20.9
c		RNN	-	4.2	5.0	5.3	9.2	5.7	9.7	6.5	23.0	24.4	23.7
d			✓	3.8	4.9	4.9	8.1	5.6	8.6	6.0	18.5	19.9	19.2
I - a	WPE (1ch)	TRI	-	4.8	5.4	6.2	8.8	6.1	8.6	6.6	20.3	20.3	20.3
b			✓	4.7	5.0	5.9	8.2	5.9	8.1	6.3	16.8	17.3	17.0
c		RNN	-	3.8	4.3	4.7	7.6	5.3	7.4	5.5	19.2	18.7	19.0
d			✓	3.6*	3.9*	4.4*	6.6*	4.8*	6.8*	5.0*	15.2*	16.7*	15.9*
II - a	WPE (2ch)	TRI	-	4.9	5.1	6.0	7.2	6.0	7.3	6.1	18.0	18.1	18.1
b			✓	4.6	4.9	5.8	6.8	5.8	6.7	5.8	14.5	16.0	15.2
c		RNN	-	4.0	4.3	4.9	6.1	5.0	5.8	5.0	16.5	16.5	16.5
d			✓	3.7*	4.0*	4.4	5.7	4.7	5.4	4.6	13.4	13.1	13.2
III - a	II + MVDR	TRI	-	4.9	5.1	5.8	6.8	5.7	6.7	5.8	15.6	15.9	15.8
b			✓	4.6	4.9	5.4	6.2	5.6	6.6	5.5	12.9	14.2	13.5
c		RNN	-	4.2	4.4	4.2	5.5	4.9	5.4	4.8	14.3	14.8	14.6
d			✓	3.9	4.2	4.1*	5.0*	4.5	5.0*	4.4*	11.6	12.8	12.2
IV - a	III + DOL	TRI	-	4.8	5.1	5.6	6.5	5.8	6.4	5.7	15.8	15.9	15.8
b			✓	4.7	5.0	5.4	6.1	5.6	6.2	5.5	12.8	14.0	13.4
c		RNN	-	4.1	4.7	4.2	5.2	4.9	5.4	4.7	14.4	14.6	14.5
d			✓	3.7	4.1	4.3	5.0*	4.4*	5.2	4.4*	11.1*	12.7*	11.9*
V - a	WPE (8ch)	TRI	-	4.6	5.1	5.8	6.5	5.8	6.8	5.8	16.7	17.0	16.9
b			✓	4.42	4.9	5.8	6.2	5.5	6.5	5.5	13.5	13.9	13.7
c		RNN	-	3.8	4.2	4.6	5.5	4.8	5.4	4.7	16.0	15.7	15.8
d			✓	3.5*	3.9*	4.1	5.0	4.3	5.3	4.3	12.7	13.1	12.9
VI - a	V + MVDR	TRI	-	4.8	5.2	5.2	5.5	5.5	5.9	5.3	11.9	12.4	12.2
b			✓	4.6	4.9	5.0	5.2	5.5	5.8	5.2	10.0	10.2	10.1
c		RNN	-	4.0	4.2	3.8	4.3	4.4	4.8	4.2	10.4	11.9	11.1
d			✓	3.7	3.9*	3.5*	4.2*	4.2*	4.9	4.1*	9.0	9.6	9.3
VII - a	VI + DOL	TRI	-	5.0	5.3	5.2	5.5	5.7	5.8	5.4	11.1	12.3	11.7
b			✓	4.8	4.8	5.0	5.3	5.6	5.6	5.2	9.7	9.9	9.8
c		RNN	-	4.1	4.2	3.9	4.3	4.3	4.8	4.3	10.0	11.4	10.7
d			✓	3.9	4.0	3.7	4.2*	4.3	4.6*	4.1*	8.9*	9.3*	9.1*

*Best performance for 1ch, 2ch and 8ch
The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with (✓) or without (-) adaptation