Skip to main content

Table 6 Results for the evaluation set using an acoustic model trained with extended training data

From: Strategies for distant speech recognitionin reverberant environments

 

Proc.

LM

Adap.

SimData

RealData

    

Room1

Room2

Room3

Ave.

Room1

Ave.

    

Near

Far

Near

Far

Near

Far

-

Near

Far

-

 

Clean/Headset mic

RNN

-

-

-

-

-

-

3.5

-

-

6.1

 

Lapel mic

RNN

-

-

-

-

-

-

-

-

-

7.3

0 - a

Distant

TRI

-

5.0

5.7

6.3

11.1

7.2

11.2

7.8

25.8

26.5

26.1

b

  

4.5

5.4

6.1

9.9

7.0

10.2

7.2

19.7

22.1

20.9

c

 

RNN

-

4.2

5.0

5.3

9.2

5.7

9.7

6.5

23.0

24.4

23.7

d

  

3.8

4.9

4.9

8.1

5.6

8.6

6.0

18.5

19.9

19.2

I - a

WPE (1ch)

TRI

-

4.8

5.4

6.2

8.8

6.1

8.6

6.6

20.3

20.3

20.3

b

  

4.7

5.0

5.9

8.2

5.9

8.1

6.3

16.8

17.3

17.0

c

 

RNN

-

3.8

4.3

4.7

7.6

5.3

7.4

5.5

19.2

18.7

19.0

d

  

3.6*

3.9*

4.4*

6.6*

4.8*

6.8*

5.0*

15.2*

16.7*

15.9*

II - a

WPE (2ch)

TRI

-

4.9

5.1

6.0

7.2

6.0

7.3

6.1

18.0

18.1

18.1

b

  

4.6

4.9

5.8

6.8

5.8

6.7

5.8

14.5

16.0

15.2

c

 

RNN

-

4.0

4.3

4.9

6.1

5.0

5.8

5.0

16.5

16.5

16.5

d

  

3.7*

4.0*

4.4

5.7

4.7

5.4

4.6

13.4

13.1

13.2

III - a

II + MVDR

TRI

-

4.9

5.1

5.8

6.8

5.7

6.7

5.8

15.6

15.9

15.8

b

  

4.6

4.9

5.4

6.2

5.6

6.6

5.5

12.9

14.2

13.5

c

 

RNN

-

4.2

4.4

4.2

5.5

4.9

5.4

4.8

14.3

14.8

14.6

d

  

3.9

4.2

4.1*

5.0*

4.5

5.0*

4.4*

11.6

12.8

12.2

IV - a

III + DOL

TRI

-

4.8

5.1

5.6

6.5

5.8

6.4

5.7

15.8

15.9

15.8

b

  

4.7

5.0

5.4

6.1

5.6

6.2

5.5

12.8

14.0

13.4

c

 

RNN

-

4.1

4.7

4.2

5.2

4.9

5.4

4.7

14.4

14.6

14.5

d

  

3.7

4.1

4.3

5.0*

4.4*

5.2

4.4*

11.1*

12.7*

11.9*

V - a

WPE (8ch)

TRI

-

4.6

5.1

5.8

6.5

5.8

6.8

5.8

16.7

17.0

16.9

b

  

4.42

4.9

5.8

6.2

5.5

6.5

5.5

13.5

13.9

13.7

c

 

RNN

-

3.8

4.2

4.6

5.5

4.8

5.4

4.7

16.0

15.7

15.8

d

  

3.5*

3.9*

4.1

5.0

4.3

5.3

4.3

12.7

13.1

12.9

VI - a

V + MVDR

TRI

-

4.8

5.2

5.2

5.5

5.5

5.9

5.3

11.9

12.4

12.2

b

  

4.6

4.9

5.0

5.2

5.5

5.8

5.2

10.0

10.2

10.1

c

 

RNN

-

4.0

4.2

3.8

4.3

4.4

4.8

4.2

10.4

11.9

11.1

d

  

3.7

3.9*

3.5*

4.2*

4.2*

4.9

4.1*

9.0

9.6

9.3

VII - a

VI + DOL

TRI

-

5.0

5.3

5.2

5.5

5.7

5.8

5.4

11.1

12.3

11.7

b

  

4.8

4.8

5.0

5.3

5.6

5.6

5.2

9.7

9.9

9.8

c

 

RNN

-

4.1

4.2

3.9

4.3

4.3

4.8

4.3

10.0

11.4

10.7

d

  

3.9

4.0

3.7

4.2*

4.3

4.6*

4.1*

8.9*

9.3*

9.1*

  1. *Best performance for 1ch, 2ch and 8ch
  2. The results are presented for the different components of the SE front-end and for different configurations of the ASR back-end, such as the language model (LM) used (trigram (tri) or RNNLM (RNN)) or with () or without (-) adaptation