[Back to top of RWCP Sound Scene Database ]

RWCP Sound Scene Database
Research introduction:
Sound Source Identification and Speech Recognition Based on HMM Using a Microphone Array

ATR Spoken Language Translation Research Laboratories, Dept. 1, Satoshi NAKAMURA, and Takanobu NISHIURA
http://www.slt.atr.co.jp/dept1/index.html

The latest information on this page is located at
http://www.slt.atr.co.jp/~tnishi/DB/micarray/researche.htm
1. Abstract

We conducted a study of sound source identification and speech recognition using a microphone array making use of a RWCP database. Sound source identification of "speech" and "non-speech" and speech recognition were carried out by the convolution of RWCP-DB impulse responses and a dry speech source of ATR-DB and a dry non-speech source of RWCP-DB.

2. Sound source identification algorithm

We assumed that speech waves come from the forward direction and non-speech waves come from the right direction as shown in the left illustration in Figure 1. In this situation, DOAs(Directions Of Arrival) estimation and beamforming to estimated DOAs are needed to capture distant-talking speech with high quality. However, while DOAs can be estimated in this situation, the direction of the talker can not be estimated from among the estimated DOAs. Thus, sound source identification must be conducted in order to capture high quality speech at a distance. The right illustration in Figure 1 shows the flow of algorithm. Sound sources are identified as talker or noise, using statistical speech and non-speech models. Identification is conducted after calculating the maximum likelihood for captured sounds with beamforming. A statistical speech model was made from a dry speech source in ATR-DB. A statistical non-speech model was also made from a dry non-speech source in RWCP-DB.


Figure 1 : Sound source identification algorithm

3. Evaluation experiment

Figure 2 shows the experimental environment and Table 1 shows the experimental conditions. We assumed the situation that desired sound was located located forward and noise would be located on the right. In this situation, we evaluated the sound source identification performance in each SNR when the single transducer and the microphone array was used. The directivity of microphone array was steered to the desired sound direction. We also conducted evaluation experiments on speech recognition using 1998 year IPA(Information-technology Promotion Agency) phonetic Al models.


Figure 2 : Experimental environment
Table 1 : Experimental condition
Microphone array 14 transducers, 2.83cm spacing
Beamformer Delay-and-sum beamformer
Sampling frequency 16kHz
Frame length 32 msec.(Hamming window)
Feature vector MFCC, ΔMFCC, Δpower
HMM for sound source discrimination Gaussian mixture type
The number of phoneme models 1 model for speech
1 model for non-speech
Speech-DB ATR database SetA
Speech model training data Female 8 subjects and male 8 subjects, total 400 words
Non-speech DB RWCP-DB
Non-speech model training data 92 kinds of environmental sounds * 20 subjects
Test data (Open) Speech : Phoneme balanced 216 words in ATR-DB for speaker MHT
Non-speech : 92 kinds of environmental sounds in RWCP-DB
Impulse response RWCP-DB
Reberberation time 0.0, 0.3, 1.3 sec
SNR -5, 0, 5, 10, 15, 20 , 25, 30, clean dB
3.1 Performance evaluation

In this experiment, we evaluated sound source identification performance on 216 words for speech and 92 sounds for non-speech, by calculating the sound source discrimination rate for each. We also evaluated speech recognition performance using 216 words.

3.2 Experimental results

Figure 3(a) shows results of the experiment using a single transducer and Figure 3(b) shows results using a microphone array. The bar graph represents sound identification rate and the line graph represents the speech recognition rate in Figure 3. In Figure 3(a),(b), sound source identification rate and speech recognition rates are clearly improved using a microphone array. Thus, we can confirm that a microphone array is an effective tool to capture high quality speech at a distance. Furthermore, sound sources can be accurately identified in lower SNR environments because identification rates in the 0dB SNR environment is almost identical to the 20dB SNR. However, the speech recognition rate in the 0dB SNR environment is 65% in spite of using a microphone array in anechoic room. This performance is much less than one in the 20dB SNR environment. We need to consider future improvement of the beamformer which can capture distant-talking speech with high quality. However, we were able to confirm that if the DOAs are known we can identified whether sound source are speech or noise with this algorithm, because high performance sound source identification was conducted in a highly reverberating environment ([T60=1.3 sec.]) .



(a) A single microphone

(b) A microphone array
Figure 3 : Sound source identification and speech recognition rates

4.Conclusion

We conducted a study to discriminate "speech" and "non-speech" using both statistical speech and non-speech HMM models in order to localize the talker. Result point out that this algorithm is very effective in a highly reverberating environment. The effectiveness of a microphone array was also confirmed by investigating sound source identification and speech recognition performances in some reverberating and some SNR environments. In future works, we will continue to study performance in conjunction with DOA estimation.

[Back to top of RWCP Sound Scene Database]
RWCP Sound Scene Database in Real Acoustical Environments
Copyright (c) 1998-2001 Satoshi Nakamura, and Takanobu Nishiura, ATR Spoken Language Translation Research Laboratories.