RWCP Sound Scene Database in Real Acoustical Environments
Voice Activity Detection in Noisy Environments
|
Takeshi Yamada, Multimedia Laboratory, Institute of Information Sciences and Electronics, University of Tsukuba
| 1. Introduction |
Voice activity detection (VAD) is very important for speech communication applications such as speech recognition, hands-free telephony and speech coding. When noise-free speech is acquired, a proper threshold set in the signal level allows relatively easy detection of the speech period. However, real speech is distorted by background noise such as computer-fans, air-conditioners and many other environment sounds, especially in distant-talking situations. Inaccurate detection of the speech period causes serious problems such as degradation of recognition performance and deterioration of speech quality. It is therefore highly desirable to develop a robust and reliable VAD method.
Our research group is studying the VAD method in the framework of pattern matching [1]. In this method, the speech period is detected as a result of speech recognition. By using appropriate environment sound models, a distinct performance advantage over the energy-based method can be achieved. However, this method has a problem when the environment sound overlaps with the speech, especially in low SNR (Signal to Noise Ratio) conditions. One way to solve this problem is to prepare multiple mixture sound models. However, there are a large number of combinations of speech and environment sounds. To solve this problem, we have proposed a new VAD method using HMM composition [2] and environment sound models [3]. This method is introduced below.
| 2. Proposed method |
Fig. 1 shows typical patterns of mixture of speech and environment sound.

Fig. 1 Typical patterns of mixture of the speech and the environment sound.
The environment sound overlaps with the beginning and ending part of the speech in Fig. 1 (A), with the whole of the speech in Fig. 1 (B) and with the inside of the speech in Fig. 1 (C). By using appropriate environment sound models, the period where the environment sound solely exists as shown in Fig. 1 (A) and (B) can be easily detected. Then, the mixture sound period can be also detected by using composed models of the speech and the detected environment sound. The algorithm of the proposed method is as follows.
[STEP 1] Detect the speech period and the environment sound period by using a Viterbi algorithm with the speech model and the environment sound models as shown in Fig. 2. S is the speech model and N1 and N2 are the enviroment sound models. The number of environment sounds and the number of states in each HMM are set at 2 and 1 in Fig. 2.

Fig. 2. Concatenation of the speech model and the environment sound models.
[STEP 2] Assume that the environment sound detected immediately before the speech period overlaps with the speech, then compose the speech model and the environment sound model for several SNR values.
[STEP 3] Detect the speech period, the environment sound period and the mixture sound period by using a Viterbi algorithm with the speech model, the environment sound models and the composed models as shown in Fig. 3. S' is the composed model. The number of the SNR values is set at 1 in Fig. 3.

Fig. 3 Concatenation of the speech model, the environment sound models and the composed models.
In the proposed method, an efficient and reliable search is realized by restricting the number of combinations of the speech model and the environment sound models. Also, the proposed method can estimate the environment sound which overlaps with speech and its SNR value. This additional information might be used for robust speech recognition.
| 3. Experiment |
Experimental conditions are shown in Tbl. 1.
Tbl. 1 Experimental conditions
| Speech database | ETL speech database [4] |
| Training data | 1050 words of a speaker S0001 |
| Test data | 492 words of a speaker S0001 |
| Environment sound database | RWCP Sound Scene Database |
| Training data | even number data from candybowl, clock1, cymbals, pan, pipong, spray, toy, trashbox, whistle1 |
| Test data | odd number data from candybowl, clock1, cymbals, pan, pipong, spray, toy, trashbox, whistle1 |
| Sampling frequency | 16 kHz |
| Frame length | 25 msec |
| Frame period | 10 msec |
| Pre-emphasis | 1 - 0.97z-1 |
| Parameter | MFCC |
| Speech and silence models | the number of states 1, the number of mixture densities 64 |
| Environment sound models | the number of states 1, the number of mixture densities 16 |
Evaluation data was prepared by adjusting the signal level of the environment sound and adding the speech and the environment sound. The beginning frame of the environment sound is set to 5 or more frames before that of the speech. As a result, the environment sound overlaps with the beginning part or the whole of the speech (see Fig. 1 (A) and (B)). In STEP 2 of the proposed method, the SNR values are set to 20, 10, 0, -10 and -20 dB.
Fig. 4 shows an example of the speech period detected by the conventional method (STEP 1 of the proposed method) and the proposed method.

Fig. 4 An example of the speech period detected by the conventional method and the proposed method.
In Fig. 4, the environment sound "whistle1" overlaps with the speech "ruigigo" in the SNR of 0 dB. The answer period and the detected period by the conventional method and the proposed method are shown at the top and bottom part of the waveform, respectively. In the conventional method, the speech period was mistaken as the environment sound period. On the other hand, the proposed method correctly detected the mixture sound period.
Fig. 5 shows the VAD accuracy for each environment sound in the SNR of 20, 10 and 0 dB.

(1) SNR20dB

(2) SNR10dB

(3) SNR 0dB
Fig. 5 VAD accuracy for each environment sound.
The VAD accuracy is defined by the following equation.
VAD accuracy = ((the number of words correctly detected the beginning frame of the speech) / (the number of all words)) x 100 (%)
The answer period is derived from the label attached to the speech database and the detection error within +- 5 frames is allowed.
Fig. 5 shows that the VAD accuracy of the proposed method is improved by a maximum of 40 % compared to that of the conventional method. In particular, as the SNR value becomes lower, the improvement of the proposed method over the conventional method become higher. The reason is that the conventional method detects the mixture sound period as the environment sound period when the SNR value is low.
On the other hand, the VAD accuracy of the proposed method is lower than or equal to that of the conventional method in some cases. The major reason is that the prediction of the environment sound which overlaps with the speech is unsuccessful. Fig. 6 shows the prediction accuracy for each environment sound in the SNR of 20, 10, 0 dB.

Fig. 6 Prediction accuracy for each environment sound.
The prediction accuracy is defined by the following equation.
Prediction accuracy = ((the number of words correctly predicted the environment sound which overlaps with the speech) / (the number of all words)) x 100 (%)
Fig. 5 and 6 show that the VAD accuracy of the proposed method is not always improved when the prediction accuracy is low. It is required to improve the prediction method.
| References |
[1] Lawrence Rabiner, Biing-Hwang Juang, "Fundamentals of speech recognition," PTR Prentice-Hall, Inc., 1993.
[2] F. Martin, K. Shikano, Y. Minami, "Recognition of noisy speech by composition of speech and noise," Proc. European Conference on Speech Communication and Technology, pp. 1031-1034, 1993.
[3] Takeshi Yamada, Narimasa Watanabe, Futoshi Asano, Nobuhiko Kitawaki, "Voice activity detection using non-speech models and HMM composition," Proc. Workshop on Hands-free Speech Communication, Apr. 2001.
[4] Kazuyo Tanaka, Satoru Hayamizu, The Journal of the Acoustical Society of Japan, Vol. 48, No. 12, pp. 883-887, 1992.
[Back to Sound DB Home]