RWCP Sound Scene Database in Real Acoustical Environments Non-speech sound recognition with microphone array |
Kazuo Hiyane and Jun Iio, Mitsubishi Research Institute,Inc.
RWCP Autonomous Learning MRI Research Laboratory (Mitsubishi Research Institute) developed non-speech sound recognition system by using the microphone array. This system can identify seven types of sounds such as a bell ringing and hand clapping with an accuracy of 80% or higher. Thirty-two feature values were extracted from spectra in a peak and stage of reduction to 1/ 10 to compare with data learnt previously and finally the sound source is discriminated. Installing 16 microphones on a circle with a diameter of 30 cm in equal intervals allows presumption of a direction of the sound source in a 10-degree unit on the basis of a difference in time in which the sound reaches respective microphones.
We obtained many kinds of information from the so-called "sound" other than speech. We expect that by recognizing everyday sounds such as a object falling, a telephone ringing, a car approaching and the like, a new function can be added to social welfare fields and security and a moving robot.
|
|
| Demonstration at RWC2000 (Jan., 2000), | Kyodo News reported our system (Feb. 9th, 2000), |
|---|
| Recognition for Single Collision Impulsive Sounds |
Single impulsive sound is non-speech sound caused by a single collision such as the everyday sound caused by beating and dropping an object. In the system, seven types of such sounds, namely, the beating of a wooden board, a metal can, a glass bottle and a drum, a handclap, a handbell, and a bell, are recognized.
Average recognition ratio is about 80% in a calm office environment and the time for recognition is 1 sec or shorter using a 500 MHz Linux Pentium PC. A future subject of our research is improving recognition ratio and extension of object of discrimination.
|
| Temporal power profile and spectrum of single impulsive sound |
|---|
| 2. Technology for presuming direction of sound source by using microphone array |
|
| Circular microphone array |
|---|
In the case of non-speech sound recognition technology, information about the "place" where the sound produced is important. Moreover, recognition of sound from a considerably distant place is generally required, making the presumption of the sound source direction and reduction of background noise necessary.
We employed a super directional filter (a directional former) which accentuates only sound arrived from the direction by making equal the difference in arrival time of the sound wave from a certain direction to 16 microphones to sum acoustical signals.
Presumption of the direction showed a resolution of 10 degrees and realized noise reduction of about 10 dB. In the future, not only presumption of the direction, but also it is a subject to improve noise reduction performance and make separate a plurality of sound sources possible.
| 3. Non-speech sound recognition system by using microphone array |
The system configuration comprised a microphone array, microphone amplifier, A/ D converter and two sound source personal computers (for recognition and display use.) The microphone array consisted of 16 omnidirectional electret capacitor microphones installed on a circle with a diameter of 30 cm.
The acoustical signal measured by using the microphone array is fed by the personal computer for recognition use through the microphone amplifier and A/ D converter.
The virtual directional former of 10 degrees and 13 directions (-60 to +60 degrees) are simultaneously calculated to yield an acoustical signal from each direction. When the signal from the direction of the largest power is yielded and power thereof continues in a certain value, it is assumed that an acoustical event occurred. After the acoustical signal is judged to be the single impulse sound, spectrum matching is carried out for each sound source previously registered.
The direction of the sound source and the type of the sound source obtained from the processing is sent to the personal computer for display use through a LAN and shown on a window for a spectrogram display and a window for displaying the result of sound source type recognition developed by Java language.
In a calm office environment, the direction presumption shows 10 degrees of resolution performance and noise reduction of about 6 dB. In addition, the average sound source type recognition ratio is about 80%. Recognition processing showed almost real time action and recognition result display in 1 sec or less.
|
|
| Processing flow of non-speech sound recognition system | Overview of hardware |
|---|
|
| Screen shot |
|---|