[Back to Sound DB Home]

RWCP Sound Scene Database in Real Acoustical Environments
Non-speech sound recognition with microphone array

Kazuo Hiyane and Jun Iio, Mitsubishi Research Institute,Inc.

See http://tosa.mri.co.jp/nonspeech/index-e.htm for latest information.

RWCP Autonomous Learning MRI Research Laboratory (Mitsubishi Research Institute) developed non-speech sound recognition system by using the microphone array. This system can identify seven types of sounds such as a bell ringing and hand clapping with an accuracy of 80% or higher. Thirty-two feature values were extracted from spectra in a peak and stage of reduction to 1/ 10 to compare with data learnt previously and finally the sound source is discriminated. Installing 16 microphones on a circle with a diameter of 30 cm in equal intervals allows presumption of a direction of the sound source in a 10-degree unit on the basis of a difference in time in which the sound reaches respective microphones.

We obtained many kinds of information from the so-called "sound" other than speech. We expect that by recognizing everyday sounds such as a object falling, a telephone ringing, a car approaching and the like, a new function can be added to social welfare fields and security and a moving robot.

Demonstration at RWC2000 (Jan., 2000), Kyodo News reported our system (Feb. 9th, 2000),
Recognition for Single Collision Impulsive Sounds

(1) What is single impulsive sound?

Single impulsive sound is non-speech sound caused by a single collision such as the everyday sound caused by beating and dropping an object. In the system, seven types of such sounds, namely, the beating of a wooden board, a metal can, a glass bottle and a drum, a handclap, a handbell, and a bell, are recognized.

(2) Method of recognition

(a) Judging a single impulsive sound
To judge a single impulsive sound, it is determined that the temporal change in the power of the acoustical signal inputted makes a profile of an exponential reduction type, which is a characteristic of a single impulsive sound.
(b) Spectrum matching
Consequently, the spectrum in the peak and attenuation of the power of the single impulsive sound is calculated to divide in 16 frequency regions and finally seek 32 characteristic quantities. For each single impulsive sound, pattern matching with a distribution of characteristic quantities prepared previously from various samples is carried out to output a kind of the most matched as the result of discrimination.

(3) Performance of recognition

Average recognition ratio is about 80% in a calm office environment and the time for recognition is 1 sec or shorter using a 500 MHz Linux Pentium PC. A future subject of our research is improving recognition ratio and extension of object of discrimination.

Temporal power profile and spectrum of single impulsive sound

2. Technology for presuming direction of sound source by using microphone array
Circular microphone array

(1) Background

In the case of non-speech sound recognition technology, information about the "place" where the sound produced is important. Moreover, recognition of sound from a considerably distant place is generally required, making the presumption of the sound source direction and reduction of background noise necessary.

(2) Delay-sum array filter

We employed a super directional filter (a directional former) which accentuates only sound arrived from the direction by making equal the difference in arrival time of the sound wave from a certain direction to 16 microphones to sum acoustical signals.

(3) Presumption performance

Presumption of the direction showed a resolution of 10 degrees and realized noise reduction of about 10 dB. In the future, not only presumption of the direction, but also it is a subject to improve noise reduction performance and make separate a plurality of sound sources possible.

3. Non-speech sound recognition system by using microphone array

(1) System configuration

The system configuration comprised a microphone array, microphone amplifier, A/ D converter and two sound source personal computers (for recognition and display use.) The microphone array consisted of 16 omnidirectional electret capacitor microphones installed on a circle with a diameter of 30 cm.

(2) Processing flow

The acoustical signal measured by using the microphone array is fed by the personal computer for recognition use through the microphone amplifier and A/ D converter.

The virtual directional former of 10 degrees and 13 directions (-60 to +60 degrees) are simultaneously calculated to yield an acoustical signal from each direction. When the signal from the direction of the largest power is yielded and power thereof continues in a certain value, it is assumed that an acoustical event occurred. After the acoustical signal is judged to be the single impulse sound, spectrum matching is carried out for each sound source previously registered.

The direction of the sound source and the type of the sound source obtained from the processing is sent to the personal computer for display use through a LAN and shown on a window for a spectrogram display and a window for displaying the result of sound source type recognition developed by Java language.

(3) Performance of recognition

In a calm office environment, the direction presumption shows 10 degrees of resolution performance and noise reduction of about 6 dB. In addition, the average sound source type recognition ratio is about 80%. Recognition processing showed almost real time action and recognition result display in 1 sec or less.

Processing flow of non-speech sound recognition system Overview of hardware
Screen shot
[Back to Sound DB Home]
RWCP Sound Scene Database in Real Acoustical Environments
Copyright (c) 1998-2001 Mitsubishi Research Institute,Inc.