RWCP Autonomous Learning Functions MRI Laboratory

Research Results

-->[Introduction of our laboratory]

[Japanese]

[Index]
Non-speech sound recognition technology
(1) Relationship between non-speech sounds and Japanese onomatopoeia
(2) Spectrum structure of non-speech sounds
(3) Non-speech sound recognition system
Real World Speech and Acoustic Database
(1) Dry source of non-speech sounds
(2) Impulse response by microphone array
Publications

Non-speech sound recognition technology

In 1997 and 1998, we developed a non-speech recognition system that recognize short-time impulse sounds as onomatopoeia.

(1) Relationship between non-speech sounds and Japanese onomatopoeia

There are more than 200 onomatopoeia (words in imitation of sounds) in Japanese. We classified words of short-time sounds into six types and made linguistic analysis.

--chi --shi --n  
   ki-n Metalic impact (high)
ka-chika-shi ka-n Metalic impact (middle)
ko-chi  ko-n Metalic impact (low), General impact
   gi-n Metalic impact (high, voiced)
ga-chiga-shi ga-n Metalic impact (middle, voiced)
go-chi  go-n Metalic impact (low, voiced)
   ta-n General impact (middle)
   to-n General impact (low)
 do-shi do-n General impact (very low)
   bo-n General impact (very low)
 zu-shi zu-n General impact (extremely low)
pi-chipi-shi pi-n Bouncing (high)
pa-chipa-shi pa-n Bouncing (middle)、Explosion
   po-n Bouncing (low)
ba-chiba-shi ba-n Strong impact, Explosion

(A) Impact and vanishing sound
Short-time sounds that have only one impact and vanish in very short time are expressed by the words such as `ko-n' and `pa-chi'. They have two letters basically. The first consonants are /k/,/g/,/t/,/d/,/p/,/b/. The first vowels are /i/,/a/,/o/ that their order of the second formant pitch (/i/>/a/>/o/) means the order of impulse sound pitch. The second letter has three types `chi', `shi', and 'n'. Compared with `chi' and `shi', `n' expresses slight reverberation. Basic rules of the impact and vanishing sound are shown in the table on the right side.

There are irregular impact and vanishing sounds that `chi-n', `ri-n', `sha-n', and `ja-n' express ringing bells. It is impossible to distinguish difference of sounds as its duration goes shorter. In this case, all sounds are expressed as `cha' or `chi'.

The other types of onomatopoeia for short-time sounds are followings.

(B) Impact and reverberating sound
The same type as (A), but its duration time is longer than (A): `ka-a-n', `po-o-n', etc.
(C) Double impact sound
Sounds of bouncing or mechanical move: `ka-ta', `go-to-n', etc.
(D) Multiple impact sound
Intermittent sounds: `ka-ta-ta', `ko-to-ko-to', etc.
(E) Destructive sounds
Sounds of crunching or snaping: `ka-ri', `pa-ki', etc.
(F) Frictional sound
Sounds of rubbing or scratching: `gi', `kyu', etc.


We also did cognitive experiments with onomatopoeia by listening synthesized sounds generated by Gammatone, in order to conduct quantitative analysis on parameters; central frequency, reverberation, and frequency perturbation time. As to attenuating sounds such as `ka-n' or `chi-n', the first vowel expresses central frequency that /o/ means up to 1kHz, /a/ means from 1kHz to 2kHz, and /i/ means above 2kHz. The ending of word shows reverberation time that `chi' means up to 100ms at 4kHz, `chi-n' means from 100ms to 200ms, and `chi-i-n' means above 300ms. As frequency perturbation increased, pure Gammatone sounds (`ta-n' or `ka-n') are recognized as explosive sound such as `pa-n' or `ba-n'.

(left) Central freq.[Hz]
(right) Onomatopoeia
`don', `bon', `ton', `pon', `kon',
`pan', `kan', `kin', `pin', `chin'

(left) Central frequency[Hz]
(right) Vowels

(left) Reverberation time[ms]
(right) Onomatopoeia
`chi', `chi-n', `chi-i-n'

(2) Spectrum structure of non-speech sounds

We measured various types of non-speech sounds. Their spectral structures are examined by its onomatopoeia. These sound data are contained in Real World Speech and Acoustic Database.

[ka-n] Sound of a metalic impact
Exponential attenuation of narrow banded several peaks.
[chi-n] Sound of ringing bell
Same as metaric sound, but all peaks are concentrated at one frequency.
[pa-n] Sound of clapping hands
Rounded peak at 500-2000Hz.
[do-n] Sound of a fallen heavy object
Low frequency is dominant, and duration time is shorter at high frequency.
[ka-ra-ra] Sound of bouncing hard object
Same spectral profile appears at intervals of 10-100ms.
[gi] Sound of friction
Many random peaks spread both time and freqency axes.
[shu-u] Sound of spouting gas
Random peaks at all bands, and temporally stable.
[pi-i] Sound of whistle or electronic sound
continuous oscillation of narrow band peak.

[ka-n] Sound of a metalic impact

[chi-n] Sound of ringing bell

[pa-n] Sound of clapping hands

[do-n] Sound of a fallen heavy object

[ka-ra-ra] Sound of bouncing hard object

[gi] Sound of friction

[shu-u] Sound of spouting gas

[pi-i] Sound of whistle or electronic sound
Spectral structure of non-speech sounds (Y-axis: frequency[Hz], X-axis: time[ms])

(3) Non-speech sound recognition system

System configuration

We developed a prototype system that recognize ten types of impluse sounds as onomatopoeia.

Sound signal from microphone through low-pass filter is sampled by A/D board. Spectrum with 256 points FFT is calculated by DSP at intervals of 4ms.

Sound event is detected by extracting period of high power. Spectrogram is emphasized in high frequency and transformed into log scale before it is devided into 16 bands of 1/4 octave width and its envelope is calculated. The maximum peak is searched in the smoothened spectrogram. Its feature values are calculated; central frequency [Hz], reverberatoin time [ms], and band width [ovtave]. Then, classification rules obtained by cognitive experiments classicies into onomatopoeia. Total recognition time is about 500ms.

Recongnition process of system


Non-speech sound recognition technology

In 1998, we measured non-speech sound sources and impulse responses of microphone array for studies such as non-speech sound recognition and microphone array signal processing. See the following page to get minute information about recording conditions, list of sounds, and distribution.

(1) Dry source of non-speech sounds

We measured 80 types of sound sources using a standard microphone as a dry source data in unechoic room. Sufficient samples (100 samples per source) are collected for study of learning algorithms.

Recording infomation of dry source of non-speech sound
Enviroments Unechoic room, Echoic room
Equipments Standard microphone (B&K 4134), Filter & Amplifier(B&K 2636,
DAT recoder (SONY DTC-77ES)
Sampling rate48kHz, 16bit
Number of samples9370(Unechoic room)、805(Echoic room)
Amount of data 840MB (16bit RAW format)
SNR 40〜50dB

Sounds at various room can be reproduced by convolution between dry source and impulse response of the room. So, variety of sounds can be obtained easily that is necessary to develope robust algorithms.

The collected sounds are devided into three types from the point of view of sound sources as the follwings.

a. Impactive sound sources
Sounds caused by impact of objects.
b. Movement sound sources
Sounds that have characteristic tone mostly caused by some action of human, but it can't be determine the name of source only by sound.
c. Characteristic sound sources
Sounds that its tone describes type of source.

Collected sounds of dry sources of non-speech sounds
 Types of sources# of samples Name of sources [onomatopoeia]
Impactive
sound
sources
a1. Impactive (woody) 1200 wood panel, wood bar [kon, poku]
a2. Impactive (metalic) 1000 metal panel, metal bowl [kan, kin]
a3. Impactive (plastic) 600 plastic case [kacha, karara]
a4. Impactive (ceramic) 800 glass, china [chin]
Movement
sound
sources
b1. Falling particle 200 pouring beans into box [za-a, bara-bara]
b2. Spouting gas 200 splay, pump [shu-u]
b3. Frictional 500 filing, sawing [gi-i]
b4. Explosive 200 snaping chopsticks, opening cap [baki, puchi]
b5. Bouncing 700 clapping hands [pan, pon]
Characteristic
sound
sourecs
c1. Metalic articles 950 small bells, coins [chirin, shan]
c2. Paper 400 falling books, tearing paper [basa, bi-i]
c3. Instrumental 900 Drum, whistle, bugle [pon, pi-i, pafu]
c4. Electoronic 450 telephone, electronic toys [pipi,bu-u]
c5. Mechanical 500 winding spring, stapler [ji-i, gacha]

About a half of the dry source data is contained in distributed CD-ROM (50 samples each of sources). Down sampled 16kHz signals are also containd for user's convinience.

(2) Impulse response by microphone array

We started to measure fundamental characteristics of microphone arrays. TSP signal (time stretched pulse) is used for measuring the impulse response of microphone arrays. We prepared two types of microphone arrays and four environments of unechoic room and echoic room (three different reverberation time). Beamforming signal processing makes use of inverse filter of these impulse responses.

Recording infomation of impulse response by microphone array
Environments Unechoic room, Echoic room (Reverberation time: 0.128, 0.313, 0.383秒)
Recording Equipments AD/DA (SDS Corp. S-RTP station)、
Microphone amplifier & filter (ONSOKU OAF-411)、
Speaker amplifier (Yamaha P4050)
Microphone array16ch circular array (ONSOKU),
54ch spherical array (ONSOKU)
Sound source Speaker (DIATONE DS-7)
Head torso (B&K Type4128)
Receiver direction Speaker : 0, 10, ..., 350 degree
Head torso : 90 degree (front, right side 45 degree)
Source signalTSP signal
Sampling rate48kHz, 16bit

Collected impulse responses by microphone arrays
Type of arrayEnvironmentReverberation timeSound sourceDirection
16ch Circular Unechoic room 0.0secSpeaker 0, 10, ..., 360 degree
Head torso90 degree(front, right side 45 degree)
Echoic room 0.128secSpeaker 90 degree
Head torso90 degree(front)
0.313secSpeaker 0, 10, ..., 180 degree
Head torso90 degree(front, right side 45 degree)
0.383secSpeaker 0, 10, ..., 180 degree
Head torso90 degree(front, right side 45 degree)
54ch Spherical Unechoic room 0.0secSpeaker 0, 10, ..., 350 degree
Head torso90 degree(front, right side 45 degree)
Echoic room 0.128secSpeaker 90 degree
Head torso90 degree(front)
0.313secSpeaker 0, 10, ..., 180 degree
Head torso90 degree(front, right side 45 degree)
0.383secSpeaker 0, 10, ..., 180 degree
Head torso90 degree(front, right side 45 degree)


Publications

[Non-speech sound recognition technology]

[Real World Speech and Acoustic Database]


RWCP Autonomous Learning Functions MRI Laboratory
Otemachi 2-3-6, Chiyoda, Tokyo, Japan, 100-8141
(in Infomation Research Center,   Mitsubishi Research Institute,Inc.)
Director: Kazuo Hiyane <hiya@mri.co.jp>
TEL: +81-3-3277-0750   FAX: +81-3-3277-3471