RWCP Autonomous Learning Functions MRI Laboratory Research Results |
-->[Introduction of our laboratory]
In 1997 and 1998, we developed a non-speech recognition system that recognize short-time impulse sounds as onomatopoeia.
There are more than 200 onomatopoeia (words in imitation of sounds) in Japanese. We classified words of short-time sounds into six types and made linguistic analysis.
| --chi | --shi | --n | |
| ki-n | Metalic impact (high) | ||
| ka-chi | ka-shi | ka-n | Metalic impact (middle) |
| ko-chi | ko-n | Metalic impact (low), General impact | |
| gi-n | Metalic impact (high, voiced) | ||
| ga-chi | ga-shi | ga-n | Metalic impact (middle, voiced) |
| go-chi | go-n | Metalic impact (low, voiced) | |
| ta-n | General impact (middle) | ||
| to-n | General impact (low) | ||
| do-shi | do-n | General impact (very low) | |
| bo-n | General impact (very low) | ||
| zu-shi | zu-n | General impact (extremely low) | |
| pi-chi | pi-shi | pi-n | Bouncing (high) |
| pa-chi | pa-shi | pa-n | Bouncing (middle)、Explosion |
| po-n | Bouncing (low) | ||
| ba-chi | ba-shi | ba-n | Strong impact, Explosion |
(A) Impact and vanishing sound
Short-time sounds that have only one impact and vanish in very short time
are expressed by the words such as `ko-n' and `pa-chi'.
They have two letters basically.
The first consonants are /k/,/g/,/t/,/d/,/p/,/b/.
The first vowels are /i/,/a/,/o/ that
their order of the second formant pitch (/i/>/a/>/o/) means
the order of impulse sound pitch.
The second letter has three types `chi', `shi', and 'n'.
Compared with `chi' and `shi', `n' expresses slight reverberation.
Basic rules of the impact and vanishing sound are shown in the table on the right side.
There are irregular impact and vanishing sounds that `chi-n', `ri-n', `sha-n', and `ja-n' express ringing bells. It is impossible to distinguish difference of sounds as its duration goes shorter. In this case, all sounds are expressed as `cha' or `chi'.
The other types of onomatopoeia for short-time sounds are followings.
We also did cognitive experiments with onomatopoeia by listening synthesized sounds generated by Gammatone, in order to conduct quantitative analysis on parameters; central frequency, reverberation, and frequency perturbation time. As to attenuating sounds such as `ka-n' or `chi-n', the first vowel expresses central frequency that /o/ means up to 1kHz, /a/ means from 1kHz to 2kHz, and /i/ means above 2kHz. The ending of word shows reverberation time that `chi' means up to 100ms at 4kHz, `chi-n' means from 100ms to 200ms, and `chi-i-n' means above 300ms. As frequency perturbation increased, pure Gammatone sounds (`ta-n' or `ka-n') are recognized as explosive sound such as `pa-n' or `ba-n'.
![]() (left) Central freq.[Hz] (right) Onomatopoeia `don', `bon', `ton', `pon', `kon', `pan', `kan', `kin', `pin', `chin' | ![]() (left) Central frequency[Hz] (right) Vowels | ![]() (left) Reverberation time[ms] (right) Onomatopoeia `chi', `chi-n', `chi-i-n' |
We measured various types of non-speech sounds. Their spectral structures are examined by its onomatopoeia. These sound data are contained in Real World Speech and Acoustic Database.
![]() [ka-n] Sound of a metalic impact | ![]() [chi-n] Sound of ringing bell | ![]() [pa-n] Sound of clapping hands | ![]() [do-n] Sound of a fallen heavy object |
![]() [ka-ra-ra] Sound of bouncing hard object | ![]() [gi] Sound of friction | ![]() [shu-u] Sound of spouting gas | ![]() [pi-i] Sound of whistle or electronic sound |
| Spectral structure of non-speech sounds (Y-axis: frequency[Hz], X-axis: time[ms]) | |||
|---|---|---|---|
|
| System configuration |
|---|
We developed a prototype system that recognize ten types of impluse sounds as onomatopoeia.
Sound signal from microphone through low-pass filter is sampled by A/D board. Spectrum with 256 points FFT is calculated by DSP at intervals of 4ms.
Sound event is detected by extracting period of high power. Spectrogram is emphasized in high frequency and transformed into log scale before it is devided into 16 bands of 1/4 octave width and its envelope is calculated. The maximum peak is searched in the smoothened spectrogram. Its feature values are calculated; central frequency [Hz], reverberatoin time [ms], and band width [ovtave]. Then, classification rules obtained by cognitive experiments classicies into onomatopoeia. Total recognition time is about 500ms.
|
| Recongnition process of system |
|---|
In 1998, we measured non-speech sound sources and impulse responses of microphone array for studies such as non-speech sound recognition and microphone array signal processing. See the following page to get minute information about recording conditions, list of sounds, and distribution.
We measured 80 types of sound sources using a standard microphone as a dry source data in unechoic room. Sufficient samples (100 samples per source) are collected for study of learning algorithms.
| Enviroments | Unechoic room, Echoic room |
|---|---|
| Equipments | Standard microphone (B&K 4134), Filter & Amplifier(B&K 2636, DAT recoder (SONY DTC-77ES) |
| 48kHz, 16bit | |
| Number of samples | 9370(Unechoic room)、805(Echoic room) |
| Amount of data | 840MB (16bit RAW format) |
| SNR | 40〜50dB |
Sounds at various room can be reproduced by convolution between dry source and impulse response of the room. So, variety of sounds can be obtained easily that is necessary to develope robust algorithms.
The collected sounds are devided into three types from the point of view of sound sources as the follwings.
| Types of sources | # of samples | Name of sources [onomatopoeia] | |
|---|---|---|---|
| Impactive sound sources | a1. Impactive (woody) | 1200 | wood panel, wood bar [kon, poku] |
| a2. Impactive (metalic) | 1000 | metal panel, metal bowl [kan, kin] | |
| a3. Impactive (plastic) | 600 | plastic case [kacha, karara] | |
| a4. Impactive (ceramic) | 800 | glass, china [chin] | |
| Movement sound sources | b1. Falling particle | 200 | pouring beans into box [za-a, bara-bara] |
| b2. Spouting gas | 200 | splay, pump [shu-u] | |
| b3. Frictional | 500 | filing, sawing [gi-i] | |
| b4. Explosive | 200 | snaping chopsticks, opening cap [baki, puchi] | |
| b5. Bouncing | 700 | clapping hands [pan, pon] | |
| Characteristic sound sourecs | c1. Metalic articles | 950 | small bells, coins [chirin, shan] |
| c2. Paper | 400 | falling books, tearing paper [basa, bi-i] | |
| c3. Instrumental | 900 | Drum, whistle, bugle [pon, pi-i, pafu] | |
| c4. Electoronic | 450 | telephone, electronic toys [pipi,bu-u] | |
| c5. Mechanical | 500 | winding spring, stapler [ji-i, gacha] |
About a half of the dry source data is contained in distributed CD-ROM (50 samples each of sources). Down sampled 16kHz signals are also containd for user's convinience.
We started to measure fundamental characteristics of microphone arrays. TSP signal (time stretched pulse) is used for measuring the impulse response of microphone arrays. We prepared two types of microphone arrays and four environments of unechoic room and echoic room (three different reverberation time). Beamforming signal processing makes use of inverse filter of these impulse responses.
| Environments | Unechoic room, Echoic room (Reverberation time: 0.128, 0.313, 0.383秒) |
|---|---|
| Recording Equipments | AD/DA (SDS Corp. S-RTP station)、 Microphone amplifier & filter (ONSOKU OAF-411)、 Speaker amplifier (Yamaha P4050) |
| Microphone array | 16ch circular array (ONSOKU), 54ch spherical array (ONSOKU) |
| Sound source | Speaker (DIATONE DS-7) Head torso (B&K Type4128) |
| Receiver direction | Speaker : 0, 10, ..., 350 degree Head torso : 90 degree (front, right side 45 degree) |
| Source signal | TSP signal |
| Sampling rate | 48kHz, 16bit |
| Type of array | Environment | Reverberation time | Sound source | Direction |
|---|---|---|---|---|
| 16ch Circular | Unechoic room | 0.0sec | Speaker | 0, 10, ..., 360 degree |
| Head torso | 90 degree(front, right side 45 degree) | |||
| Echoic room | 0.128sec | Speaker | 90 degree | |
| Head torso | 90 degree(front) | |||
| 0.313sec | Speaker | 0, 10, ..., 180 degree | ||
| Head torso | 90 degree(front, right side 45 degree) | |||
| 0.383sec | Speaker | 0, 10, ..., 180 degree | ||
| Head torso | 90 degree(front, right side 45 degree) | |||
| 54ch Spherical | Unechoic room | 0.0sec | Speaker | 0, 10, ..., 350 degree |
| Head torso | 90 degree(front, right side 45 degree) | |||
| Echoic room | 0.128sec | Speaker | 90 degree | |
| Head torso | 90 degree(front) | |||
| 0.313sec | Speaker | 0, 10, ..., 180 degree | ||
| Head torso | 90 degree(front, right side 45 degree) | |||
| 0.383sec | Speaker | 0, 10, ..., 180 degree | ||
| Head torso | 90 degree(front, right side 45 degree) |
[Non-speech sound recognition technology]
[Real World Speech and Acoustic Database]