RWCP Sound Scene Database Microphone Array Database |
ATR Spoken Language Translation Research Laboratories, Dept. 1, Satoshi NAKAMURA, and Takanobu NISHIURA
http://www.slt.atr.co.jp/dept1/index.html
| 1. Objectives |
This database is recorded to simulate the various environments by convolution of dry sources and numerrous impulse responses. Also, moving sound sources that can not be simulated by convolution of dry sources and numerous impulse responses are recorded in real environments. This database includes 9 impulse responses collected in 9 kinds of room, 5 kinds of actual recorded speech, 5 kinds of actual recorded talkers who were moving, etc.
| 2. The abstract of recording |
| AD/DA | MD-8000 mk2 (PAVEC) |
|---|---|
| Amplitude | A transducers amplitude(PAVEC MD-8000 mk2), A speaker amplitude (YAMAHA P4050) |
| Microphone array | Onkyo Sokki corp. ( transducer (Hoshiden KUC1333)) |
| Speaker | DIATONE DS-7 |
| Moving sound source | Custom fabricated moving sound source system (Nogi corp.) |
| Location measurement | OPTOTRAK (Nothern Digital Inc.) |
Table 1 shows the instruments that were used for recording the data. Two variations of microphone arrays were employed.
Circle: Circle type microphone array (16 channels)
Cirline: Circle + Linear type microphone array (30 channels)
Figure 1 depicts the location of the transducers. Number 1 is the forward direction. A circle type microphone array has a diameter of 30cms, and has 16 transducers located at equal intervals. A linear type microphone array has 14 transducers located at 2.83cm intervals.


| 3. OPTOTRAK |
Real time 3-dimensional position measuring system
1. Tracks 256 infrared diodes in real time
2. Three dimensional position real time measurement
3. Analyzing resolution performance of 0.01mm, RMS performance of 0.1mm
4. 400Hz sampling frequency at maximum
System Design : Data Acquisition Unit II (by Nothern Digital Inc.)
Synchronization signal input-output system
Record OPTOTRAK speech and position data in synchronization
1. Analog data : 16ch input and 2ch output
2. Digital : 8ch input

| 4. Recording conditions |
The average ambient noise level, humidity and temperature in each recording room are shown in Table 2. Note that the ambient noise level is slightly higher for a moving talker because of the effects of OPTOTRAK. Symbols (*) indicates sound level meter ''under flow'' in the Table 2.
| Room | Reverberation time | Temperature | Humidity | dBA | dBC |
|---|---|---|---|---|---|
| Anechoic room | 0.0 | 21.4 degrees | 44.4 % | 15.9* | 29.9 | Echo room(cylinder) | 0.12 | - | - | 18.5* | 48.0 |
| Echo room(cylinder) | 0.31 | - | - | 18.5* | 48.3 | Echo room(cylinder) | 0.38 | - | - | 17.4* | 46.0 |
| Echo room(panel) | 0.3 | 22.1 degrees | 37.1 % | 18.9* | 43.8 |
| Echo room(panel) | 1.3 | 22.1 degrees | 37.1 % | 19.6* | 49.6 |
| Conference room | 0.78 | 20.7 degrees | 36.2 % | 43.2 | 50.9 |
| Tatami - floored room (L) | 0.60 | 11.5 degrees | 37.0 % | 44.7 | 55.1 |
| Tatami - floored room (S) | 0.47 | 12.0 degrees | 36.8 % | 44.8 | 56.7 |
| 5. Recording scenery |
![]() |
![]() |
![]() |
![]() |
| 6. Recording configuration |
![]() Information from each experimental instrument |
![]() |
![]() |
| 7. Recorded data |
This database includes the 5 items in below. Impulse response is float type binary data (4 bytes) and speech data is short type binary data (2 bytes). Also, all binary types are little endian, and sampling frequency is 48kHz.
2. Moving sound source
3. Diffused sound environment
4. Ambient noise
5.Piston phone
Impulse response
Impulse response is measured using the TSP method [1]. TSP length is
65536 points. The number of synchronous addition is 16 times. TSP
used during recording and time inversed TSP are shown below. The
figure shows a sample of impulse response.
[TSP data], [Time inverse TSP data]
![]() A sample of impulse response |
Speech data
Speech data is recorded for 90 sentences that include phoneme
balanced 50 sentences * 1 subject in ATR (Advanced Telecommunications
Research Institute International ) database and 10 sentences * 4
subjects in TIMIT (The DARPA TIMIT Acoustic-Phonetic Continuous Speech
Corpus)database. Speech data is compressed by "ZIP." If you conduct
"UNZIP," speech data appears in each file (mmysda01.1, mmysda01.2,...)
in the directory. Samples of speech waves which are recorded in an
anechoic room and in a variable echo room are shown below.
![]() |
![]() |
Moving sound source
Moving sound sources are recorded with the custom fabricated moving
sound source system as shown in Figure 2. The microphone array is
placed at the base of a center pole after fixing the speaker at the
end of rod as shown in Figure 2. Impulse responses for fixed
positions as shown in above 2.(A) are also recorded at some points on
this track. Speech signals from a moving sound source are recorded by
manually moving the rod. Thus, the source positions which are
measured by OPTOTRAK are also recorded each time. Source position
data format is little endian float type binary data (4 bytes) and
100Hz sampling. File; *.x, *.y and *.z show the coordinates that
represent the distances from the microphone array to the speaker in
units [mm]. The figure below shows the track of moving sound recorded
with the custom fabricated moving sound system.
![]() |
![]() |
Diffused sound environment
Impulse response was measured by turning a speaker towards the wall in
the diffused sound environment. Recording configurations are shows
below.
![]() |
![]() |
Ambient noise
Ambient noise is recorded 10 times * 2 subjects at conference room.
Piston phone
Transducer calibration signals were recorded using a piston phone
(ONSOKU TYPE2126).
| 8. A list of data |
RT: Reverberation time in room
CH: The number of transducers
ang: The number of recorded directions
len: The length of data per transducer (in speech data, the number of sentences)
Byte: Byte order of files (all data is little endian)
Total: Capacity
Room RT CH ang len Byte Total
/MICARRAY/data1 ... Impulse response
+--/circle
| +--- ANE/imp000 Anechoic room 0.00 16 36 5000 4 11.2MB
| +--- E1A/imp000 Echo room(cylinder)A 0.12 16 1 12000 4 753kB
| +--- E1B/imp000 Echo room(cylinder)B 0.31 16 19 30000 4 34.9MB
| +--- E1C/imp000 Echo room(cylinder)C 0.38 16 19 36000 4 41.8MB
|
+ --/cirline
+--- E2A/imp000 Echo room(panel)A 0.30 30 9 30000 4 31.0MB
+--- E2B/imp000 Echo room(panel)B 1.30 30 9 125000 4 128.8MB
+--- OFC/imp000 Conference room 0.78 30 9 75000 4 77.3MB
+--- JR1/imp000 Tatami - floored room (L) 0.60 30 9 60000 4 61.9MB
+--- JR2/imp000 Tatami - floored room (S) 0.47 30 7 45000 4 36.1MB
/MICARRAY/data2 ...Speech( E2A in DVD-ROM Vol.1, and E2B OFC JR1 JR2 in DVD-ROM Vol.2)
+-- /cirline/
+--- E2A/mmysda01 Echo room(panel)A 0.30 30 1 90sent. 2 1.2GB
+--- E2B/mmysda01 Echo room(panel)B 1.30 30 1 90sent. 2 1.2GB
+--- OFC/mmysda01 Conference room 0.78 30 1 90sent. 2 1.2GB
+--- JR1/mmysda01 Tatami - floored room (L) 0.60 30 1 90sent. 2 1.2GB
+--- JR2/mmysda01 Tatami - floored room (S) 0.47 30 1 90sent. 2 1.2GB
/MICARRAY/data3 ...Movement (Impulse response)
+-- /cirline
+--- ANE/imp000 Anechoic room 0.00 30 5 5000 4 2.9MB
+--- E2A/imp000 Echo room(panel)A 0.30 30 5 30000 4 17.2MB
+--- E2B/imp000 Echo room(panel)B 1.30 30 5 125000 4 71.6MB
+--- OFC/imp000 Conference room 0.78 30 5 78000 4 43.0MB
+--- JR1/imp000 Tatami - floored room (L) 0.60 30 5 60000 4 34.4MB
/MICARRAY/data4 ...Movement(Speech)(in DVD-ROM Vol.3)
+-- /cirline
+--- ANE/mmysda01 Anechoic room 0.00 30 1 90sent. 2 1.2GB
+--- E2A/mmysda01 Echo room(panel)A 0.30 30 1 90sent. 2 1.2GB
+--- E2B/mmysda01 Echo room(panel)B 1.30 30 1 90sent. 2 1.2GB
+--- OFC/mmysda01 Conference room 0.78 30 1 90sent. 2 1.2GB
+--- JR1/mmysda01 Tatami - floored room (L) 0.60 30 1 90sent. 2 1.2GB
/MICARRAY/data5 ...Diffused sound environment
+-- /cirline
+--- OFC/imp_rev Conference room 0.78 30 1 75000 4 8.6MB
/MICARRAY/data6 ...Ambient noise
+-- /cirline
+--- OFC/ambient1 Conference room 0.78 30 1 480000 2 27.5MB
+--- OFC/ambient2 Conference room 0.78 30 1 480000 2 27.5MB
/MICARRAY/data7 ...Piston Phone
+--/cirline
+--- Equalize - - 30 1 144000 2 8.3MB
| Reference |