We report on the design and the collection of a multi-modal data corpus for cognitive acoustic scene analysis. Sounds are generated by stationary and moving sources (people), that is by omni-directional speakers mounted on people's heads. One or two subjects walk along predetermined systematic and random paths, in synchrony and out of sync. Sound is captured in multiple microphone systems, including a four MEMS microphone directional array, two electret microphones situated in the ears of a stuffed gerbil head, and a Head Acoustics, head-shoulder unit with ICP microphones. Three micro-Doppler units operating at different frequencies were employed to capture gait and the articulatory signatures as well as location of the people in the scene. Three ground vibration sensors were recording the footsteps of the walking people. A 3D MESA camera as well as a web-cam provided 2D and 3D visual data for system calibration and ground truth. Data were collected in three environments ranging from a well controlled environment (anechoic chamber), an indoor environment (large classroom) and the natural environment of an outside courtyard. A software tool has been developed for the browsing and visualization of the data.