Welcome to the
Digital Audio Processing Lab

The Digital Audio Processing Lab is a signal processing research facility dedicated to speech and audio applications. Research projects involving Ph.D. and M.Tech. students include music content analysis and retrieval, speech prosody for language learning, speech enhancement and recognition. The lab is equipped with computers, GPU servers, recording and listening equipment. The interdisciplinary flavour of several of the projects has stimulated many interactions with musicians, musicologists and language experts. Further, audio IP developed in the lab has been incorporated into a few products and services for the entertainment industry.

Research Areas and Projects

Audio Music Information Retrieval (MIR)

Audio MIR techniques seek to bridge the gap between semantically useful music concepts and audio signal descriptions to benefit applications such as music recommendation, performance analyses and musicology research. In an oral tradition, such as is the case with the art music of India, musicology research can reap rich benefits by computational methods applied to the audio recordings of great artists. Apart from providing a better understanding of the common practices employed in performance, deeper insights into the structural and theoretical aspects of the genre can be acquired. Our work has involved developing tools for the extraction of musically relevant attributes from vocal and instrumental audio recordings. Models for the culture-dependent notion of melodic similarity are refined based on musicological concepts drawn from raga grammar. The models, further tested in experiments on melodic phrase perception by musicians, reveal interesting behavioural differences between musicians trained in the genre and those who are not.

All music has structure at a variety of time scales. Uncovering this structure can provide for powerful visual representations of the concert audio that is of potential value in music appreciation and pedagogy. In the case of Indian art music, rhythm and tempo form the basis of structural segments at the largest time scales of the concert. We study the application of supervised and unsupervised segmentation methods based on musically motivated acoustic features computed on the concert audio. Source separation methods are investigated for better segmentation and visualization of concert structure.

An important application of MIR methods is music pedagogy where automatic feedback on instrument playing or singing can be helpful. Melodic accuracy, for instance, is judged by comparing the detected note events to a reference in terms of pitch and duration values. Our work addresses the goodness of instrumental sounds such tabla strokes in terms of correct articulation by a learner. The precise hand gesture and position of striking the tabla membrane influence the timbre and affect the acoustic properties in a manner that can be detected by the spectro-temporal analysis of the audio signal. Other work involves the flexible generation of natural sounding musical tones based on deep learning.

Automating Spoken Language Assessment

Motivated by the problem of lack of exposure to spoken English faced by children in rural schools in the country, this project targets the development of speech technology for automatic feedback on oral reading skill. Good reading involves accurate word decoding as well as fluency and expressiveness. There have been a number of studies on the effectiveness of automatic feedback on pronunciation of words and the sounds of the language therein. However there has been relatively little work on predicting speaking skills based on the detected prosody although no language teacher will dispute the importance of prosodic fluency for intelligible and natural sounding speech.

Our research is directed toward exploiting automatic speech recognition and prosody modeling in the assessment of speaking skills. The known challenges are the language dependence of prosody and its relatively high variability in natural speech compared to segmental variability. Our current context is the oral reading of stories by children learning English as a second language in school. The low entropy of the text facilitates the automatic segmentation of the signal as required for the estimation of prosodic events at the word level. New methods for robustly extracting prosodic events in noisy speech are researched. The acoustic correlates of prosodic functions such as phrasing and prominence are investigated for children of specific L1 reading English text. High-level attributes such as comprehensibility and confidence are predicted from the combination of extracted lexical and prosodic cues. This work is supported by the Tata Centre for Technology and Design (TCTD) at IIT Bombay and by the Abdul Kalam Technology Innovation National Fellowship 2020-2023.

Multi-channel Speech Enhancement

While a number of information extraction tasks including speech and speaker recognition are reliably implemented today, their performance is observed to degrade steeply on speech that is recorded with a distant microphone. A microphone array can provide the needed improvement in signal to noise ratio with a combination of multi- and single-channel speech enhancement techniques. Our work focuses on the task of meetings transcription, including speaker diarization, where a microphone array is used to record the meeting involving several speakers in a possibly reverberant and noisy environment. Better modeling of the room acoustic characteristics and the simultaneous exploitation of spatial and speaker-dependent cues are being applied to generate better quality single-channel speech for automatic speech and speaker recognition.