Without a doubt, the most important recent advancement in machine learning (and AI) has been the success of deep neural networks. They are key to modern speech recognition and many other areas. But perhaps even more interesting is the move to systems without features. For many decades pattern recognition, machine learning, and other forms of intelligence have been based on carefully engineered features. DNNs have completely out performed systems with careful feature engineering. And now the most recent work has obliterated even simple features. The best results are often possible with no features at all—just apply the waveform or the pixels to a deep-enough neural network. We take up the problem of pitch estimation in noisy and polyphonic environments. For frequency estimation in noisy speech or music signals, time domain methods based on signal processing techniques such as autocorrelation or average magnitude difference, often do not perform well. As deep neural networks (DNNs) have become feasible, some researchers have attempted with some success to improve the performance of signal processing based methods by learning on autocorrelation, Fourier transform or constant-Q filter bank based representations. In our approach, blocks of signal samples are input directly to a neural network to perform end to end learning.These NNs appear to learn a nonlinearlyspaced frequency representation in the first layer followed by comb-like filters, which resemble strongly to some of the existing state-of the art signal processing algorithms.
Prateek Verma is a research assistant at Stanford Artificial Intelligence Laboratory advised by Dan Jurafsky. His research interests include signal processing and machine learning. He graduated from Department of EE at IITB in 2014.