In the first part of my talk, I will present a supervised end-to-end method for transforming audio from one style to another. For the case of speech, by conditioning on speaker identities, we can train a single model to transform words spoken by multiple people into multiple target voices. For the case of music, we can specify musical instruments and achieve the same results. The work derives inspiration from chatbot personalization, recent advances in speech recognition and text-to-speech engines. In second part of my talk I will talk about an unsupervised learning approach for the same problem, without any knowledge of the desired transformation or the type of inputs present in the audio. For demonstration, we investigate two different tasks, resulting in bandwidth expansion/compression, and timbral transfer from singing voice to musical instruments. A single architecture can generate these different audio-style-transfer types using the same set of parameters which otherwise require different complex hand-tuned diverse signal processing pipelines. Finally to conclude, we would motivate plethora of applications possible with this framework, with simple tweaking of the loss functions. This work was done in collaboration with Julius Smith, Michelle Guo, Albert Haque and Alexander Alahi at Stanford.
Prateek Verma is a Stanford CCRMA graduate interested in the intersection of signal processing, machine learning, audio processing and optimization. Before coming to Stanford, he graduated from IIT Bombay in Electrical Engineering in 2014. He has held research positions at Stanford in Artificial Intelligence Lab in the Computer Science Department under Dan Jurafsky working in areas of speech recognition, audio analysis and synthesis. He is continuing his research at Stanford currently working on optimization techniques for designing better learning algorithms?, and various applications.