Given a large database of sequential data, a natural problem is to find the entry in the database which is most similar to a query sequence. Warping-based similarity measures such as the dynamic time warping (DTW) distance can be prohibitively expensive when the sequences are long and/or high-dimensional. We therefore present methods for learning efficient representations of sequences using convolutional networks. Our first approach learns a mapping from sequences of feature vectors to downsampled sequences of binary vectors, providing quadratic speed gains and substantially faster distance calculations. For further speedup, we present an approximate pruning method which involves embedding sequences as fixed-length vectors in a Euclidean space by using form of attention which integrates over time. These techniques allow orders-of-magnitude speedup with a small decrease in accuracy. We discuss the application of these approaches to the task of matching a collection of about 150,000 unique MIDI files to the Million Song Dataset.
Colin Raffel is currently a PhD student in Electrical Engineering at Columbia University, where he works in LabROSA and is supervised by Dan Ellis. In 2010, he received a Master's in Music, Science and Technology from Stanford University's CCRMA, supervised by Julius O. Smith III. His main research focus is machine learning methods (especially convolutional and recurrent networks) and their application to sequential data (especially audio signals).