One of the fundamental goals of computer vision is to understand a scene. Towards this goal, we want the system to answer several questions - who, what, when, why, how much, etc. pertaining to the visual scene. Re-identifying persons over a network of cameras addresses questions involving the identity of the persons i.e., it deals with questions involving the word ‘who’. Similarly asking the question why a complex model gives rise to a particular decision enables one to make the models explainable and thus more trustworthy by making them more compatible with human reasoning. Person re-identification is the task of identifying and monitoring people moving across a number of non-overlapping cameras. Several factors like significant changes in viewing angle, lighting, background clutter, and occlusion cause features to vary a lot from camera to camera. The first part of the talk will be about the following research questions about person re-identification. The first question is - Can we model the way features get transformed between cameras? Can we also learn the way feature ‘does not’ get transformed and tell if a image pair (from separate cameras) is coming from the same person or not? The similarity between the feature histograms and time series data motivated us to apply the principle of Dynamic Time Warping to study the transformation of features by warping the feature space. The warped space not only allowed us to model feasible transformation between pairs of instances of the same target, but also to separate them from the infeasible transformations between instances of different targets. Existing person re-identification methods are camera pairwise where the focus is on finding similarities of persons between pairs of cameras. While this works well for a 2 camera network, it introduces inconsistency of re-identification when a network consisting of 3 or more cameras are considered. The next part of the talk will address two important research questions. Can the results be made consistent? and Will re-identification performance be improved by enforcing consistency? We addressed the problem by posing re-identification as an optimization that minimizes the global cost of associating pairs of targets on the entire camera network constrained by a set of consistency criterion. Supervised deep learning methods have already enjoyed enormous success in computer vision and language research and have the potential to revolutionize robotics. Yet, it remains largely unclear about how the system comes to a decision, how certain the model is about its decision, if and when it can be trusted or when it has to be corrected. Due to this opacity, it becomes worse when such models fail. The next part of the talk will address the ‘why’ question for a video description system. We explored a top-down approach to capture spatio-temporally salient regions corresponding to generated video descriptions for an off-the-shelf video-to-text generation system. The work is motivated by the need to explain the word generation mechanism. For example, for a generated video description ‘A woman is cutting a piece of meat’, the quest is to see if the word “woman” is generated because the model recognized a woman or merely because “A woman” is a likely way to start a sentence? The saliency is estimated by measuring the drop in word probabilities when only one small part of the input video is fed into the network. The talk will be concluded with some insight into possible future directions leveraging on the strengths of explainable spatio-temporal saliency towards getting rational feedback from human agents and to use the feedback iteratively to get better and trustworthy models.
Abir received his B.E. degree in Electrical Engineering from Jadavpur University, India in 2007. He received his M.S. and Ph.D. degrees in the same subject from University of California, Riverside, USA in 2013 and 2015 respectively. He is currently a post doctoral researcher at the Computer Science Department at Boston University, USA. His main research interests include multi-camera person re- identification and video summarization, end-to-end video description and activity detection as well as explainable AI using machine learning based methods.