You probably remember a scene from a movie where we see lots of big screens in a dark room that follow cars, people and objects. Then the antagonist walks in, looks closely at the footage and notices something, and yells, “Wait, I see something.” This method of drawing a box and tracking the movements of the same object/person/car is called visual tracking, and it is a very active area of research in computer vision.
Visual tracking is a crucial part of many applications, such as autonomous driving, surveillance, and robotics. The goal is to track the object that appeared in a certain frame, usually the first, of the video in future frames. Occlusions, lightning shifts, and other issues make it difficult to find the same object in different images. On the other hand, visual tracking is usually performed at peripheral devices. These devices have limited computing power, as we are talking about consumer computers or mobile devices. Visual tracking is a difficult task; however, having a robust visual tracking system is a prerequisite for multiple applications.
One approach to the visual tracking problem is to use deep learning techniques to train a model to recognize the object of interest in video images. The model can then predict the location of the object in subsequent frames, and the tracking algorithm can use this prediction to update the position of the object in the frame. Many different deep learning architecture approaches can be used for visual object tracking, but recent advances in Siamese networks have enabled significant progress.
Discover Hailo-8™: an artificial intelligence processor that uses computer vision for multi-camera and multi-person re-identification (sponsored)
Siamese network-based trackers can be trained offline in an end-to-end approach so that a single network can detect and track the object. This is a huge advantage over other approaches, especially in terms of complexity.
State-of-the-art visual tracking networks can achieve impressive performance when it comes to object tracking, but ignore the computational complexity required to perform these methods. Therefore, taking them and applying them in advanced devices where computing power is limited is a difficult problem. Siamese tracking architecture does not significantly increase inference time when a mobile-friendly backbone is used, because the decoder or bounding box prediction modules perform the majority of memory- and time-intensive activities . Therefore, designing a mobile-friendly visual tracking method remains an open challenge.
Additionally, in order to make a tracking algorithm robust to variations in an object’s appearance, such as changes in pose or lighting, it is important to include temporal information. This can be done by adding specialized branches to the model or by implementing e-learning modules. However, both of these approaches incur additional floating point operations, which can negatively impact tracker runtime performance.
FEAR tracker is introduced to solve these two problems. FEAR uses a single-parameter dual-model module that allows the tracking algorithm to learn changes in the object’s appearance in real time without increasing the complexity of the model. This helps alleviate memory constraints that have been a problem for some eLearning modules. The module predicts the proximity of the target object to the center of the image, which activates candidates for updating the template image.
Additionally, an interpolation to mix the selected dynamic model image feature map online with the original static model image feature map in a way that can be learned by the model is used. This allows the model to adapt to changes in the object’s appearance during inference. FEAR uses an optimized neural network architecture that can be over ten times faster than many current Siamese trackers. The resulting lightweight FEAR model can run at 205 FPS on an iPhone 11, which is much faster than existing models.
Check Paper and GithubGenericName. All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.
Ekrem Çetinkaya obtained his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis on image denoising using deep convolutional networks. He is currently pursuing a doctorate. degree at the University of Klagenfurt, Austria, and working as a researcher on the ATHENA project. His research interests include deep learning, computer vision and multimedia networks.