Google DeepMind's New AI Model - TAPIR: Seeing Through the Lens of AI
Taper: A New AI Model for Video Analysis
In this video, the speaker discusses a new AI model called Taper that has been developed by researchers at Google DeepMind. The model is designed to track any point on any physical surface throughout a video sequence with high accuracy and precision.
How Computer Vision Systems Function
- Computer vision systems use deep learning techniques to learn from large amounts of data and extract features that are relevant for the task at hand.
- To recognize a person's face in a photo, you need a model that can learn to identify the key characteristics of a face such as the shape of the eyes, nose, mouth, etc.
- If you want to track a specific point on a person's face or any other object in a video sequence, you have to deal with challenges like occlusion, motion blur, illumination changes, scale variations and so on.
Introducing Taper
- Taper stands for Tracking Any Point with Per-frame Initialization and Temporal Refinement.
- It is a new AI model that can effectively track any point on any physical surface throughout a video sequence.
- It was developed by researchers from Google Deep Mind VGG department of engineering science and the University of Oxford.
- The model uses a two-stage algorithm that consists of matching stage and refinement stage.
Matching Stage
- In this stage, it analyzes each video frame separately and tries to find suitable candidate point matches for the query point (the point that needs to be tracked).
- It uses cosine similarity to compare feature vectors of all possible points in each frame with feature vector around query point.
Refinement Stage
- In this stage taper updates both trajectory (path followed by query point) and query features (feature vectors representing its appearance) based on local correlations.
- It uses another deep neural network that takes as input an image patch around candidate point match in each frame and outputs a displacement vector that indicates how much the candidate point match should be shifted to match the query point more precisely.
- It updates the query features by averaging feature vectors of refined point matches over time.
Taper Performance
- The researchers evaluated taper using the TapVid benchmark, which is a standardized evaluation dataset for video analysis models.
- Taper outperformed other state-of-the-art models on this benchmark and can handle videos of various sizes and quality while tracking multiple points simultaneously.
Taper Outperforms Baseline Methods on TapVid Benchmark
In this section, the speaker discusses how taper outperformed all baseline methods by a significant margin on the TapVid benchmark. The speaker explains what the average Jacquard AJ is and how it was used to evaluate the performance of different methods.
Taper's Performance on TapVid Benchmark
- The average intersection over Union between predicted point locations and ground truth point locations is called average Jacquard AJ.
- Taper outperformed all baseline methods by a significant margin on the TapVid benchmark with an AJ score of 0.64, which is about 20% higher than the second-best method D2Net that scored 0.44.
- Taper also performed well on another benchmark called Davis, achieving an AJ score of 0.59, which is again about 20% higher than D2Net that scored 0.39.
- The researchers used taper to track ten points per video on Davis and computed the AJ score as before.
Two Online Demos for Running Taper
In this section, the speaker talks about two online demos provided by researchers for running taper on your own videos.
Two Online Demos
- The first demo is called tap vid demo and allows you to upload your own video or choose one from YouTube and select any point in any object in the first frame that you want to track throughout the video.
- Then it runs taper on your video and shows you the results in real-time.
- The second demo is called webcam demo and allows you to use your own webcam as the input source and select any point on your face or any other object in front of you that you want to track live as you move around.
- Then it runs taper on your webcam feed and shows you the results in real-time.
Conclusion
In this section, the speaker concludes by summarizing how awesome taper is and how it can enable various applications in computer vision.
Final Thoughts
- Taper can track any point on any object with amazing accuracy and precision even when there are occlusions, motion blur, illumination changes, scale variations, etc.
- The speaker thinks that taper is a huge breakthrough for computer vision and can't wait to see what kind of applications this model can enable in the future.