IEEE Workshop/Winter Conference on Applications of Computer Vision / 2024

Transferable-Guided Attention Is All You Need for Video Domain Adaptation

André Sacilotti, S. Santos, N. Sebe, Jurandy Almeida

Computer VisionLarge Language ModelsMultimodal LearningPopular and Landmark Papers

Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video UDA has been little explored. Our key idea is to use transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments were conducted on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets, with different backbones, like ResNet101, I3D, and STAM, to verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. Our code is available at https://github.com/Andre-Sacilotti/transferattn-project-code.

4 citations0 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Transferable-Guided Attention Is All You Need for Video Domain Adaptation
Authors: André Sacilotti, S. Santos, N. Sebe, Jurandy Almeida
Year: 2024
Venue: IEEE Workshop/Winter Conference on Applications of Computer Vision
Categories: Computer Vision, Large Language Models, Multimodal Learning, Popular and Landmark Papers
Citations: 4
Paper URL: https://arxiv.org/abs/2407.01375v2
Open PDF: https://arxiv.org/pdf/2407.01375v2

Abstract:
Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video UDA has been little explored. Our key idea is to use transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB). DTAB compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Extensive experiments were conducted on UCF-HMDB, Kinetics-Gameplay, and Kinetics-NEC Drone datasets, with different backbones, like ResNet101, I3D, and STAM, to verify the effectiveness of TransferAttn compared with state-of-the-art approaches. Also, we demonstrate that DTAB yields performance gains when applied to other state-of-the-art transformer-based UDA methods from both video and image domains. Our code is available at https://github.com/Andre-Sacilotti/transferattn-project-code.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2407.01375v2)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2407.01375v2)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Transferable-Guided%20Attention%20Is%20All%20You%20Need%20for%20Video%20Domain%20Adaptation)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Transferable-Guided%20Attention%20Is%20All%20You%20Need%20for%20Video%20Domain%20Adaptation)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/7ceeef9f027a20582248cc34e389eba77169a2a4)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Transferable-Guided%20Attention%20Is%20All%20You%20Need%20for%20Video%20Domain%20Adaptation+paper+explained)