Posts by Tags

CLIP

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

Computer Vision

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

DeiT III: Revenge of the ViT

8 minute read

Published:

As hinted by the title, the paper is a follow-up work to DeiT (Training data-efficient image transformers & distillation through attention [1]) and co-authored by several of DeiT’s authors. The paper was presented at ECCV 2022, code is available here under the original DeiT repository. The goal of the paper is to provide an improved training recipe for “vanilla” Vision Transformers (ViT) [2] in order to achieve a stronger baseline for vanilla ViT, without any architectural changes. I find this approach very interesting as there is a large body of works which offer various architectural changes (some motivated by ConvNets) to vanilla ViT (e.g.: PVT[3], Swin[4], CvT[5], Focal Transformer[6], Coatnet[7]), and here the authors steer away from making any changes to the architecture and focus instead only on the training recipe. This work is also similar to a previous paper by the several of same authors, “ResNet strikes back: An improved training procedure in timm” [8] which offers an improved training receipt for ResNets to achieve a stronger baseline for simple “vanilla” ResNets. Fun fact, there is no DeiT2 !

Fast Vision Transformers with HiLo Attention

5 minute read

Published:

The paper proposes a novel efficient ViT architecture with throughput in mind to mitigate ViT’s high computational complexity which stems from the quadratic memory and time complexity of the attention mechanism. First, the paper argue (and in my opinion rightfully so) that although many improved and more efficient ViT architectures have been proposed, in practice they do not offer high processing speed. This claim might seem contradictory, but in fact previous works usually consider metrics such as number of FLOPS, memory usage and asymptotic computational complexity (which are important by themselves), but those metrics do not capture the actual running time or throughput nor those works directly measure those. Moreover, specific architectures with small number of FLOPS and memory requirements or lower asymptotic complexity as might actually run slowly when implemented on GPU due to specific operations which not hardware-friendly or cannot be parallelized. To this end, the paper directly benchmarks FLOPS and memory consumption as well as throughput (on GPU) and proposes a ViT architecture that performs favourably in those metrics while achieving high accuracy when used as backbone in classification and various down-stream vision tasks.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

CAN - A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

5 minute read

Published:

In my opinion, a very neat paper that combines several ideas from self supervised learning (SSL), namely contrastive loss (most notably, SimCLR [1]) reconstruction of masked patches (most notably “Masked Auto-encoders Are Scalable Vision Learners” [2]) and denoising autoencoder. Their pipeline is summarized in the figure below and works as follows: given an input image, generate 2 different views by applying augmentations, mask 50% of the patches, add Gaussian noise to the unmasked patches and feed the resulting noisy masked image to a ViT encoder. Now, we take the encoding of the unmasked patches, perform mean pooling, pass to a light MLP and apply contrastive loss (hence, the “contrastive” in the title). The encoded “patch image” is then passed to a ViT decoder to perform reconstruction of the masked patches and denoising of the unmasked noisy patches, which gives us the both reconstruction loss and the denoising loss.

Deep Learning

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

DeiT III: Revenge of the ViT

8 minute read

Published:

As hinted by the title, the paper is a follow-up work to DeiT (Training data-efficient image transformers & distillation through attention [1]) and co-authored by several of DeiT’s authors. The paper was presented at ECCV 2022, code is available here under the original DeiT repository. The goal of the paper is to provide an improved training recipe for “vanilla” Vision Transformers (ViT) [2] in order to achieve a stronger baseline for vanilla ViT, without any architectural changes. I find this approach very interesting as there is a large body of works which offer various architectural changes (some motivated by ConvNets) to vanilla ViT (e.g.: PVT[3], Swin[4], CvT[5], Focal Transformer[6], Coatnet[7]), and here the authors steer away from making any changes to the architecture and focus instead only on the training recipe. This work is also similar to a previous paper by the several of same authors, “ResNet strikes back: An improved training procedure in timm” [8] which offers an improved training receipt for ResNets to achieve a stronger baseline for simple “vanilla” ResNets. Fun fact, there is no DeiT2 !

Fast Vision Transformers with HiLo Attention

5 minute read

Published:

The paper proposes a novel efficient ViT architecture with throughput in mind to mitigate ViT’s high computational complexity which stems from the quadratic memory and time complexity of the attention mechanism. First, the paper argue (and in my opinion rightfully so) that although many improved and more efficient ViT architectures have been proposed, in practice they do not offer high processing speed. This claim might seem contradictory, but in fact previous works usually consider metrics such as number of FLOPS, memory usage and asymptotic computational complexity (which are important by themselves), but those metrics do not capture the actual running time or throughput nor those works directly measure those. Moreover, specific architectures with small number of FLOPS and memory requirements or lower asymptotic complexity as might actually run slowly when implemented on GPU due to specific operations which not hardware-friendly or cannot be parallelized. To this end, the paper directly benchmarks FLOPS and memory consumption as well as throughput (on GPU) and proposes a ViT architecture that performs favourably in those metrics while achieving high accuracy when used as backbone in classification and various down-stream vision tasks.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

CAN - A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

5 minute read

Published:

In my opinion, a very neat paper that combines several ideas from self supervised learning (SSL), namely contrastive loss (most notably, SimCLR [1]) reconstruction of masked patches (most notably “Masked Auto-encoders Are Scalable Vision Learners” [2]) and denoising autoencoder. Their pipeline is summarized in the figure below and works as follows: given an input image, generate 2 different views by applying augmentations, mask 50% of the patches, add Gaussian noise to the unmasked patches and feed the resulting noisy masked image to a ViT encoder. Now, we take the encoding of the unmasked patches, perform mean pooling, pass to a light MLP and apply contrastive loss (hence, the “contrastive” in the title). The encoded “patch image” is then passed to a ViT decoder to perform reconstruction of the masked patches and denoising of the unmasked noisy patches, which gives us the both reconstruction loss and the denoising loss.

Muti-modal

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

NLP

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

Papers

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

DeiT III: Revenge of the ViT

8 minute read

Published:

As hinted by the title, the paper is a follow-up work to DeiT (Training data-efficient image transformers & distillation through attention [1]) and co-authored by several of DeiT’s authors. The paper was presented at ECCV 2022, code is available here under the original DeiT repository. The goal of the paper is to provide an improved training recipe for “vanilla” Vision Transformers (ViT) [2] in order to achieve a stronger baseline for vanilla ViT, without any architectural changes. I find this approach very interesting as there is a large body of works which offer various architectural changes (some motivated by ConvNets) to vanilla ViT (e.g.: PVT[3], Swin[4], CvT[5], Focal Transformer[6], Coatnet[7]), and here the authors steer away from making any changes to the architecture and focus instead only on the training recipe. This work is also similar to a previous paper by the several of same authors, “ResNet strikes back: An improved training procedure in timm” [8] which offers an improved training receipt for ResNets to achieve a stronger baseline for simple “vanilla” ResNets. Fun fact, there is no DeiT2 !

Fast Vision Transformers with HiLo Attention

5 minute read

Published:

The paper proposes a novel efficient ViT architecture with throughput in mind to mitigate ViT’s high computational complexity which stems from the quadratic memory and time complexity of the attention mechanism. First, the paper argue (and in my opinion rightfully so) that although many improved and more efficient ViT architectures have been proposed, in practice they do not offer high processing speed. This claim might seem contradictory, but in fact previous works usually consider metrics such as number of FLOPS, memory usage and asymptotic computational complexity (which are important by themselves), but those metrics do not capture the actual running time or throughput nor those works directly measure those. Moreover, specific architectures with small number of FLOPS and memory requirements or lower asymptotic complexity as might actually run slowly when implemented on GPU due to specific operations which not hardware-friendly or cannot be parallelized. To this end, the paper directly benchmarks FLOPS and memory consumption as well as throughput (on GPU) and proposes a ViT architecture that performs favourably in those metrics while achieving high accuracy when used as backbone in classification and various down-stream vision tasks.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

CAN - A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

5 minute read

Published:

In my opinion, a very neat paper that combines several ideas from self supervised learning (SSL), namely contrastive loss (most notably, SimCLR [1]) reconstruction of masked patches (most notably “Masked Auto-encoders Are Scalable Vision Learners” [2]) and denoising autoencoder. Their pipeline is summarized in the figure below and works as follows: given an input image, generate 2 different views by applying augmentations, mask 50% of the patches, add Gaussian noise to the unmasked patches and feed the resulting noisy masked image to a ViT encoder. Now, we take the encoding of the unmasked patches, perform mean pooling, pass to a light MLP and apply contrastive loss (hence, the “contrastive” in the title). The encoded “patch image” is then passed to a ViT decoder to perform reconstruction of the masked patches and denoising of the unmasked noisy patches, which gives us the both reconstruction loss and the denoising loss.

Research

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

DeiT III: Revenge of the ViT

8 minute read

Published:

As hinted by the title, the paper is a follow-up work to DeiT (Training data-efficient image transformers & distillation through attention [1]) and co-authored by several of DeiT’s authors. The paper was presented at ECCV 2022, code is available here under the original DeiT repository. The goal of the paper is to provide an improved training recipe for “vanilla” Vision Transformers (ViT) [2] in order to achieve a stronger baseline for vanilla ViT, without any architectural changes. I find this approach very interesting as there is a large body of works which offer various architectural changes (some motivated by ConvNets) to vanilla ViT (e.g.: PVT[3], Swin[4], CvT[5], Focal Transformer[6], Coatnet[7]), and here the authors steer away from making any changes to the architecture and focus instead only on the training recipe. This work is also similar to a previous paper by the several of same authors, “ResNet strikes back: An improved training procedure in timm” [8] which offers an improved training receipt for ResNets to achieve a stronger baseline for simple “vanilla” ResNets. Fun fact, there is no DeiT2 !

Fast Vision Transformers with HiLo Attention

5 minute read

Published:

The paper proposes a novel efficient ViT architecture with throughput in mind to mitigate ViT’s high computational complexity which stems from the quadratic memory and time complexity of the attention mechanism. First, the paper argue (and in my opinion rightfully so) that although many improved and more efficient ViT architectures have been proposed, in practice they do not offer high processing speed. This claim might seem contradictory, but in fact previous works usually consider metrics such as number of FLOPS, memory usage and asymptotic computational complexity (which are important by themselves), but those metrics do not capture the actual running time or throughput nor those works directly measure those. Moreover, specific architectures with small number of FLOPS and memory requirements or lower asymptotic complexity as might actually run slowly when implemented on GPU due to specific operations which not hardware-friendly or cannot be parallelized. To this end, the paper directly benchmarks FLOPS and memory consumption as well as throughput (on GPU) and proposes a ViT architecture that performs favourably in those metrics while achieving high accuracy when used as backbone in classification and various down-stream vision tasks.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.

CAN - A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

5 minute read

Published:

In my opinion, a very neat paper that combines several ideas from self supervised learning (SSL), namely contrastive loss (most notably, SimCLR [1]) reconstruction of masked patches (most notably “Masked Auto-encoders Are Scalable Vision Learners” [2]) and denoising autoencoder. Their pipeline is summarized in the figure below and works as follows: given an input image, generate 2 different views by applying augmentations, mask 50% of the patches, add Gaussian noise to the unmasked patches and feed the resulting noisy masked image to a ViT encoder. Now, we take the encoding of the unmasked patches, perform mean pooling, pass to a light MLP and apply contrastive loss (hence, the “contrastive” in the title). The encoded “patch image” is then passed to a ViT decoder to perform reconstruction of the masked patches and denoising of the unmasked noisy patches, which gives us the both reconstruction loss and the denoising loss.

Self Supervised Learning

CAN - A simple, efficient and scalable contrastive masked autoencoder for learning visual representations

5 minute read

Published:

In my opinion, a very neat paper that combines several ideas from self supervised learning (SSL), namely contrastive loss (most notably, SimCLR [1]) reconstruction of masked patches (most notably “Masked Auto-encoders Are Scalable Vision Learners” [2]) and denoising autoencoder. Their pipeline is summarized in the figure below and works as follows: given an input image, generate 2 different views by applying augmentations, mask 50% of the patches, add Gaussian noise to the unmasked patches and feed the resulting noisy masked image to a ViT encoder. Now, we take the encoding of the unmasked patches, perform mean pooling, pass to a light MLP and apply contrastive loss (hence, the “contrastive” in the title). The encoded “patch image” is then passed to a ViT decoder to perform reconstruction of the masked patches and denoising of the unmasked noisy patches, which gives us the both reconstruction loss and the denoising loss.

Vision Transformer

DeiT III: Revenge of the ViT

8 minute read

Published:

As hinted by the title, the paper is a follow-up work to DeiT (Training data-efficient image transformers & distillation through attention [1]) and co-authored by several of DeiT’s authors. The paper was presented at ECCV 2022, code is available here under the original DeiT repository. The goal of the paper is to provide an improved training recipe for “vanilla” Vision Transformers (ViT) [2] in order to achieve a stronger baseline for vanilla ViT, without any architectural changes. I find this approach very interesting as there is a large body of works which offer various architectural changes (some motivated by ConvNets) to vanilla ViT (e.g.: PVT[3], Swin[4], CvT[5], Focal Transformer[6], Coatnet[7]), and here the authors steer away from making any changes to the architecture and focus instead only on the training recipe. This work is also similar to a previous paper by the several of same authors, “ResNet strikes back: An improved training procedure in timm” [8] which offers an improved training receipt for ResNets to achieve a stronger baseline for simple “vanilla” ResNets. Fun fact, there is no DeiT2 !

Fast Vision Transformers with HiLo Attention

5 minute read

Published:

The paper proposes a novel efficient ViT architecture with throughput in mind to mitigate ViT’s high computational complexity which stems from the quadratic memory and time complexity of the attention mechanism. First, the paper argue (and in my opinion rightfully so) that although many improved and more efficient ViT architectures have been proposed, in practice they do not offer high processing speed. This claim might seem contradictory, but in fact previous works usually consider metrics such as number of FLOPS, memory usage and asymptotic computational complexity (which are important by themselves), but those metrics do not capture the actual running time or throughput nor those works directly measure those. Moreover, specific architectures with small number of FLOPS and memory requirements or lower asymptotic complexity as might actually run slowly when implemented on GPU due to specific operations which not hardware-friendly or cannot be parallelized. To this end, the paper directly benchmarks FLOPS and memory consumption as well as throughput (on GPU) and proposes a ViT architecture that performs favourably in those metrics while achieving high accuracy when used as backbone in classification and various down-stream vision tasks.

Vision and Language

ImageBind: One Embedding Space To Bind Them All

12 minute read

Published:

The proposed idea is simple - mapping six different modalities to a joint embedding space. This allows data samples from different modalities, which share the same semantic meaning, to be mapped to similar vectors (i.e., vectors that are close in the cosine similarity metric) within the embedding space. Embedding modalities using explicitly aligned training data has been proposed before. For instance, the seminal works of CLIP[1] and ALIGN[2] map images and text to a joint embedding space. AudioCLIP[3] extends CLIP by adding an audio modality, and “Everything at Once”[4] embeds audio, images, and text into a joint mapping space to leverage the audio modality in video retrieval. However, embedding six different modalities - images, text, audio, depth, thermal, and Inertial Measurement Unit (IMU) data - particularly those without explicitly aligned training data or even datasets where they naturally coexist (e.g., you probably wouldn’t find dataset of depth images associated with sounds) is a challenge that hasn’t been tackled before. In my opinion, this opens the door to many new applications.

Text-Only Training for Image Captioning using Noise-Injected CLIP

5 minute read

Published:

Cool paper, simple and elegant. Training image captioning models commonly requires supervision in the form of image and text pairs. Given a text-only dataset (so no images and certainly no pairs), can we leverage CLIP’s [1] strong image and text embedding capabilities to train a text-only image captioning model? turns out we can.