CLIP

Architecture

Contrastive Language-Image Pre-training (CLIP) uses a dual-encoder architecture to map images and text into a shared latent space. It works by jointly training two encoders. One encoder for images (Vision Transformer) and one for text (Transformer-based language model).

Image Encoder: The image encoder extracts salient features from the visual input. This encoder takes an ‘image as input’ and produces a high-dimensional vector representation. It typically uses a convolutional neural network (CNN) architecture, like ResNet, for extracting image features.
Text Encoder: The text encoder encodes the semantic meaning of the corresponding textual description. It takes a ‘text caption/label as input’ and produces another high-dimensional vector representation. It often uses a transformer-based architecture, like a Transformer or BERT, to process text sequences.
Shared Embedding Space: The two encoders produce embeddings in a shared vector space. These shared embedding spaces allow CLIP to compare text and image representations and learn their underlying relationships.

alt text

Step 1: Contrastive Pre-training

CLIP is pre-trained on a large-scale dataset of 400 million (image, text data) pairs collected from the internet. During pre-training, the model is presented with pairs of images and text captions. Some of these pairs are genuine matches (the caption accurately describes the image), while others are mismatched. It creates shared latent space embeddings.

Step 2: Create Dataset Classifiers from Label Text

For each image, multiple text descriptions are created, including the correct one and several incorrect ones. This creates a mix of positive samples (matching) and negative sample (mismatched) pairs. These descriptions are fed into the text encoder, generating class-specific embeddings.

At this stage, one crucial function also came into play: Contrastive Loss Function. This function penalizes the model for incorrectly matching (image-text) pairs. But, rewards it for correctly matching (image-text) pairs in the latent space. It encourages the model to learn representations that accurately capture visual and textual information similarities.

Step 3: Zero-shot Prediction

Now, the trained text encoder is used as a zero-shot classifier. With a new image, CLIP can make zero-shot predictions. This is done by passing it through the image encoder and the dataset classifier without fine-tuning.

CLIP computes the cosine similarity between the embeddings of all image and text description pairs. It optimizes the parameters of the encoders to increase the similarity of the correct pairs. Thus, decreasing the similarity of the incorrect pairs.

This way, CLIP learns a multimodal embedding space where semantically related images and texts are mapped close to each other. The predicted class is the one with the highest logit value.

Architecture

You may also enjoy:

PEFT

GANs