Multimodal Embeddings (CLIP)

- Understand how CLIP learns joint image-text representations using contrastive learning - Run your first CLIP similarity queries and interpret shared embedding space - Practice prompt engineering with images — and see how wording shifts retrieval results - Build retrieval systems: text-to-image and image-to-image using cosine similarity - Experiment with visual vector arithmetic: apply analogies to embeddings - Explore advanced tasks like visual question answering (VQA) and image captioning - Compare multimodal architectures: CLIP, ViLT, ViT-GPT2 and how they process fusion - Learn how modality-specific encoders (image/audio) integrate into transformer models