Multimodal Finetuning (Mini Project 6)

- Understand what CLIP is and how contrastive learning aligns image/text modalities - Fine-tune CLIP for classification (e.g., pizza types) or regression (e.g., solar prediction) - Add heads on top of CLIP embeddings for specific downstream tasks - Compare zero-shot performance vs fine-tuned model accuracy - Apply domain-specific LoRA tuning to vision/text encoders - Explore regression/classification heads, cosine similarity scoring, and decision layers - Learn how diffusion models extend CLIP-like embeddings for text-to-image and video generation - Understand how video generation differs via temporal modeling, spatiotemporal coherence