https://youtu.be/989aKUVBfbk?si=uzoaSLQZlqQAJg1r Multi Modal 중 하나인 OpenAI의 CLIP 모델 크게 두 가지 모델로 구성됨. Vision Transformers Text Transformers https://github.com/openai/CLIP GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - ope..