Fast Intro to image and text Multi-Modal with OpenAI CLIP

학습일지/AI

Fast Intro to image and text Multi-Modal with OpenAI CLIP

inspirit941 2024. 3. 26. 17:52

https://youtu.be/989aKUVBfbk?si=uzoaSLQZlqQAJg1r

Multi Modal 중 하나인 OpenAI의 CLIP 모델

스크린샷 2024-03-26 오전 9 42 59

크게 두 가지 모델로 구성됨.

Vision Transformers
Text Transformers

https://github.com/openai/CLIP

GitHub - openai/CLIP: CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image

CLIP (Contrastive Language-Image Pretraining), Predict the most relevant text snippet given an image - openai/CLIP

github.com

이미지와 텍스트 조합을 각각 Embedding -> 두 개의 output embedding vectors가 최대한 가까운 값을 갖도록 한다.

즉 image와 text를 받아서, 각 pair를 similar Vector Space에 저장한다
- text: single embedding 512 dim vet을 리턴
이렇게 되면 image & text classification, image & text search 등 다양한 작업을 해볼 수 있음. 이미지와 텍스트 조합으로 시도해볼 수 있는 것들이 많아진다

https://huggingface.co/openai/clip-vit-base-patch32

openai/clip-vit-base-patch32 · Hugging Face

Model Card: CLIP Disclaimer: The model card is taken and modified from the official CLIP repository, it can be found here. Model Details The CLIP model was developed by researchers at OpenAI to learn about what contributes to robustness in computer vision

huggingface.co

github 공식 repo가 있긴 하지만, implemetation 용으로는 huggingface에 등록된 위 모델에 더 좋다.

huggingface library와 인터페이스가 맞춰져 있기 때문.

간단하게 CLIP으로 text -> image 매핑하는 예시.

저작자표시 비영리 변경금지 (새창열림)

'학습일지 > AI' 카테고리의 다른 글

Simple LangChain Agent with OpenAI, Wikipedia, DuckDuckGo (0)	2024.04.26
MultiModal RAG With GPT-4 Vision and LangChain 정리 (0)	2024.04.12
DroidCon 2024 - AI Pull Request reviewer using ChatGPT and GitHub Actions (0)	2024.03.21
LangChain - Advanced RAG Technique for Better Retrieval Performance 정리 (0)	2024.03.14
SK Tech Summit 2023 - 비즈니스에 실제로 활용 가능한 LLM 서비스 만들기 (1)	2024.02.17

현재글Fast Intro to image and text Multi-Modal with OpenAI CLIP

관찰과 질문, 그리고 데이터