Zhe Gan (@zhegan4)

2024-10-04 | ❤️ 927 | 🔁 154

💡CLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info.

🤩🤩 We present CLOC ⏰, our next-generation image encoder, with enhanced localization capabilities, and serves as a drop-in replacement for CLIP.

🚀🚀How to do that? We conduct large-scale pre-training with region-text supervision pseudo-labelled on 2B images.

🎁As a result, CLOC is indeed a better image encoder, not only for zero-shot image/region tasks, but also for multimodal LLM.

📚 세현's Vault

🌍 도메인

📄 Papers

clip-is-the-default-choice-for-most-multimodal-llm-research-but-we-know-clip-is-1

Zhe Gan (@zhegan4)

미디어

Tags

그래프 뷰

목차

백링크