Zhe Gan (@zhegan4)

2024-10-04 | โค๏ธ 927 | ๐Ÿ” 154


๐Ÿ’กCLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info.

๐Ÿคฉ๐Ÿคฉ We present CLOC โฐ, our next-generation image encoder, with enhanced localization capabilities, and serves as a drop-in replacement for CLIP.

๐Ÿš€๐Ÿš€How to do that? We conduct large-scale pre-training with region-text supervision pseudo-labelled on 2B images.

๐ŸŽAs a result, CLOC is indeed a better image encoder, not only for zero-shot image/region tasks, but also for multimodal LLM.

๋ฏธ๋””์–ด

image


Tags

domain-ai-ml domain-genai domain-dev-tools domain-vlm