Zhe Gan (@zhegan4)
2024-10-04 | โค๏ธ 927 | ๐ 154
๐กCLIP is the default choice for most multimodal LLM research. But, we know CLIP is not perfect. It is good at high-level semantics, but not for capturing fine-grained info.
๐คฉ๐คฉ We present CLOC โฐ, our next-generation image encoder, with enhanced localization capabilities, and serves as a drop-in replacement for CLIP.
๐๐How to do that? We conduct large-scale pre-training with region-text supervision pseudo-labelled on 2B images.
๐As a result, CLOC is indeed a better image encoder, not only for zero-shot image/region tasks, but also for multimodal LLM.
๋ฏธ๋์ด
