Chris Paxton (@chris_j_paxton)
2024-10-09 | โค๏ธ 212 | ๐ 38
How do we represent knowledge for the next generation of in-home robots?
We want generally useful robots which can execute multi-step, open-vocabulary queries. This means that I should be able to tell my robot something like โcheck to see if the dog has enough foodโ or โsee if the water on the stove is boiling.โ That means a robot that has rich knowledge of the world: where objects are, what it can do, etc.
This is not something end-to-end models are doing yet. Instead, the way I see it, there are three main options:
- Use something like retrieval-augmented generation and a large vision-language model to make decisions
- Generate some rich 3d feature-based representation
- Build complex open-vocabulary scene graphs
Short review of whatโs happening in this space right now โ
๋ฏธ๋์ด

Tags
domain-vision-3d domain-robotics domain-ai-ml domain-genai domain-llm domain-vlm