Progress has been made in all aspects of the proposal, according to the planned research. In particular, we have developed novel deep-learning methods for modeling complex inputs, and applied them successfully to challenging problems in text and image understanding. Additionally, we have advanced our theoretical understanding of when deep-learning models are expected to work.
Some of the results achieved during the project are:
1. Differentiable Scene Graphs (WACV 2020): A natural description of an image is a via the set of objects in it, their attributes and relations, known as a scene-graph (SG). A key goal of machine vision is to map images into such scene graphs. However, there is typically no data available to “teach” AI models to perform this task. Here we showed that this learning process can be achieved even without explicit supervision, by using SGs as an intermediate representation when performing simpler tasks. We showed that models trained in this way can generate accurate SGs without explicit training.
2. Modeling entities in text (EMNLP 2020): Texts typically describe multiple entities. A key task is to discover all parts of the text that correspond to a given entity. In particular, we would like to teach models to achieve this with limited supervision. Here we show that such models can be significantly improved by allowing them to “read” large volumes of text, and build entity representations that can predict entity properties such as the correct pronoun or the name of the entity. We showed how such models can be trained from unlabeled data, resulting in state of the art performance.
3. Improved Question Answering via Recurring Entities (published in ACL 2021): Answering questions about text is an important application of natural language processing. Current systems typically require large amounts of manually annotated data, which are expensive to collect. Here we show that systems can be trained with far less data, by again focusing on the notion of entities in text. In particular, we find recurring entities in text, hide one such occurrence and “ask” the model to complete it. Since this task is very similar to question answering, the model that learns it can quickly adapt to answering questions.
4. Theoretical understanding of deep-learning (e.g. two ICML 2021 papers, ICML 2022, UAI 2022): Artificial neural networks have achieved great empirical success. But from a theoretical perspective, it is not sufficiently understood why they can be optimized effectively, and why they generalize well even when trained on relatively little data. In this project we have advanced our understanding of this problem, by providing theoretical results that rigorously characterize the outcome of deep-learning for certain models that are key to semantic understanding.
5. The visual prompting approach to pre-training visual models (NeurIPS 2022). This is one of the first illustrations of how visual models can be trained so that they can be applied to multiple different vision tasks (eg segmentation).
6. Understanding in-context learning in language and vision models. Current models have a striking ability to learn from a few examples presented in context. How this is achieved has been unclear thus far. In an EMNLP Findings paper (2023) we presented a mechanism that may underlie this phenomenon, and showed empirically how this is achieved. In ECCV 2024 we further showed how these ideas also apply to visual models.
7. Using scene structure within transformer architectures: SGs are an effective formalism for describing complex scenes, and there are some rich datasets that contains scene graph annotations. However, it was not clear how to effectively use these within transformer architectures and for improving other tasks. We developed various adaptations of transformers that can be applied to SGs, and showed resulting improvement in a wide range of visual tasks (e.g. NeurIPS 2022, EMNLP 2023).
8. Analysis of implicit bias in RNNs: Temporal models are very important for learning text, vision and other signals. It has been observed that these models can generalize well (both interpolation and extrapolation) even with relatively little training data. We provided a theoretical analysis of this in two papers (AISTATS 2022, ICLR 2023), which used a new type of analysis that focused on moment matching aspect of these models.
The results obtained during the project were published at the top venues in the field (e.g. NeurIPS, ICLR, ICML, CVPR, ECCV, ACL, NAACL, EMNLP) and have inspired followup work in both theory community and applied language and vision communities.