Periodic Reporting for period 4 - HOLI (Deep Learning for Holistic Inference)
Periodo di rendicontazione: 2023-08-01 al 2025-01-31
The goal of this project is to create a general methodology for semantic interpretation of input streams, and to apply it to challenges such as text and video understanding.
Our approach consists of several key elements. First, we focus on deep-learning architectures, since these are highly expressive and have the potential for modeling complex semantic structures. Second, we ask how should deep learning models be constructed so that they can discover entities in input streams. Third, we ask how interactions between entities can be modeled in a way that also improves our overall understanding of the scene. In other words, we expect our system to integrate information from across the input (i.e. in a top-down holistic manner), in order to achieve better semantic understanding. Fourth, we seek a theoretical understanding of when such models are expected to work well, and how they can be learned from data.
The project also places emphasis on practical performance of the methods developed, and in particular to improve capabilities in the fields of language and video understanding.
The potential impact of the project is considerable. First, methods developed here can improve technologies such as robotics, medical diagnostics, and automated dialogue. Second, from a scientific perspective understanding design principles and theoretical aspects of deep-learning can help improve AI methodology in general.
Some of the results achieved during the project are:
1. Differentiable Scene Graphs (WACV 2020): A natural description of an image is a via the set of objects in it, their attributes and relations, known as a scene-graph (SG). A key goal of machine vision is to map images into such scene graphs. However, there is typically no data available to “teach” AI models to perform this task. Here we showed that this learning process can be achieved even without explicit supervision, by using SGs as an intermediate representation when performing simpler tasks. We showed that models trained in this way can generate accurate SGs without explicit training.
2. Modeling entities in text (EMNLP 2020): Texts typically describe multiple entities. A key task is to discover all parts of the text that correspond to a given entity. In particular, we would like to teach models to achieve this with limited supervision. Here we show that such models can be significantly improved by allowing them to “read” large volumes of text, and build entity representations that can predict entity properties such as the correct pronoun or the name of the entity. We showed how such models can be trained from unlabeled data, resulting in state of the art performance.
3. Improved Question Answering via Recurring Entities (published in ACL 2021): Answering questions about text is an important application of natural language processing. Current systems typically require large amounts of manually annotated data, which are expensive to collect. Here we show that systems can be trained with far less data, by again focusing on the notion of entities in text. In particular, we find recurring entities in text, hide one such occurrence and “ask” the model to complete it. Since this task is very similar to question answering, the model that learns it can quickly adapt to answering questions.
4. Theoretical understanding of deep-learning (e.g. two ICML 2021 papers, ICML 2022, UAI 2022): Artificial neural networks have achieved great empirical success. But from a theoretical perspective, it is not sufficiently understood why they can be optimized effectively, and why they generalize well even when trained on relatively little data. In this project we have advanced our understanding of this problem, by providing theoretical results that rigorously characterize the outcome of deep-learning for certain models that are key to semantic understanding.
5. The visual prompting approach to pre-training visual models (NeurIPS 2022). This is one of the first illustrations of how visual models can be trained so that they can be applied to multiple different vision tasks (eg segmentation).
6. Understanding in-context learning in language and vision models. Current models have a striking ability to learn from a few examples presented in context. How this is achieved has been unclear thus far. In an EMNLP Findings paper (2023) we presented a mechanism that may underlie this phenomenon, and showed empirically how this is achieved. In ECCV 2024 we further showed how these ideas also apply to visual models.
7. Using scene structure within transformer architectures: SGs are an effective formalism for describing complex scenes, and there are some rich datasets that contains scene graph annotations. However, it was not clear how to effectively use these within transformer architectures and for improving other tasks. We developed various adaptations of transformers that can be applied to SGs, and showed resulting improvement in a wide range of visual tasks (e.g. NeurIPS 2022, EMNLP 2023).
8. Analysis of implicit bias in RNNs: Temporal models are very important for learning text, vision and other signals. It has been observed that these models can generalize well (both interpolation and extrapolation) even with relatively little training data. We provided a theoretical analysis of this in two papers (AISTATS 2022, ICLR 2023), which used a new type of analysis that focused on moment matching aspect of these models.
The results obtained during the project were published at the top venues in the field (e.g. NeurIPS, ICLR, ICML, CVPR, ECCV, ACL, NAACL, EMNLP) and have inspired followup work in both theory community and applied language and vision communities.
1. Our paper in ACL 2021 presented novel question answering models that can be trained on far less data than previous models.
2. Our paper in EMNLP 2020 presented coreference resolution models (i.e. entity discovery in text) with state of the art accuracy, achieved by training on large amounts of free text.
3. Our DETReg model (CVPR 2022) achieved state of the art results on object detection in images.
4. Results in WACV 2020 achieved state of the art performance on the task of referring relationships in images.
5. Our paper in ECCV 2020 presented state of the art results for generating images based on semantic description, in particular for images with a relatively large number of objects.
6. In another ECCV 2020 paper we obtained state of the art results on the problem of tracking objects in video, in cases where they occluded and even carried by other objects during parts of the video.
7. Our theoretical results in UAI 2021 were the first to show a complete optimization and generalization guarantee for a max-pooling architecture, which is a core component of deep learning systems.
8. The visual prompting paper showed a surprising and important result, that we were not expecting both in terms of its empirical results as well as its impact on the field. It essentially showed how the success obtained with LLM pretraining could potentially be obtained in the visual domain, and has already made an impact on the field.
9. Our results on implicit bias in temporal models (e.g. ICLR 2023) were among the first to show that large models can in fact learn simple temporal rules without overfitting. This involved novel use of moment-matching results and sparsity based methods.