Work package 2
Task 2.1 – Business and technical requirements specification (Completed)
Deliverable D2.1 offers a very detailed and solid description of the specification, requirements, and reference architecture, and is used as a foundation for future work.
Task 2.2 – Data management plan (Completed)
The corresponding deliverable D2.2 – Data Management Report led by CUT has been submitted on time.
Task 2.3 – Reference Architecture (Completed)
The submitted deliverable D2.1 describes aiD components/architectures in detail, along with the modelling pipelines and model integration that is used by the aiD project.
Work package 4
Task 4.1 - Dataset curation (Completed)
In the context of this task, Deliverable D4.1 was submitted. We created/collected and curated the largest open annotated SL footage in the literature. This dataset is publicly available on Zenodo.org.
Task 4.2 - SL-to-text algorithm. (Completed)
We developed a Transformer model for text generation from SL (SLT), published in the top-tier ICCV conference. The model was subsequently fine-tuned by considering trajectory feature extraction mechanisms such as AlphaPose and OpenPose; the final work was published in ICCV 2023. The final contribution of the Task is contained in D4.2 and used in the pilots.
Task 4.3 - Text-to-speech algorithm (Completed)
We started the development phase of the text-to-speech (TTS) task by implementing and exploring the performance of the well-known Tacotron-2. This is part of D4.2.
Work package 5
Task 5.1 - Dataset curation (Completed)
Deliverable D5.1 – Curated dataset for learning to generate realistically looking videos of SL translation, by means of generative AI. The outcome represents the largest dataset for training deep generative models for Text-to-SL video (SLG).
Task 5.2 - Speech-to-text algorithm (Completed)
The aiD consortium has developed new libraries both for speech-to-text and text-to-speech tasks, extending upon state-of-the-art deep neural networks, including Jasper. This is part of D5.2.
Task 5.3 & Task 5.4 - Text-to-SL video (Completed)
We initially experimented with Transformer generative models, mainly the Progressive Transformer, for learning to generate trajectories from text; this would be fed to an expensive and complicate graphics engine. We observed an outcome of disturbingly low accuracy, which was commensurate with the best methods reported in the literature.
Drawing from these results, we took the bold step of considering bleeding-edge denoising diffusion probabilistic models (DDPMs) for text-to-video. We fine-tuned on the SL video generation task, and obtained a groundbreaking outcome, which is both more efficient and accurate than any existing alternative (publication pending; part of D5.2).
Work package 6
Task 6.1 – Bayesian Inference Mechanisms (Completed)
The developed techniques achieve a compression rate of more than 70% for the SLT model.
Task 6.2 – Network Distillation approaches (Completed)
We experiments with distillation on DDPMs to yield scalable models for SLG.
Work package 7 – Pilots and Evaluation (Completed)
We had a very active and open discussion and collaboration among the researchers that worked in WP7, WP4 and WP5. All methodology developments have been timely and accurately disseminated to pilot development and incorporated into pilots.