Semi-Automatic Project Classification - Explainability Notice
1. Introduction
The Community Research and Development Information Service (CORDIS) uses a semi-automatic project classification system to categorise EU-funded research projects according to their respective scientific fields. This system, which leverages artificial intelligence (AI) techniques, specifically natural language processing (NLP) and machine learning (ML), enables users to quickly find projects related to specific areas of research. The classification system uses the European Science Vocabulary (EuroSciVoc) taxonomy, which provides a hierarchical tree-structure for categorising scientific fields. EuroSciVoc is a multilingual taxonomy that covers over 1 000 scientific fields, making it an essential tool for researchers, policymakers and other stakeholders looking to explore developments in science and technology. Scientific journalists writing for CORDIS, project beneficiaries and other CORDIS users are encouraged to verify classifications and provide us with their feedback, so that the classification system can be continuously enhanced.
2. Terms and definitions
| Term | Definition |
|---|---|
| Artificial intelligence (AI) | Technology that enables computers and machines to simulate or to imitate human intelligence and problem-solving capabilities. |
| Classification model | A type of machine learning model that categorizes or classifies data into predefined classes or labels. It takes input data and predicts which category or class the data belongs to. |
| Fields of Research and Development (FoRD) | The OECD classification developed for measurement purposes, following primarily a content approach, which is used as the backbone of EuroSciVoc. |
| Machine Learning (ML) | A type of AI that allows software applications to 'learn' from past practice and feedback becoming more accurate at predicting outcomes without being explicitly programmed. |
| Natural Language Processing (NLP) | A field of AI used to analyse, understand and process human language. |
| Organisation for Economic Co-operation and Development (OECD) | An intergovernmental organisation, founded in 1961 to stimulate economic progress and world trade. |
| Semi-Automatic Classification System (SACS) | The software used on CORDIS for content classification and taxonomy maintenance. |
| Simple Knowledge Organisation System (SKOS) | A W3C standard for representing controlled vocabularies. |
| The World wide web consortium (W3C) | The main international standards organisation for the World Wide Web. |
3. EU funded research projects information on CORDIS
CORDIS provides detailed information on EU-funded research projects, including project objectives, funding, outcomes and results. To manage the large volume and the complexity of project subjects, CORDIS employs a classification system that leverages AI techniques, specifically NLP and ML, to order the content by fields of science as listed in EuroSciVoc.
4. What is semi-automatic project classification?
The semi-automatic project classification is a feature of the SACS software that classifies EU-funded research projects by applying the EuroSciVoc taxonomy. SACS leverages AI techniques, specifically NLP and ML to match projects with the relevant scientific fields in EuroSciVoc. The aim is to ensure efficient classification of all projects on CORDIS. The system is "semi-automatic" because it allows human intervention including the validation of classifications and the maintenance of the taxonomy.
5. How does the classification process work?
The classification process involves two main stages: automated classification and semi-automatic classification, which is human-reviewed.
5.1. The automated classification
The classification software uses a classification model trained on a sample of project descriptions covering a broad range of scientific fields. The model takes project-related texts as input and provides categories and quality indicators as output. The automated steps are:
- Preprocessing: texts relating to the project are assembled and cleaned from unnecessary or disruptive elements (HTML tags, formatting codes, extra spaces, line breaks, etc.)
- Domain detection: the high-level scientific domain of the content is identified using an integrated NLP tool.
- Collocation (combinations of words) and keyword extraction: the text is annotated with an integrated NLP tool to identify relevant keywords and collocated keywords.
- Classification: business rules are used to weight extracted keywords and domains, ranking categories by relevance.
- Category selection and prioritisation: the ranked list of categories is processed using a combination of business rules and hierarchical logic to select categories with the highest combined relevance scores.
- Assignment of categories: from the categories that exceed the minimum relevance threshold, the top five highest-ranked ones are recommended for each project.
5.2. Human review and sampling
The classification of projects is partially reviewed by humans. A sampling‑based approach is used to maintain quality and performance, including:
- Classical sampling (model training): The initial classification model is trained on a large, representative sample of projects, providing a baseline for accuracy.
- Opportunistic sampling (ongoing validation): A validated document set is maintained, consisting of a selected subset of documents that are manually classified. Human reviews support the sampling process by providing ongoing, practical feedback on system output. This is, for example, done by:
- Scientific journalists and project beneficiaries who are encouraged to verify classifications in the communication related to the publication of articles about the projects.
- Registered CORDIS website users who can suggest new categories via the "Suggest new fields of science" feature. These suggestions are moderated by the CORDIS team. The registered CORDIS website users' feedback is used to validate the classifications.
The classification "status" is transparent in the web interface for the end-user who will see a green badge on a human-validated classification with the text:
"CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. The classification of this project has been human-validated."
In contrast, a simple blue information tool tip indicates that the classification was not human-validated.
6. What data are used in the classification?
The classification process uses publicly available data on the CORDIS platform, including:
- Titles and objectives of research projects;
- Reported summaries of the projects' progress;
- Validated keywords and categories from past classification rounds.
7. Is personal data used in the process?
No personal data is used to train or improve the classification system.
8. Limitations at the current point in time
- The semi-automatic classification is limited by the scope and granularity of EuroSciVoc.
- The quality of the automatic suggestions varies with the richness of project information available.
- Due to the high volume of projects (more than 400 per month) and the complexity of the related information, not all projects can be quickly validated by humans.
9. Disclaimer - liability aspects
The semi-automatic project classification system is based on NLP and ML. It is automated and partially verified and validated by humans. Although all necessary measures are taken to ensure content quality, accuracy cannot be guaranteed. The classification is provided for informational purposes only and should not be relied upon for specific purposes without verification of accuracy and completeness.
Any liability of the Publications Office of the European Union and of the EU institutions for errors or omissions in the outcome resulting from applying AI tools and techniques of classification on CORDIS is hereby disclaimed. No responsibility can be assumed for any consequence of relying solely upon such AI generated content. Users are advised to use the content with caution and to exercise due diligence.