SPARQL

A guide to CORDIS Linked Open Data

What is Linked Open Data?

Linked Open Data (LOD) is a combination of Linked Data and Open Data. Linked Data refers to machine-readable data shared on the Web, while Open Data allows for data to be used and distributed freely.

Linked Open Data is a method of accessing the decentralized web in a centralized way. It provides users with the means and services to discover the most relevant and accurate information. By combining the Linked Data design principles with machine-readable structured data, LOD can offer more useful information, interlinked with other data for further discovery.

The FAIR principles (Findable, Accessible, Interoperable, Reusable) and the 5-star deployment scheme for Open Data as described by Tim Berners-Lee ensure that data can be freely shared and distributed on the web.

As part of the Linked Open Data initiative, the Resource Description Format (RDF) is the primary language and technology to express and publish information about data, as well as interlink them on the Web. RDF allows to structure the data as subject-predicate-object triples.

EURIO Knowledge Graph

A knowledge graph represents real-world entities (e.g., projects, organisations, project results such as project deliverables) along with their relationships (e.g., an organization’s participation in a project) and attributes (e.g., the start date of a project or the VAT number of an organisation) as an interconnected network comprising nodes and edges.

Knowledge graphs provide a structured, machine-readable representation of data, promoting integration, linking, and reuse of knowledge. The EURIO Knowledge Graph makes use of the knowledge graph representation paradigm) to transform the CORDIS data into machine-readable interlinked data.

The data is published in the form of Resource Description Format (RDF) triples, following the Linked Open Data principles. The meaning of the entities described is formally defined by the EUropean Information Research Ontology (EURIO). The resulting EURIO Knowledge Graph is a network of interconnected RDF triples that encode the original CORDIS data, and it can be queried using SPARQL, which is the standardized language for retrieving and manipulating data in RDF format.

The EURIO ontology

To improve the visibility, reusability, and accessibility of CORDIS content, and boost its semantic interoperability, the Publications Office of the European Commission has developed the EUropean Research Information Ontology (EURIO). EURIO is a conceptual data model that draws on a network of existing ontologies (e.g., schema.org, DINGO, etc.) and reference data (e.g., the EuroSciVoc taxonomy, the NUTS code list, etc.). It provides the means to describe, among others, administrative information associated with research projects and their grants, such as start and end dates, total cost and funding received, information about the organisations and persons involved, as well as the produced project results, such the list of authors, title and journal information about a publication.

EURIO uses the OWL 2 Web Ontology Language to formally define the meaning of the domain terms used to describe the CORDIS entities (e.g., projects, organisations, etc.), their attributes (e.g., title, acronym, legal name, etc.) and interrelations (e.g., the relation between a project and the participating organisations, etc.),.

The EURIO ontology and its documentation can be accessed on the EU Vocabularies website.

Using SPARQL to query the EURIO Knowledge Graph

SPARQL is a standard query language for retrieving and manipulating data stored in RDF format. Its development and evolution are overseen by the SPARQL Working Group within W3C and it is fully documented and publicly available.

SPARQL queries are based around graph pattern matching, i.e., the matching of sets of triple patterns forming conjunctive (AND) or disjunctive (OR) conditions. Triple patterns are like RDF triples except that each of the subject, predicate and object may be a variable. A given SPARQL query graph pattern matches a subgraph of the queried RDF data when RDF terms from that subgraph may be substituted for the variables.

For example, the SPARQL query to find the start date of the H2020 project "Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities" given the EURIO Knowledge Graph would be:

PREFIX eurio:<http://data.europa.eu/s66#>
SELECT DISTINCT ?startDate
WHERE
{
  ?project a eurio:Project.
  ?project eurio:title "Knowledge-Based Information Agent with Social Competence and Human Interaction Capabilities" .
  ?project eurio:startDate ?startDate .
}

As shown, running this query, using in our example the Virtuoso SPARQL interface, we get the start date of the project in question, i.e., 1st of March 2015.

The PREFIX keyword is used to designate a prefix label (i.e., an abbreviation) to an IRI that denotes that namespace of the terms used in the query; in the running example, we used the terms “Project”, “title” and “startDate” all which are defined in the EURIO ontology whose IRI is http://data.europa.eu/s66#.

The query consists of two parts:

the SELECT clause that identifies the variables to appear in the query results, and which in our running example is the variable (?startDate) that stands for the requested start date value
the WHERE clause that provides the graph pattern to match against the EURIO Knowledge Graph, and which in our running example consists of three conjunctive triple patterns, i.e., three patterns that must all be matched, namely:
- a triple pattern with the variable (?project) used to express the referenced project
- a triple pattern that indicates the title information of the referenced project
- a triple pattern that with the (?startDate) variable in the object position.

In addition to expressing triple patterns, SPARQL provides several operators and constructs that enable, among others to express optional patterns, filter the matched triple patterns against some condition and aggregate or order the retrieved results.

Consider another simple query, where this time we want to all projects contained in the EURIO Knowledge Graph along with their titles. As such, using the SELECT clause, we will express this as follows: “SELECT ?project ?title”, and we will use the WHERE clause to specify the conditions that need to be met, namely that the variable (?project) used to denote the requested project entities should belong to the class eurio:Project and that the variable (?title) should denote the title value of these project entities. The resulting query would be:

PREFIX eurio:<http://data.europa.eu/s66#>
SELECT ?project ?title
WHERE
{
  ?project a eurio:Project.
  ?project eurio:title ?title. 
}
ORDER BY ?title
LIMIT 100

The use of the ORDER BY clause enables us to additionally order the retrieved results based on the alphabetical ordering of their titles.

Ascending sequence can be indicated using the ASC() modifier or by using no modifier, while descending sequence can be indicated using the DESC() modifier.

The example query showcases also the use of the LIMIT clause which enables us to set an upper limit on the number of returned results; in this case only the 100 pairs of projects and their respective titles will be shown.

To demonstrate the use of another common operator, namely FILTER, let’s continue the example, assuming that we are interested in retrieving only those projects whose start date is between 2021 and 2022 along with their respective titles. The SELECT clause remains the same as before, as we are still requesting the same information, namely projects and their titles, but the WHERE clause needs to be updated with further triple patterns that reflect the condition on their starting date. The extended query would be as follows.

PREFIX eurio:<http://data.europa.eu/s66#>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
<br />
SELECT ?project ?title ?startDate
WHERE
{
  ?project a eurio:Project.
  ?project eurio:title ?title. 
  ?project eurio:startDate ?startDate .
  FILTER ((?startDate >= "2021-01-01"^^xsd:date) && (?startDate<="2023-12-31"^^xsd:date))
}

As illustrated, using the FILTER operator, we can express the condition that the (?startDate) variable needs to satisfy, namely that it needs to be after “01-01-2021” and before “31-12-2023”, with the two limit values included.

The SELECT queries described above comprise one of the query forms defined by SPARQL and which enable to specify and use the solutions from pattern matching to form result sets or RDF graphs. These are:

SELECT, which, as presented above, returns all, or a subset of, the variables bound in a query pattern match.
CONSTRUCT, which returns an RDF graph constructed by substituting variables in a set of triple templates.
ASK, which returns a boolean indicating whether a query pattern matches or not.
DESCRIBE, which returns an RDF graph that describes the resources found.

For further information into the use of the different query forms, as well as a comprehensive description of the overall features of the SPARQL query language, the official SPARQL 1.1 Query Language documentation should be consulted.

How to trigger federated queries?

In all the examples presented above, the queries were executed over the data contained in the EURIO Knowledge Graph.

However, with the growing number of SPARQL query services (SPARQL endpoints) by various data providers via the publishing of their data as Linked Open Data, the opportunity to query jointly these distributed LOD datasets emerges.

To allow this, SPARQL uses the SERVICE extension. This extension allows to direct a portion of a query to a particular SPARQL endpoint and to combine the returned results with the results of the rest of the query.

The following query shows an example of the SPARQL federated query syntax, where we are looking for the Human Developing Index (HDI) of the country in which the organisation “UNIVERSIDAD POMPEU FABRA”, one of the organisations participating in the EU-funded projects contained in the EURIO KG, is located. To get the HDI information, we need to query jointly the EURIO KG that contains information about the country in which the given organisation is located as well as the external KG of DBpedia that contains, among others, information regarding a country’s HDI.

PREFIX eurio:<http://data.europa.eu/s66#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
<br />
SELECT  ?country ?name ?hdi
WHERE {  
   ?org a eurio:Organisation . 
   ?org eurio:legalName "UNIVERSIDAD POMPEU FABRA" . 
   ?org eurio:hasSite ?site .
   ?site eurio:hasGeographicalLocation ?location .
   ?location a eurio:Country .
   ?location eurio:name ?name . 
   SERVICE <http://dbpedia.org/sparql> {
        ?dbpedia_country a dbo:Country .
        ?dbpedia_country dbp:commonName ?dbname . 
        ?dbpedia_country dbp:hdi ?hdi. 
        FILTER (lang(?dbname) = "en") 
        FILTER (STR(?name) = STR(?dbname))
   } 
}

As illustrated, to execute this query we must insert a SERVICE clause inside the WHERE clause of our query, followed by the external endpoint IRI (i.e., http://dbpedia.org/sparql) and then specify the applicable triple patterns, that is, that we are looking for a country in the DBpedia KG (the ?dbpedia_country variable) that has the same name as the country of the “UNIVERSIDAD POMPEU FABRA” organisation of the EURIO KG (i.e., that the variables ?name and ?dbname have the same value), and for which country, we request its HDI value making use of the respective DBpedia property (i.e., dbp:hdi).

It must be noted that federated queries must be used with caution to avoid excessive queries to remote SPARQL endpoints as well as inefficient query patterns, as both can severely impact query execution time, often leading to query time outs and the inability to retrieve any result at all. In the view of this situation, and since, additionally, no guarantee can be provided on the stability, availability, and performance of external SPARQL endpoints, it is highly recommended to opt instead for local data dumps of the KGs (or their sub-graphs) of interest and rely on local-based federated query deployments.

For a comprehensive overview of the features and specifications pertinent to SPARQL’s support of federated queries the SPARQL 1.1 Federated Query documentation should be consulted.

Data dumps

The latest EURIO KG dump can be downloaded from the European data portal, where you can also find sub-graphs of the EURIO KG. These sub-graphs comprise a most relevant, self-contained snapshot of the relations and attributes pertinent to each of the main EURIO KG types of entities and allow for finer-grained access to the EURIO KG contents. The sub-graphs are published as distinct Named Graphs, i.e., as subsets of the EURIO KG graph each with its own distinct label.