Fundamental concepts

In this section we will present some of the main concepts behind SimPhoNy.

General notions

Degrees of interoperability

There is a multitude of tools and programs out there, all with their own formats and protocols.

Every time a user wants to use one of these tools, they must familiarise themselves with the software. Furthermore, if they want to integrate multiple tools in one workflow, the must, in most cases, take care of the conversion on their own.

Based on how tools communicate with other tools, we can define 3 levels:

Compatibility

rectangle A
rectangle B
rectangle C
rectangle D

A <-> B
A <-[hidden]- C
C <-> D

Compatibility

When we say two tools are compatible, they are able to communicate with each other in a one to one basis. This means the tools must either use the same format, or be able to convert to the format of the other.

If we compare this to speaking languages, you could say A and B, or C and D speak the same language. However, A has no way to talk with C or D, for example.

De Facto Standard

rectangle A
rectangle B
rectangle C
rectangle D

A <--> B
A <-> C
C <-[hidden]- D
B <-[hidden]- D
A <-> D

De Facto Standard

In this case, the level of operability is higher. All tools know how to communicate with a tool whose format has become a de facto standard.

To continue with our language simile, A would be a translator that speaks the languages of B, C and D. If B wants to talk to C, they must first relay the message to A, and A will convert it to a format that C understands.

Interoperability

usecase x as "open standard"
rectangle A
rectangle B
rectangle C
rectangle D
A <-down-> x
B <-right-> x
C <-left-> x
D <-up-> x

Interoperability

The highest level of operability is interoperability. Here there is no need for all tools to go through the De Facto standard, because there is a format that is known by all of them and enables all components to communicate among themselves.

This final stage could be compared to all parties using an instant translator that can convert text from one language into any other.

Interoperability between software tools is one of the most important objectives of the SimPhoNy framework.

Semantic vs. syntactic

We can interpret a word as a specific sequence of characters without caring about the meaning itself. This way, a simulation engine parsing an input file will know that the integer written after the keyword step will be used to set the number of iterations the execution loop will run. It does nothing else, and could as easily use the sequence ppp.

However, for a person, the word step will be a sign representing a specific concept. It could be the number of rounds in a simulation, but also the consecutive instructions in an algorithm, the different levels in a stair or the motion a person makes when walking. Based on the domain, a person can also list other relevant concepts and relationships (e.g. when thinking of a stair, the material or the width).

Being able to know the semantic meaning of an instance, and hence its connection to other concepts, is one of the principles of SimPhoNy. For achieving this goal, ontologies play a major role.

Ontology

Important

An ontology is a formal specification of a shared conceptualization. [Borst, 1997] .

Let’s look at the individual components of this definition, starting from the end.

  • Conceptualization, an ontology will work on the ideas and relationships in an area of interest.

  • Shared, the ideas and concepts are perceived and agreed by multiple people.

  • Specification, it will define and describe them in detail, following some predetermined rules and format.

  • Formal, meaning it will follow a machine readable syntax.

In a simpler way, an ontology can be seen as the definition of concepts relevant to a given domain, as well as the relationships between them, in a way that a machine can interpret it.

For a deeper, more detailed analysis of the definition, refer to [Guarino, 2009].

Ontologies are more elaborated than taxonomies in that they can include multiple kinds of relationships (not just parent-child) between complex concepts in big domains.

EMMO

The European Materials Modelling Ontology (EMMO) is an ontology developed by the European Materials Modelling Council (EMMC). EMMO’s goal is to define a representational system universal for scientists in the field of materials modelling to enable interoperability.

It has been designed from the bottom up, starting with the concepts of different domains and application fields and generalising into a middle and top level layers, and it is currently being further developed in multiple projects of the European Union.

SimPhoNy is being developed with the intention of being compatible with EMMO, and an easy installation of the ontology is available (further explained here).

There is also documentation available for developing an EMMO compliant ontology (requires login).

CUDS

CUDS, or Common Universal Data Structure, is the ontology compliant data format of OSP-core:

  • CUDS is an ontology individual: each CUDS object is an instantiation of a class in the ontology. If we assume a food ontology that describes classes like pizza or pasta, a CUDS object could represent one specific pizza or pasta dish, that exists in the real world. Similar to ontology individuals, CUDS objects can be related with other individuals/CUDS by relations defined in the ontology. Like a pizza that ‘hasPart’ tomato sauce

  • CUDS is API: To allow users to interact with the ontology individuals and their data, CUDS provides a CRUD API.

  • CUDS is a container: Depending on the relationship connecting two CUDS objects, a certain instance can be seen as a container of other instances. We call a relationship that express containment an ‘active relationship’. In the pizza example, ‘hasPart’ would be an ‘active relationship’. If one would like to share the pizza CUDS object with others, one would like to share also the tomato sauce.

  • CUDS is RDF: Internally a CUDS object is only an interface to an RDF-based triple store that contains the data of all CUDS objects.

  • CUDS is a node in a graph: : CUDS being individuals in an RDF graph implies that each CUDS object can also be seen as a node in a graph. This does not conflict with the container perspective, instead we see it as to different views on the data.

Technologies and frameworks

RDF

RDF (Resource Description Framework) is a formal language for describing structured information used in the Semantic Web. Its first specification was published in 1999 and extended in 2004.

Knowledge is represented in directed graphs where the nodes are either ontological classes, instances of those classes or literals and the edges the relationships connecting them.

The graph is serialised in the form of triples of the form “subject-predicate-object”

  • Subject: The IRI of the entity the triple refers to. Blank nodes have no IRI, but they are outside of the scope of this thesis.

  • Predicate: IRI of the relationship from subject to object.

  • Object: Literal or IRI of an entity

The following is an example of an RDF triple. This example will also be used to show the different serialisation formats of RDF. For the IRIs, dbpedia’s namespace was used.

(dbr:J._R._R._Tolkien) as tolkien
(dbr:The_Lord_of_the_Rings) as lotr
lotr -> tolkien : dbo:author

RDF triple sample

The most used formats for storing RDF data are:

  • XML: Historically the most common format given the amount of libraries for handling it. It was released hand in hand with the RDF specification. Unfortunately, XML is best used with tree-like structures rather than graphs, which also makes it harder for humans to read.

    The example triple in XML is:

      <?xml version="1.0" encoding="utf-8"?>
      <?xml version="1.0" encoding="utf-8"?>
      <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
              xmlns:dbp="http://dbpedia.org/property/">
          <rdf:Description rdf:about="http://dbpedia.org/resource/The_Lord_of_the_Rings">
              <dbp:author rdf:resource="http://dbpedia.org/resource/J._R._R._Tolkien"/>
          </rdf:Description>
      </rdf:RDF>
    
  • N3: Notation3 is designed with human readability as a motivator. The RDF triples are written one per line, with the possibility to define common prefixes and other directives for simplicity.

    The previous example in N3 would be:

        @prefix dbo: <http://dbpedia.org/ontology/> .
        @prefix dbr: <http://dbpedia.org/resource/> .
        dbr:The_Lord_of_the_Rings  dbo:author  dbr:J._R._R._Tolkien .
    
  • Turtle: Based on N3, it strips some of its syntax, making it easier to parse for machines. The recurring example would be exactly the same in Turtle as in N3.

  • N-Triples: N-Triples are even simpler, without any of the syntactic sugar from N3 or Turtle. The triples are written one per line without prefixes. This makes it a very easy format to parse but complex to maintain/read by a human.

    The following representation should be in one line (it has been split for readability)

      <http://dbpedia.org/resource/The_Lord_of_the_Rings>
        <http://dbpedia.org/ontology/author>
        <http://dbpedia.org/resource/J._R._R._Tolkien> .
    
  • JSON-LD: uses the commonly accepted web data scheme for serialising RDF triples. Easier than XML for humans, JSON has standard libraries in practically all programming languages.

    The example in JSON is:

      {"@id": "http://dbpedia.org/resource/The_Lord_of_the_Rings",
        "http://dbpedia.org/property/author": 
          [{"@id": "http://dbpedia.org/resource/J._R._R._Tolkien"}]
        }
    

SimPhoNy supports all the previous formats (plus a simpler custom YAML) as inputs in the ontology installation.

SPARQL

SPARQL (recursively SPARQL Protocol and RDF Query Language) is the most common query language for RDF. Queries are graph patterns (similar to the triples of Turtle) with variables for the parts of the pattern that make up the result.

Variables start with the identifier ? and represent concrete values that will be matched in the query process. They can appear in multiple locations in the patterns and those present in the SELECT clause will be returned as the query result.

The query for the author of The Lord of the Rings from our sample triples in SPARQL is:

  PREFIX dbo: <http://dbpedia.org/ontology/>
  PREFIX dbr: <http://dbpedia.org/resource/>
  SELECT ?person WHERE {
      dbr:The_Lord_of_the_Rings  dbo:author  ?person .
  }

The SPARQL query language offers multiple types of result sets and clauses, most of which won’t be used for this Master’s thesis. One which should be mentioned is the FILTER keyword. This will limit the result to those that evaluate true to the expression inside the brackets. For instance (omitting the prefix declaration for simplicity):

  SELECT ?character WHERE {
      ?character dbp:affiliation dbr:The_Lord_of_the_Rings .
      ?character dbo:age ?age .
      FILTER(?age >= 100)
  } 

The previous query would return the characters from the book series with an age higher or equal to 100. (Note that while the query is correct, the result is empty, as such information is not stored on DBpedia).

For a very interesting and comprehensive introduction into RDF and SPARQL, see [Hitzler, 2009].