Optimizing Semantic Data Transformation using
  High Performance Computing Techniques

      José Antonio Bernabé-Dı́az1 , Marı́a del Carmen Legaz-Garcı́a2 , José M.
                  Garcı́a1 , and Jesualdo Tomás Fernández-Breis1
          1
            Faculty of Informatics, University of Murcia, 30100 Murcia, Spain
                {joseantonio.bernabe1, jfernand, jmgarcia}@um.es
    2
      Biomedical Informatics and Bioinformatics Platform, IMIB-Arrixaca, Calle Luis
                        Fontes Pagán, n 9, 30003 Murcia, Spain
                               mcarmen.legaz@ffis.es


         Abstract. The growth of the Life Science Semantic Web is illustrated
         by the increasing number of resources available in the Linked Open Data
         Cloud. Our SWIT tool supports the generation of semantic repositories,
         and it has been successfully applied in the field of orthology resources,
         helping to achieve objectives of the Quest for Orthologs consortium. How-
         ever, our experience with SWIT reveals that the time required for the
         generation of datasets is longer than desired. In this work we present the
         application of High Performance Computing techniques, mainly memory
         optimization and parallelization, to speed up SWIT.

         Keywords: Semantic Web, Data transformation, High Performance Com-
         puting


1      Optimizing SWIT

The Semantic Web Integration Tool (SWIT) [1] transforms and integrates het-
erogeneous biomedical data for generating open semantic repositories by defining
mapping rules between an input schema and a OWL ontology. The SWIT inputs
are:

 – Data instances: The tool is able to process XML and relational data.
 – OWL Ontology: It supplies the domain knowledge for the transformation
   and the constraints applied for the generation of logically consistent data.
 – Transformation rules: They define how the content of the input dataset is
   transformed into a semantic format: (1) mapping rules provide the links
   between the entities of the input data schema and the entities of the OWL
   ontology; and (2) identity rules prevent the creation of redundant entities by
   defining what makes an individual unique.

    The application of SWIT to the transformation of large datasets has revealed
that despite the computational complexity increases linearly with the number
of entities to be transformed, it is slower than expected, and we have identified
some limitations to the performance: (1) use of an interpreted language; (2) inef-
ficient memory management; (3) the execution of identity rules using SPARQL
queries create an execution bottleneck; and (4) the sequential execution of the
transformation. Hence, a code modernization process was carried out3 :
     1. The SWIT kernel has been re-implemented in C/C++.
     2. We have optimized the way SWIT manages the ontology individuals during
the transformation process. We use now two hash maps composed of pointers
to individuals and not copies, so keeping the coherence and reducing memory
consumption. One map grants that no failure happens when searching for an
individual, if this exists, while the other map could miss in some searches since
it acts as a greedy algorithm. The speed up of the execution of searches with the
two maps is up to 2x.
     3. The optimization of memory management also affects the process of identi-
fying equivalent individuals using identity rules. Identity rules are defined using
AND and OR conditions, and the new method uses one hash map of vectors
for each type, where pointers to hashed individuals are stored. The hashing is
carried out differently depending on the map. For AND conditions, a hash is per-
formed along all the properties of the entity. Contrariwise, for OR conditions,
the entity properties are hashed separately, having several pointers to the same
individual along the hash map.
     4. The SWIT parallelization is done by using gnu parallel tool4 . The paral-
lel design consists in setting one input file and one SWIT instance per core, so
the parallelization only works when multiple files are established. When trans-
forming a single large file, we need to split it in several smaller files to enable
parallelization. It provides the largest speed-up.


2     Results
Our tests show a speed-up of 1000x, 4100x and 7800x in the InParanoid [2]
datasets E.coli, H.arabidopsidis and H.sapiens respectively. The executions were
tested on a high performance server that provides 2 chips of Intel R Xeon R E5-
2698 v4 with 20 cores each (2 hyper-threading), making a total of 40 physical
cores or 80 virtual cores, running at 2,2 GHz and 128 GB RAM DDR4.


References
1. Legaz-Garcı́a, M.D.C., Miñarro-Giménez, J.A., Tortosa, M.M., Fernández-Breis,
   J.T.: Generation of open biomedical datasets through ontology-driven transfor-
   mation and integration processes. J. Biomedical Semantics 7 (2016) 32
2. O’brien, K.P., Remm, M., Sonnhammer, E.L.: Inparanoid: a comprehensive
   database of eukaryotic orthologs. Nucleic acids research 33(suppl 1) (2005) D476–
   D480

3
    https://software.intel.com/en-us/articles/what-is-code-modernization
4
    http://www.gnu.org/s/parallel