VisionKG: Towards A Unified
                      Vision Knowledge Graph

     Anh Le-Tuan1 , Trung-Kien Tran2 , Manh Nguyen-Duc1 , Jicheng Yuan1 ,
                Manfred Hauswirth1,3 , and Danh Le-Phuoc1,3
                1
               Open Distributed Systems, Technical University of Berlin
            2
             Bosch Center for Artificial Intelligence, Renningen, Germany
      3
        Fraunhofer Institute for Open Communication Systems, Berlin, Germany


        Abstract. Computer Vision (CV) has recently achieved significant im-
        provements, thanks to the evolution of deep learning. Along with ad-
        vanced architectures and optimisations of deep neural networks, CV data
        for (cross-datasets) training, validating, and testing contributes greatly
        to the performance of CV models. Many CV datasets have been created
        for different tasks, but they are available in heterogeneous data formats
        and semantic representations. Therefore, it is challenging when one needs
        to combine different datasets either for training or testing purposes. This
        paper proposes a unified framework using the Semantic Web technology
        that provides a novel way to interlink and integrate labelled data across
        different data sources. We demonstrate its advantages via various sce-
        narios with the system framework accessible both online and via APIs.4

        Keywords: Semantic Web · Knowledge Graph · Computer Vision Dataset


1     Motivation and Contributions

Image datasets (e.g., ImageNet [6], COCO [8], etc.) contribute greatly to the cur-
rent success of deep learning in computer vision (CV). The quality of a trained
deep neural network (DNN) is influenced by not only the advanced architec-
ture and optimisation of the DNN but also the annotations and images used for
training, validating and testing [12]. The number of labelled datasets has been
rapidly growing, and working with different datasets is desirable (e.g., to resolve
the out-of-distribution problem and to increase the robustness of CV models [4,
7]). However, the labels are available in heterogeneous formats and are not con-
sistent across datasets. As illustrated in Figure 1, the pedestrian in KITTI
dataset [3] or the man in Visual Genome dataset [5]) are annotated as person in
4
    https://vision.semkg.org
    Copyright © 2021 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
2        A. Le-Tuan et al.

COCO dataset. Therefore, it is challenging when one needs to combine different
datasets for training or testing purposes.
    Recently, the Semantic Web technologies have offered a flexible and powerful
mechanism to integrate data from different sources [1]. However, such technolo-
gies have been used in very limited settings to manage CV datasets. Prominent
CV datasets, such as ImageNet or Visual Genome only use a light form of tax-
onomy (e.g., WordNet5 ) to label their images. Even when these datasets can
be queried, e.g., using SPARQL, the lack of interoperability leads to complex
queries that cover all possible cases to unify the labels across different datasets
(the left query in Figure 1).
    Such shortcomings motivate us to build a unified knowledge graph (KG) to
realise the FAIR principles [10] for CV datasets. Our vision is that the ability
to interlink labels across label spaces under shared semantic understanding will
not only enable a more convenient way to organise training data (e.g., see the
right query in Figure 1) but also enable a more robust way to analyse and test
trained DNNs. Moreover, this KG can pay the way to enable the interpretability
and explainability of the resulting models, e.g. [11, 9].


Fig. 1: An example of two equivalent queries to obtain images that contain
Person from COCO, KITTI, and Visual Genome datasets.

    As a step towards the above vision, we propose a unified framework, called
VisionKG, that facilitates a novel way to organise CV datasets. Our on-going
implementation of VisionKG employs the Semantic Web technology to interlink
labelled data across different datasouces. This demo paper will show its ad-
vantages via three scenarios: (i) exploring labelled images across datasets, (ii)
building training pipelines with mixed datasets , and (iii) validating and testing
a trained DNN.
5
    https://wordnet.princeton.edu/
                     VisionKG: Towards A Unified Vision Knowledge Graph          3

2     Vision Semantic Knowledge Graph
Figure 2 illustrates the overview of our VisionKG framework and the process
of creation and enrichment of our unified KG for CV datasets. In step ○,    1 we
analyse the structure of the collected CV datasets and propose a unified data
model to integrate these datasets. Following the FAIR principles, to make the
data findable, in step ○, 2 we add metadata and semantic annotations for the
data, e.g., what it is about and where it comes from. To make the data accessible,
we use RDF and provide a query interface with SPARQL as in step ○.          4 To
enhance the interoperability, in step ○,
                                       2 we also link the data with WordNet and
Wikidata6 to reuse the taxonomy of the labels defined by the original sources;
and, in step ○,3 we utilise a reasoner to expand the taxonomy by materialising
the labels in each dataset using the ontology hierarchy, e.g., pedestrian or man is
SubClassOf person. This makes the two queries in Figure 1 equivalent, and thus,
helps users to avoid complex queries such as the one on the left. Additionally,
in our data model, we utilise existing standardised ontologies whenever it is
possible, which makes it reusable, i.e., the metadata and data are well described
and are ready to be used in different settings.


                         Fig. 2: The overview of VisionKG

    Additionally, VisionKG system includes a front-end web interface ○ 7 that al-
lows users to explore the KG as shown in the first scenario of our demonstration.
Furthermore, our framework contains a DNN Training Engine ○       5 and an Eval-
uator ○.
       6 The image data as the tensor inputs for the Evaluator and the DNN
Training Engine can be stored in a tensor storage (i.e. TensorDB). And the labels
can be retrieved with SPARQL as demonstrated in the second scenario. The
trained models are stored in our Model Zoo Directory and are evaluated by the
Evaluator.
    The tutorials for the training pipelines based on [2] are available online
at https://github.com/cqels/vision. Most of the data preparation and con-
6
    https://www.wikidata.org/
4       A. Le-Tuan et al.

figuration steps are automated so that Semantic Web developers familiar with
SPARQL can easily try out the pipelines.
    In the current version (by August, 2021), VisionKG has 67 million triples
which cover Visual Genome, COCO, and KITTI datasets with the total num-
ber of 239k images, a million of labels (including ones for bounding-box), and
hundreds of object categories. These categories are reused but aligned with Wiki-
data concepts/classes. VisionKG also contains millions of detection results that
are evaluated with popular pretrained models such as Yolov3, Yolov4, Efficient-
Det, FRCNN, etc.

3   Feature Demonstrations
The demonstration session will consist of three scenarios with a demonstration
video at https://vision.semkg.org/iswc2021-demo.html. For all mentioned sce-
narios, we provide both Python APIs for developers and a Web interface for
end-users in the training/testing phase and data exploration phase respectively.


          Fig. 3: The screenshots of the demonstation with VisionKG


○a Graph-based Exploration Across Visual Label Spaces: In this sce-
nario, we demonstrate our web-based image explorer that is used for retriev-
ing images using SPARQL (Figure 3 ○).    a The demonstration shows that with
VisionKG, users will be able to search for images containing different labels, e.g.,
images that contain a cat and a person; or images that have 10 cars.

○b Building Training Pipelines with Mixed datasets: In the second part,
we demonstrate the scenario to obtain a mixed dataset for training purposes
(Figure 3 ○).
           b A user starts a training pipeline by writing a SPARQL to retrieve
the images and labels as desired. This includes an advanced setting like merging
training data with the same label, e.g. Person, from different datasets (as shown
in the right query of Figure 1).

○c Cross-dataset Validation and Testing: The third part demonstrates
the scenario of getting the mixed dataset for validating purposes (Figure 3 ○).
                                                                            c
                      VisionKG: Towards A Unified Vision Knowledge Graph             5

Similar to the scenario for training data, this includes the case where test data
with the same labels are combined from different datasets. In advanced settings,
one can test different models on specific labels in one specific dataset or over
different datasets. This scenario is particularly useful in the case that developers
want to target specific applications. For example, one can test the trained models
to detect Car on images of Car in crowded traffic scenes or in mountain areas.

4    Next Steps
Our proposed framework opens various new research venues. First, we plan to
build DNNs with a unified label space powered by VisionKG. Next, we will ex-
tend VisionKG to analyse the robustness of DNNs models using testing samples
with semantic similarities via KG embeddings. Such KG embeddings can be
combined with visual features of DNN-based visual models to investigate the
interpretability and explainability of these models, e.g. [9, 11].

Acknowledgments This work was funded by the German Research Foundation
(DFG) under the COSMO project (ref. 453130567), the German Ministry for
Education and Research via The Berlin Institute for the Foundations of Learning
and Data (BIFOLD, ref. 01IS18025A and ref. 01IS18037A), and the German
Academic Exchange Service (DAAD, ref. 57440921).

References
 1. Benjelloun, O., Chen, S., Noy, N.F.: Google dataset search by the numbers. In:
    19th International Semantic Web Conference Proceedings. Springer (2020)
 2. Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu,
    Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark.
    arXiv preprint arXiv:1906.07155 (2019)
 3. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti
    dataset. The International Journal of Robotics Research (2013)
 4. Huang, R., Li, Y.: Mos: Towards scaling out-of-distribution detection for large
    semantic space. In: Proceedings of the IEEE/CVF (2021)
 5. Krishna, R., et la.: Visual genome: Connecting language and vision using crowd-
    sourced dense image annotations. IJCV (2017)
 6. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
    volutional neural networks. NeurIPS (2012)
 7. Lambert, J., Liu, Z., Sener, O., Hays, J., Koltun, V.: MSeg: A composite dataset
    for multi-domain semantic segmentation. In: CVPR (2020)
 8. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
    Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV (2014)
 9. Park, J.S., Bhagavatula, C., Mottaghi, R., Farhadi, A., Choi, Y.: Visualcomet:
    Reasoning about the dynamic context of a still image. In: ECCV (2020)
10. Wilkinson, M.D., et la.: The fair guiding principles for scientific data management
    and stewardship. Scientific data 3(1), 1–9 (2016)
11. Zareian, A., Karaman, S., Chang, S.: Bridging knowledge graphs to generate scene
    graphs. In: ECCV 2020 (2020)
12. Zhu, X., et la.: Do we need more training data? IJCV (2016)