Towards Efficient Annotation Databases
René Heinzl1 , Markus Nissl2 and Emanuel Sallinger2,3
1
  Building Digital Solutions 421 GmbH, Vienna, Austria
2
  TU Wien, Vienna, Austria
3
  University of Oxford, Oxford, United Kingdom


                                         Abstract
                                         Recent advances in machine learning have increased the demand for efficient annotation data management
                                         for machine learning applications by organizations. In this paper, we address this challenge through an
                                         industrial collaboration centered around the unification of data for training and prediction workflows by
                                         enabling fast analytical processing through summarization. Beyond this specific solution, we provide a
                                         very concrete real-world scenario and solution to the data management community as inspiration for
                                         further theoretical and practical research. Finally, we report on the open scientific challenges that remain
                                         in this field.


1. Introduction
Answering the call specifically pushing for “papers in real-world contexts” we represent a
paper on a real-world application in the area of waste separation, that is, in the context of the
pressing societal issues of circular economy and meeting the UN sustainable development goals
(SDGs). This presents ongoing research based on an award-winning in-production large-scale
deployment in multiple countries.
Context. The core of this paper is focused on annotation data management for machine learning,
a critical part of data management for machine learning [1]. Industrial implementations, such as
Amazon SageMaker [2], VGG Image Annotator [3] or Anafora [4] exist, but stop at the level of
annotating data for training purposes or at the management of the training process itself. They
have limited support for metadata management, lacking support for real-time data management
and analytical querying. Yet, we know that in the data management community, we have ample
studies on metadata management [5, 6, 7, 8, 9] and annotation databases [10, 11, 12] – though
in quite different contexts than what is required for annotation data management in machine
learning.
   In this paper, we describe the concrete solution to this problem which we developed for
this widely deployed real-world application. Our solution is centered around the unification
of data for training and prediction workflows by enabling fast analytical processing through
summarization. This is especially important when real-time data is used in reporting systems
and automated machine learning processes. Beyond this specific solution, the most important
aspect of this paper is giving a very concrete real-world scenario and solution to the data
management community as inspiration for further theoretical and practical research.

15th Alberto Mendelzon International Workshop on Foundations of Data Management, May 22–26, 2023, Santiago, Chile
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
Application. In the following we provide the core use cases of our business partner for the
domain of interest, demonstrating the need for an advanced annotation and metadata storage
for machine learning processes.

  Use Case (Object storage). The waste separation business is interesting in detecting
  impurities in plastic waste such as batteries, metals or cardboard. The company has the
  requirement that (i) each image should be stored as a possible candidate for training for at
  least one year for subsequent analysis requests, (ii) each detected label for each version of
  a machine learning model applied on an image should be stored for (real-time) analytical
  purposes and (iii) for statistical evidence of correct labeling, the created labels are stored
  per user.

The storage of the image data alone for this use case with an average of 100GB (or 20.000
Full-HD images) per device generates a large amount of data. While typically the image data
is stored in an object storage, still the metadata for each device exceed 7 million entries per
year, without considering the details such as the number of labels per model or user. Moreover,
usually additional metadata is stored as demonstrated by the following use case:

  Use Case (Metadata). The company is interested in storing next to annotation data
  for training and analytical purposes information regarding the device, such as the model
  number, camera metadata or location data. This allows the company among others to
  correlate specific waste information with trucks and household areas for optimization
  purposes.

There exist different approaches to store such metadata in database systems. Typically annota-
tion databases are built on top of relational databases or NoSQL stores using either separate
annotation tables, additional fields in the document table, or as binary data such as serialized
JSON or XML data. In some cases, annotations are stored in object stores with a reference to the
location in the database. While the first two methods allow more efficient analytical queries, the
latter two methods allow to deal with more complex annotation scenarios, such as frame series
annotations, where several thousand records ranging between several MB to several hundred
MB are required at once [13]. Systems and theory that support both scenarios do not exist to
the best of our knowledge.
Contribution. In this paper, we address this challenge by reporting on

    • a real-world contemporary use case in the context of the pressing societal issues of circular
      economy;
    • ongoing work for an efficient annotation data storage solution that leverages both database
      systems and object storage;
    • key requirements that an annotation database for machine learning purposes has to fulfil.

Outline. In the remainder of this paper, we discuss first the requirements, then present the
solution and finally conclude by discussing open challenges.
2. Requirements
In this section, we establish several key requirements that an annotation database1 has to
fulfill in order to manage annotation data effectively. Our requirements are based on our use
cases from the waste separation company, extended with knowledge from different scenarios
established over several years on hands-on experience in the field of machine learning. The
requirements are:

        • Integration with machine learning workflows. An annotation database should integrate
          seamlessly with machine learning workflows, allowing the use of annotation data in the
          training and evaluation of machine learning models.
        • Support for search and analysis. An annotation database should store data in an optimized
          format that can be efficiently queried and analyzed in real-time. One should be able to
          navigate through the data, extract insights and trends as well as find specific annotations.
        • Performance and scalability. An annotation database should be able to handle large
          volumes of data and support high levels of concurrent access.
        • Flexibility and extensibility. An annotation database should support a wide range of
          annotation types and workflows as well as custom annotation types to cover highly
          specialized annotation tasks.
        • Support of annotation metadata. An annotation database should allow the storage and
          management of annotation metadata, such as annotator details, timestamps, model infor-
          mation, location data and additional information related to the annotation.


3. Solution
In this section, we present our solution for the use case to address the established requirements
from the previous section. Our approach is structured into three different components: (i) base
data ingestion, (ii) machine learning data ingestion, and (iii) real-time analytical component.
We provide an overview of each of the components in the following by referring to Figure 1.

Base data ingestion. In our use case, multiple end user devices (in the figure referenced as
“Data Collector”) are capturing new (image) data at real-time (each device every few seconds)
and inserting them into our annotation database. Thereby we distinguish between raw data
(e.g., the image) which is stored in an object storage, and meta data (e.g., timestamps, locations,
the path at the object storage of the raw data) which is stored in our meta storage. Already
in this step it is crucial to utilise an efficient bucketing schema for the metadata to optimise
towards the real-time analytical component – a key shortcoming of some approaches discussed
in the introduction.


1
    Note that we concentrate here on the annotation database, not on the annotation management system which
    includes also additional functionality such as user management, visualization tools and an advanced user interface.
Figure 1: Overview of our Annotation Database Setup


Machine Learning Data Ingestion. Here our main goal is to overcome the – inefficient and
costly – separation between training data storage and operational data storage. Operationally,
the machine learning process is initiated when different triggers fire. These are, for already
deployed models, insertion triggers for computing new annotations (labels) in real-time and,
for newly trained models, an on-demand execution over existing raw data in the object storage
after deployment of the model. The resulting annotations are written to the object storage, a
summary of those annotations are provided to the meta storage. With this, i.e., the storage
of the annotation data in the object storage on the one hand, we allow for handling complex
annotation scenarios – a key shortcoming of the other approaches discussed in the introduction,
and with the summarization on the other hand, we provide the foundation of efficient real-time
querying of the meta storage, the second key point raised in the introduction.

Real-time Analytical Component. The last part of the system is the efficient possibility
to subscribe to a query of annotation results from the meta storage. For this, we encountered
different queries from the business domain, such as how many annotations of one specific label
or a combination of labels have been found per day for specific metadata criteria (device, location,
machine learning model, and so on). This provides a high number of query combinations, but
with only a limited amount of queries being currently actively requested. By combining an
efficient analytical real-time database with bucketing (we use buckets based on the timestamp),
we are able to cache “old” results and only have to (re)compute the changes in the newest bucket.
With a subscription to database changes, the solution is even able to clear and recompute the
cache for changes for currently subscribed queries as well as notify in real-time the current
subscribed queries with the newest updates. This ensures that the solution meets the second
key point raised in the introduction, more efficient analytical queries, and one of the key
requirements.

Evaluation. This approach has been evaluated by the stakeholders of the company in real-
world production in multiple countries and satisfies all requirements.
4. Conclusion.
We conclude by raising open challenges for our community:

  Open Challenges (theory). While the presented solution provides an effective solution
  for the use, in the data management community we lack (1) a systematic study of this
  combination of annotation storages and summarization, and (2) theoretical results on the
  limits of such techniques.


  Open Challenges (practice). Here we lack (1) a systematic analysis of different database
  technologies for the meta storage, and (2) the development of optimized data management
  systems in the context of resource-limited environments.

In addition, we are particularly interesting in exploring this topic in more detail in the setting
of Knowledge Graphs [14, 15, 16, 17] and our Vadalog system [18, 19, 20].


Acknowledgments
This work has been funded by the Vienna Science and Technology Fund (WWTF)
[10.47379/VRG18013, 10.47379/NXT22018, 10.47379/ICT2201]; and the Christian Doppler Re-
search Association (CDG) JRC LIVE.


References
 [1] M. Schlegel, K. Sattler, Management of machine learning lifecycle artifacts: A survey,
     SIGMOD Rec. 51 (2022) 18–35.
 [2] D. Nigenda, Z. Karnin, M. B. Zafar, R. Ramesha, A. Tan, M. Donini, K. Kenthapadi, Amazon
     sagemaker model monitor: A system for real-time insights into deployed machine learning
     models, in: KDD, ACM, 2022, pp. 3671–3681.
 [3] A. Dutta, A. Zisserman, The VIA annotation software for images, audio and video, in:
     ACM Multimedia, ACM, 2019, pp. 2276–2279.
 [4] W. Chen, W. Styler, Anafora: A web-based general purpose annotation tool, in: HLT-
     NAACL, The Association for Computational Linguistics, 2013, pp. 14–19.
 [5] P. G. Kolaitis, Schema mappings, data exchange, and metadata management, in: PODS,
     ACM, 2005, pp. 61–75.
 [6] P. A. Bernstein, S. Melnik, Model management 2.0: manipulating richer mappings, in:
     SIGMOD Conference, ACM, 2007, pp. 1–12.
 [7] M. Arenas, J. Pérez, J. L. Reutter, C. Riveros, Foundations of schema mapping management,
     in: PODS, ACM, 2010, pp. 227–238.
 [8] P. G. Kolaitis, Reflections on schema mappings, data exchange, and metadata management,
     in: PODS, ACM, 2018, pp. 107–109.
 [9] P. Edara, M. Pasumansky, Big metadata : When metadata is big data, Proc. VLDB Endow.
     14 (2021) 3083–3095.
[10] D. Bhagwat, L. Chiticariu, W. C. Tan, G. Vijayvargiya, An annotation management system
     for relational databases, VLDB J. 14 (2005) 373–396.
[11] P. Senellart, Provenance and probabilities in relational databases, SIGMOD Rec. 46 (2017)
     5–15.
[12] P. Buneman, W. Tan, Data provenance: What next?, SIGMOD Rec. 47 (2018) 5–16.
[13] How to efficiently manage storage for high-volume data annotation projects, https://
     medium.com/multisensory-data-training/storage-e7f37afba24c, 2023. Accessed: 2023-03-
     08.
[14] L. Bellomarini, M. Benedetti, S. Ceri, A. Gentili, R. Laurendi, D. Magnanimi, M. Nissl,
     E. Sallinger, Reasoning on company takeovers during the COVID-19 crisis with knowledge
     graphs, in: RuleML+RR (Supplement), volume 2644 of CEUR Workshop Proceedings, CEUR-
     WS.org, 2020, pp. 145–156.
[15] L. Bellomarini, L. Bencivelli, C. Biancotti, L. Blasi, F. P. Conteduca, A. Gentili, R. Laurendi,
     D. Magnanimi, M. S. Zangrandi, F. Tonelli, S. Ceri, D. Benedetto, M. Nissl, E. Sallinger,
     Reasoning on company takeovers: From tactic to strategy, Data Knowl. Eng. 141 (2022)
     102073.
[16] L. Bellomarini, E. Sallinger, S. Vahdati, Knowledge graphs: The layered perspective, in:
     Knowledge Graphs and Big Data Processing, volume 12072 of Lecture Notes in Computer
     Science, Springer, 2020, pp. 20–34.
[17] L. Bellomarini, E. Sallinger, S. Vahdati, Reasoning in knowledge graphs: An embeddings
     spotlight, in: Knowledge Graphs and Big Data Processing, volume 12072 of Lecture Notes
     in Computer Science, Springer, 2020, pp. 87–101.
[18] L. Bellomarini, L. Blasi, M. Nissl, E. Sallinger, The temporal vadalog system, in: RuleML+RR,
     volume 13752 of Lecture Notes in Computer Science, Springer, 2022, pp. 130–145.
[19] L. Bellomarini, R. R. Fayzrakhmanov, G. Gottlob, A. Kravchenko, E. Laurenza, Y. Nenov,
     S. Reissfelder, E. Sallinger, E. Sherkhonov, S. Vahdati, L. Wu, Data science with vadalog:
     Knowledge graphs with machine learning and reasoning in practice, Future Gener. Comput.
     Syst. 129 (2022) 407–422.
[20] L. Bellomarini, D. Benedetto, G. Gottlob, E. Sallinger, Vadalog: A modern architecture for
     automated reasoning with large knowledge graphs, Inf. Syst. 105 (2022) 101528.