<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Eficient Annotation Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>René Heinzl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Markus Nissl</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuel Sallinger</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Building Digital Solutions 421 GmbH</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TU Wien</institution>
          ,
          <addr-line>Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Oxford</institution>
          ,
          <addr-line>Oxford</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent advances in machine learning have increased the demand for eficient annotation data management for machine learning applications by organizations. In this paper, we address this challenge through an industrial collaboration centered around the unification of data for training and prediction workflows by enabling fast analytical processing through summarization. Beyond this specific solution, we provide a very concrete real-world scenario and solution to the data management community as inspiration for further theoretical and practical research. Finally, we report on the open scientific challenges that remain in this field.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Answering the call specifically pushing for “papers in real-world contexts” we represent a
paper on a real-world application in the area of waste separation, that is, in the context of the
pressing societal issues of circular economy and meeting the UN sustainable development goals
(SDGs). This presents ongoing research based on an award-winning in-production large-scale
deployment in multiple countries.</p>
      <p>
        Context. The core of this paper is focused on annotation data management for machine learning,
a critical part of data management for machine learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Industrial implementations, such as
Amazon SageMaker [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], VGG Image Annotator [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or Anafora [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] exist, but stop at the level of
annotating data for training purposes or at the management of the training process itself. They
have limited support for metadata management, lacking support for real-time data management
and analytical querying. Yet, we know that in the data management community, we have ample
studies on metadata management [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ] and annotation databases [10, 11, 12] – though
in quite diferent contexts than what is required for annotation data management in machine
learning.
      </p>
      <p>In this paper, we describe the concrete solution to this problem which we developed for
this widely deployed real-world application. Our solution is centered around the unification
of data for training and prediction workflows by enabling fast analytical processing through
summarization. This is especially important when real-time data is used in reporting systems
and automated machine learning processes. Beyond this specific solution, the most important
aspect of this paper is giving a very concrete real-world scenario and solution to the data
management community as inspiration for further theoretical and practical research.
Application. In the following we provide the core use cases of our business partner for the
domain of interest, demonstrating the need for an advanced annotation and metadata storage
for machine learning processes.</p>
      <p>Use Case (Object storage). The waste separation business is interesting in detecting
impurities in plastic waste such as batteries, metals or cardboard. The company has the
requirement that (i) each image should be stored as a possible candidate for training for at
least one year for subsequent analysis requests, (ii) each detected label for each version of
a machine learning model applied on an image should be stored for (real-time) analytical
purposes and (iii) for statistical evidence of correct labeling, the created labels are stored
per user.</p>
      <p>The storage of the image data alone for this use case with an average of 100GB (or 20.000
Full-HD images) per device generates a large amount of data. While typically the image data
is stored in an object storage, still the metadata for each device exceed 7 million entries per
year, without considering the details such as the number of labels per model or user. Moreover,
usually additional metadata is stored as demonstrated by the following use case:
Use Case (Metadata). The company is interested in storing next to annotation data
for training and analytical purposes information regarding the device, such as the model
number, camera metadata or location data. This allows the company among others to
correlate specific waste information with trucks and household areas for optimization
purposes.</p>
      <p>There exist diferent approaches to store such metadata in database systems. Typically
annotation databases are built on top of relational databases or NoSQL stores using either separate
annotation tables, additional fields in the document table, or as binary data such as serialized
JSON or XML data. In some cases, annotations are stored in object stores with a reference to the
location in the database. While the first two methods allow more eficient analytical queries , the
latter two methods allow to deal with more complex annotation scenarios, such as frame series
annotations, where several thousand records ranging between several MB to several hundred
MB are required at once [13]. Systems and theory that support both scenarios do not exist to
the best of our knowledge.</p>
      <sec id="sec-1-1">
        <title>Contribution. In this paper, we address this challenge by reporting on</title>
        <p>• a real-world contemporary use case in the context of the pressing societal issues of circular
economy;
• ongoing work for an eficient annotation data storage solution that leverages both database
systems and object storage;
• key requirements that an annotation database for machine learning purposes has to fulfil.
Outline. In the remainder of this paper, we discuss first the requirements, then present the
solution and finally conclude by discussing open challenges.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Requirements</title>
      <p>In this section, we establish several key requirements that an annotation database1 has to
fulfill in order to manage annotation data efectively. Our requirements are based on our use
cases from the waste separation company, extended with knowledge from diferent scenarios
established over several years on hands-on experience in the field of machine learning. The
requirements are:
• Integration with machine learning workflows. An annotation database should integrate
seamlessly with machine learning workflows, allowing the use of annotation data in the
training and evaluation of machine learning models.
• Support for search and analysis. An annotation database should store data in an optimized
format that can be eficiently queried and analyzed in real-time. One should be able to
navigate through the data, extract insights and trends as well as find specific annotations.
• Performance and scalability. An annotation database should be able to handle large
volumes of data and support high levels of concurrent access.
• Flexibility and extensibility. An annotation database should support a wide range of
annotation types and workflows as well as custom annotation types to cover highly
specialized annotation tasks.
• Support of annotation metadata. An annotation database should allow the storage and
management of annotation metadata, such as annotator details, timestamps, model
information, location data and additional information related to the annotation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Solution</title>
      <p>In this section, we present our solution for the use case to address the established requirements
from the previous section. Our approach is structured into three diferent components: (i) base
data ingestion, (ii) machine learning data ingestion, and (iii) real-time analytical component.
We provide an overview of each of the components in the following by referring to Figure 1.
Base data ingestion. In our use case, multiple end user devices (in the figure referenced as
“Data Collector”) are capturing new (image) data at real-time (each device every few seconds)
and inserting them into our annotation database. Thereby we distinguish between raw data
(e.g., the image) which is stored in an object storage, and meta data (e.g., timestamps, locations,
the path at the object storage of the raw data) which is stored in our meta storage. Already
in this step it is crucial to utilise an eficient bucketing schema for the metadata to optimise
towards the real-time analytical component – a key shortcoming of some approaches discussed
in the introduction.
1Note that we concentrate here on the annotation database, not on the annotation management system which
includes also additional functionality such as user management, visualization tools and an advanced user interface.
Machine Learning Data Ingestion. Here our main goal is to overcome the – ineficient and
costly – separation between training data storage and operational data storage. Operationally,
the machine learning process is initiated when diferent triggers fire. These are, for already
deployed models, insertion triggers for computing new annotations (labels) in real-time and,
for newly trained models, an on-demand execution over existing raw data in the object storage
after deployment of the model. The resulting annotations are written to the object storage, a
summary of those annotations are provided to the meta storage. With this, i.e., the storage
of the annotation data in the object storage on the one hand, we allow for handling complex
annotation scenarios – a key shortcoming of the other approaches discussed in the introduction,
and with the summarization on the other hand, we provide the foundation of eficient real-time
querying of the meta storage, the second key point raised in the introduction.
Real-time Analytical Component. The last part of the system is the eficient possibility
to subscribe to a query of annotation results from the meta storage. For this, we encountered
diferent queries from the business domain, such as how many annotations of one specific label
or a combination of labels have been found per day for specific metadata criteria (device, location,
machine learning model, and so on). This provides a high number of query combinations, but
with only a limited amount of queries being currently actively requested. By combining an
eficient analytical real-time database with bucketing (we use buckets based on the timestamp),
we are able to cache “old” results and only have to (re)compute the changes in the newest bucket.
With a subscription to database changes, the solution is even able to clear and recompute the
cache for changes for currently subscribed queries as well as notify in real-time the current
subscribed queries with the newest updates. This ensures that the solution meets the second
key point raised in the introduction, more eficient analytical queries, and one of the key
requirements.</p>
      <p>Evaluation. This approach has been evaluated by the stakeholders of the company in
realworld production in multiple countries and satisfies all requirements.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion.</title>
      <sec id="sec-4-1">
        <title>We conclude by raising open challenges for our community:</title>
        <p>Open Challenges (theory). While the presented solution provides an efective solution
for the use, in the data management community we lack (1) a systematic study of this
combination of annotation storages and summarization, and (2) theoretical results on the
limits of such techniques.</p>
        <p>Open Challenges (practice). Here we lack (1) a systematic analysis of diferent database
technologies for the meta storage, and (2) the development of optimized data management
systems in the context of resource-limited environments.</p>
        <p>In addition, we are particularly interesting in exploring this topic in more detail in the setting
of Knowledge Graphs [14, 15, 16, 17] and our Vadalog system [18, 19, 20].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been funded by the Vienna Science and Technology Fund (WWTF)
[10.47379/VRG18013, 10.47379/NXT22018, 10.47379/ICT2201]; and the Christian Doppler
Research Association (CDG) JRC LIVE.
[10] D. Bhagwat, L. Chiticariu, W. C. Tan, G. Vijayvargiya, An annotation management system
for relational databases, VLDB J. 14 (2005) 373–396.
[11] P. Senellart, Provenance and probabilities in relational databases, SIGMOD Rec. 46 (2017)
5–15.
[12] P. Buneman, W. Tan, Data provenance: What next?, SIGMOD Rec. 47 (2018) 5–16.
[13] How to eficiently manage storage for high-volume data annotation projects, https://
medium.com/multisensory-data-training/storage-e7f37afba24c, 2023. Accessed:
2023-0308.
[14] L. Bellomarini, M. Benedetti, S. Ceri, A. Gentili, R. Laurendi, D. Magnanimi, M. Nissl,
E. Sallinger, Reasoning on company takeovers during the COVID-19 crisis with knowledge
graphs, in: RuleML+RR (Supplement), volume 2644 of CEUR Workshop Proceedings,
CEURWS.org, 2020, pp. 145–156.
[15] L. Bellomarini, L. Bencivelli, C. Biancotti, L. Blasi, F. P. Conteduca, A. Gentili, R. Laurendi,
D. Magnanimi, M. S. Zangrandi, F. Tonelli, S. Ceri, D. Benedetto, M. Nissl, E. Sallinger,
Reasoning on company takeovers: From tactic to strategy, Data Knowl. Eng. 141 (2022)
102073.
[16] L. Bellomarini, E. Sallinger, S. Vahdati, Knowledge graphs: The layered perspective, in:
Knowledge Graphs and Big Data Processing, volume 12072 of Lecture Notes in Computer
Science, Springer, 2020, pp. 20–34.
[17] L. Bellomarini, E. Sallinger, S. Vahdati, Reasoning in knowledge graphs: An embeddings
spotlight, in: Knowledge Graphs and Big Data Processing, volume 12072 of Lecture Notes
in Computer Science, Springer, 2020, pp. 87–101.
[18] L. Bellomarini, L. Blasi, M. Nissl, E. Sallinger, The temporal vadalog system, in: RuleML+RR,
volume 13752 of Lecture Notes in Computer Science, Springer, 2022, pp. 130–145.
[19] L. Bellomarini, R. R. Fayzrakhmanov, G. Gottlob, A. Kravchenko, E. Laurenza, Y. Nenov,
S. Reissfelder, E. Sallinger, E. Sherkhonov, S. Vahdati, L. Wu, Data science with vadalog:
Knowledge graphs with machine learning and reasoning in practice, Future Gener. Comput.</p>
      <p>Syst. 129 (2022) 407–422.
[20] L. Bellomarini, D. Benedetto, G. Gottlob, E. Sallinger, Vadalog: A modern architecture for
automated reasoning with large knowledge graphs, Inf. Syst. 105 (2022) 101528.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlegel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sattler</surname>
          </string-name>
          ,
          <article-title>Management of machine learning lifecycle artifacts: A survey</article-title>
          ,
          <source>SIGMOD Rec</source>
          .
          <volume>51</volume>
          (
          <year>2022</year>
          )
          <fpage>18</fpage>
          -
          <lpage>35</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nigenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Karnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Zafar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramesha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Donini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kenthapadi</surname>
          </string-name>
          ,
          <article-title>Amazon sagemaker model monitor: A system for real-time insights into deployed machine learning models</article-title>
          ,
          <source>in: KDD, ACM</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>3671</fpage>
          -
          <lpage>3681</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dutta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>The VIA annotation software for images, audio and video</article-title>
          , in: ACM Multimedia, ACM,
          <year>2019</year>
          , pp.
          <fpage>2276</fpage>
          -
          <lpage>2279</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W. Styler,
          <article-title>Anafora: A web-based general purpose annotation tool</article-title>
          , in: HLTNAACL, The Association for Computational Linguistics,
          <year>2013</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. G.</given-names>
            <surname>Kolaitis</surname>
          </string-name>
          ,
          <article-title>Schema mappings, data exchange, and metadata management</article-title>
          ,
          <source>in: PODS, ACM</source>
          ,
          <year>2005</year>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <source>Model management 2</source>
          .
          <article-title>0: manipulating richer mappings</article-title>
          , in: SIGMOD Conference, ACM,
          <year>2007</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Reutter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Riveros</surname>
          </string-name>
          ,
          <article-title>Foundations of schema mapping management</article-title>
          ,
          <source>in: PODS, ACM</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>227</fpage>
          -
          <lpage>238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>P. G. Kolaitis,</surname>
          </string-name>
          <article-title>Reflections on schema mappings, data exchange, and metadata management</article-title>
          ,
          <source>in: PODS, ACM</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>107</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Edara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pasumansky</surname>
          </string-name>
          ,
          <article-title>Big metadata : When metadata is big data</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>3083</fpage>
          -
          <lpage>3095</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>