<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>USA, March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards More Robust Fashion Recognition by Combining Deep-Learning-Based Detection with Semantic Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Achim Reiz</string-name>
          <email>achim.reiz@uni-rostock.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamad Albadawi</string-name>
          <email>mohamad.albadawi@igd-r.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kurt Sandkuhl</string-name>
          <email>kurt.sandkuhl@uni-rostock.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Vahl</string-name>
          <email>matthias.vahl@igd-r.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dennis Sidin</string-name>
          <email>dennis.sidin@igd-r.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Neural Network</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IGD</institution>
          ,
          <addr-line>Joachim-Jungius-Straße 11, 18059 Rostock</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Image Classification</institution>
          ,
          <addr-line>Ontology, Semantic Augmentation, Deep Learning, Convolutional</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>In A. Martin, K. Hinkelmann</institution>
          ,
          <addr-line>H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.)</addr-line>
          ,
          <institution>Proceedings of the AAAI 2021 Spring</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Rostock University</institution>
          ,
          <addr-line>18051 Rostock</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University</institution>
          ,
          <addr-line>Palo Alto, California</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>2</volume>
      <fpage>2</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>The company FutureTV produces and distributes self-produced videos in the fashion domain. It creates revenue through the placement of relevant advertising. The placement of apposite ads, though, requires an understanding of the contents of the videos. Until now, this tagging is created manually in a labor-intensive process. We believe that image recognition technologies can significantly decrease the need for manual involvement in the tagging process. However, the tagging of videos comes with additional challenges: Preliminary, new deep-learning models need to be trained on vast amounts of data obtained in a labor-intensive data-collection process. We suggest a new approach for the combining of deep-learning-based recognition with a semantic reasoning engine. Through the explicit declaration of knowledge fitting to the fashion categories present in the training data of the recognition system, we argue that it is possible to refine the recognition results and win extra knowledge beyond what is found in the neural net.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Traditional, linear television is on a steady decline due to the rise of free and paid online content and
video on demand. The increasing bandwidths combined with mobile flat rates, the possibility to create
interaction with the users, and the uprising of new innovative media formats amplify this industry trend
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Especially, free online videos have a high reach in the advertising-relevant target group of 14 to
39-year-olds [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. FutureTV, a content-marketing enterprise specialized in the creation and distribution
of online-videos, fulfills the need for the rising demands of these short videos with self-produced,
highquality videos. These videos contain several scenes and show mostly fashion-related content. FutureTV
creates monetary value through the placement of specific content-related advertising. For example, if
the video shows a female face close-up wearing sunglasses and earrings, advertising should be placed
for these specific items. Knowing the kind of fashion objects in the video has a direct impact on revenue
and economic success. Historically, this approach heavily depended on the use of manual tagging. This
approach is labor-intensive, costly, and challenging to scale up due to hiring and training the workforce.
Automatic image detection technologies can reduce the need for workers and enable more efficient and
      </p>
      <p>In the last eight years, the field of object recognition has been witnessing a revolution in terms of
achieved recognition accuracy; mainly led by deep neural networks. However, that came with additional
costs. These new smart models need costly computational power and vast amounts of costly annotated
training data to deliver good performance. This data also needs to be balanced. That is, target objects</p>
      <p>2021 Copyright for this paper by its authors.
need to be equally represented in the data. In a context where fine granular categories are highly
desirable (e.g. FutureTV use case), it may be challenging to meet that condition of balance for objects
that are in their nature not very common. In fashion categories, for instance, under the main category
‘coat’, a tailcoat is not as common as a raincoat.</p>
      <p>The approach presented in this work aims at eliminating the need for annotated data corresponding
to fine granular categories. By that, less overall data (and hence less effort) will be needed in the fine
granular recognition task. Moreover, no imbalance problems will arise finding examples of the fine
categories. The paper is structured as follows. The next section introduces the related work, followed
by an overview of our new approach's technical architecture. Section four is then concerned with the
evaluation of exemplary results. The approach is further motivated by an economic business case in
section five. The paper concludes with an outlook on further research prospects.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>This related work section is structured in two parts. At first, we overview the used image recognition
technologies and show the advantages of a hierarchical classification approach. In the next step, the
state of the art regarding the connection of semantic technologies with deep learning is established.
2.1.</p>
    </sec>
    <sec id="sec-3">
      <title>Image Recognition Technologies</title>
      <p>
        In recent years several neural networks architectures have been proposed for image classification
and object detection. The ResNet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] family represents one of the most widely used approaches for
image classification. ResNet exhibits a simple but effective strategy of stacking large number of
convolutional blocks that are furthermore coupled by shortcut identity connections; that enabled the
building of increasingly deeper models leading to excellent performance. ResNeXt [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] is based on
ResNet; it integrates a technique initially used by Inception [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] known as split- transform-merge. The
input of an Inception module is split into lower-dimensional embeddings and then transformed by a set
of filters before the results are merged and concatenated. Those aggregated transformations outperform
the original ResNet modules even under the restricted condition of maintaining model and
computational complexity. Most architectures are used in flat classifiers; some works [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7–9</xref>
        ] suggested
using convolutional classifiers in a hierarchical fashion for better separation between visually similar
objects. That way, a classifier's ability is concentrated toward such objects rather than being distributed
among large number of objects categories in the flat approach.
      </p>
      <p>
        Recently a competing family of neural networks called EfficientNets [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] has evolved. Remarkably,
these architectures have not been handcrafted but discovered using neural architecture search. The
aforementioned classification models are typically used as backbone models for modern object
detection networks. These networks follow two main approaches: Two-stage architectures leveraging
a proposal driven mechanism to generate a set of object locations that are then classified. One-stage
detectors immediately regress bounding-box coordinates and classifications. Faster-RCNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is a
popular two-stage detector and is often used together with ResNet as the backbone model. It uses a
sophisticated region proposal network in order to generate candidate object locations. One-stage
detectors as SSD [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], YOLO [13], and RetinaNet [14] have been designed to be more efficient and
thus faster than two-stage detector networks at the cost of an accuracy loss. Another approach to
increase network efficiency has been the research of anchor-free network architectures as FCOS [15],
which uses a fully convolutional network architecture. Recently a new family of one-stage architectures
called EfficientDets [16] has been proposed, which uses the EfficientNets mentioned above as a
backbone. This approach achieved similar accuracy with significantly fewer model parameters.
      </p>
      <p>
        Over the last years, several scientific publications have researched classification and object detection
in the context of fashion understanding and analysis [17–19]. Because of the excessive variety of
clothing types and appearance and difficulties due to occlusion, classification, and detection of fashion
objects remains a challenging problem. Several datasets like DeepFashion [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], DeepFashion2 [20],
and ModaNet [21] of varying sizes have been made publicly available. However, depending on the
specific type of fashion elements that need to be recognized, real-world use cases of these datasets are
limited, and the synthesis of a suitable dataset is inevitable.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Connecting Semantic and Image Recognition</title>
      <p>With the rising maturity of image recognition technologies, the connection of these technologies and
semantic capabilities had seen some attention from the scientific community in the past years. Two
literature reviews by Bhandari and Kulikajevas in 2018 [22] and by Ding et al. in 2019 [23] already
collected the state of the art regarding the connection of semantic and deep learning technologies.</p>
      <p>[22] first considers three different application areas for the interdisciplinary approach: The
increasing accuracy in segmentation tasks, the automatic creation of image labels to annotate large
libraries of former unstructured video and image content, and the recognition of part-of relationships of
larger objects. Further, the paper names three domain-specific application scenarios: Robotics, to reduce
the required computational capacity and to reduce the amount of detected false positives, geographic
information science to translate the imagery into a GIS-ready format, and sports events to improve
complex keyword searches.</p>
      <p>[23] distincts between single object image recognition and multi-object image recognition. For the
former, Ding et al. present examples for the connection of recognition algorithm with high-level
semantics. This allows, e.g., the iterative detection of bird species on changing backgrounds or the more
accurate classification of buildings. For the latter, the authors describe the opportunity to analyze the
relationships between the targets in the multi-object environment for better analysis accuracy through
the connection with WordNet or user-behavior identification in videos. Further examples for the
connection of semantic and deep learning are the categorization and storing of information inside an
ontology [25] or the rule-based indexing of CCTV [24].</p>
      <p>On a high-level view, the related works share a similar core. From a detailed perspective, though,
these approaches differ widely. The analysis of CCTV does not require a high granularity and detailed
hierarchy for similar-looking objects. The connection of labels through WordNet does not function in
a domain-specific environment with a specific business-related task.</p>
      <p>Our work is concerned with fine-grained fashion categories, which need hard-to-find training data
that may result in a highly imbalanced recognition problem. For that, we start with popular coarse
fashion categories and leverage existing large-scale datasets to refine those. Our approach exploits the
correlation between our coarse categories and the existing general-purpose large-scale recognition
dataset. A semantic augmenter will be analyzing fashion elements in the light of knowledge extracted
from the input imagery based on the Places dataset [25]. Back to the example ‘coat’ from the previous
subsection, it will be enough to detect a coat in the image; the semantic augmenter will take on from
that point and infer a tailcoat knowing that the scene is a concert hall. There are currently no approaches
that use ontologies to maximize the extracted information and reduce training costs of deep learning
methodologies to the best of our knowledge.</p>
    </sec>
    <sec id="sec-5">
      <title>3. Connecting semantic reasoning with image recognition</title>
      <p>The project aims at developing an innovative approach by combining a technique from the symbolic
and subsymbolic sub-disciplines of AI research. The aim is to apply knowledge captured in an ontology
to improve object recognition in videos, based on an artificial neural network (ANN) and a
deeplearning approach.</p>
      <p>Figure 1 presents an overview of the architecture and orchestration of the newly created approach.
The image recognition unit provides access to the ANN database containing available models. The
models can either be for a single concept in the ontology or a combined model for several concepts.</p>
      <p>The semantic augmentor unit provides access to the ontology. The ontology is supposed to capture
the relevant knowledge for the application field of discovering fashion items in videos. There exists an
ontological twin for every classification model in the deep learning part of the system. These twins are
embedded in contextual knowledge like a taxonomy of fashion items, environments suitable for specific
fashion categories (mountain, skiing, outdoor), social contexts relevant for fashion categories
(weddings, parties), and more. See 3.2 for an extensive description of the ontology structure.</p>
      <p>An upload of a video on the deep learning servers triggers an initial analysis, resulting in a scene
(e.g., airfield, bar, concert_hall) classification, the detection of the gender (male, female), and the body
area/fashion category. For each classified scene, a set of concepts from the ontology is considered
relevant for this purpose. This "starter set" usually includes high-level concepts in the ontology (e.g.,
fashion category TOP for the gender female in the scene outdoor). Rather than searching for all potential
fashion items, which results in many fashion items, the search starts with a subset of 2 to 8 objects.
These results are returned to the analysis framework and forwarded to the semantic servers. The
semantic service utilizes the shared taxonomy between the project partners and can infer contextualized
knowledge fitting to the requested items. While the detection of a bikini or bra is likely in a swimming
pool, the same item in a hardware_store or ski_resort is unlikely. The semantic service will, therefore,
filter these elements.</p>
      <p>The refined iteration can be triggered after a higher-level concept is detected, utilizing semantic
reasoning based on the previous analysis. The iterative improvement process takes place on the common
terminology that is defined for all project partners. Figure 2 shows an excerpt from this shared ontology.
As an example, we could assume a medium shot of a woman in a swimming pool. The image
recognition unit would recognize the scene swimming_pool_indoor and the fashion category TOP. The
semantic component now can derive that no leaf item in the TOPLAYER and MIDLAYER category fits
the classified situation and returns only the LOWERLAYER category as the next possible item in the
given situation. For the next refinement iteration, two of the three available classifiers can be omitted
in the image analysis, thus saving computational resources. With this approach, we reduce the effort
required to object recognition compared to the brute force strategy of trying all existing concepts but
expect to reach high tagging-quality. However, this expectation has to be validated in subsequent
experiments.</p>
      <p>Further, we can infer additional possible sub-categories of the detected fashion items through
reasoning after reaching the image recognition service's finest search category. The shared ontology
contains 62 classes. If the algorithm reaches a leaf element, no further iteration can be triggered at the
image recognition-servers. To enable the deriving of possible, more accurate results beyond the image
classifiers, the leaf elements of the shared ontology contain a link towards a more extensive, non-shared
fashion vocabulary containing 693 elements. This larger ontology is based on the EU-funded
fashionbrain project [26]. More information on the created fashion ontology, its evolution, and the underlying
design decisions can be found in [27].</p>
      <p>The integration of the more extensive fashion vocabulary with the semantic twins enables the
inferring of new sub-classes without newly trained recognition classifiers through the use of semantic
reasoning. The ontology stores matching sub-concepts for the detected classes and puts them into
perspective to the already classified items, allowing the filtering of sub-concepts that might fit into a
given situation. The latter is validated in the evaluation section.</p>
    </sec>
    <sec id="sec-6">
      <title>Structure of the Image Recognition Unit</title>
      <p>
        The recognition unit consists of three main components that perform one of two main visual tasks:
classification and detection. The components are next described in the same order of their use in the
recognition unit. The first component is a ResNet18 classification model trained on the Places 365
dataset [25]; the model is trained and provided by the team responsible for Places. This network can
distinguish between 365 scene categories covering a broad spectrum of environments seen in the real
world. This component operates independently from the other two, which operate together to produce
their results. The next component is an object detector that is trained to detect seven main types of
objects, namely Male_head, Female_head, Top, Bottom, Dress (clothes covering the full body),
Accessory (hat, tie, bag …), and Shoes. As for the detector's architecture, we leverage the more complex
but more accurate ResNet50 as the backbone model for a Faster R-CNN detection model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A dataset
of 6000 images was prepared and annotated accordingly for the training of the model. The detector is
trained only once and will then detect its seven categories regardless of which specific fashion element
is present. It will always detect the region in an image representing a top object irrespective of which
top is that (T-shirt, jacket…). The detected objects are then to be further classified by the third
component, which is a hierarchy of classifiers. We went for hierarchical classification because it offers
the system a significant advantage compared to a flat classification approach; it delivers better accuracy
with regard to objects with a similar appearance, as shown in works [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7–9,28</xref>
        ]. That is required in our
use case as many fashion elements are very similar (e.g., bra vs. bikini top), and an appropriate
differentiation is necessary for the semantic augmenter to work well. The enhanced accuracy comes
with extra computational time; however, that is not a problem, as the system does not need to conform
to any specific runtime requirements. Each classifier in the hierarchy corresponds to a non-leaf node in
the tree of a shared fashion ontology with the semantic augmenter. Figure 2 shows an excerpt of the
ontology. In total, 22 classifiers were trained; they can be controlled/run separately or as one entity.
      </p>
      <p>Leaf nodes in that tree represent the final categories that the hierarchy can differentiate. In total, the
tree has 37 leaf nodes corresponding to 37 fashion categories. These are the coarse fashion categories
that the semantic augmenter can obtain from the recognition unit along with the gender information
from the detector and the scene category from the first component for further refinement. The semantic
augmenter also has access to the results of the intermediate classifiers in the hierarchy. To train the
classifiers in our hierarchy, a dataset of around 60,000 images was prepared out of freely available
images online. For each one of the 37 end-categories, about 1500 images were collected. For training
intermediate classifiers, images from leaf categories were merged to represent parent categories. The
images in this dataset were taken as tight around the objects as possible; that would yield training
examples that imitate image areas under the bounding boxes delivered by the detector.</p>
      <p>In the recognition unit, we followed a cloud-based computer vision as a service approach and
provided our trained classification and detection models through a custom-built REST-API. This API
allows the upload of a video or image file, selecting a trained neural network model for inference, and
then the request of a classification or detection process. A feedback mechanism was also implemented
in the recognition unit. Either the augmenter or a human user can mark-up false recognition results; the
system can accumulate that information, and upon request, a training process can be triggered for
arbitrary networks. The training process ends with the help of the early stopping stopping technique
based on the validation loss.</p>
      <p>Due to the fast-changing nature of computer vision research, we could not leverage the most recent
model architectures like EfficinetDets as such models were not yet available at the time of implementing
our solution. However, we developed our server and machine learning infrastructure with extensibility
in mind. Therefore, it is possible to easily integrate new model architectures, and we are currently
researching suitable candidates.
3.2.</p>
    </sec>
    <sec id="sec-7">
      <title>Structure of Semantic Image Augmentation Ontology</title>
      <p>In this section, the semantic augmenting unit is described in more detail. The ontology is built in
OWL and utilizes an extensive class structure without depending on individuals. Figure 3 presents the
structure of the semantic augmentation ontology. In the following example, JEANS_SHORTS are
detected by the image recognition unit. It is associated with the gender-class female and male. The
fashion items are not connected directly to the scenes but through an intermediary object occasion. The
ontology utilizes object properties for the connection of the various classes, modeled as subclass
relationships. The ontology is publicly available at [29].</p>
      <p>As the ontology consists of 365 recognizable scenes and over 750 fashion items, creating a
connection between all of these items heavily increases the size of the ontology and the modeling effort.
The in total 56 occasions reduce the complexity and ease the maintenance of the ontology. Taking the
example of the JEANS_SHORTS, it allows to exclude occasions, among others, like winter,
formalbusiness, or high-class-events with the associated, detectable scenes like courthouse, ski_resort, office
or dining_hall.</p>
      <p>Taking the example of Figure 3, the IMAGERECOGNITION class represents the iterative
improvements of the image recognition service. For the example JEANS_SHORTS, at first, the body
area is detected (BOTTOM), then the kind of trousers (In this case TROUSERS, other possibilities would
be SKIRT, and BUSINESS_PANTS). At last, the clothing element itself is detected. At this point, the
image-recognition classifier does not offer additional knowledge; it has reached the finest available
granularity. The leaf element of the IMAGERECOGNITION's ontological twin now points to the more
extensive fashion-knowledge base. As these elements are connected to the same describing attributes,
the semantic engine can infer additional fashion items that fit the given situation using the same gender
and occasion/scenes constrains. In our example, the larger, non-shared fashion ontology contains
additional shorts-elements like casual_shorts, cutoffs, bermuda-shorts, cuolottes, and more (30
different kinds of shorts in total).</p>
    </sec>
    <sec id="sec-8">
      <title>4. Evaluation of Image Recognition Unit</title>
      <p>Improving on the accuracy of state-of-the-art machine learning models is out of the scope of this
work. However, an evaluation of our models is important to make sure they are working properly. The
ResNet18 classification model trained on Places365 was provide by the authors [25]. They provided
multiple ResNet models with different number of layers. Accuracy figures for the 18-layers model were
not mentioned. We evaluated the model on the test set of Places365 and reached a top-1 accuracy of
53.2% and a top-5 accuracy of 83.8%. Our 7-classes detection model was evaluated on a test set of
1000 images and reached an mAP of 51.7% using an intersection over union threshold of 0.5. Classifiers
in the classification hierarchy were individually evaluated. Each classifier was assessed on a test set
that has around 100 images per class. The results are summarized in Table 1.
pants: cargo-pants, casual-pants,
chinos-andkhakis, corduroy, corduroys, cropped-pants,
dressformal, joggers, knits, overalls, overalls-pants,
wide-leg-pants</p>
    </sec>
    <sec id="sec-9">
      <title>5. Evaluation of Semantic Augmentation</title>
      <p>The current results of the new service look promising. For the evaluation, we choose a total of 12
different pictures. As the evaluation is focused on the performance of the semantic augmentation engine,
we considered "perfect" image-detection results. Table 1 shows the detected scenes, the corresponding
image detection items in their finest granularity, and the additional items inferred by the semantic
augmentation engine. The last row, containing the inferred elements, has to be read the following:</p>
      <p>If the semantic engine cannot infer any results, this is indicated by NA. Otherwise, the linked item
is presented, followed by their fitting sub-items, if applicable. For example, in result #2, the
imagerecognition item t-shirt is linked with a t-shirt item in the more extensive fashion knowledge base. This
linked item also has more detailed sub-items like uni-t-shirt, pattern-t-shirt, and print-t-shirt. The
scenes constrain these subclasses. An example of this constraint can be found in #4. For the scenes
stadium_soccer and stadium_baseball, there are no subitems of t-shirt linked as a proper fit.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Expected Economic Benefits</title>
      <p>The hybrid approach's economic potential can be illustrated by comparing the object recognition
solely using machine learning with the combined semantic and deep learning (DL) approach. Let us
assume that we have M videos and determine the relevant objects in video k. Furthermore, there are N
classifiers available that have been trained in the DL approach, one for each object. What objects are
relevant in what scene is further specified in the semantic net. To determine the relevant objects in k
can be done only with the DL approach or with the combined semantic net and DL approach.</p>
      <p>In the first case (without a semantic net), to identify all relevant objects in video k, all classifiers
have to be applied because there is no information from the semantic net about what scenes exist, what
objects characterize these situations and what additional objects are relevant in a scene. In this case, the
effort Ek of identifying relevant objects for video k consists of using the approach with all available
classifiers.</p>
      <p>When using the semantic net, the classifiers of the DL approach first have to be used to identify the
scene. The number of scenes is much smaller than the total number of objects for M, and the number
of classifiers required to determine a specific scene also is much smaller than the total number of
objects. In the second step, only the objects related to the identified scene are relevant, i.e., only this
specified set of classifiers for the DL approach has to be used.</p>
      <p>The effort required to determine the relevant objects in k consists of the effort for detecting the scene
plus the effort for running the classifiers for the objects relevant in the detected situation. This is
illustrated in Figure 5, where e0 is the effort required to prepare scene recognition for video k on the
initial analysis of the semantic net, ei is the effort required to run an individual classifier, and N' is the
number of required classifiers to determine all relevant objects. If for video k all objects (and available
classifiers) are relevant, N is equal to N'. Then the effort of object recognition without the use of the
semantic net will be less than with its use.</p>
    </sec>
    <sec id="sec-11">
      <title>7. Discussion</title>
      <p>The usage of automated object detection has the potential to replace the manual tagging of video
contents and can, therefore, lead to significant monetary savings. However, the specific characteristics
of the analysis of fashion-related videos present a challenge for implementing a classical object
detection analysis. Due to the nature of moving pictures, many frames need to be analyzed and require
enormous computing power. Some fashion objects look similar to each other and need additional
context to be distinguished. In this paper, we proposed a combination of a deep-learning CNN through
a hierarchical semantic network. We argue that this approach has the potential to lower the
computational requirements and enhance precision. Furthermore, the extension of the novel detection
requires less effort, and the deriving of additional information besides the image recognition data
through semantically linked knowledge is possible.</p>
      <p>In this work, we evaluated the semantic augmentation performance and showed how it could help
bring us beyond what we can explicitly detect with typical recognition techniques. While the semantic
augmenter and the image recognition system are now deployed online and the productive end-software
is imminent (debugging phase), a full evaluation of the novel approach (image-recognition + semantic)
is still pending. Therefore, this research endeavor's next steps are concerned with the numeric analysis
of the characteristics in terms of computational requirements, end accuracy (recognition error +
semantic error), training effort, and response times.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgements</title>
      <p>We like to thank FutureTV GmbH for the successful collaboration and are looking forward to
working together on further innovative approaches relevant to the research and industry community.
The results shown here are based on the joint project "Kl-based object recognition and
semanticsupported content analysis (KOSlnA)" which is co-financed by the European Union from the European
Regional Development Fund. Operational Programme Mecklenburg-Vorpommern 2014 to 2020 –
Investments in growth and employment.
[13] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You Only Look Once: Unified, Real-Time
Object Detection, in: 29th IEEE Conference on Computer Vision and Pattern Recognition, Las
Vegas, NV, USA, IEEE, Piscataway, NJ, 2016, pp. 779–788.
[14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollar, Focal Loss for Dense Object Detection, in:
2017 IEEE International Conference on Computer Vision, Venice, IEEE, Piscataway, NJ, 2017,
pp. 2999–3007.
[15] Z. Tian, C. Shen, H. Chen, T. He, FCOS: Fully Convolutional One-Stage Object Detection, in:
Proceedings, 2019 International Conference on Computer Vision, Seoul, Korea (South), IEEE
Computer Society, Conference Publishing Services, Los Alamitos, California, 2019, pp. 9626–
9635.
[16] M. Tan, R. Pang, Q. Le V, EfficientDet: Scalable and Efficient Object Detection, in: CVPR
2020: Computer Vision and Pattern Recognition, 2020, pp. 10781–10790.
[17] W. Yang, P. Luo, L. Lin, Clothing Co-parsing by Joint Image Segmentation and Labeling, in:
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition,
Columbus, OH, USA, 2014, pp. 3182–3189.
[18] J. Huang, R. Feris, Q. Chen, S. Yan, Cross-Domain Image Retrieval with a Dual
AttributeAware Ranking Network, in: 2015 IEEE International Conference on Computer Vision,
Santiago, Chile, IEEE, Piscataway, NJ, 2015, pp. 1062–1070.
[19] M.H. Kiapour, X. Han, S. Lazebnik, A.C. Berg, T.L. Berg, Where to Buy It: Matching Street
Clothing Photos in Online Shops, in: 2015 IEEE International Conference on Computer Vision,
Santiago, Chile, IEEE, Piscataway, NJ, 2015, pp. 3343–3351.
[20] Y. Ge, R. Zhang, X. Wang, X. Tang, P. Luo, DeepFashion2: A Versatile Benchmark for
Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images, in: 2019
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA,
IEEE Computer Society, Los Alamitos, CA, 2019, pp. 5332–5340.
[21] S. Zheng, F. Yang, M.H. Kiapour, R. Piramuthu, ModaNet, in: Proceedings of the 26th ACM
international conference on Multimedia, Seoul, Republic of Korea, ACM, [Place of publication
not identified], 2018, pp. 1670–1678.
[22] S. Bhandari, A. Kulikajevas, Ontology based image recognition: A review, CEUR Workshop</p>
      <p>Proceedings 2145 (2018).
[23] Z. Ding, L. Yao, B. Liu, J. Wu, Review of the Application of Ontology in the Field of Image
Object Recognition, in: Proceedings of the 11th International Conference on Computer Modeling
and Simulation - ICCMS 2019, North Rockhampton, QLD, Australia, ACM Press, New York,
New York, USA, 2019, pp. 142–146.
[24] Alejandro Zambrano, Carlos Toro, Cesar Sanín, Edward Szczerbicki, Marcos Nieto, Ricardo
Sotaquira, Video Semantic Analysis Framework based on Run-time Production Rules - Towards
Cognitive Vision. https://doi.org/10.3217/jucs-021-06-0856.
[25] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, A. Torralba, Places: A 10 Million Image Database
for Scene Recognition, IEEE Trans. Pattern Anal. Mach. Intell. 40 (2018) 1452–1464.
https://doi.org/10.1109/TPAMI.2017.2723009.
[26] A. Checco, G. Demartini, A. Loeser, I. Arous, M. Khayati, M. Dantone, R. Koopmanschap, S.</p>
      <p>Stalinov, M. Kersten, Y. Zhang, FashionBrain Project: A Vision for Understanding Europe's
Fashion Data Universe, 2017.
[27] A. Reiz, K. Sandkuhl, Design Decisions and Their Implications: An Ontology Quality
Perspective, in: R.A. Buchmann, A. Polini, B. Johansson, D. Karagiannis (Eds.), Perspectives in
Business Informatics Research, Springer International Publishing, Cham, 2020, pp. 111–127.
[28] X. Zhu, M. Bain, B-CNN: Branch Convolutional Neural Network for Hierarchical Classification,
arXiv preprint arXiv:1709.09890 (2017).
[29] A. Reiz, Fashion-Ontology for the Connection of Semantic with Deep Learning, Rostock
University, https://doi.org/10.5281/zenodo.4519359, 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>I. Irving</surname>
          </string-name>
          ,
          <source>Two Happy Ladies on top of Le Crete</source>
          ,
          <year>2012</year>
          . https://flic.kr/p/dHrvi8 (accessed
          <issue>30</issue>
          <year>November 2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Goldmedia</surname>
            <given-names>GmbH</given-names>
          </string-name>
          , Grugel Productions,
          <source>WEB-TV-MONITOR</source>
          <year>2019</year>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>SevenOne</given-names>
            <surname>Media</surname>
          </string-name>
          ,
          <source>View Time Report: Neue Perspektiven der Videonutzung</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          ,
          <source>in: 29th IEEE Conference on Computer Vision</source>
          and Pattern Recognition, Las Vegas,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA, IEEE, Piscataway, NJ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Fixing the train-test resolution discrepancy</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , Boston, MA, USA, IEEE, Piscataway, NJ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Piramuthu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Jagadeesh</surname>
          </string-name>
          , D. DeCoste, W. Di,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          , HD-CNN:
          <article-title>Hierarchical Deep Convolutional Neural Networks for Large Scale Visual Recognition</article-title>
          , in: 2015
          <source>IEEE International Conference on Computer Vision</source>
          , Santiago, Chile,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , Piscataway, NJ,
          <year>2015</year>
          , pp.
          <fpage>2740</fpage>
          -
          <lpage>2748</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Murdock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , T. Duerig, Blockout:
          <article-title>Dynamic Model Selection for Hierarchical Deep Networks</article-title>
          ,
          <source>in: 29th IEEE Conference on Computer Vision</source>
          and Pattern Recognition, Las Vegas,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA, IEEE, Piscataway, NJ,
          <year>2016</year>
          , pp.
          <fpage>2583</fpage>
          -
          <lpage>2591</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhanga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <article-title>Embedding Visual Hierarchy with Deep Networks for Large-Scale Visual Recognition</article-title>
          ,
          <source>IEEE Trans. Image Process</source>
          . (
          <year>2018</year>
          ). https://doi.org/10.1109/TIP.
          <year>2018</year>
          .
          <volume>2845118</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q. Le V</surname>
          </string-name>
          ,
          <article-title>EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          , ArXiv abs/
          <year>1905</year>
          .11946 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>39</volume>
          (
          <year>2017</year>
          )
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          . https://doi.org/10.1109/TPAMI.
          <year>2016</year>
          .
          <volume>2577031</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , SSD: Single Shot MultiBox Detector, in: B.
          <string-name>
            <surname>Leibe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sebe</surname>
          </string-name>
          , M. Welling (Eds.),
          <source>Computer vision - ECCV</source>
          <year>2016</year>
          :
          <article-title>14th European conference</article-title>
          , Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          proceedings, Springer, Cham,
          <year>2016</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>