<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An interactive atomic-cluster watershed-based system for lifelog moment retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Van-Luon Tran</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Trong-Dat Phan</string-name>
          <email>P@10</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anh-Vu Mai-Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anh-Khoa Vo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Son Dao?</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Koji Zettsu</string-name>
          <email>zettsug@nict.go.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Institute of Information and Communications Technology</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCMC</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we introduce a new interactive atomic-cluster watershed-based system for lifelog moment retrieval. We investigate three essential components that can help improve accuracy and support both amateur and professional users to enhance their querying based on di erent content and context hypothesis. These components are (1) the atomic cluster function that clusters dataset to a set of time-consecutive images that shares the same content and context constraints, (2) the text-tosample image generation that helps to overcome the gap between textual queries of users and visual-based feature vectors database, and (3) The interactive interface that assists users to imagine what they want to look for better. The system is customized to meet the challenge of lifelog moment retrieval of imageCLEFlifelog2020. The evaluation and comparison of our method to others con rm the stability of our method when people want to retrieve a large number of results within 100 top results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Finding a moment in our past with a few hints or cues is the activity we probably
carry on almost every day. Except for extraordinary people who have a
fantastic memory that can recall every moment in their lives within a split-second,
ordinary people need more time to narrow down their searching scope from a
very abstract level to detail. The same situation happens when people want to
nd their historical moment from their lifelog data. That leads to the fact that if
people can have an interactive system that can help them turn their queries from
an amateur sketch to an artist's paint, they will retrieve their moment faster and
more precisely [1], [2]</p>
      <p>Besides, turning a few keywords and less semantic contents of users' text
queries to somethings that can be understood by the search engine is another
challenge [3]. There is still a big gap between the natural language spoken by
users and machine language designed for search engines [4] that can prevent the
improvement of accuracy. Feature selection is another factor that can assist in
bridging this gap [5], [6], and in support of the well-organized dataset.</p>
      <p>Based on the discussion mentioned above, we design an interactive
atomiccluster watershed-based system for lifelog moment retrieval. This system is
customized to meet the lifelog moment retrieval (LMRT) challenge of
imageCLEFlifelog2020 [7], a lab task of imageCLEF2020 [8].</p>
      <p>Our system's main contributions are:
1. We introduce an atomic cluster, a set of time-consecutive images that shares
the same content and context constraints.
2. We build the text-to-sample image generation to overcome the gap between
textual queries of users and visual-based feature vectors database.
3. We create an interactive interface to help users imagine what they want to
look for better.</p>
      <p>We organize this paper as follows: Section 2 describes our method in
details, Section 3 discusses the challenge and evaluates our results, and Section 4
concludes our paper and sketches our plan in the future.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methodology</title>
      <p>The principal idea of our method bases on the following observations:
1. daily activities of people can be divided into sequential atomic moments
whose content has a consensus of both content and context. In other words,
lifelog data recorded during a day can be divided into sequential atomic
clusters whose content re ects a unique semantic meaning with a consensus
of spatiotemporal dimension. These atomic clusters cannot be divided into
smaller clusters. Hence, if we could nd one image that matches the query
(i.e., seeds), we can count in the atomic cluster the image belongs and the
neighbors of the atomic cluster (i.e., watershed).
2. people can decide which data do not satisfy their queries. In other words,
people can remove irrelevant data and modify their queries to get more
relevant data. Hence, if we can provide people a friendly interface for interactive
querying, people can improve the quali cation of the querying system.</p>
      <p>
        We call the system built upon these observations is an interactive
atomiccluster watershed system. The system has four vital components (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
atomiccluster clustering (Cluster function), (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) text-to-sample image generation
(Attention function), (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) querying by text-to-sample images (Query function), (
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
interaction (Interactive function), and (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) querying by user's images (Query
function). Algorithm 1 and Figure 1 describe and illustrate how the system
works, respectively.
There are several notations and feature vectors which are used for our system.
Thus, we de ne them below:
{ Let Q denote the query sentence.
{ Let I = fIigi=1::N denote the set of given images (e.g., lifelog).
{ Let C = fCkg denote a set of atomic clusters.
{ Let S denote the set of samples for each query, and Si denote the status of
set S at the time i.
{ Let BoV = fVi kgik==11::::Nm denote the set of feature vectors of objects
extracted from I, where Vi k denotes the feature vector of the kth object of Ii
and BoVi denotes the set of all object vectors of Ii.
{ Let BoVDB denote the database stores all object vectors of all images in I.
{ Let Seed and LM RT denote a set of seeds and lifelog moments, respectively.
{ Let denote V oi is the 1024-D vector representation of the ith object region
in the photos.
{ Let denote pi is output vector of the ith image.
      </p>
      <p>{ Let denote V wi is word embedding vector of the ith word.</p>
      <p>Algorithm 1 Query-to-Sample Attention-based Search Engine
In this subsection, we give a detailed explanation about our Interactive
Multimodal Lifelog Retrieval System depicted in gure 1 and the Algorithm 1.</p>
      <p>
        There are two stages in our system's work ow: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) o ine stage, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) online
stage.
      </p>
      <p>The former stage is for data preprocessing. Firstly, we divide lifelog images
into atomic clusters by utilizing Clustering function, described in 2.4. Then, all
lifelog images are converted to Vsample by using Feature Extraction (FE)
function, described in 2.3. In other words, Vsample contains feature vectors extracted
from images. To make use full of FAISS [9], we embed these Vsample into a uni ed
database by applying FAISS's function.</p>
      <p>The latter stage is for textual and visual querying. For textual querying, our
system activates Attention function, described in 2.5, to generate sample images
from texts. Then, sample images (and input images if users carry on visual
querying) are fed into the FE function to create related Vsample. The Vsample
is used to nd the most similar feature vectors from the FAISS-based database
with the prede ned similarity threshold. Next, we enrich Vsample by adding these
found feature vectors and re-querying upon FAISS-based databased until no new
feature vectors found. All images that have their features vectors appear in this
set are considered as the queried results and set as seeds. The nal results are all
atomic clusters contained these seeds. Then, users use Interactive tools described
in 2.6 to polish the output, so they receive wanted results.
2.3</p>
      <sec id="sec-2-1">
        <title>Feature Extraction</title>
        <p>{ V oi is extracted by using object detection model (Faster-RCNN backbone
Resnet) in scaled Visual Genome dataset [10] (removing semantic
overlapping classes)
{ pi is extracted by utilizing place detection model described in 2.4
{ V wi is built as follows: Hidden state 768-D vectors extracted from BERT [11]
are combined with one linear Conditional Random Field layer to construct
seq2seq model [12] and output keywords (from a long input query sentence)
with their representation vectors.
2.4</p>
      </sec>
      <sec id="sec-2-2">
        <title>Atomic-Cluster Clustering</title>
        <p>
          As mentioned in previous sections, an atomic cluster contains a set of
consecutive lifelog images (and related metadata) whose content re ects a particular
activity constrained by location, time, and semantic meaning. We have the whole
dataset clustered into atomic clusters by two steps (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) enhance the quality of
metadata and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) cluster multimodal data. The former applies a self-supervised
learning method to regenerate metadata. By utilizing SimCLR method [13], we
manually label place names for about 20k images and then train a new model
to label the remaining images in a dataset automatically. Finally, we strengthen
metadata's location constraints by having more precise place names than the
original metadata. The latter utilizes the updated metadata and feature vectors
extracted from images as the input of the clustering method proposed in [14] to
form atomic clusters.
2.5
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Text-to-sample Image Generation</title>
        <p>The essential idea of this function is to replace a textual query with a set of
visual queries. First, we create a dataset of objects using open image datasets
(e.g., COCO, image365). It means that we have a set of object names, and each
object name links to a set of images that contains the object. Then, we parse a
textual query to extract the object's names replaced by linked images. Notably,
we utilized the attention mechanism [15] to build our function, as described in
Algorithm 2. We rstly utilize Top-Down Attention LSTM in a two-layer LSTM
model for captioning images from feature vectors of regions detected by the
object detection model [16]. We then determine a useful feature transformation
from word vector space to visual space using a well-trained Bottom-up Attention
model.</p>
        <p>Algorithm 2 Text-to-sample Image Generation
Input: Word set fW ordigi=1::M , Object set fObjj gj=1::N
Output: fW ordi : Objj g map from word to relevant object.
1: fVwi gi=1::M ( W ordEmb(W ord)
2: Voj j=1::N ( F E(Obj)
3: Training bottom-up attention model as in [15].
4: for all k kVwk do
5: v^k ( PjN=1 k;j vj
6: j0 ( arg maxj kj
7: v^k ( vj0
8: v^k is the optimized presentation for W ordk in visual space
9: end for
10: return fW ordi : Objj g where i = 1::M; j = 1::N
2.6</p>
      </sec>
      <sec id="sec-2-4">
        <title>Interaction</title>
        <p>After having the rst results generated by the query by sample function, users
can lter the results using other metadata such as visible objects, places, and
time. These metadata are saved as text les using PostgreSQL and stored in
Logic Server, as described in 2.7. Besides, users can re-query by manually
selecting samples from results visualized on the system's interface or add more
query categories by texts. Moreover, users can delete inappropriate images as
they think. These images are taken into account by the system to mark as
outliers or unnecessary items for the next query. Algorithm 3 explained how the
interaction works.</p>
        <p>Algorithm 3 Interactive Algorithm
Input: PostgreSQL for metadata P ,</p>
        <p>I ( LM RT
Output: I
1: while Interactive do
2: I ( Remove(I) fContinue if there is no removed imageg
3: F ( Input lters
4: I ( P:select(I; F ) fContinue if F is noneg
5: I ( I[ Re-query(input images or text)
6: end while
7: return I
2.7</p>
      </sec>
      <sec id="sec-2-5">
        <title>Interactive System Architecture</title>
        <p>To build a exible system, we design our system following three-tier and
threelayer architecture, depicted in gure 2. The rst layer is the presentation layer
on a User Client, the second is the logic layer on Logic Server, and the last one
is the core layer on the Core Server.</p>
        <p>At the rst layer, it is a convenient web-based interface where users can
interact with our system. This interface can easily be installed in a wide range
of operating systems. It rstly allows users to type text queries, select lters, and
input sample images, which is a powerful tool for users to describe which images
they would like to retrieve. Then, these data, along with IDs of removed images
(in the case users delete the queried results of the previous interaction), will
be pushed to Logic Server. Next, the interface has responsibility for presenting
images sent from Logic Server. Before users re-query, they can modify their text
query, adjust lters, choose images from other sources, and remove unwanted
images. They can re-query until presented images satisfy user's demand. Finally,
users use the export function to download images or image IDs.</p>
        <p>
          At the second layer, Logic Server has responsibility for processing requests
from User Client. Firstly, this server converts query to a suitable form and send it
to Core Server. Then, Logic Server receives outputted results with IDs of images
and IDs of related atomic clusters. The result will be saved directly to Cache,
a temporary memory on Logic Server. At the following steps, depending on the
type of lters, this server will apply the lters on the whole dataset or only the
results stored in Cache. There are two types of lter (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Extend Filter and (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
Narrow Filter. With the former, Logic Server will nd all images whose metadata
are matched to this lter before adding these image's IDs to Cache. With the
latter, from IDs in Cache, Logic Server will select images whose metadata is
tted to the lter. Finally, the server returns ltered images and ranked clusters
to User Client.
        </p>
        <p>In terms of Core Server, it receives input from Logic Server and sends result
reversely after completely processing. Core Server is an always-on server where
AI components are deployed.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>In this section, we present our system's experiment results when applying to the
lifelog dataset CLEF2020.
3.1</p>
      <sec id="sec-3-1">
        <title>Dataset and Evaluation Metrics</title>
        <p>
          The CLEF2020 dataset has been captured by one active lifelogger for 114 days
between 2015 and 2018. It contains not only over 191000 lifelog images also
metadata, including visual concepts, attributes, semantic content, to name a
few. The training set has ten topics, and each topic is described by title and
description. These titles are: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Having beers in a bar, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Building Personal
Computer, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) In A Toy Shop, (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Television Recording, (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) Public Transport In
Home Country, (
          <xref ref-type="bibr" rid="ref6">6</xref>
          ) Seaside Moments, (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) Grocery Stores, (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) Photograph of The
Bridge, (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) Car Repair, (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) Monsters. The topic descriptions are used to explain
in detail about the content and context of each query. Similar to the training
set, the testing set has ten topics which are: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) Praying Rite, (
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) Recall, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
Bus to work - Bus to home, (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) Bus at the Airport, (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) Medicine cabinet, (
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
Order Food in the Airport, (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ) Seafood at Restaurant, (
          <xref ref-type="bibr" rid="ref8">8</xref>
          ) Meeting with people,
(
          <xref ref-type="bibr" rid="ref9">9</xref>
          ) Eating Pizza, (
          <xref ref-type="bibr" rid="ref10">10</xref>
          ) Socialising.
        </p>
        <p>The evaluation metrics are de ned by ImageCLEFli og 2020 as follow:
{ Cluster Recall at X (CR@X) - a metric that assesses how many di erence
clusters from the ground truth are represented among the top X results;
{ Precision at X (P@X) - measures the number of relevant photos among the
top X results;
{ F1-measure at X (F1@X) - the harmonic mean of the previous two.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation and Comparison</title>
        <p>
          The ImageCLEFlifelog challenge has ve participant teams including: (
          <xref ref-type="bibr" rid="ref1">1</xref>
          ) RRibeiro,
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          ) FatmaBA RegimLab, (
          <xref ref-type="bibr" rid="ref3">3</xref>
          ) DCU Team, (
          <xref ref-type="bibr" rid="ref4">4</xref>
          ) BIDAL HCMUS (ourselves), (
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
HCMUS. We are ranked in the second position. Table 1 and 2 shows our results
running on the training and testing set while table 3 and 4 denote the comparison
to the other teams.Figures 3-12 illustrate our results of the testing stage.
0.50
0.50
0.50
0.67
0.11
0.50
0.44
0.50
1.00
1.00
        </p>
        <p>1
0.50</p>
        <p>1
0.50
0.78
0.33
0.20</p>
        <p>1
0.67
0.50</p>
        <p>When comparing the results evaluated by F1@10 and F1@50 metrics, we
found that our scores are less uctuation than the others (some other teams have
a massive reduction in their scores), as described in table 3 and 4. That probably
could lead to the conclusion that our proposed method is stable, especially if the
user wants to retrieve a large of images.</p>
        <p>In some queries, we have worse scores because of misunderstanding the
content and context of queries. For instance, query 5 has the title: 'Medicine cabinet'
and description: 'Find the moment when u1 was looking inside the medicine
cabinet in the bathroom at home', we very confused when trying to con rm whether
the lifelogger really looks inside the medicine cabinet or appear nearby (i.e., the
medicine cabinet is captured by lifelog camera, but the u1 does not look at it).
The result of query 5 is shown in gure 7.</p>
        <p>Furthermore, we found that the ground truth could have some incorrect
points. We have veri ed with the organizers that the ground-truth might not
be precise. For example, the image ID b00000986 21i6bq 20150225 161718e (in
query 9) and the image ID 20160904 120624 000 (in query 5) should have been
in the ground-truth. Figures 7 and 11 illustrate the results of queries 5 and 9,
where the red rectangle denotes mentioned images. That probably makes our
results not precise enough as we expected.
Fig. 7: The top ten results of query 5 "Medicine cabinet" (F1@10 = 0.74)
Fig. 10: The top ten results of query 8 "Meeting with people" (F1@10 = 0.75)
We introduced a new interactive atomic-cluster watershed-based system for
lifelog moment retrieval. The system is specially customized to meet the
requirement of the imageCLEFlifelog2020 challenges. The system rst indexes the
database based on atomic clusters that contain similar data based on our
similarity measure. The reason behind the atomic clusters is that whenever one image
is found, its atomic cluster counts in. We store feature vectors extracted from
data in FAISS database for further querying. We convert all textual queries into
visual queries by using the attention mechanism approach. The system provides
a friendly interactive interface that allows users to select precise results and
re-query with modi cation. Our results are evaluated and compared to other
participants with positive accuracy. We will investigate the atomic clustering
function to improve the consensus and compact of atomic clusters in the future.
Moreover, we will consider wrapping spatiotemporal information to the
querying engine by strengthening semantic constraints. Last but not least, we will
focus on feature engineering and similarity measures to have a higher accuracy
of querying.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgement</title>
      <p>This research is conducted under the Collaborative Research Agreement
between National Institute of Information and Communications Technology and
University of Science, Vietnam National University at Ho-Chi-Minh City.
15. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang,
\Bottom-up and top-down attention for image captioning and visual question
answering," in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2018, pp. 6077{6086.
16. S. Ren, K. He, R. Girshick, and J. Sun, \Faster r-cnn: Towards real-time object
detection with region proposal networks," in Advances in neural information
processing systems, 2015, pp. 91{99.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>S.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Park</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Ro</surname>
          </string-name>
          , \Ivist:
          <article-title>Interactive video search tool in vbs 2020,"</article-title>
          <source>in International Conference on Multimedia Modeling</source>
          . Springer,
          <year>2020</year>
          , pp.
          <volume>809</volume>
          {
          <fpage>814</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>B.</given-names>
            <surname>Jonsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. S.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Koelma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rudinac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zahalka</surname>
          </string-name>
          , \
          <article-title>Exquisitor at the video browser showdown 2020,"</article-title>
          in MultiMedia Modeling,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Ro</surname>
          </string-name>
          , W.-H. Cheng, J. Kim, W.-T. Chu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-W.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. Hu</surname>
          </string-name>
          , and W. De Neve, Eds. Cham: Springer International Publishing,
          <year>2020</year>
          , pp.
          <volume>796</volume>
          {
          <fpage>802</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>S.</given-names>
            <surname>Andreadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moumtzidou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Apostolidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gkountakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Michail</surname>
          </string-name>
          , I. Gialampoukidis,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vrochidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Kompatsiaris</surname>
          </string-name>
          , \Verge in vbs
          <year>2020</year>
          ,
          <article-title>"</article-title>
          <source>in International Conference on Multimedia Modeling</source>
          . Springer,
          <year>2020</year>
          , pp.
          <volume>778</volume>
          {
          <fpage>783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , \
          <article-title>Spatial keyword search: a survey,"</article-title>
          <source>Geoinformatica</source>
          , vol.
          <volume>24</volume>
          , p.
          <volume>85</volume>
          {
          <issue>106</issue>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and X.</given-names>
            <surname>Wei</surname>
          </string-name>
          , \
          <article-title>Feature selection with multi-view data: A survey,"</article-title>
          <source>Information Fusion</source>
          , vol.
          <volume>50</volume>
          , pp.
          <volume>158</volume>
          {
          <issue>167</issue>
          ,
          <year>2019</year>
          . [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1566253518303841
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R.</given-names>
            <surname>Sheikhpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sarram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gharaghani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A. Z.</given-names>
            <surname>Chahooki</surname>
          </string-name>
          , \
          <article-title>A survey on semi-supervised feature selection methods,"</article-title>
          <source>Pattern Recognition</source>
          , vol.
          <volume>64</volume>
          , pp.
          <volume>141</volume>
          {
          <issue>158</issue>
          ,
          <year>2017</year>
          . [Online]. Available: http://www.sciencedirect.com/science/ article/pii/S0031320316303545
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. V.-T. Ninh,
          <string-name>
            <surname>T.-K. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          l Halvorsen, M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gurrin</surname>
          </string-name>
          , and D.
          <string-name>
            <surname>-T.</surname>
          </string-name>
          Dang-Nguyen, \Overview of ImageCLEF Lifelog 2020:
          <article-title>Lifelog Moment Retrieval and Sport Performance Lifelog," in CLEF2020 Working Notes, ser</article-title>
          .
          <source>CEUR Workshop Proceedings</source>
          . Thessaloniki, Greece: CEURWS.org &lt;http://ceur-ws.
          <source>org&gt;</source>
          ,
          <source>September</source>
          <volume>22</volume>
          -25
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , H. Muller,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          , D. DemnerFushman,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kozlovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. D.</given-names>
            <surname>Cid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
          </string-name>
          , V.-T. Ninh,
          <string-name>
            <surname>T.-K. Le</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          l Halvorsen, M.-
          <string-name>
            <given-names>T.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.-T.</surname>
            Dang-Nguyen,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fichou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berari</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Stefan</surname>
            , and
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
          </string-name>
          , \
          <article-title>Overview of the ImageCLEF 2020: Multimedia retrieval in lifelogging, medical, nature, and internet applications," in Experimental IR Meets Multilinguality, Multimodality, and Interaction, ser</article-title>
          .
          <source>Proceedings of the 11th International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ), vol.
          <volume>12260</volume>
          .
          <string-name>
            <surname>Thessaloniki</surname>
          </string-name>
          ,
          <source>Greece: LNCS Lecture Notes in Computer Science</source>
          , Springer, September
          <volume>22</volume>
          -25
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , M. Douze, and
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          , \
          <article-title>Billion-scale similarity search with gpus,"</article-title>
          <source>arXiv preprint arXiv:1702.08734</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , and L.
          <string-name>
            <surname>Fei-Fei</surname>
          </string-name>
          , \
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations,"</article-title>
          <year>2016</year>
          . [Online]. Available: https://arxiv.org/abs/1602.07332
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Devlin</surname>
            , M.-
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Toutanova</surname>
          </string-name>
          , \Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding,"</article-title>
          arXiv preprint arXiv:
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. I.
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Vinyals</surname>
            , and
            <given-names>Q. V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          , \
          <article-title>Sequence to sequence learning with neural networks,"</article-title>
          <source>in Advances in neural information processing systems</source>
          ,
          <year>2014</year>
          , pp.
          <volume>3104</volume>
          {
          <fpage>3112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. T. Chen,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          , and G. Hinton, \
          <article-title>A simple framework for contrastive learning of visual representations,"</article-title>
          arXiv preprint arXiv:
          <year>2002</year>
          .05709,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. T. Phan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Zettsu</surname>
          </string-name>
          , \
          <article-title>An interactive watershed-based approach for lifelog moment retrieval,"</article-title>
          <source>in 2019 IEEE Fifth International Conference on Multimedia Big Data (BigMM)</source>
          ,
          <source>Sep</source>
          .
          <year>2019</year>
          , pp.
          <volume>282</volume>
          {
          <fpage>286</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>