<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>VC-I2R@ImageCLEF2017: Ensemble of Deep Learned Features for Lifelog Video Summarization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana Garcia del Molino</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bappaditya Mandal</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jie Lin</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joo Hwee Lim</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vigneshwaran Subbaraju</string-name>
          <email>Subbaraju@sbic.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vijay Chandrasekhar</string-name>
          <email>vijayg@i2r.a-star.edu.sg</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>STAR</institution>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science and Engineering</institution>
          ,
          <addr-line>NTU</addr-line>
          ,
          <country country="SG">Singapore</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Singapore Bioimaging Consortium</institution>
          ,
          <addr-line>A</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Visual Computing Department, Institute for Infocomm Research</institution>
          ,
          <addr-line>A</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we describe our approach for the ImageCLEFlifelog summarization task. A total of ten runs were submitted, which used only visual features, only metadata information, or both. In the rst step, a set of relevant frames are drawn from the whole lifelog. Such frames must be of good visual quality, and match the given task semantically. For the automatic runs, this subset of images is clustered into events, and the key-frames are selected from the clusters iteratively. In the interactive runs, the user can select which frames to keep or discard in each interaction, and the clustering is adapted accordingly. We observe that the more relevant features to be used depend on the context and the nature of the input lifelog.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>With the rising availability of a ordable wearable recording devices in the market
(e.g. SenseCam or Narrative Clip), as well as the presence of countless mobile
apps for tness and lifestyle tracking, one may resort to personal Life-logging
solutions to create memory collections or monitor their own life. However, little
support is available for the browsing of such digital memories, and as a result,
our phones and computers can get lled with personal information we may never
revisit or analyze.</p>
      <p>
        To solve this problem, ImageCLEF LifeLog Task [
        <xref ref-type="bibr" rid="ref13 ref6">6, 13</xref>
        ] aims to bring
attention of researchers from diverse elds to study, evaluate and propose new
methodologies to address the challenging problems in lifelog video
summarization tasks. This rigorous comparative benchmarking would help the various
research groups to evaluate their existing methodologies among each others on a
common platform and also spur new thinking for solving the long standing key
problems. In the following section we discuss these key challenges and related
work in the literature.
      </p>
      <p>Quality
Assessment</p>
      <p>Relevance
to Task
• Color Diversity • Windowed at ±0’,
•• EBdlugrersiness ••±1PO’lboajrcee±cs2ts(’C(NCNN)N)
• People (CNN)
• Location
(metadata)
• Activity
(metadata)</p>
      <p>INTERACTIVE
Clustering for Key-Frame</p>
      <p>Diversity Selection
• K-means: after • Top nk images per
each iteration… cluster, where n
•• cERldeui-stctleourrssigteirnal tisdtoeeelcrteharcetetiaioeosnxensi.ascitnicnoegraducishnegr</p>
      <p>AUTOMATIC
Clustering for</p>
      <p>Diversity
• K-means: images
sorted by distance
to cluster center.
• Hierarchic tree:
images sorted by
relevance.</p>
      <p>Key-Frame</p>
      <p>
        Selection
• Top nk images per
cluster.
Summarization of egocentric videos has become a problem of much interest. In a
recent comprehensive survey on the summarization of high temporal resolution
videos [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], we note the importance of context dependency to generate better
summaries. For low resolution videos, two recent surveys review the methods for
better summarizing Lifelogs for memory augmentation [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and storytelling [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
The authors of [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] observe that our memory recall accuracy is directly related
to how di erent that episode is from the rest of memories.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] four steps in the process are identi ed: informative pictures ltering
(removal of blurred or dark images and those with useless content [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]), episode
segmentation (using low level features k -means, eigenvalues or graphs, and
timedependent methods [
        <xref ref-type="bibr" rid="ref5 ref9">5, 9</xref>
        ]), summarization (based on representativeness, or
relying in the presence of important people/objects [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) and retrieval (by means of
encoded context, people, objects or activities [
        <xref ref-type="bibr" rid="ref10 ref15">10, 15</xref>
        ]).
      </p>
      <p>All the aforementioned surveys conclude that richer semantic-level features
are needed to encode the di erent episodes. Moreover, the key-frames included
in a summary should be diverse, informative, and good memory triggers.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Methodologies</title>
      <p>We compare a series of summarization methods that (i) lter out uninformative
images; (ii) rank the remaining images according to how well they match the
given query; (iii) cluster the top ranked images into a series of events; and (iv)
select, in an iterative manner, as many images per cluster as to ll the length
budget. Fig. 1 shows the ow of our proposed approaches.</p>
      <p>Ful month</p>
      <p>Non−relevant images filtered out (in dark grey)
(a) Input Stream
(b) Relevant Images
(c) Final Summary
The incoming frame stream is pre-processed to evaluate the quality and
informativeness of the images. All frames below a certain quality threshold are then
discarded. The quality rate is obtained by combining the following scores:
Blurriness assessment We use two di erent methodologies:</p>
      <p>Modi ed Laplacian: This method applies a non-linear lter operation on an
image and lters out the prominent edges from the input image using a Gaussian
kernel along the x- and y- directions. The edge score is taken as the mean value
of all absolute values.</p>
      <p>Variance Of Laplacian: The input image is convolved with the Laplacian
operator (3 3 kernel). Then, the variances (i.e. standard deviation squared) of
the response are computed. All values over a certain threshold are averaged to
obtain a blurriness score.</p>
      <p>Color diversity Images with very homogeneous colors are deem to be
uninformative, since this usually means there are few di erent objects in the image.
We compute the histograms of the quantized RGB values, and nd the color
diversity score based on the frequency of the predominant color.
2.2</p>
      <p>Task Relevance Retrieval
Since the summaries are query-driven, we rst need to evaluate the relevance of
each image to the given task. For each task, we de ne a set of objects and places
to be found or avoided in the target images (as listed in Table 1). Additionally,
from the location and activity information available in the metadata, we de ne
the relevant locations and activities, and the ones to avoid strictly.
q=0; w=6.4.−4.5.0.1.4.1; win=1
q=0; w=6.4.−4.5.0.1.6.1; win=1
q=0; w=6.4.−4.5.0.2.4.1; win=1
q=0; w=6.4.−4.5.0.2.6.1; win=1
q=0; w=6.4.−4.5.−2.1.4.1; win=1
q = quality threshold; w =[wcoco, wobjy , wobjn, wply , wpln, wloc, wact, wppl];
win = size of the smoothing window. (Best viewed in color.)</p>
      <sec id="sec-2-1">
        <title>Objects</title>
      </sec>
      <sec id="sec-2-2">
        <title>Places group meeting television room</title>
        <p>task 2 (X=400)</p>
      </sec>
      <sec id="sec-2-3">
        <title>MSCOCO</title>
        <p>Relevant
laptop
keyboard</p>
        <p>tv
remote
etc.</p>
        <p>laptop
keyboard
laptop
book
etc.</p>
        <p>fork
sandwich
etc.</p>
        <p>bottle
wine glass
bus
train
oven
etc.</p>
        <p>refrigerator
Relevant
computer
group meeting
television
food
glass
computer
group meeting
computer</p>
        <p>pencil
notebook
food
glass
drink
glass
beverage</p>
        <p>public transport</p>
        <p>food
cooking utensil
white goods
shopping
shop</p>
        <p>Relevant
computer
group meeting</p>
        <p>etc.
living room
etc.
etc.</p>
        <p>etc.
co ee shop
living room
living room
hotel room
food court
restaurant
etc.
bar
pub
etc
temple
palace
etc.</p>
        <p>bus interior
subway station
etc.
pantry
kitchen
etc.
store
etc.</p>
        <p>supermarket
Task
0.6
0.4
0.2
0
1
2
3
4
5
6
7
8
9
10
public transport
residential neigh.</p>
        <p>nd all related (and to avoid) ImageNet classes, manual selection of relevant
(and to avoid) places classes, and objects to detect.</p>
        <p>Avoid</p>
        <p>computer
o ce
o ce
drum
white goods</p>
        <p>menu'
computer</p>
        <p>cab
car seat
taxi
conference room</p>
        <p>lecture room
conference room
conference room</p>
        <p>Avoid</p>
        <p>etc.
o ce
etc.
o ce
etc.</p>
        <p>home
bus
etc.
car interior
living room
shopfront
shopping mall
etc.</p>
        <p>For each image, a relevance score is given by the presence of such key objects,
places, locations and activities, and the number of people in tasks In a meeting
and Social drinking. To fuse all these aspects, each one is given a di erent weight,
as described in section 3.1. The N images with higher relevance score are drawn
and used in the following steps.</p>
        <p>
          Image Features Used: Image understanding is crucial for lifelog data analysis,
of which, what objects are present in the images and where are the images taken
is capable of linking lifelog images to certain topics/events. Here, our objective is
to estimate \what" and \where" of the lifelog images, using deep convolutional
neural networks (DCNN) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          Objects and Places: We use DCNN respectively trained on ImageNet1K [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and
Places365 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to identify objects and places depicted in the Lifelog images. The
ImageNet1K training set contains 1000 object categories from WordNet and 1.2
million images, the Places365 train set has 365 place categories and around 1.8
million images. A separate ResNet152 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is pre-trained on each dataset (termed
as ResNet152-ImageNet1K and ResNet152-Places365). During test, by passing a
lifelog image through ResNet152-ImageNet1K (ResNet152-Places365), a 1000-D
(365-D) probability vector is extracted from the last layer (after Softmax). As
in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], data augmentation is performed to generate scaled and rotated versions
for each lifelog image. The maximum activation value (instead of the average) is
chosen for each class. These probability vectors serve as object and place features
for the retrieval stage.
        </p>
        <p>
          Besides image-level object recognition, we also perform object detection to
locate objects in lifelog images. A Faster R-CNN [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] is pre-trained on MSCOCO [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
training dataset, containing bounding box annotations for more than 200K
images over 80 object categories. Most of the categories are common in lifelog
images (e.g. laptop and tv). Given a lifelog image as test input to Faster R-CNN,
we compute maximum probability for each category over the top 20 detections,
empirically. The maximum probabilities serve as detection features for the
subsequent relevance stage.
        </p>
        <p>Human Detections &amp; Counting: Detecting and counting the number of persons
in an image may provide vital information which may be useful to determine
the relevance of the image for a particular query. The most popular method
for detecting people in an image is by using the histogram-of-gradients (HOG)
approach. We tried this approach, however, due to the lack of su cient and
representative number of training samples from this database and possible
involvement of huge manual e orts, good training cannot be achieved.</p>
        <p>
          Several commercial entities also provide cloud-based APIs that perform the
task of detecting and counting the humans in an image. Many of these entities
use proprietary deep learning based approaches to perform the task of human
detection. We selected the person detection API provided by Sighthound, Inc.
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as it provided convenient features such as detecting and counting people, as
well as providing the coordinates of the bounding box of the detected people.
The performance of the Sighthound API on several computer vision tasks on
benchmark datasets has been studied and a superior performance has been
reported [
          <xref ref-type="bibr" rid="ref18 ref7">7, 18</xref>
          ]. The pre-trained model used by this API requires that the person
in the image should occupy at least 96 40 pixels for upper/full body detection
and at least 72 64 pixels for head and shoulders. In general we found the API
to be more accurate on images with good lighting conditions and when the head
and shoulders of the person are clearly visible.
2.3
        </p>
        <p>Event Clustering
The N images with best relevance score are then described with the
concatenation of all the available features. Each feature (deep learned features, locations,
activities, day and time) is given a weight which can be 0, 1, the feature
frequency score tf (for the deep learned features), or the inverse of its maximum
(for the metadata).</p>
        <p>The images are then clustered into events using either k -means or a
hierarchical tree.</p>
        <p>
          Image Features Used:
Objects and Places: A part from the object and place features described in
section 2.2, we also test describing the images with low-level deep descriptors.
Using the widely used VGG16 architecture [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] pre-trained on ImageNet1K data
set, we extract 512 feature maps from the last pooling layer (i.e. pool5) for
each augmented image, followed by nested invariance pooling over all feature
maps [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. This results in a 512-D global descriptor representing each lifelog
image. Post-processing techniques (e.g. PCA whitening) can be applied to further
enhance the discriminative power of the pooled descriptors.
        </p>
        <p>Locations and Activities: Each location in the user's lifelog is given a unique id.
Same process is done for the activity tag.</p>
        <p>Day and Time: From the image TimeStamp, we extract the day of the month
and the hour it was taken. Alternatively, we quantize the hour into morning,
afternoon, evening or night.
2.4</p>
        <p>Key-Frame Selection
To select the key-frames, all frames in each cluster ci = ffik g are ranked
according to distance to the cluster center (for k -means clustering) or relevance
score (for hierarchical trees), so that ci = [f k]i, and the summary is initialized
empty, S = []. The following process is repeated iteratively until reaching the
desired summary length X: The rst available image in each cluster is selected
to be part of the nal summary s = ffk j 8i; k = 0g, and discarded from the
bag of available frames. Then, the selection is sorted according to each frame's
relevance score, so that the most relevant are rst in the generated summary.
The sorted sequence is added at the end of the summary, S = S_s. Note that
to force that each event will be represented if selecting a summary shorter than
X, the sequence to be sorted is the newly drawn s, and never S. If, at the end
of the drawing process, the cadence of S is greater than X, the last elements of
S are discarded.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>For this task, we submit two di erent sets of runs: automatic and interactive.
In the interactive, we give the user the opportunity of removing, replacing and
adding frames from the automatically generated summary. The bag of frames
from where the user can replace images is the same set of relevant images as
in the automatic approach. In this section, we will rst present the parameters
used to nd the set of relevant images, and then explain in more detail the two
types of submitted runs.
3.1</p>
      <p>
        Learning the Best Parameters
We use the development set [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to train the relevance weights and select the best
features to fuse into the image descriptor. For the ve test tasks that do not
match any of the development ones, we have split the test set into two parts,
and used the smaller of them as training data. All parameters are learned for
each run and task separately.
      </p>
      <p>Quality Filtering For most tasks, the quality threshold is de ned by the 35
or 50 percentile, depending on whether the run uses only image features, or also
metadata (note that since the quality assessment is visual, the threshold is set
to zero when only using metadata). For task 6 (Social Drinking ), where images
are usually taken in dark places and are thus of bad quality, the threshold is set
to be of 10 percentile.</p>
      <p>Computing the Relevance Score The relevance score is a fusion of three
modules: the visual score, obtained from the DCNN activations; the location
and activity relevance from tags; and the locations and activities to remove.</p>
      <p>
        The visual relevance score for each frame is the dot product between its
descriptor and the reference query descriptor. For this purpose, the frame
descriptor is de ned as the 1365D vector of activations. To de ne the reference
query descriptor, relevant object classes are found by using the WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
structure on two or three key concepts, and places are selected manually
(Table 1). The reference query descriptor is initialized to zeros. Then, all classes
present among the wanted objects are given a constant value wobjy, the objects
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
      </p>
      <p>Average F1 (*) and Precision (doted) for all submitted runs</p>
      <p>Task 1</p>
      <p>(a) Average for all tasks</p>
      <p>Task 2</p>
      <p>Task 3</p>
      <p>Task 4
to avoid are set to wobjn, and the same is done for places. Additionally, we
perform object detection as described in section 2.2, and a weight wcoco is applied
over the score of the relevant items.</p>
      <p>Following, a value wloc is added to the relevance score of all these images
with a relevant location label, and the score for frames matching the relevant
activity is increased wact.</p>
      <p>Finally, all the frames with location or activity label to avoid are given a
relevance score of 0, and thus removed from the pool of frames.</p>
      <p>Additionally, for tasks 1 (In a meeting ) and 6 (Social Drinking ), where the
presence of other people is a task-relevance indicator, we count people as
described in section 2.2, and increase the relevance score of those images with
enough people by wppl. We observe that, given that relevant images in task 1
have many occlusions, and that images in task 6 have poor lightning conditions,
the performance of the people detector for such tasks is not good enough.</p>
      <p>The nal relevance value is smoothed using a triangular window of size win,
which ranges between 1 (0 extra frames) and 11 (5 frames to each side). The
optimal values of wcoco, wobjy, wobjn, wply, wpln, wloc, wact, wppl and win are
found heuristically by analyzing the retrieval performance in the training data,
as shown in Fig. 3. The objective is best recall at X = 400, to have the greatest
number of events brought forward for the next step in the summarization process.
3.2</p>
      <p>Automatic Runs
We submitted seven di erent automatic runs. Such runs are compiled in Table 2,
and are de ned by the range of features used: only image (with and without
object detection), only metadata and mixed.</p>
      <p>Clustering the Lifelog Images into Events The weights for each feature in
the image descriptor for each task are de ned by the best combination in the test
set. The images are then clustered into M events using k -means, or hierarchical
trees when only using metadata. The number of events M is set to be equal or
smaller than the summary length budget. When using k -means, the frames with
relevance below the 50 percentile for each cluster are discarded. The selection
and ranking of keyframes from the clusters is described in section 2.4.
3.3</p>
      <p>Interactive Runs
We submitted three di erent interactive runs, as compiled in Table 2. The
interaction time per task is of 30 on average, ranging between 103000 and 404000.
Clustering the Lifelog Images into Events For the interactive runs, k
means is chosen for clustering the relevant images into M events, where M is
greater than the summary length budget. In each iteration, the user can select
which images to preserve for the nal summary, which frames to remove from
the bag of relevant images, and whether all other images in the cluster should
Fig. 5: GUI for Interactive Summarization. Frames are shown with information of
their location and timestamp, and those already selected for the nal summary
are marked \In Summary". The user can select the frames to be included (green
Y ), to be removed (red N ), or to remove all frames in the same cluster (*N ).
be removed (Fig. 5). Two methodologies for updating the summary are
proposed: First, re-clustering the remaining frames. Second, using the same initial
clustering for all iterations.</p>
      <p>In the rst approach, the frames are clustered into a 20% more clusters than
additional keyframes needed (note that this number changes at each iteration).
For the second approach, two con gurations are tested: clustering into either a
20% or 100% more frames than the length of the nal summary. Once clustered,
the frames closest to each cluster center are chosen as candidates to be added to
the summary. The most relevant ones (as many as needed to ll the summary
budget length) are then included in the proposed summary.
3.4</p>
      <p>Discussion
By looking at the results in Fig. 4, we can observe that the use of mixed features
generally improves the performance. Using only metadata (run 3) is, in average,
worse than using only visual features (runs 1 and 8). We noted that the location
metadata was not very precise in terms of starting and nishing times. It is
interesting to see how interactive approaches (runs 4, 5 and 10) do not necessarily
perform better than automatic ones for low values of X. This may be due to the
subjectivity of such kind of retrieval task. One may think that, being the images
handpicked (and sorted in drawing order), the precision score should be much
higher than the automatic runs, and close to 1 for low values of X (being those
the rst selected keyframes). However, this only happens for some tasks.</p>
      <p>Analyzing each task independently, we observe that visual features are not
very precise for tasks 1 (In a meeting ) and 4 (Working at home). Surprisingly,
using only visual features yield the best results for task 6 (Social drinking ) for low
values of X, specially if also using object detection, which performs best for larger
X. An outstanding retrieval performance (precision greater than 80% for all X)
is achieved with a mix of visual and metadata features for tasks 7 (Sightseeing ),
8 (Transporting ) and 10 (Shopping ), possibly due to the quality of the metadata
(although the performance of using only visual features is competitive with the
mixed approaches). The best results for tasks 2 (Watching TV ), 4 (Working
at home), 5 (Eating ), 9 (Preparing meals ) and 10 (Shopping ) are obtained with
the interactive approaches, since recall can be improved when manually selecting
images from di erent clusters.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper, we have presented a generic framework for the summarization of
lifelogs given a target task. It can be used both in an automatic or interactive
way, with the user providing feedback on the retrieved frames. The proposed
approaches require that the user selects the relevant locations for each task. In
order to ease this forced manual input, the metadata obtained with lifelog apps
could contain additional info of, e.g. , the nature of each location.</p>
      <p>We have observed that di erent tasks require di erent summarization
methodologies (e.g. di erent weights), which may not be completely consistent when
changing the lifelog input. Trained on the development set, we have obtained a
0:497 best averaged F1 score @X = 10 (the averaged F1 for the best run for
each task is 0:563), meaning there is still a lot of room for improvement. We
encourage other researchers to participate in future such competitions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wordnet</surname>
          </string-name>
          . Princeton University (
          <year>2010</year>
          ), http://wordnet.princeton.edu
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <article-title>Sighthound detection api</article-title>
          . Sighthound, Inc. (
          <year>2017</year>
          ), https://www.sighthound.com/ docs/cloud/detection/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.L.A.T.</given-names>
            ,
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Places: An image database for deep scene understanding</article-title>
          .
          <source>In: arXiv:1610</source>
          .
          <year>02055</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bolanos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimiccoli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radeva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Toward storytelling from visual lifelogging: An overview</article-title>
          .
          <source>IEEE Trans. HumanMach. Syst</source>
          .
          <volume>47</volume>
          (
          <issue>1</issue>
          ),
          <volume>77</volume>
          {
          <fpage>90</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bolanos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mestre</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Talavera</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giro-i Nieto</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Radeva</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Visual summary of egocentric photostreams by representative keyframes</article-title>
          .
          <source>In: IEEE ICME Workshops</source>
          . pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEFlifelog 2017:
          <article-title>Lifelog Retrieval and Summarization</article-title>
          .
          <source>In: CLEF 2017 Labs Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Dublin,
          <source>Ireland (September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dehghan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortiz</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masood</surname>
            ,
            <given-names>S.Z.</given-names>
          </string-name>
          :
          <article-title>Dager: Deep age, gender and emotion recognition using convolutional neural network</article-title>
          .
          <source>arXiv preprint arXiv:1702.04280</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fei-Fei</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <string-name>
            <surname>ImageNet: A LargeScale Hierarchical Image</surname>
          </string-name>
          <article-title>Database</article-title>
          . In: CVPR (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Doherty</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conaire</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blighe</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeaton</surname>
            ,
            <given-names>A.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>O'Connor</surname>
            ,
            <given-names>N.E.</given-names>
          </string-name>
          :
          <article-title>Combining image descriptors to e ectively retrieve events from visual lifelogs</article-title>
          .
          <source>In: Proceedings of the 1st ACM MIR</source>
          . pp.
          <volume>10</volume>
          {
          <fpage>17</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Elsweiler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruthven</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Towards memory supporting personal information management tools</article-title>
          .
          <source>J. Assoc. Inf. Sci. Technol</source>
          .
          <volume>58</volume>
          (
          <issue>7</issue>
          ),
          <volume>924</volume>
          {
          <fpage>946</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Harvey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langheinrich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
          </string-name>
          , G.:
          <article-title>Remembering through lifelogging: A survey of human memory augmentation</article-title>
          .
          <source>Pervasive Mob Comput</source>
          .
          <volume>27</volume>
          ,
          <issue>14</issue>
          {
          <fpage>26</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: CVPR</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Villegas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arenas</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dicente Cid</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia Seco de Herrera</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Islam</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwall</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Overview of ImageCLEF 2017: Information extraction from images</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017. Lecture Notes in Computer Science</source>
          , vol.
          <volume>10456</volume>
          . Springer, Dublin,
          <source>Ireland (September</source>
          <volume>11</volume>
          -14
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dey</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Providing good memory cues for people with episodic memory impairment</article-title>
          .
          <source>In: ACM SIGACCESS ASSETS</source>
          . pp.
          <volume>131</volume>
          {
          <fpage>138</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ghosh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Discovering important people and objects for egocentric video summarization</article-title>
          .
          <source>In: CVPR</source>
          vol.
          <volume>2</volume>
          , p.
          <volume>6</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bourdev</surname>
            ,
            <given-names>L.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Microsoft COCO: common objects in context</article-title>
          .
          <source>In: ECCV</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Masood</surname>
            ,
            <given-names>S.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dehghan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ortiz</surname>
            ,
            <given-names>E.G.</given-names>
          </string-name>
          :
          <article-title>License plate detection and recognition using deeply learned convolutional neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1703.07330</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Garcia del Molino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>A.H.</given-names>
          </string-name>
          :
          <article-title>Summarization of egocentric videos: A comprehensive survey</article-title>
          .
          <source>IEEE Trans. HumanMach. Syst</source>
          .
          <volume>47</volume>
          (
          <issue>1</issue>
          ),
          <volume>65</volume>
          {
          <fpage>76</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Garcia del Molino</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Describing lifelogs with convolutional neural networks: A comparative study</article-title>
          .
          <source>In: LTA Workshop</source>
          . pp.
          <volume>39</volume>
          {
          <fpage>44</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Morere</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veillard</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duan</surname>
          </string-name>
          , L.y.,
          <string-name>
            <surname>Chandrasekhar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Poggio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Nested invariance pooling and rbm hashing for image instance retrieval</article-title>
          .
          <source>In: ACM ICMR</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          .
          <source>In: NIPS</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>In: ICLR</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Detecting snap points in egocentric video with a web photo prior</article-title>
          . In: ECCV pp.
          <volume>282</volume>
          {
          <issue>298</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>