<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Series</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Image Processing in Collaborative Open Narrative Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Petr Pulc</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Rosenzveig</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Holenˇa</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of information technology, Czech Technical University in Prague Thákurova 9</institution>
          ,
          <addr-line>160 00 Prague ♣❡tr✳♣✉❧❝❅❢✐t✳❝✈✉t✳❝3</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Film and TV School, Academy of Performing Arts in Prague Smetanovo nábrˇeží 2</institution>
          ,
          <addr-line>116 65 Prague ❡r✐❝✳r♦s❡♥3✈❡✐❣❅❢❛♠✉✳❝3</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Computer Science, Academy of Sciences of the Czech Republic Pod Vodárenskou veˇží 2</institution>
          ,
          <addr-line>182 07 Prague ♠❛rt✐♥❅❝s✳❝❛s✳❝3</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>1649</volume>
      <fpage>155</fpage>
      <lpage>162</lpage>
      <abstract>
        <p>Open narrative approach enables the creators of multimedia content to create multi-stranded, navigable narrative environments. The viewer is able to navigate such space depending on author's predetermined constraints, or even browse the open narrative structure arbitrarily based on their interests. This philosophy is used with great advantage in the collaborative open narrative system NARRA. The platform creates a possibility for documentary makers, journalists, activists or other artists to link their own audiovisual material to clips of other authors and finally create a navigable space of individual multimedia pieces. To help authors focus on building the narratives themselves, a set of automated tools have been proposed. Most obvious ones, as speech-to-text, are already incorporated in the system. However other, more complicated authoring tools, primarily focused on creating metadata for the media objects, are yet to be developed. Most complex of them involve an object description in media (with unrestricted motion, action or other features) and detection of near-duplicates of video content, which is the focus of our current interest. In our approach, we are trying to use motion-based features and register them across the whole clip. Using Grid Cut algorithm to segment the image, we then try to select only parts of the motion picture, that are of our interest for further processing. For the selection of suitable description methods, we are developing a meta-learning approach. This will supposedly enable automatic annotation based not only on clip similarity per se, but rather on detected objects present in the shot.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Amounts of multimedia content in archives of
documentarists and other multimedia content creators were always
large, even in the era of analogue film. With higher
availability and much lower price of capturing devices
suitable for cinema- or television-grade multimedia
production, much more content is stored archivally and only a
fraction is later published as a typical “closed narrative”
ie. a traditional media work of say 30, 60 or feature length
90 minutes.</p>
      <p>With a wider access to broadband internet connections
and higher participation of individual users in the creation
of internet content, the publication of such archives is now
theoretically possible, yet they are usually difficult to
navigate by users unfamiliar with the structure proposed by
the author. Even the authors themselves tend to lose track
of the entirety of their own content. And many time
constrained projects or longer term project’s media archives
lack any structure at all.</p>
      <p>To enable a creation of structure maintainable by a
group of authors, the open narrative principle can be used.
Although the original meaning refers rather to soap operas
or other pieces of art with no foreseeable end, the main
idea of multi-stranded narrative is easily transferable to
other environments, such as documentaries.</p>
      <p>In our example system, NARRA, that will be described
in section 2, multiple strands of narrative created by
multiple authors are combined and structured using data
visualizations into coherent multiple narratives and can be
mapped to a single graph, therefore extending the
viewpoint of one author as opposed to more traditional
narratives. However, such approach to multimedia clip
connection discovery may be insufficient in certain cases.</p>
      <p>One of them involves a discovery of near-identical video
clips, that are created by editing the original (raw) footage.
Authors tend to lose track through multiple iterated
versions (including cropping, colour corrections, visual
effects, retouching, soundtrack alterations or “sweetening”,
etc.) before arriving at a sequence used in the final edit.
This brings a need for automated moving picture
processing, that will be discussed in section 3.</p>
      <p>To be able to work efficiently with only a relatively
small set of interest points, instead of the whole
image, common image feature extraction algorithms will be
briefly presented in subsection 3.3. These algorithms will
be than compared in a task of basic motion detection.</p>
      <p>In subsection 3.4, we will present on idea of
motionbased image segmentation. The basic notion is based on
a similar approach used in object recognition from static
images, however instead of using just the image itself for
segmentation, hints from object movement will be used for
object determination.</p>
      <p>As most of the topics are still open, further research in
these areas will be briefly discussed in section 4. Based
on that direction of research, not only the recognition of
objects, but also a recognition of the properties of the
objects will be supposedly possible. In this area, we would
like to use a meta-learning approach. This approach will
be outlined in subsection 4.1.
2</p>
    </sec>
    <sec id="sec-2">
      <title>NARRA</title>
      <p>
        Open narrative systems were usually created as
one-of-akind tools that enabled the user to browse authored
content in a somewhat open manner. First approaches
similar to open narrative platforms stemmed from multimedia
archives at the end of 20th century, with annotations and
connections curated by hand. David Blair’s Waxweb,
besides being the first streaming video on the web, is often
cited as the first online video based navigable narrative [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>One of the major projects of the second author, Eric
Rosenzveig, on which we are building, is
playListNetWork. A system developed from 2001 to 2003 in
collaboration with Willy LeMaitre and other media artists and
programmers. This software enabled multiple users in
different locations to simultaneously work with an
underlying audiovisual database, annotating the media clips and
joining them into branching playlists. The publicly
accessible part of the software, disPlayList, enabled a 3D
visualization of the playlist structure created by
playListNetWork and a subsequent unique “run” or cinematic
experience through the material.</p>
      <p>NARRA is an evolution of playListNetWork concepts,
brought to a new world of hyper-linked media and direct
audiovisual playback, as opposed to the more complicated
multimedia streaming approaches of the past. With the
increasing processing power of computers, it has been
proposed that some parts of media annotation or linking can
be handed over to automated processing tools.</p>
      <p>The main task of NARRA is to create a platform for
collaboration of multiple artists, and therefore the system
is being built modularly, with an extensible API. During
the use of NARRA on multiple projects, we discovered
diverse ideas about multimedia collaboration and that
different kinds of annotations are needed. To this end, NARRA
uses a NoSQL database to avoid any possible limitations
in the future.</p>
      <p>Modules themselves are of three distinct types:
Connectors are used to ingest the multimedia data, yet
because NARRA is not a multimedia archive, only a
preview and proxy is stored alongside basic metadata.
Generators are automated tools, that process the
multimedia and create a set of new metadata. An example
of such a module uses an AT&amp;T speech recognition
API for automated transcription of human speech.
Synthesizers find any structure in the (meta-) data
already present in storage to link the items together. For
example, the synthesizer looks for a keyword
similarity between two items, or is used to create and
enhance links between clips used in stored video
sequences.</p>
      <p>NARRA can be then used for presentation of generated
multimedia sequences, allowing for media discovery due
to navigation during sequence playback or to show any
user interface or visualization created in Processing.js or
P5.js scripts.</p>
      <p>This article will propose a generator creating
annotations based on motion vectors in the video. Further
research is intended to create a synthesizer that will enable a
final linking of similar audiovisual clips automatically.</p>
      <p>Detection and description of objects is proposed as
another metadata generator. Currently, motion vectors can be
used for detection of individual objects in unconstrained
motion picture. Evolving rules connecting the detected
objects with salient features contained in their description is
a goal for our further research.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Moving Picture Processing</title>
      <p>Computer vision, moving picture processing and still
image processing are interconnected areas that use a very
similar set of processing techniques. Using edge detection
to create outlines of objects in the scene, detecting
occurrences of previously defined shapes, detection of interest
points and registering them among multiple pictures, etc.</p>
      <p>Opposed to static image, moving picture brings a
possibility of motion detection, yet on the other hand a problem
of high data amounts that we need to deal with.
3.1</p>
      <sec id="sec-3-1">
        <title>State of the Art</title>
        <p>Many of the traditional approaches analyse individual
multimedia frames, and such extracted data is taken as a
discrete time sequence. Or even only statistical properties
of such sequence are used for further processing.</p>
        <p>
          Examples of single-frame processing methods include
the classification of textures [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], bag-of-features
classification [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], text recognition [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], object recognition [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] or
face recognition [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>
          The method created by Lukáš Neumann and Jirˇí Matas
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] has been also further extended into text transcription
from live video. But opposed to a later mentioned
approach of Fragoso et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], frames were still processed
one-by-one.
        </p>
        <p>Other systems process pairs of frames, but have to
introduce certain limitations to the acquisition process – such
as limiting the motion of either the camera or the object.
The camera motion limitation is for example acceptable
in security camera applications, the second one in static
object or environment scanning.</p>
        <p>Especially the static camera is widely used, as it allows
us a very simple motion detection concept: If many pixels
change significantly in-between frames, it can be assumed
that motion had happened. The location of the changed
pixels tells us the position of such a motion and the
difference between positions in individual frames can be
deduced as a motion vector.</p>
        <p>If we have enough information about the background or
gather it during the processing, it can be subtracted from
all frames to enable not only the detection of movement,
but detection of whole objects. Yet still, the camera has to
be static and the gathered background has to be as invariant
as possible, which is not always achievable.</p>
        <p>
          To enhance the information from image segmentation,
other specialised sensors or camera arrays can be used to
gather a depth information, however distance sensors do
not usually have high-enough resolution and scene
reconstruction from multiple sources is costly. There is a new
method developed by Disney Research Zurich [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to
eliminate such problems, yet they are still based on processing
of individual pixels into 3D point clouds.
        </p>
        <p>
          Another problem that is currently based mostly on still
image comparison, is measuring similarity between
individual clips. Existing approaches try to gather similar
patches from two sets of frames and compare them with
invariance to very little or no editing operations [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Interest Point Based Image Processing</title>
        <p>A very different approach to image processing can be
based on detection and registration of interest points
among a set of individual multimedia frames. This brings
an advantage of much smaller data processing
requirements with only a slight compromise in quality and
precision. Technically, the worst type of error is a detection
of similar, yet not related, points of interest. But these
outliers can be filtered out later on.</p>
        <p>
          To contrast with previously mentioned methods, we try
to use primarily the information about interest points,
especially their motion. An example of such use of motion
tracking can be seen in the already mentioned translation
application [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Image is sent to the recognition service
only once, and the returned result is kept in track with the
moving picture thanks to extracted motion vectors.
        </p>
        <p>Image segmentation, as another example of widely used
image processing technique, have to be still based on the
image information itself, yet the motion information can
be used to discover and track position of the detected
object.</p>
        <p>In our use case, the motion vectors extracted from all
frames can be divided into two basic groups – motion of
the camera itself and motion of the objects in the scene.
For both groups, we can make some basic assumptions that
will help us to distinguish them. In case of object motion,
we can safely assume that the singular motion vectors
exceeding some interframe distance are false detections and
can be avoided. Also, we can assume that the motion of the
object is at least to some extent smooth. Therefore, rapid
movement of an object is impossible without a jump-cut
in the post-production. And higher frame rate footage will
be supposedly able to rely on this property even more.</p>
        <p>The camera motion can be proposed as a smallest
deviation to a global motion model. However, several problems
arise as the camera can not only translate and rotate, but
also change focus and in case of some lenses also zoom.
The detection model therefore needs to incorporate all
possible deformations of the field.</p>
        <p>Currently, we will incorporate such moving picture
description into NARRA, as a more robust computation of
item similarity. By combining this approach with
metalearned rules concerning item description, we should be
then able to correctly describe both the environment where
the action takes place and the objects themselves.
However, to validate the applicability of such a complex
description, more experimentation with extracted image
features and segmentation needs to be performed.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Experiments with Image Descriptor Matching</title>
        <p>
          Because of distinct properties of currently used image
feature descriptors and specificity of our use-case, we used
the following image descriptors with two distinct
matching algorithms. Brute-force (BF) searches for the closest
descriptors directly, in linear time, and ends. More
elaborate Fast Library for Approximate Nearest Neighbours
(FLANN) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] first creates a set of binary trees and
indexes all descriptors. During search, the trees are
recursively traversed many times to increase match precision –
currently 50 times, which is possibly excessive. In both
cases we perform a ratio check proposed by Lowe [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Scale-invariant feature transform (SIFT) is an algorithm
for detection and description of local features in images,
published by David Lowe in 1999 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. This algorithm
takes the input image, and returns a description of the
individual interest points as 8-binned gradient direction
histogram of 16 × 16 surrounding blocks, collected on 4 × 4
sub-blocks. Therefore, SIFT creates a vector of 128
numbers for each interest point.
        </p>
        <p>Speeded up Robust Features (SURF) is merely an
enhancement of the SIFT descriptor. The Laplacian of
Gaussian used in SIFT is approximated with a Box filter, and
both orientation assignment and feature description are
gathered from wavelet responses. Around the interest
points, 4 × 4 sub-regions are considered, each being
described by four properties of the wavelet responses. SURF
descriptor therefore creates by default a vector of 64
values for each interest point.</p>
        <p>
          ORB is a fairly new image feature descriptor presented
by Rublee in 2011 [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] which uses Binary Robust
Independent Elementary Features descriptor of detected points
of interest.
        </p>
        <p>Due to the binary nature of ORB, the search of matching
points of interest is much faster in case of either algorithm,
as can be seen in Table 2. Yet the resulting set of matches
40
35
]
[sm 30
r
tcoe 25
v
rep 20
e
item 15
raeg 10
v
a 5</p>
        <sec id="sec-3-3-1">
          <title>Resolution</title>
          <p>480 × 270
960 × 540
1920 × 1080
3840 × 2160</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Resolution</title>
          <p>480 × 270
960 × 540
1920 × 1080
3840 × 2160
consist of much fewer points. The number of resulting
vectors is shown in Table 1</p>
          <p>Visual comparison of detected motion by all three
algorithms is shown in Figure 1. It indicates that ORB
would be useful for direct classification of actions in the
image and possibly also the multimedia clip comparison.
Whereas SURF, as a most time-consuming method with
results exceeding the ones of SIFT by much more detected
motion vectors, would be beneficial for the detailed image
segmentation. However, as we are working on a proof of
concept only, the much faster ORB descriptors will be in
focus of our further interest.</p>
          <p>The graph in Figure 2 also shows that SIFT does not
scale well and that simple brute force based matching has
a better time performance. Yet the visual comparison of
outputs in Figure 1 shows that vectors matched by the
FLANN algorithm are more precise. Meaning that not as
many false motion vectors (long green lines) are detected.
Also, the FLANN algorithm can be tuned a lot, for
example by reducing the number of checks. Therefore, image
segmentation will be tested on vectors obtained from ORB
descriptors matched with FLANN.</p>
          <p>It is needed to say that the current time performance of
any of these algorithms is insufficient for any real-time or</p>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>3.4 Image Segmentation</title>
        <p>Image segmentation itself is a very important discipline in
computer vision, as it enables to bring our focus to narrow
details of a particular part of the image, as opposed to a
complicated description of the whole scene.</p>
        <p>
          The basic image segmentation may be derived from a
detection of connected components in the image and
provide a set of areas, ideally affine to the local texture of the
image. Such approaches, partially discussed in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], bring a
possibility to categorize such areas and therefore describe
the whole image.
        </p>
        <p>
          A bit more sophisticated image segmentation algorithm
uses a principle of minimal energy cuts in the space
of the image, where the inlets and outlets to the graph
are assigned by rather imprecise scribbles. More
precisely, we will be using a speeded-up version of
BoykovKolmogorov algorithm – Grid Cut [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>For better segmentation, the image is converted into an
edge-representation. To this end, a convolution with the
Laplacian of Gaussian kernel is performed. The inlets are
then generated from the clustered motion vectors.</p>
        <p>Such clustering is crucial as we need to assign inlets
corresponding to whole objects, not individual motion
vectors. To this end, all vectors of motion are represented as
6-dimensional data points, storing frame number, location
and motion vector as angle sine, cosine and length.</p>
        <p>
          For clustering, we have used a partially normalised data
representation, where the position of the starting pixel in
the image was divided by the image resolution and frame
number has been made relative to the the length of the
processed clip. This had a consequence that the role of
those features in the performed hierarchical clustering
decreased, in favour of the motion vector length and
direction. Ward’s linkage [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] yielded a dendrogram shown in
Figure 3.
(a) ORB, exact match
(b) ORB FLANN table
(c) SIFT 2-NN selection
(d) SIFT FLANN 2-NN
(e) SURF 2-NN selection
(f) SURF FLANN 2NN
        </p>
        <p>Based on this dendrogram, a division of motion vectors
into 5 clusters has been performed and the resulting
clusters are shown in Figure 4. Cyan and red points represent
correctly the background and the yellow and green points
mostly represent the moving objects. Sadly, both of the
objects have similar vectors of motion and normalization
of point positions reduced the possibility to discriminate
them.</p>
        <p>Any detected cluster in this space is then assigned a
unique descriptor that is used as a scribble index. Scribble
pixels (min-cut inlets) are assigned from neighbourhood
of the clustered motion vector start points.</p>
        <p>Although we have used the approach resulting in the
minimal amount of detected motion vectors, the
preliminary result of segmentation in Figure 5 shows that this
approach is valid and can be used at least for motion
description of both the background and foreground objects.
Investigating other clustering and segmentation algorithms
is a part of our further research interests.
So far, the information gathered can be used for a simple
indexing tasks. For example gathering a number of
objects present in the scene, their shape, colour histograms,
present textures and points of interest. Motion vectors can
be also indexed for later comparison of multimedia clips.
Such index will be invariant to scale, crop, colour edits and
other, more complex modifications of the multimedia, as
the final descriptor would be deduced only from motion
vectors and their relative displacement.</p>
        <p>The final goal of our research is, however, to enable an
automatic description of objects and environments in an
unconstrained multimedia item. For such description, we
may propose a custom baseline classifier, that would use
the information about the segment contour, relative colour
histogram and / or texture. However, we aim for
utilisation of some already existing and previously mentioned
single-frame processing methods. As the content of each
multimedia segment should be now composed only of a
single object in ideal case, only the classification part of
such methods may be used.</p>
        <p>Yet, we have no prior information about the type of the
recognized object. The custom classifier would be
difficult to train. If we would run all of the already existing
classifiers and combine their outputs to deduce the final
class of the object, high amount of noise and possibly
contradictory information would be introduced. Also there
is no sense to run the recognition algorithms on all media
frames, as the ones with blurred or highly occluded objects
will just confuse the classifiers.</p>
        <p>
          Therefore, we are currently studying a meta-learning
approaches that will select only several best-performing
classification algorithms, based on the meta-features
describing the considered video – such as coarsely binned
colour histogram and edge information. Although
metalearning itself has been used on text corpora [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for several
decades, its application to the classification of multimedia
content is rather novel.
        </p>
        <p>We are currently investigating two levels, on which we
can apply meta-learning to multimedia. The first, and
higher-level, introduces a processing method
recommendation – a classifier on the meta level that chooses the most
appropriate from a set of available processing methods,
based on easily extractable meta-features. In our current
case, the computed boundary of segmented object, its
histogram and other meta-features will be used to select more
complex and thorough extraction and classification
methods, such as face description or texture processing. A set
of methods is used, to enable an evolution of meta-learner.
To accomplish that, the best-performing method is
associated with input meta-features for next rounds of the
metalearning.</p>
        <p>This approach can be even stacked to multiple layers.
An example of such situation is a more precise
recognition of people, where the meta-learning classifier
recognizes the shape as a human, and subsequent classification,
possibly also obtained through meta-learning, brings
information about recognized face, clothes, eye-wear,
carried objects, types of movement and other features.</p>
        <p>However, using more methods will also introduce much
higher time complexity. To eliminate such problem, a
meta-learning with multiobjective optimization can be
introduced. Such meta-learning will then try to select
methods both from the point of view of predictive accuracy and
from the point of view of computational demands.</p>
        <p>The second level will aim on optimization of the
individual media processing units on their own. As some of
the data description methods incorporate trainable and
tunable methods (such as regression or classification), we can
either trust their recommended settings during training, or
consider multiple methods and/or their set-ups. This way,
we would like to increase the precision and also possibly
discover a wider variety of classes reflecting any drift in
the input data.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Acknowledgements</title>
        <p>The research reported in this paper has been carried out on
Academy of Performing Arts, within project “NARRA”
supported by the Institutional Support for Long-term
Conceptual Development of Research Organization
programme, provided by Ministry of Education, Youth and
Sports, Czech Republic.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Brazdil</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Giraud-Carrier</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Soares</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ; et al.
          <source>Metalearning. Cognitive Technologies</source>
          , Berlin, Heidelberg: Springer Berlin Heidelberg,
          <year>2009</year>
          , ISBN 978-3-
          <fpage>540</fpage>
          -73262- 4. Available from: ❤tt♣✿✴✴❧✐♥❦✳s♣r✐♥❣❡r✳❝♦♠✴✶✵✳ ✶✵✵✼✴✾✼✽✲✸✲✺✹✵✲✼✸✷✻✸✲✶
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Duygulu</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Barnard</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>J. F. G.</given-names>
          </string-name>
          ; et al.
          <source>Computer Vision - ECCV 2002: 7th European Conference on Computer Vision</source>
          Copenhagen, Denmark, May
          <volume>28</volume>
          -31,
          <year>2002</year>
          Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>IV</given-names>
          </string-name>
          ,
          <article-title>chapter Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary</article-title>
          . Berlin, Heidelberg: Springer Berlin Heidelberg,
          <year>2002</year>
          ,
          <source>ISBN 978-3-540-47979-6</source>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>112</lpage>
          . Available from: ❤tt♣✿✴✴❞①✳❞♦✐✳♦r❣✴✶✵✳✶✵✵✼✴✸✲✺✹✵✲✹✼✾✼✾✲✶❴✼
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Fragoso</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gauglitz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zamora</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; et al.
          <article-title>Translatar: A mobile augmented reality translator</article-title>
          .
          <source>In Applications of Computer Vision (WACV)</source>
          ,
          <source>2011 IEEE Workshop on, IEEE</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>497</fpage>
          -
          <lpage>502</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Jamriška</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sýkora</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hornung</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Cache-efficient graph cuts on structured grids</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2012</year>
          ,
          <year>June 2012</year>
          , ISSN 1063-
          <issue>6919</issue>
          , pp.
          <fpage>3673</fpage>
          -
          <lpage>3680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Klose</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bazin</surname>
          </string-name>
          , J.-C.; et al.
          <article-title>Sampling based scene-space video processing</article-title>
          .
          <source>ACM Transactions on Graphics (TOG)</source>
          , volume
          <volume>34</volume>
          , no.
          <issue>4</issue>
          ,
          <year>2015</year>
          : p.
          <fpage>67</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Kuncheva</surname>
            ,
            <given-names>L. I.</given-names>
          </string-name>
          <article-title>Combining Pattern Classifiers</article-title>
          . John Wiley &amp; Sons, Inc.,
          <year>2004</year>
          , ISBN 0471210781.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          <article-title>Object recognition from local scale-invariant features</article-title>
          .
          <source>In Computer vision</source>
          ,
          <year>1999</year>
          .
          <source>The proceedings of the seventh IEEE international conference on</source>
          , volume
          <volume>2</volume>
          ,
          <string-name>
            <surname>Ieee</surname>
          </string-name>
          ,
          <year>1999</year>
          , pp.
          <fpage>1150</fpage>
          -
          <lpage>1157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; et al.
          <article-title>A multimedia information fusion framework for web image categorization</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          , volume
          <volume>70</volume>
          , no.
          <issue>3</issue>
          , jun
          <year>2014</year>
          : pp.
          <fpage>1453</fpage>
          -
          <lpage>1486</lpage>
          , ISSN 1380-
          <fpage>7501</fpage>
          . Available from: ❤tt♣✿✴✴❧✐♥❦✳ s♣r✐♥❣❡r✳❝♦♠✴✶✵✳✶✵✵✼✴s✶✶✵✹✷✲✵✶✷✲✶✶✻✺✲✷
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] Meyer, T.;
          <string-name>
            <surname>Blair</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hader</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>WAXweb: a MOO-based collaborative hypermedia system for WWW</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          , volume
          <volume>28</volume>
          , no.
          <issue>1</issue>
          ,
          <issue>1995</issue>
          : pp.
          <fpage>77</fpage>
          -
          <lpage>84</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Muja</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lowe</surname>
            ,
            <given-names>D. G.</given-names>
          </string-name>
          <article-title>Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration</article-title>
          .
          <source>VISAPP (1)</source>
          , volume
          <volume>2</volume>
          ,
          <year>2009</year>
          : pp.
          <fpage>331</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Computer</surname>
          </string-name>
          Vision - ACCV
          <source>2010: 10th Asian Conference on Computer Vision</source>
          , Queenstown, New Zealand, November 8-
          <issue>12</issue>
          ,
          <year>2010</year>
          ,
          <string-name>
            <given-names>Revised</given-names>
            <surname>Selected</surname>
          </string-name>
          <string-name>
            <given-names>Papers</given-names>
            ,
            <surname>Part</surname>
          </string-name>
          <string-name>
            <surname>III</surname>
          </string-name>
          ,
          <article-title>chapter A Method for Text Localization and Recognition in Real-World Images</article-title>
          . Berlin, Heidelberg: Springer Berlin Heidelberg,
          <year>2011</year>
          ,
          <source>ISBN 978-3-642-19318- 7</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>783</lpage>
          . Available from: ❤tt♣✿✴✴❞①✳❞♦✐✳♦r❣✴ ✶✵✳✶✵✵✼✴✾✼✽✲✸✲✻✹✷✲✶✾✸✶✽✲✼❴✻✵
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Nowak</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Jurie</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Triggs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>Sampling strategies for bag-of-features image classification</article-title>
          .
          <source>In Computer VisionECCV 2006</source>
          , Springer,
          <year>2006</year>
          , pp.
          <fpage>490</fpage>
          -
          <lpage>503</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Rublee</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rabaud</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Konolige</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ; et al.
          <article-title>ORB: an efficient alternative to SIFT or SURF</article-title>
          .
          <source>In Computer Vision</source>
          (ICCV),
          <source>2011 IEEE International Conference on, IEEE</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>2564</fpage>
          -
          <lpage>2571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Selvan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>SVD-based modeling for image texture classification using wavelet transformation</article-title>
          .
          <source>Image Processing</source>
          , IEEE Transactions on, volume
          <volume>16</volume>
          , no.
          <issue>11</issue>
          ,
          <year>2007</year>
          : pp.
          <fpage>2688</fpage>
          -
          <lpage>2696</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; et al.
          <article-title>Real-time Large Scale Near-duplicate Web Video Retrieval</article-title>
          .
          <source>In Proceedings of the 18th ACM International Conference on Multimedia, MM '10</source>
          , New York, NY, USA: ACM,
          <year>2010</year>
          ,
          <source>ISBN 978-1-60558-933-6</source>
          , pp.
          <fpage>531</fpage>
          -
          <lpage>540</lpage>
          . Available from: ❤tt♣✿ ✴✴❞♦✐✳❛❝♠✳♦r❣✴✶✵✳✶✶✹✺✴✶✽✼✸✾✺✶✳✶✽✼✹✵✷✶
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Turk</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pentland</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Face recognition using eigenfaces</article-title>
          .
          <source>In Computer Vision and Pattern Recognition</source>
          ,
          <year>1991</year>
          . Proceedings CVPR '
          <fpage>91</fpage>
          ., IEEE Computer Society Conference on,
          <source>Jun</source>
          <year>1991</year>
          , ISSN 1063-
          <issue>6919</issue>
          , pp.
          <fpage>586</fpage>
          -
          <lpage>591</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Ward</surname>
            <given-names>Jr</given-names>
          </string-name>
          ,
          <string-name>
            <surname>J. H.</surname>
          </string-name>
          <article-title>Hierarchical grouping to optimize an objective function</article-title>
          .
          <source>Journal of the American statistical association</source>
          , volume
          <volume>58</volume>
          , no.
          <issue>301</issue>
          ,
          <year>1963</year>
          : pp.
          <fpage>236</fpage>
          -
          <lpage>244</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>