<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Superpixel Group Mining for Manipulation Action Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tianjun Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephen McKenna</string-name>
          <email>s.j.z.mckennag@dundee.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CVIP, School of Science and Engineering, University of Dundee</institution>
          ,
          <addr-line>Dundee DD1 4HN</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Manipulation action recognition is a challenging problem in computer vision. We previously reported a system based on matching groups of superpixels. In this paper, we modify the superpixel group mining algorithm and report results on two datasets. Recognition accuracies are comparable with those reported using deep learning. The representation used in our approach is amenable to interpretation. Speci cally, visualisation of matched groups provides a level of explanation for recognition decisions and insights into the likely generalisation ability of action representations.</p>
      </abstract>
      <kwd-group>
        <kwd>Superpixel group mining vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Manipulation actions usually contain ne-grained motions involving both
actor and manipulated objects, in contrast to actions such as, e.g., running and
jumping. One approach to recognition of manipulation actions is to build object
and human body part detectors and analyse the relationships between them [
        <xref ref-type="bibr" rid="ref1 ref2 ref6">6,
1, 2</xref>
        ]. However, supervised training of detectors for all objects of interest can
require extensive manual image annotation [1{5]. Another shortcoming is that
object transformations arising from manipulation are not always su ciently
represented [
        <xref ref-type="bibr" rid="ref14 ref4">4, 14</xref>
        ]. Actions such as those involved in food preparation can markedly
change object appearance (e.g., mixing ingredients) and topology (e.g., cutting
into pieces). This situation is not well-handled by spatial-temporal tube methods
for example. Some other methods relied on pose trackers (e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and assumed
that most of the human body appears in the camera view. Yang et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] used
an unsupervised method to segment objects for recognising manipulation actions
against a clear, uncluttered background.
      </p>
      <p>
        We proposed an action recognition system based on discriminative superpixel
group mining which avoids the need for manual object annotations and which
can represent object transformations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. However, in order to select the best
representations for each action, the representative property should be considered
as well in the mining process [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In this work, we modify the discriminative
group mining algorithm by including representativity. We report results on two
datasets: 50 Salads [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and Actions for Cooking Eggs (ACE) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We illustrate
that the method learns representations that are amenable to interpretation via
visualisation, providing insights into recognition decisions and generalisation.
2
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Proposed Method</title>
      <sec id="sec-2-1">
        <title>On-line Spatio-temporal Superpixel Grouping</title>
        <p>
          We brie y introduce our superpixel grouping algorithm. More details can be
found in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Each frame is rst over-segmented into superpixels by using Depth
SEEDS [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. RANSAC [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is applied to nd the plane of work surface.
Superpixels above this surface are connected spatially and temporally based on
color similarity and optical ow to sequentially build spatio-temporal superpixel
groups. These groups can contain temporal bifurcations and loops so that they
are able to represent complex object transformations in actions such cutting and
mixing.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Group Representation and Matching</title>
        <p>We use colour, motion and texture to represent each superpixel group. Colour
is represented by a histogram (25 bins per channel), motion by an optical ow
orientation histogram weighted by ow magnitudes (30 bins), and texture by a
histogram of oriented gradients (30 bins). Let a(gi; gj ), m(gi; gj ) and h(gi; gj )
denote repectively the intersections of the colour, ow, and texture histograms of
two superpixel groups gi and gj . These groups' similarity k(gi; gj ) is computed
as in Eqn. (1) where 3 = 1 1 2 and the parameters 1 and 2 are tuned
during the training process.</p>
        <p>k(gi; gj ) = 1a(gi; gj ) + 2m(gi; gj ) + 3h(gi; gj )
(1)
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Mining and Recognition</title>
        <p>
          Previously [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], mining used the seeding algorithm in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] which considers
discriminability. Here we also include representativeness in the mining process [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
The idea is that, for example, mined superpixel groups for action A should only
tend to appear in instances of action A (disciminability) and that they should
appear in many instances of action A (representativeness). To achieve this, for
each group gi in the training set, we select the M most similar groups from each
di erent subject who performed the manipulation action. The total number of
selected groups is then K = M (P 1) where P is the number of subjects
in the training set. We compute the mining score for a group by summing its
discriminability and representativeness scores. The former is the proportion of
selected groups with the same label as that group. The latter is the proportion
of subjects with at least one selected group with the same label.
        </p>
        <p>
          A video frame is assigned an action label based on a xed duration temporal
window centred on that frame. Max-N pooling [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] is used to generate the feature
vector for a temporal window. Implementation details can be found in [
          <xref ref-type="bibr" rid="ref14 ref8">14, 8</xref>
          ].
Windows are classi ed using support vector machines trained in LibLinear [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>
        We used two datasets: 50 Salads and ACE. The 50 Salads dataset contains 50
videos. It has 25 subjects; each subject made two mixed salads with non-unique
order of steps. There are 10 actions in this dataset: add pepper, add oil, mix
dressing, peel cucumber, cut ingredient, place ingredient into bowl, mix
ingredients, serve salad onto plate, dress salad and NULL, where NULL represents all
times when one of those 9 actions is not occurring. Following the protocol in
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the dataset is split into 5 folds. Each fold contains 10 videos made by 5
subjects. Five-fold cross validation is used to estimate performance.
      </p>
      <p>The ACE dataset was proposed in the contest \Kitchen Scene Context based
Gesture Recognition" at ICPR 2012. There are seven subjects. Each of them
was required to cook ve recipes. Nine actions were annotated in the dataset:
breaking, mixing, baking, turning, cutting, boiling, seasoning, peeling and NULL,
where NULL represents all times when one of those 8 actions is not occurring.
There are 25 videos in the training set and 10 videos in the testing set.</p>
      <p>We randomly selected 10,000 superpixel groups from 4,000 temporal windows
in each action class for group mining. Each temporal window has a duration of
155 frames.</p>
      <p>Fig. 1 shows examples of mined superpixel groups; red regions are superpixels
in the mined groups. The mined groups provide interpretable representations for
the di erent actions. For instance, in the 50 Salads examples, groups representing
the pepper container and the hand motion suggest the action add pepper; groups
representing food ingredients with groups on the bowl suggest the action mix
ingredients. In ACE examples, mined groups capture the eggs in the bowl as the
representation for action mixing; superpixel groups of human arm and spoon
together suggest the action seasoning.</p>
      <p>By visualising mined groups, we can discover if they provide a representation
that is likely to generalise. For instance, the third group in Fig. 1(d) seasoning
captures clothing rather than anything inherently associated with the action
class. This indicates over tting and may cause failure to generalise.</p>
      <p>
        Table 1 compares the modi ed mining method with the previous method [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
on both datasets. Accuracies on 50 Salads are similar. As reported in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] this
accuracy is better than that of competing methods using deep learning. The
modi ed mining method improved accuracy a little on the ACE dataset.
      </p>
      <p>8
(a) add pepper</p>
      <p>18
(b) mix ingredients</p>
      <p>17
(c) mixing</p>
      <p>8
(d) seasoning
7
16
6
47
We modi ed the superpixel group mining used in our previously proposed method
for manipulation action recognition. Experiments on two datasets showed the
effectiveness in terms of accuracy. We also highlighted the interpretable nature of
the learned representation, in contrast to many deep learning methods for
example. Visualisation of matched superpixel groups can provide a level of explanation
for the recognition decisions made. It can also provide insights into likely
generalisation ability, enabling identi cation of groups that represent aspects of the
video that are not relevant to the actions of interest.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Prest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ferrari</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Schmid: Explicit modeling of human-object interactions in realistic videos</article-title>
          .
          <source>IEEE Trans. PAMI</source>
          ,
          <volume>835</volume>
          {
          <fpage>848</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          <article-title>Tian: Pipelining localized semantic features for ne-grained action recognition</article-title>
          .
          <source>In ECCV</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Paramathayalan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Moulin: Multiple granularity modeling: a coarse-to- ne framework for ne-grained action analysis</article-title>
          .
          <source>Int. J. Computer Vision</source>
          ,
          <volume>28</volume>
          {
          <fpage>43</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Gao: Progressively parsing interactional objects for ne grained action detection</article-title>
          .
          <source>In CVPR</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. C. Fermuller,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zampogiannis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barranco</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Pfei er: Prediction of Manipulation Actions</article-title>
          .
          <source>Int. J. Computer Vision</source>
          ,
          <volume>1</volume>
          {
          <fpage>17</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>B.</given-names>
            <surname>Packer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Saenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Koller: A combined pose, object, and feature model for action understanding</article-title>
          .
          <source>In CVPR</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fermuller</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Aloimonos: Detection of manipulation action consequences (MAC)</article-title>
          .
          <source>In CVPR</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J.</surname>
          </string-name>
          <article-title>McKenna: Sequential recognition of manipulation actions using discriminative superpixel group mining</article-title>
          .
          <source>In ICIP</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>B.</given-names>
            <surname>Fernando</surname>
          </string-name>
          , E. Fromont, T. Tuytelaars:
          <article-title>Mining mid-level features for image classi cation</article-title>
          .
          <source>Int. J. Computer Vision</source>
          ,
          <volume>186</volume>
          {
          <fpage>203</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>S.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J.</surname>
          </string-name>
          <article-title>McKenna: Combining embedded accelerometers with computer vision for recognizing food preparation activities</article-title>
          .
          <source>In ACM UbiComp</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>A.</given-names>
            <surname>Shimada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kondo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deguchi</surname>
          </string-name>
          , G. Morin, H. Sterna:
          <article-title>Kitchen scene context based gesture recognition: a contest in ICPR2012</article-title>
          .
          <source>Advances in Depth Image Analysis and Applications</source>
          ,
          <volume>168</volume>
          {
          <fpage>185</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. Van den Bergh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Carton</surname>
            ,
            <given-names>L. Van Gool</given-names>
          </string-name>
          :
          <string-name>
            <surname>Depth</surname>
            <given-names>SEEDS</given-names>
          </string-name>
          :
          <article-title>Recovering incomplete depth data using superpixels</article-title>
          .
          <source>In WACV</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. A. Fischler</surname>
          </string-name>
          , R. C.
          <article-title>Bolles: Random sample consensus: a paradigm for model tting with applications to image analysis and automated cartography</article-title>
          .
          <source>Comm. ACM</source>
          ,
          <volume>381</volume>
          {
          <fpage>395</fpage>
          (
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Q.</surname>
          </string-name>
          <article-title>Tian: Interaction part mining: a mid-level approach for ne-grained action recognition</article-title>
          .
          <source>In CVPR</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>R.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hsieh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Lin: LIBLINEAR: a library for large linear classi cation</article-title>
          .
          <source>J. Machine Learning Research</source>
          ,
          <year>1871</year>
          {
          <year>1874</year>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>