<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Video Processing for Judicial Applications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantinos Avgerinakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexia Briassouli</string-name>
          <email>abria@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Informatics and Telematics Institute, Centre for Research and Technology</institution>
          ,
          <addr-line>Hellas Thessaloniki</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <fpage>58</fpage>
      <lpage>64</lpage>
      <abstract>
        <p>The use of multimedia data has expanded into many domains and applications beyond technical usage, such as surveillance, home monitoring, health supervision, judicial applications. This work is concerned with the application of video processing techniques to judicial trials in order to extract useful information from them. The automated processing of the large amounts of digital data generated in court trials can greatly facilitate their browsing and access. Video information can provide clues about the state of mind and emotion of the speakers, information which cannot be derived from the textual transcripts of the trial, and even from the audio recordings. For this reason, we focus on analyzing the motions taking place in the video, and mainly on tracking gestures or head movements. A wide range of methods is examined, in order to nd which one is most suitable for judicial applications.</p>
      </abstract>
      <kwd-group>
        <kwd>video analysis</kwd>
        <kwd>recognition</kwd>
        <kwd>judicial applications</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The motion analysis of the judicial videos is based on various combinations of
video processing algorithms, in order to achieve reliable localization and tracking
of significant features in the video. Initially, optical flow is applied to the video, in
order to extract the moving pixels and their activity. Numerous algorithms exist
for the estimation of optical flow, like the Horn-Schunck developed in 1981 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and
the Lucas-Kanade [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. A more recent and sophisticated approach was developed
by Bouguet in 2000 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This method implements a sparse iterative version of
Lucas-Kanade optical flow in pyramids. By applying the optical flow algorithm,
we separate the video frame pixels into a set of points that are moving and a
static set of points. As some of the optical flow estimates may be caused by
measurement noise, a more accurate separation of the pixels into static and
active can be obtained by applying higher-order statistics, namely the kurtosis,
to the optical flow data.
      </p>
      <p>
        After the pixel motion is estimated and the active pixels are separated from
the static ones, only the active pixels are processed. This allows the system to
operate with fewer errors that would be caused by confusing the noise in static
pixels with true motion. The next step in the motion analysis is the interest point
Post-proceedings of the 2nd International Conference
on ICT Solutions for Justice (ICT4Justice 2009)
tracking in active pixels. The goal of this stage is to obtain the points which will
define the human action. Firstly we want to define the object that appears to
move through the frames. This object can be characterized using specific
features that will define the image and the object singularly. These specific feature
points then need to be matched from frame to frame so as to acquire interest
point tracking. Several state of the art algorithms have been examined for the
detection and description of these features. These are: The SIFT (Scale
Invariant Feature Transform) algorithm which had been published by David Lowe in
1999 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The SURF(Speeded Up Robust Features) algorithm which had been
presented by Herbert Bay in 2006 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and is based on sums of 2D Haar wavelet
responses to make an efficient use of integral images. The Harris-Stephens corner
detection algorithm developed in 1988 and finds corners with big eigenvalues in
the image [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>After the feature points are detected, interest point matching is needed. This
can be accomplished using a kd-tree algorithm, and specifically the BBF(Best
Bin First) algorithm. The BFF finds the closest neighbor of each feature in the
next frame based on the two feature’s descriptors’ distance. In the rest of the
paper, the algorithms and their combinations that are used are explained in
detail. The results of each technique are shown to demonstrate which ones give
the best results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Kurtosis-based Activity Area</title>
      <p>
        In of the methods used in this work, the optical flow estimates can be used
immediately, or further processed by higher order statistics in order to obtain a
more accurate estimate of the active pixels [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The optical flow values may be
caused by true motion in each pixel, or by measurement noise, expressed by the
following two hypotheses:
      </p>
      <p>H0 : vk0(r¯) = zk(r¯)
H1 : vk1(r¯) = uk(r¯) + zk(r¯);
(1)
where vki(r¯) expresses the flow estimate at frame k and pixel r¯ and i = 0; 1
depending on the corresponding hypothesis. Also, uk(r¯) is the illumination
variation caused by true motion and zk(r¯) is the additive measurement noise. In
the literature, additive noise is often modeled by a Gaussian distribution. Thus
the separation of the active from the static pixels is reduced to the detection
of Gaussianity in the flow estimates. A classical measure of Gaussianity is the
kurtosis, which is zero for a Gaussian random variable and is given by:
kurt(y) = E[y4]
noise-induced flow values. This leads to a binary mask showing pixel activity,
called the Activity Area, which can be used to isolate the active from the static
pixels.
3</p>
      <p>Combination of Features and Flow for Active Feature
Localization
3.1</p>
      <sec id="sec-2-1">
        <title>HS Optical Flow, Kurtosis, SIFT, BFF Matching</title>
        <p>
          In the first approach examined, the Horn-Schunck algorithm is used to estimate
the optical flow of the video [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The optical flow values are processed using the
kurtosis-based technique of Sec. 2 in order to extract the Activity Area for that
video. The Activity Area is shown in Fig. 1(a), and a video frame masked by
the Activity Area is shown in Fig. 1(b), from where it can be seen that the
moving arm is correctly masked. The Activity Area is used to separate active
pixels from the static ones, which are ignored in the rest of the processing. The
SIFT algorithm is applied to the active pixels in order to extract features of
interest from them [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. This feature detector is chosen due to its robustness,
and in particular because it has been shown to be invariant to image scale and
rotation, as well as robust to local occlusion. The resulting features are matched
between successive frames, and the results are shown in Fig. 2. In Fig. 2 the
blue points indicate which SIFT features have not been matched and the purple
lines show the matching between two frames. These results are good but not
entirely accurate, as a small number of features is found, and the matching does
not provide rich enough information about the activity taking place. Therefore,
more methods are examined for increased accuracy in the sequel.
(a)
(b)
        </p>
        <p>
          Harris Tomas Corner Detection, Lukas-Kanade pyramidal
optical Flow, Kurtosis
As features of interest often appear near corners, in this set of experiments the
Harris-Thomas Corner Detection is initially used to detect feature points[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. As
Fig. 3 shows, this method provides more feature points than the previously used
SIFT feature point detector, and can be therefore considered to give more reliable
and robust results. The motion of these feature points also needs to be estimated,
to better understand the activity taking place - in this case a gesture. A more
recent and sophisticated method for optical flow estimation is used here, namely
the pyramidal Lukas-Kanade optical flow, which can deal with both large and
small flow values. The kurtosis method is applied again, in order to accurately
isolate the truly active features. As Fig. 4 shows, the resulting masked features
points provide a good localization of the activity of interest in the video, and can
therefore be used in subsequent stages for activity classification or recognition.
In this case, the features are detected using the SIFT algorithm initially, with the
results shown in Fig. 5. Afterwards, a more sophisticated method for estimating
the motion is used. Namely, the pyramidal Lucas-Kanade is applied to them
in order to find their motion throughout the video [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The kurtosis of these
flow values is estimated so as to eliminate the feature points that are not truly
moving, and only keep the active ones. Fig. 6 shows that in this case, fewer
features of interest (i.e. moving feature points) are detected than in the method
of Sec. 3.2.
In this case, Horn-Schunk optical flow is applied to the pixels where features
were detected, and the truly active pixels are extracted, as before, by applying
the kurtosis method described above. Afterwards, the resulting Activity Areas
are used to mask the video and the resulting, smaller sequence, is input to the
SURF algorithm. The SURF [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] method is examined for feature extraction as
it has been shown to be more robust than the SIFT algorithm against a wider
range of image transformations. Additionally, it runs significantly faster, which
is important when processing many long video streams. Features on the active
pixels are then found by SURF and, as Fig. 7 shows, they can successfully isolate
the most interesting active parts of the video.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>A variety of video processing algorithms has been applied to judicial videos
containing characteristic gestures, in order to determine which ones are most
appropriate for isolating the activity of interest. The optical flow is extracted as
the moving points are of interest in this work. The active pixels are separated
from the static ones using a kurtosis-based method. The SIFT algorithm is
initially tested for feature as it is known to be robust, but is shown to provide too
sparse noiseless feature points for an accurate depiction of the activity taking
place. The Harris-Thomas detector and the SIFT algorithms, combined with a
pyramidal version of the Lukas-Kanade algorithm, provide an accurate
representation of the features of interest in the video. Finally, the SURF algorithm
is examined, as it is one of the current state of the art methods for feature
extraction, being robust to a wide range of transformations and computationally
efficient. The results of applying SURF to the active pixel areas are very
satisfactory and can be considered as reliable information for activity classification
at latter stages. Future work includes examining additional feature detectors
like the STAR detector, as well as performing feature point matching for the
detectors examined.</p>
      <sec id="sec-3-1">
        <title>Acknowledgements</title>
        <p>The research leading to these results has received funding from the
European Community’s Seventh Framework Programme FP7/2007-2013 under grant
agreement FP7-214306 - JUMAS.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Horn</surname>
            <given-names>B.K.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schunck</surname>
            <given-names>B. G.</given-names>
          </string-name>
          :
          <article-title>Determining Optical Flow</article-title>
          .
          <source>Arti cial Intelligence</source>
          <volume>17</volume>
          ,
          <fpage>185</fpage>
          -
          <lpage>203</lpage>
          (
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lucas</surname>
            <given-names>B.</given-names>
          </string-name>
          , T. Kanade:
          <article-title>An Iterative Image Registration Technique with an Application to Stereo Vision</article-title>
          .
          <source>In: Proc. of 7th International Joint Conference on Arti cial Intelligence (IJCAI)</source>
          , pp.
          <fpage>674</fpage>
          -
          <lpage>679</lpage>
          . (
          <year>1981</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bouguet</surname>
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Pyramidal Implementation of the Lucas Kanade Feature Tracker</article-title>
          . In: OpenCV distribution. (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Lowe</surname>
            <given-names>D. G.</given-names>
          </string-name>
          :
          <article-title>Distinctive Image Features from Scale-Invariant Keypoints</article-title>
          .
          <source>International Journal of Computer Vision</source>
          .
          <volume>60</volume>
          ,
          <issue>91</issue>
          {
          <fpage>110</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>Bay H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ess</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tuytelaars</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Gool L.</surname>
          </string-name>
          :
          <article-title>SURF: Speeded Up Robust Features</article-title>
          .
          <source>Computer Vision</source>
          and Image Understanding.
          <volume>110</volume>
          ,
          <issue>346</issue>
          {
          <fpage>359</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Harris</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stephens</surname>
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A combined corner and edge detector</article-title>
          .
          <source>In: Proceedings of the 4th Alvey Vision Conference</source>
          , pp.
          <volume>147</volume>
          {
          <fpage>151</fpage>
          . (
          <year>1988</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Briassouli</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kompatsiaris</surname>
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Robust Temporal Activity Templates Using Higher Order Statistics</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          . To appear.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>