Automatic Selection of Live User Generated Content
      Stefanie Wechtitsch, Marcus Thaler, Albert                                        Wolfram Hofmeister, Jameson Steiner,
        Hofmann, Andras Horti, Werner Bailer                                                      Reinhard Grandl
          JOANNEUM RESEARCH, DIGITAL                                                                   Bitmovin
           Steyrergasse 17, 8010 Graz, Austria                                          Lakeside B01, 9020 Klagenfurt, Austria
           {firstname.lastname}@joanneum.at                                              {firstname.lastname}@bitmovin.com
ABSTRACT                                                                            workflows. The end users manually need to select a particular
User generated content (UGC) is a valuable source for improv-                       stream and have to discover themselves whether there are alter-
ing the coverage of events such as concerts, festivals or sports                    native streams of the event available, in case the one they are
events. Integrating UGC in existing professional production                         watching becomes boring or turns out to be of insufficient qual-
workflows is particularly challenging in live productions. UGC                      ity (both are unfortunately not so uncommon on today’s live
needs to be checked for quality in this case, and metadata cap-                     streaming platforms). Thus, a system that integrates profes-
tured by the mobile device and extracted from the content are                       sional and user generated content of an event needs to provide
relevant for filtering the UGC streams that go into a live pro-                     support for content selection. Content selection can be sup-
duction system. We propose a system for capturing live audio                        ported by metadata either captured on the mobile device (e.g.,
and video streams on a mobile device, performing automatic                          capture location) or extracted from the content (e.g., content
metadata extraction in real-time and indexing the metadata                          quality).
for access by a production system. The system receives an
                                                                                    We propose a system for capturing live audio and video
audio, video and metadata stream from the mobile device,
                                                                                    streams on a mobile device, performing automatic metadata
and creates additional metadata from the ingested audiovisual
                                                                                    extraction in real-time and indexing the metadata for access
content. The metadata (e.g., location, quality) are then used
                                                                                    by a production system. The system receives an audio, video
to automatically select and rank streams, either selecting a
                                                                                    and metadata stream from the mobile device, and creates addi-
stream to show to a viewer or a list of streams from which a
                                                                                    tional metadata from the audiovisual content. All metadata are
human operator can select.
                                                                                    available as a stream (with low latency from the extraction),
                                                                                    and are indexed in a metadata store. Metadata needed in the
ACM Classification Keywords
                                                                                    real-time process can be read directly from the stream, and
H.5.1 Information Interfaces and Presentation: Multimedia                           earlier metadata can be queried from the store. The metadata
Information Systems; I.4.1 Image Processing and Computer                            are used to automatically filter content that matches defined
Vision: Digitization and Image Capture                                              quality levels, to select the best stream among alternative ones
                                                                                    and to provide a set of content options.
Author Keywords
user generated content, content selection, sensor, mobile,                          The rest of this paper is organised as follows. The Section Cap-
content analysis, live                                                              ture and Analysis System describes the capture tools and the
                                                                                    analysis framework and modules. The approach to content
INTRODUCTION                                                                        selection and the results are discussed in Section Content Se-
User generated content (UGC) is a valuable source for im-                           lection, followed by a Conclusion.
proving the coverage of events such as concerts, festivals or
sports events. In order to integrate user generated content                         CAPTURE AND ANALYSIS SYSTEM
into existing production workflows, both the quality of UGC
needs to be checked and metadata needs to be extracted. Such                        System Overview
metadata, together with sensor information from the mobile                          Figure 1 shows an overview of the proposed system. The
device, will help the production team to assess the context,                        system consists of a dedicated capture app, which sends video,
quality and relevance of the user contribution.                                     audio and metadata as separate streams. This saves the mux-
                                                                                    ing/demuxing effort and also facilitates distributed processing
A particularly challenging scenario are live productions, where                     of different modalities on different machines in the cloud. All
such metadata needs to be available with small latency. Live                        data are provided as RTP streams. The processing system
streaming of UGC from mobile devices has recently gained                            (dashed box in the diagram) performs the necessary decoding
popularity, among others through the use of apps like Meerkat1                      and transformation for the content, and also includes a set
or Periscope2 . However, these apps provide a stream “as is”                        of interconnected analysis modules. These modules may not
for viewing on the web, without integration in production                           only use the content as input, but may also use metadata from
1 https://meerkatapp.co                                                             the device or from other modules. All extracted metadata are
2 https://www.periscope.tv                                                          provided as streams again, and a logging module listens to
                                                                                    these streams and indexes data in the metadata store. The
                                                                                    audiovisual streams can be connected to viewers or to an edit-
4th International Workshop on Interactive Content Consumption at TVX’16, June 22,   ing system. A web application performs content selection
2016, Chicago, IL, USA Copyright is held by the author/owner(s).
                                      audio                    analysis     analysis              content, were implemented. Thus, contributing users who
                                                    decoding
                                     ingest
                                                                   analysis                       have this app installed are capable of performing visual qual-
                         3G/4G
                                                               analysis     analysis
                                                                                                  ity analysis on the mobile device while capturing video and
                                      video
                                     ingest
                                                    decoding                                      obtaining direct feedback about the quality of the captured con-
                                                                   analysis
                                                                                                  tent. The application continuously measures sharpness, noise,
     Client app                     metadata
                                     ingest
                                                                                       metadata
                                                                                        logger
                                                                                                  luminance, exposure and detects the use of brightness compen-
  (capture, quality
     analysis,                                                                                    sation before streaming captured video [9]. This way, users
 sensor metadata)
                                                                                                  are notified during capture if one of the quality measurements
                                                                      Metadata store
                                                                                                  falls outside the target range. For each quality measure, an
                                                                                                  overlay including a descriptive icon and message is displayed
                                                                                                  to immediately notify the user to avoid the quality impairment.
                                                                                                  Algorithms for sharpness, noise and over-/underexposure de-
                                              editing
                                                                                                  tection have been implemented in the app. Details on these
                                                                   web application
                                                                                                  quality algorithms can be found in [9]. For sharpness esti-
                      Figure 1. Overview of the proposed system                                   mation we use the Laplace operator for edge detection. By
                                                                                                  subsampling the response image into equally sized blocks, the
                                                                                                  sharpness value for each block is represented by its maximum
and displays the audiovisual data together with the extracted                                     slope response of the corresponding edges. The blocks with
metadata.                                                                                         the highest values (strongest slopes) are selected to obtain the
We decided to build on an existing framework with many                                            global sharpness value. For noise estimation the luminance
standard components which is able to handle the decoding of                                       component of each of the analysed image is calculated and the
commonly used media formats. Thus, the GStreamer3 open                                            similar block scores from the sharpness estimation are used
source multimedia framework is used for this purpose.                                             to find the most homogeneous blocks (with few edges). For
                                                                                                  the remaining blocks the average absolute differences between
Content Capture
                                                                                                  the original and the median filtered image are computed rep-
                                                                                                  resenting the block’s noise score. The global noise level is
The integrated capture application for Android enables users
                                                                                                  then estimated by taking the median of the appropriate block
to perform quality analysis while capturing sensor data and
                                                                                                  values. To detect the use of brightness compensation the av-
streaming captured video. The main features are: (a) audio
                                                                                                  erage brightness progression of images within a certain time
and video recording, via the built-in microphone and camera
                                                                                                  frame is approximated. If the summed up positive or negative
respectively, (b) metadata capturing from different sensors
                                                                                                  brightness variation values exceeds a predefined threshold, the
available on the device, (c) on-device analysis of captured
                                                                                                  algorithm reports overexposure or underexposure, respectively.
essence to meet quality constrains, (d) en-/transcoding and
                                                                                                  Using a Samsung Galaxy S5 the runtime for all proposed
packaging of recorded content and (e) the up-streaming func-
                                                                                                  quality analysis algorithms for one frame of the captured HD
tionality to servers for processing.
                                                                                                  image sequence is about 200ms. Due to gradual temporal
Raw video and audio data is captured through the camera and                                       changes of image quality problems (e.g., noise) it is sufficient
microphone of the device and encoded using Android’s Media-                                       to process every sixth frame, enabling real-time operation.
Codec API, while at the same time the quality of video frames
is analysed. As encoding of video frames typically is more                                        Content and Metadata Streaming
time consuming than for audio frames, a buffer synchronizes
                                                                                                  In order to perform the RTP streaming, the encoded audio and
both streams. Once the encoding for a frame has finished, it
                                                                                                  video frames are pushed into a buffer and wrapped into the
is committed into the buffer and/or sent to an RTP packager.
                                                                                                  ISOBMFF file format. During the entire recording session
In parallel, a live on-device preview, containing visual quality-
                                                                                                  of a user, each video segment is uploaded to the processing
related notifications, as discussed in the upcoming sections,
                                                                                                  system. After the capture is finished, the full video is accessi-
is presented. Synchronization is done by keeping track of the
                                                                                                  ble via built-in Android functions. In order to stream packets
latest PTS for each stream.
                                                                                                  over RTP, a packetizer which generates the RTP headers and
During initialisation of the capture application, various types                                   splits the data into several packets (if necessary) is used. Ev-
of static metadata (such as properties and technical parame-                                      ery encoded audio/video frame is pushed into the respective
ters of the mobile device) are sent to the processing system.                                     packetizer. To ensure synchronization, a similar buffer-based
Moreover, together with the content, sensor metadata of the                                       approach as described in Section Content Capture is applied.
on-device sensors are captured to support real time quality
                                                                                                  The captured video metadata such as device and sensor meta-
analysis, by recording the following sensors: location, ac-
                                                                                                  data are accumulated locally with records indexed by time and
celerometer, gyroscope, magnetic field, orientation, rotation,
                                                                                                  type of metadata. Incremental metadata are made available as
ambient light, proximity and pressure. For example, the ac-
                                                                                                  segments of the metadata stream. The segments available as
celerometer can be used to detect fast and shaky movements
                                                                                                  strings in JSON format (following the format defined in [2])
of the mobile device. For the analysis of video frames, several
                                                                                                  are sent in chunks periodically as a UDP stream from the mo-
lightweight algorithms, which identify defects in the captured
                                                                                                  bile device to the processing system. For the analysis chain a
3 http://gstreamer.freedesktop.org                                                                dedicated module was developed to receive the UDP packets,
put the string chunks together to restore the JSON message         the source of the blur and noise measure the computation for
and forward them to the metadata store handler. The analysis       the overall quality measure stays the same.
chain receives the media stream over the RTP protocol as a
UDP stream. The GStreamer standard components are taking           All sampled values are collected individually for each measure
care of the reception, management and decoding of the audio        and are sorted in increasing order. An appropriate subset of
and video streams and finally converting the video frames to       the sorted list is chosen to compute the average value for each
RGB8 images.                                                       measure. It was empirically established that a well correlated
                                                                   measure emerges if the subset is chosen from the higher values
                                                                   (high values indicate lower quality in this case), causing bad
Analysis Framework                                                 quality frames to have a higher influence on the result than
The analysis components are implemented as GStreamer plu-          good quality frames (so bad quality frames are over weighted
gins and we create a flexible and powerful analysis chain by       compared to good quality frames). Thus, a video where only
combining the standard GStreamer modules and our compo-            parts appear as very blurry or have a high level of noise will be
nents. The GStreamer framework’s messaging concept assures         rated as being of poor quality. The quality is represented by a
the optimal configuration for each plugin. For our purposes        floating point number in the range of 0 to 1, where 0 indicates
we applied a simplified configuration manner to create sim-        excellent quality and 1 corresponds with very poor quality.
ple plugins with arbitrary number of input and output pads         This representation is used for each individual measure as well
having different formats though in some cases manual hints         as for the overall quality value.
are necessary. Since the standard RTP stream contains only         We have chosen to use the upper 25% of quality scores (i.e.,
relative timestamps, the synchronization of audio and video        representing the 25% segments with worst quality) to compute
content from different devices is realized on the basis of the     an average value for all measures of the involved quality metric.
timestamps of the RTCP stream. A custom plugin handles             Finally, those values have to be combined. Simply averaging
the extraction of the timestamps and the difference calculation    the individual measures is not a good strategy, since having
between the internal clock and the absolute timestamp (e.g.,       one or two bad quality measures out of our set of five metrics
synchronized with a PTP [7] clock).                                would result in a non appropriate quality measure, distorted
                                                                   by the good quality measures.
Analysis Modules                                                   The measure which causes the highest impact on the content
The visual quality of a user generated video is a good indicator   quality should have the highest impact on the final quality
for an early decision whether the video might be useful to be      measure. Thus, we apply a weighted sum where the highest
considered, e.g. in a production, or whether it can conversely     values are disproportionately weighted higher.
be sorted out due to an insufficient quality. In particular, the
quality is an important decision criterion when having a huge      Metadata Store
amount of data available which should be reduced automati-         The data exchange between the analysis platform and the pro-
cally. In order to obtain an overall quality measure of a user     duction system is realised via a metadata store. This metadata
generated video, all available individual quality indicators are   store is a persistent hybrid repository accessible over a REST
considered. The metadata received from the mobile device di-       interface. Short term data are kept in a Redis4 in-memory
rectly as well as the more complex quality measures obtained       data structure store whereas long term data are archived into a
after transmitting to the server are fused as described in the     MySQL5 database. The repository type is transparent for the
following.                                                         client, the difference is only noticeable in the query response
As mentioned earlier, the mobile device provides quality esti-     time.
mates of how blurry the content is, how much noise it contains,    The extracted metadata are used for automatic content filtering
if there are parts suffering from over or under exposure and       of the UGC streams, e.g., discarding streams based on overall
if the video was recorded under shaky conditions or not. On        quality metadata or their location. By querying the metadata
the server side, we may get additional measures by using a         store with the appropriate criteria, the relevant streams can be
more complex algorithm for the blurriness and the contained        selected for live editing.
noise. Furthermore, an estimate for macro-blocking artefacts
is determined. At first we compute one representative value        CONTENT SELECTION
for each measure and combine them with all others by fusion.       When multiple concurrent live streams are available for an
All these measures may be available or not, they are optional.     event, automatic and real-time selection of the best quality
The sampling steps for each measure are individual but con-        content is advocated. The selection strategies implemented
stant over the whole duration of the video. Noise and blur-        so far are rule-based. They use the metadata available in the
riness can be measured on both the mobile device or on the         metadata store as input, i.e., the metadata captured on the
server. Since on the server a more complex algorithm can be        device and the results from quality analysis on the mobile
used, the results may differ a bit. Depending on the use case      device and the server. The metadata does not only contain raw
and the computational complexity that can be afforded, a sub-      sensor and analysis data, but also the annotations of segments
set of measures is computed and used for the overall quality       where pre-defined minimum quality limits have been violated.
measure. Under real-time requirements, we rather use the blur      4 http://redis.io
and noise estimation from the mobile device. Independent of        5 http://www.mysql.com
Prior work on automatic video production in [1] and [4] aims
for automating the selection of captured content, but these
approaches have been developed for professional content and
therefore do not exploit video quality at all as a cue for selec-
tion. For supporting or automating home video editing some
specific approaches for quality based video production and se-
lection have been studied, e.g., in [10] and [6]. Although they
address some quality detection requirements specific for user
generated content, these approaches are intended to be applied
in an off-line fashion on pre-recorded video. An approach for
creating mashups of multiple camera concert recordings using
video quality cues has been proposed [8], which comes closest
to our requirements. Signal quality measures extracted from
the individual recordings are used for selecting best quality
segments. The approach is applied in a file-based off-line sce-
nario, an on-line real-time scenario has not been investigated.

Approach
In our approach, content is discarded when quality metrics vi-
olate thresholds for minimum quality, and the same thresholds
are applied for all streams. In addition, the average quality
measure determined as described in Section Analysis Modules
is compared against a threshold. Temporal filtering of selec-
tion decisions is applied, in order to avoid switching streams
on and off when quality values fluctuate around the thresh-          Figure 2. Web-based content and metadata visualisation. For each qual-
olds. The choice of the size of the temporal filter is a trade-off   ity metric, a line chart with the continuous evolution of the measurement
                                                                     is shown. An additional event view on the top of each quality metric high-
between more frequent switching between streams and more             lights segments that do not meet predefined quality standards (indicated
robust decisions that come at the cost of higher latency of the      by a red bar).
analysis result. If the system is used in a semi-automatic mode,
an operator may override automatic filtering decisions based
on quality if the clip is the only showing content that should       annotations done at the mobile client and server are used. For
be included.                                                         each annotation type, a chart with the continuous quality mea-
                                                                     sure is shown, and an additional event view displays segments
After filtering, ranking of the remaining streams is applied.        that do not meet predefined quality standards. This level of
For content-based ranking, we use a strategy that is similar to      detail is only shown for the currently selected video stream.
approaches that boost diversity in search results: (i) we prefer     For other concurrent streams, the overall quality metrics are
streams showing a different area of the event over more of           additionally retrieved from the metadata store and visualised in
the same, and (ii) from a group of similar ones we select the        a compact form. When switching to another stream, the views
one with the best quality. We use location information, where        are switched accordingly. To provide audiovisual content to
available from the metadata of the stream and/or additionally        the HTML5 viewer, the incoming media stream is re-streamed
from determining the visual overlap between streams as de-           by the analysis platform. This can be done as RTP stream with
scribed in [3]. The spatial distance and the visual similarity       very low latency (requiring a browser plugin) or providing
are used to determine a pairwise measure for diversity be-           a stream for consumption by an HTML video player, with
tween two streams, in analogy to the affinity graph described        possibly higher latency.
in [11]. However, as we do not start from a specific query, we
always rank the entire set of streams available at a current time    CONCLUSION
segment. In the current implementation, we only update the           In this paper, we have presented a framework for automating
location metadata when streams end or are added. The ranked          content selection in order to complement professional cover-
list of streams can be provided as input to a user interface, or     age of live events such as concerts, festivals or sports events
an automatic method can be used to select from the top en-           with user generated content. We have described a system for
tries in the list, such as the virtual director approach proposed    capturing live audio and video streams on a mobile device,
in [5].                                                              performing automatic metadata extraction in real-time and
                                                                     indexing the metadata for access by a production system. The
Results                                                              system creates additional metadata from the audiovisual con-
The content selection application is implemented as a web            tent, and all available metadata are then used for automatic
application, which implements the selection rules and also           filtering and ranking of streams, using a rule-based approach.
includes an HTML5 metadata viewer (see Figure 2). The
metadata store is polled in defined intervals for recent data.       ACKNOWLEDGMENTS
The selection rules are executed and the UI is updated accord-       The authors would like to thank Jürgen Schmidt and Mario
ingly. As described in Section Analysis Modules, both quality        Sieck from Technicolor. The research leading to these re-
sults has received funding from the European Union’s Seventh       Assessment With Spatiotemporal Factors. IEEE Trans.
Framework Programme (FP7/2007-2013) under grant agree-             Cir. and Sys. for Video Technol. 17, 6 (June 2007),
ment n◦ 610370, ICoSOLE (“Immersive Coverage of Spatially          699–706.
Outspread Live Events”, http://www.icosole.eu).
                                                                7. The Institute of Electrical and Electronics Engineers.
REFERENCES                                                         2008. IEEE Standard for a Precision Clock
 1. Gulrukh Ahanger and Thomas D. C. Little. 1998.                 Synchronization Protocol for Networked Measurement
    Automatic Composition Techniques for Video Production.         and Control Systems, version 2. (2008).
    IEEE Trans. Knowl. Data Eng. 10, 6 (1998), 967–987.         8. Prarthana Shrestha, Peter H.N. de With, Hans Weda,
 2. Werner Bailer, Gert Kienast, Georg Thallinger, Philippe        Mauro Barbieri, and Emile H.L. Aarts. 2010. Automatic
    Bekaert, Juergen Schmidt, David Marston, Richard Day,          Mashup Generation from Multiple-camera Concert
    and Chris Pike. 2015a. Format Agnostic Scene                   Recordings. In Proceedings of the 18th ACM
    Representation v2. Technical Report D3.1.2. ICoSOLE            International Conference on Multimedia (MM ’10).
    project.                                                       541–550.
 3. Werner Bailer, Marcus Thaler, and Georg Thallinger.         9. Stefanie Wechtitsch, Hannes Fassold, Marcus Thaler,
    2015b. Spatiotemporal Video Synchronisation by Visual          Krzysztof Kozłowski, and Werner Bailer. 2016. Quality
    Matching. In Proceedings of the 3rd International              Analysis on Mobile Devices for Real-Time Feedback. In
    Workshop on Interactive Content Consumption co-located         MultiMedia Modeling. Springer, 359–369.
    with ACM International Conference on Interactive
    Experiences for Television and Online Video (ACM TVX       10. Si Wu, Yu-Fei Ma, and Hong-Jiang Zhang. 2005. Video
    2015).                                                         quality classification based home video segmentation. In
                                                                   Multimedia and Expo, 2005. ICME 2005. IEEE
 4. Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2004.            International Conference on.
    Automatic Music Video Generation Based on Temporal
    Pattern Analysis. In Proceedings of the 12th Annual ACM    11. Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo
    International Conference on Multimedia (MULTIMEDIA             Fan, Zheng Chen, and Wei-Ying Ma. 2005. Improving
   ’04). 472–475.                                                  Web Search Results Using Affinity Graph. In
                                                                   Proceedings of the 28th Annual International ACM SIGIR
 5. Rene Kaiser and Wolfgang Weiss. 2013. Virtual Director.        Conference on Research and Development in Information
    John Wiley & Sons, Ltd, 209–259.                               Retrieval (SIGIR ’05). 504–511.
 6. Tao Mei, Xian-Sheng Hua, Cai-Zhi Zhu, He-Qin Zhou,
    and Shipeng Li. 2007. Home Video Visual Quality