Automatic Selection of Live User Generated Content Stefanie Wechtitsch, Marcus Thaler, Albert Wolfram Hofmeister, Jameson Steiner, Hofmann, Andras Horti, Werner Bailer Reinhard Grandl JOANNEUM RESEARCH, DIGITAL Bitmovin Steyrergasse 17, 8010 Graz, Austria Lakeside B01, 9020 Klagenfurt, Austria {firstname.lastname}@joanneum.at {firstname.lastname}@bitmovin.com ABSTRACT workflows. The end users manually need to select a particular User generated content (UGC) is a valuable source for improv- stream and have to discover themselves whether there are alter- ing the coverage of events such as concerts, festivals or sports native streams of the event available, in case the one they are events. Integrating UGC in existing professional production watching becomes boring or turns out to be of insufficient qual- workflows is particularly challenging in live productions. UGC ity (both are unfortunately not so uncommon on today’s live needs to be checked for quality in this case, and metadata cap- streaming platforms). Thus, a system that integrates profes- tured by the mobile device and extracted from the content are sional and user generated content of an event needs to provide relevant for filtering the UGC streams that go into a live pro- support for content selection. Content selection can be sup- duction system. We propose a system for capturing live audio ported by metadata either captured on the mobile device (e.g., and video streams on a mobile device, performing automatic capture location) or extracted from the content (e.g., content metadata extraction in real-time and indexing the metadata quality). for access by a production system. The system receives an We propose a system for capturing live audio and video audio, video and metadata stream from the mobile device, streams on a mobile device, performing automatic metadata and creates additional metadata from the ingested audiovisual extraction in real-time and indexing the metadata for access content. The metadata (e.g., location, quality) are then used by a production system. The system receives an audio, video to automatically select and rank streams, either selecting a and metadata stream from the mobile device, and creates addi- stream to show to a viewer or a list of streams from which a tional metadata from the audiovisual content. All metadata are human operator can select. available as a stream (with low latency from the extraction), and are indexed in a metadata store. Metadata needed in the ACM Classification Keywords real-time process can be read directly from the stream, and H.5.1 Information Interfaces and Presentation: Multimedia earlier metadata can be queried from the store. The metadata Information Systems; I.4.1 Image Processing and Computer are used to automatically filter content that matches defined Vision: Digitization and Image Capture quality levels, to select the best stream among alternative ones and to provide a set of content options. Author Keywords user generated content, content selection, sensor, mobile, The rest of this paper is organised as follows. The Section Cap- content analysis, live ture and Analysis System describes the capture tools and the analysis framework and modules. The approach to content INTRODUCTION selection and the results are discussed in Section Content Se- User generated content (UGC) is a valuable source for im- lection, followed by a Conclusion. proving the coverage of events such as concerts, festivals or sports events. In order to integrate user generated content CAPTURE AND ANALYSIS SYSTEM into existing production workflows, both the quality of UGC needs to be checked and metadata needs to be extracted. Such System Overview metadata, together with sensor information from the mobile Figure 1 shows an overview of the proposed system. The device, will help the production team to assess the context, system consists of a dedicated capture app, which sends video, quality and relevance of the user contribution. audio and metadata as separate streams. This saves the mux- ing/demuxing effort and also facilitates distributed processing A particularly challenging scenario are live productions, where of different modalities on different machines in the cloud. All such metadata needs to be available with small latency. Live data are provided as RTP streams. The processing system streaming of UGC from mobile devices has recently gained (dashed box in the diagram) performs the necessary decoding popularity, among others through the use of apps like Meerkat1 and transformation for the content, and also includes a set or Periscope2 . However, these apps provide a stream “as is” of interconnected analysis modules. These modules may not for viewing on the web, without integration in production only use the content as input, but may also use metadata from 1 https://meerkatapp.co the device or from other modules. All extracted metadata are 2 https://www.periscope.tv provided as streams again, and a logging module listens to these streams and indexes data in the metadata store. The audiovisual streams can be connected to viewers or to an edit- 4th International Workshop on Interactive Content Consumption at TVX’16, June 22, ing system. A web application performs content selection 2016, Chicago, IL, USA Copyright is held by the author/owner(s). audio analysis analysis content, were implemented. Thus, contributing users who decoding ingest analysis have this app installed are capable of performing visual qual- 3G/4G analysis analysis ity analysis on the mobile device while capturing video and video ingest decoding obtaining direct feedback about the quality of the captured con- analysis tent. The application continuously measures sharpness, noise, Client app metadata ingest metadata logger luminance, exposure and detects the use of brightness compen- (capture, quality analysis, sation before streaming captured video [9]. This way, users sensor metadata) are notified during capture if one of the quality measurements Metadata store falls outside the target range. For each quality measure, an overlay including a descriptive icon and message is displayed to immediately notify the user to avoid the quality impairment. Algorithms for sharpness, noise and over-/underexposure de- editing tection have been implemented in the app. Details on these web application quality algorithms can be found in [9]. For sharpness esti- Figure 1. Overview of the proposed system mation we use the Laplace operator for edge detection. By subsampling the response image into equally sized blocks, the sharpness value for each block is represented by its maximum and displays the audiovisual data together with the extracted slope response of the corresponding edges. The blocks with metadata. the highest values (strongest slopes) are selected to obtain the We decided to build on an existing framework with many global sharpness value. For noise estimation the luminance standard components which is able to handle the decoding of component of each of the analysed image is calculated and the commonly used media formats. Thus, the GStreamer3 open similar block scores from the sharpness estimation are used source multimedia framework is used for this purpose. to find the most homogeneous blocks (with few edges). For the remaining blocks the average absolute differences between Content Capture the original and the median filtered image are computed rep- resenting the block’s noise score. The global noise level is The integrated capture application for Android enables users then estimated by taking the median of the appropriate block to perform quality analysis while capturing sensor data and values. To detect the use of brightness compensation the av- streaming captured video. The main features are: (a) audio erage brightness progression of images within a certain time and video recording, via the built-in microphone and camera frame is approximated. If the summed up positive or negative respectively, (b) metadata capturing from different sensors brightness variation values exceeds a predefined threshold, the available on the device, (c) on-device analysis of captured algorithm reports overexposure or underexposure, respectively. essence to meet quality constrains, (d) en-/transcoding and Using a Samsung Galaxy S5 the runtime for all proposed packaging of recorded content and (e) the up-streaming func- quality analysis algorithms for one frame of the captured HD tionality to servers for processing. image sequence is about 200ms. Due to gradual temporal Raw video and audio data is captured through the camera and changes of image quality problems (e.g., noise) it is sufficient microphone of the device and encoded using Android’s Media- to process every sixth frame, enabling real-time operation. Codec API, while at the same time the quality of video frames is analysed. As encoding of video frames typically is more Content and Metadata Streaming time consuming than for audio frames, a buffer synchronizes In order to perform the RTP streaming, the encoded audio and both streams. Once the encoding for a frame has finished, it video frames are pushed into a buffer and wrapped into the is committed into the buffer and/or sent to an RTP packager. ISOBMFF file format. During the entire recording session In parallel, a live on-device preview, containing visual quality- of a user, each video segment is uploaded to the processing related notifications, as discussed in the upcoming sections, system. After the capture is finished, the full video is accessi- is presented. Synchronization is done by keeping track of the ble via built-in Android functions. In order to stream packets latest PTS for each stream. over RTP, a packetizer which generates the RTP headers and During initialisation of the capture application, various types splits the data into several packets (if necessary) is used. Ev- of static metadata (such as properties and technical parame- ery encoded audio/video frame is pushed into the respective ters of the mobile device) are sent to the processing system. packetizer. To ensure synchronization, a similar buffer-based Moreover, together with the content, sensor metadata of the approach as described in Section Content Capture is applied. on-device sensors are captured to support real time quality The captured video metadata such as device and sensor meta- analysis, by recording the following sensors: location, ac- data are accumulated locally with records indexed by time and celerometer, gyroscope, magnetic field, orientation, rotation, type of metadata. Incremental metadata are made available as ambient light, proximity and pressure. For example, the ac- segments of the metadata stream. The segments available as celerometer can be used to detect fast and shaky movements strings in JSON format (following the format defined in [2]) of the mobile device. For the analysis of video frames, several are sent in chunks periodically as a UDP stream from the mo- lightweight algorithms, which identify defects in the captured bile device to the processing system. For the analysis chain a 3 http://gstreamer.freedesktop.org dedicated module was developed to receive the UDP packets, put the string chunks together to restore the JSON message the source of the blur and noise measure the computation for and forward them to the metadata store handler. The analysis the overall quality measure stays the same. chain receives the media stream over the RTP protocol as a UDP stream. The GStreamer standard components are taking All sampled values are collected individually for each measure care of the reception, management and decoding of the audio and are sorted in increasing order. An appropriate subset of and video streams and finally converting the video frames to the sorted list is chosen to compute the average value for each RGB8 images. measure. It was empirically established that a well correlated measure emerges if the subset is chosen from the higher values (high values indicate lower quality in this case), causing bad Analysis Framework quality frames to have a higher influence on the result than The analysis components are implemented as GStreamer plu- good quality frames (so bad quality frames are over weighted gins and we create a flexible and powerful analysis chain by compared to good quality frames). Thus, a video where only combining the standard GStreamer modules and our compo- parts appear as very blurry or have a high level of noise will be nents. The GStreamer framework’s messaging concept assures rated as being of poor quality. The quality is represented by a the optimal configuration for each plugin. For our purposes floating point number in the range of 0 to 1, where 0 indicates we applied a simplified configuration manner to create sim- excellent quality and 1 corresponds with very poor quality. ple plugins with arbitrary number of input and output pads This representation is used for each individual measure as well having different formats though in some cases manual hints as for the overall quality value. are necessary. Since the standard RTP stream contains only We have chosen to use the upper 25% of quality scores (i.e., relative timestamps, the synchronization of audio and video representing the 25% segments with worst quality) to compute content from different devices is realized on the basis of the an average value for all measures of the involved quality metric. timestamps of the RTCP stream. A custom plugin handles Finally, those values have to be combined. Simply averaging the extraction of the timestamps and the difference calculation the individual measures is not a good strategy, since having between the internal clock and the absolute timestamp (e.g., one or two bad quality measures out of our set of five metrics synchronized with a PTP [7] clock). would result in a non appropriate quality measure, distorted by the good quality measures. Analysis Modules The measure which causes the highest impact on the content The visual quality of a user generated video is a good indicator quality should have the highest impact on the final quality for an early decision whether the video might be useful to be measure. Thus, we apply a weighted sum where the highest considered, e.g. in a production, or whether it can conversely values are disproportionately weighted higher. be sorted out due to an insufficient quality. In particular, the quality is an important decision criterion when having a huge Metadata Store amount of data available which should be reduced automati- The data exchange between the analysis platform and the pro- cally. In order to obtain an overall quality measure of a user duction system is realised via a metadata store. This metadata generated video, all available individual quality indicators are store is a persistent hybrid repository accessible over a REST considered. The metadata received from the mobile device di- interface. Short term data are kept in a Redis4 in-memory rectly as well as the more complex quality measures obtained data structure store whereas long term data are archived into a after transmitting to the server are fused as described in the MySQL5 database. The repository type is transparent for the following. client, the difference is only noticeable in the query response As mentioned earlier, the mobile device provides quality esti- time. mates of how blurry the content is, how much noise it contains, The extracted metadata are used for automatic content filtering if there are parts suffering from over or under exposure and of the UGC streams, e.g., discarding streams based on overall if the video was recorded under shaky conditions or not. On quality metadata or their location. By querying the metadata the server side, we may get additional measures by using a store with the appropriate criteria, the relevant streams can be more complex algorithm for the blurriness and the contained selected for live editing. noise. Furthermore, an estimate for macro-blocking artefacts is determined. At first we compute one representative value CONTENT SELECTION for each measure and combine them with all others by fusion. When multiple concurrent live streams are available for an All these measures may be available or not, they are optional. event, automatic and real-time selection of the best quality The sampling steps for each measure are individual but con- content is advocated. The selection strategies implemented stant over the whole duration of the video. Noise and blur- so far are rule-based. They use the metadata available in the riness can be measured on both the mobile device or on the metadata store as input, i.e., the metadata captured on the server. Since on the server a more complex algorithm can be device and the results from quality analysis on the mobile used, the results may differ a bit. Depending on the use case device and the server. The metadata does not only contain raw and the computational complexity that can be afforded, a sub- sensor and analysis data, but also the annotations of segments set of measures is computed and used for the overall quality where pre-defined minimum quality limits have been violated. measure. Under real-time requirements, we rather use the blur 4 http://redis.io and noise estimation from the mobile device. Independent of 5 http://www.mysql.com Prior work on automatic video production in [1] and [4] aims for automating the selection of captured content, but these approaches have been developed for professional content and therefore do not exploit video quality at all as a cue for selec- tion. For supporting or automating home video editing some specific approaches for quality based video production and se- lection have been studied, e.g., in [10] and [6]. Although they address some quality detection requirements specific for user generated content, these approaches are intended to be applied in an off-line fashion on pre-recorded video. An approach for creating mashups of multiple camera concert recordings using video quality cues has been proposed [8], which comes closest to our requirements. Signal quality measures extracted from the individual recordings are used for selecting best quality segments. The approach is applied in a file-based off-line sce- nario, an on-line real-time scenario has not been investigated. Approach In our approach, content is discarded when quality metrics vi- olate thresholds for minimum quality, and the same thresholds are applied for all streams. In addition, the average quality measure determined as described in Section Analysis Modules is compared against a threshold. Temporal filtering of selec- tion decisions is applied, in order to avoid switching streams on and off when quality values fluctuate around the thresh- Figure 2. Web-based content and metadata visualisation. For each qual- olds. The choice of the size of the temporal filter is a trade-off ity metric, a line chart with the continuous evolution of the measurement is shown. An additional event view on the top of each quality metric high- between more frequent switching between streams and more lights segments that do not meet predefined quality standards (indicated robust decisions that come at the cost of higher latency of the by a red bar). analysis result. If the system is used in a semi-automatic mode, an operator may override automatic filtering decisions based on quality if the clip is the only showing content that should annotations done at the mobile client and server are used. For be included. each annotation type, a chart with the continuous quality mea- sure is shown, and an additional event view displays segments After filtering, ranking of the remaining streams is applied. that do not meet predefined quality standards. This level of For content-based ranking, we use a strategy that is similar to detail is only shown for the currently selected video stream. approaches that boost diversity in search results: (i) we prefer For other concurrent streams, the overall quality metrics are streams showing a different area of the event over more of additionally retrieved from the metadata store and visualised in the same, and (ii) from a group of similar ones we select the a compact form. When switching to another stream, the views one with the best quality. We use location information, where are switched accordingly. To provide audiovisual content to available from the metadata of the stream and/or additionally the HTML5 viewer, the incoming media stream is re-streamed from determining the visual overlap between streams as de- by the analysis platform. This can be done as RTP stream with scribed in [3]. The spatial distance and the visual similarity very low latency (requiring a browser plugin) or providing are used to determine a pairwise measure for diversity be- a stream for consumption by an HTML video player, with tween two streams, in analogy to the affinity graph described possibly higher latency. in [11]. However, as we do not start from a specific query, we always rank the entire set of streams available at a current time CONCLUSION segment. In the current implementation, we only update the In this paper, we have presented a framework for automating location metadata when streams end or are added. The ranked content selection in order to complement professional cover- list of streams can be provided as input to a user interface, or age of live events such as concerts, festivals or sports events an automatic method can be used to select from the top en- with user generated content. We have described a system for tries in the list, such as the virtual director approach proposed capturing live audio and video streams on a mobile device, in [5]. performing automatic metadata extraction in real-time and indexing the metadata for access by a production system. The Results system creates additional metadata from the audiovisual con- The content selection application is implemented as a web tent, and all available metadata are then used for automatic application, which implements the selection rules and also filtering and ranking of streams, using a rule-based approach. includes an HTML5 metadata viewer (see Figure 2). The metadata store is polled in defined intervals for recent data. ACKNOWLEDGMENTS The selection rules are executed and the UI is updated accord- The authors would like to thank Jürgen Schmidt and Mario ingly. As described in Section Analysis Modules, both quality Sieck from Technicolor. The research leading to these re- sults has received funding from the European Union’s Seventh Assessment With Spatiotemporal Factors. IEEE Trans. Framework Programme (FP7/2007-2013) under grant agree- Cir. and Sys. for Video Technol. 17, 6 (June 2007), ment n◦ 610370, ICoSOLE (“Immersive Coverage of Spatially 699–706. Outspread Live Events”, http://www.icosole.eu). 7. The Institute of Electrical and Electronics Engineers. REFERENCES 2008. IEEE Standard for a Precision Clock 1. Gulrukh Ahanger and Thomas D. C. Little. 1998. Synchronization Protocol for Networked Measurement Automatic Composition Techniques for Video Production. and Control Systems, version 2. (2008). IEEE Trans. Knowl. Data Eng. 10, 6 (1998), 967–987. 8. Prarthana Shrestha, Peter H.N. de With, Hans Weda, 2. Werner Bailer, Gert Kienast, Georg Thallinger, Philippe Mauro Barbieri, and Emile H.L. Aarts. 2010. Automatic Bekaert, Juergen Schmidt, David Marston, Richard Day, Mashup Generation from Multiple-camera Concert and Chris Pike. 2015a. Format Agnostic Scene Recordings. In Proceedings of the 18th ACM Representation v2. Technical Report D3.1.2. ICoSOLE International Conference on Multimedia (MM ’10). project. 541–550. 3. Werner Bailer, Marcus Thaler, and Georg Thallinger. 9. Stefanie Wechtitsch, Hannes Fassold, Marcus Thaler, 2015b. Spatiotemporal Video Synchronisation by Visual Krzysztof Kozłowski, and Werner Bailer. 2016. Quality Matching. In Proceedings of the 3rd International Analysis on Mobile Devices for Real-Time Feedback. In Workshop on Interactive Content Consumption co-located MultiMedia Modeling. Springer, 359–369. with ACM International Conference on Interactive Experiences for Television and Online Video (ACM TVX 10. Si Wu, Yu-Fei Ma, and Hong-Jiang Zhang. 2005. Video 2015). quality classification based home video segmentation. In Multimedia and Expo, 2005. ICME 2005. IEEE 4. Xian-Sheng Hua, Lie Lu, and Hong-Jiang Zhang. 2004. International Conference on. Automatic Music Video Generation Based on Temporal Pattern Analysis. In Proceedings of the 12th Annual ACM 11. Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo International Conference on Multimedia (MULTIMEDIA Fan, Zheng Chen, and Wei-Ying Ma. 2005. Improving ’04). 472–475. Web Search Results Using Affinity Graph. In Proceedings of the 28th Annual International ACM SIGIR 5. Rene Kaiser and Wolfgang Weiss. 2013. Virtual Director. Conference on Research and Development in Information John Wiley & Sons, Ltd, 209–259. Retrieval (SIGIR ’05). 504–511. 6. Tao Mei, Xian-Sheng Hua, Cai-Zhi Zhu, He-Qin Zhou, and Shipeng Li. 2007. Home Video Visual Quality