EUMSSI: Multilayered analysis of multimedia content
            using UIMA, MongoDB and Solr
                                       Jens Grivolla and Maite Melero and Toni Badia 1


Abstract. We present a scalable platform that allows for distributed
processing of large quantities of multimedia content. The EUMSSI
platform provides support for both synchronous and asynchronous
analysis processes and thus allows for on-demand services as well
as long running batch processes. Analysis services for speech, video
and text are integrated in the platform, as well as transversal services
that combine and enrich the existing outputs from various modal-
ities. It builds on established open source projects such as UIMA,
MongoDB and Solr and the project outcomes are published under
permissive open source licenses.


1   Introduction

For reasoning with and about the multimedia data, the EUMSSI plat-                            Figure 1. Video Mining Analysis
form needs to recognize entities, such as actors, places, topics, dates
and genres. A core idea is that metadata resulting from analyzing           2   Architecture overview
one media helps reinforce the aggregation of information from other
media. For example, an important issue in speech recognition is the         The EUMSSI platform has been developed using UIMA, MongoDB,
transcription of previously unknown (out-of-vocabulary) words. This         Solr and other Open Source technologies to manage complex work-
is particularly important when dealing with current news content,           flows involving online (on-demand) and offline (batch) processing,
where person and organization names, and other named entities that          with mutual dependencies between the different modalities. The
may not appear in older training corpora, are among the most critical       three main challenges of the core platform are:
parts of the transcription. Existing text, tags and other metadata, as
well as information automatically extracted from these sources, are         • Enabling the integration and combination of different annotation
used to improve and adapt the language models. Further, OCR on                layers using UIMA and its CAS format
video data, speech analysis and speaker recognition mutually rein-          • Managing the processing workflow using MongoDB and UIMA
force one another.                                                          • Providing efficient and scalable access to the analyzed content for
   The combined and integrated results of the audio, video and text           applications and demonstrators using Solr
analysis significantly enhance the existing metadata, which can be
used for search, visualization and exploration. In addition, the ex-          The EUMSSI architecture was designed with a few core principles
tracted entities and other annotations are exploited for identifying        and requirements in mind:
specific video fragments in which a particular person speaks, a new
                                                                            • Simplicity: The platform should not be overly complex, in order to
topic begins, or an entity is mentioned. Figure 1 illustrates some of
                                                                              make it maintainable as well as to rapidly have a working system
the different layers of analysis that may exist for a video content item.
                                                                              that all involved parties can build on
   The EUMSSI system currently includes a wide variety of analy-
                                                                            • Robustness: Failures, even hardware failures, should not have dis-
sis components (many of which leverage and improve upon existing
                                                                              astrous consequences
open source systems), such as automatic speech transcription (ASR),
                                                                            • Portability: It should be possible to easily migrate the platform to
person identification (combining voice and face recognition, OCR
                                                                              a different system
on subtitles for naming, and Named Entity Recognition and Link-
                                                                            • Flexibility: It must be possible to quickly extend the platform, in
ing), and many natural language processing approaches, applied to
                                                                              particular by adding new analysis processes or content sources
speech transcripts as well as original written content or social me-
                                                                            • Scalability: The platform must be able to support large-scale con-
dia, e.g. NER (Stanford NLP), Entity Linking (DBpedia Spotlight),
                                                                              tent processing, as well as efficiently provide results to end users
keyphrase extraction (KEA), quote extraction, topic segmentation,
sentiment analysis, etc.                                                       As a result, the EUMSSI platform relies on open source technolo-
                                                                            gies with a proven track record of reliability and scalability as its
1 Universitat Pompeu Fabra, Spain, email: <firstname>.<lastname>@upf.edu
                                                                            foundation.
   The EUMSSI platform functions as a set of loosely coupled com-                                                                Note that this architecture design mainly depicts the data analysis
ponents that only interact through a common data back-end (Mon-                                                              part of the EUMSSI system. The applications for end users are built
goDB) that ensures that the system state is persisted and can be ro-                                                         upon the Solr indexes that are automatically synchronized with the
bustly recovered after failures of individual components or even the                                                         analysis results.
whole platform (including hardware failures).                                                                                    Crawlers, preprocessors and API layer are maintained as part
   All components run independently and can be seen as basically                                                             of the core EUMSSI platform. The MongoDB database is installed
“stateless” in that they maintain only the information necessary for                                                         separately and managed from within the platform components (with
immediate execution. As such it is possible to restart individual com-                                                       little or no specific configuration and setup), and the same goes for
ponents without affecting the overall system, making it relatively                                                           some external dependencies such as having a Tomcat server on which
easy to ensure the overall reliability of the platform.                                                                      to run the API layer.
   All new content coming into the system is first normalized to a                                                               Analysis components for video and audio are fully external and
common metadata schema (based on schema.org) and stored in a                                                                 independent and communicate with the platform through the API
MongoDB database to make it available for further processing. Anal-                                                          layer. Text analysis and cross-modality components are implemented
ysis results, as well as the original metadata, are stored in UIMA’s                                                         as UIMA components and run as pipelines integrated into the plat-
CAS format2 to allow integration of different aligned layers of anal-                                                        form using custom input (CollectionReader) and output (CASCon-
ysis as well as in a simplified format that is then indexed with Solr.                                                       sumer) modules that read an existing CAS representation of the doc-
The applications use the Solr indexes for efficient and scalable ac-                                                         ument from the MongoDB back-end, and write back a modified CAS
cess to the analyzed content, as well as statistical metrics over the                                                        with added annotations (and possibly layers/views) as well as ex-
whole document collection or specific subsets that can be used for                                                           tracted or "flattened" metadata that can be used by other components
exploration and visualization.                                                                                               (e.g. a list of all detected entities in the document).
    Data Sources
                                                                                                                                 Crawlers make external data sources available to the platform.
                                 Preprocess                           queue
                                                                                                                             Some crawler components are run only once to import existing
      crawlers                                                       manager
                                                                                                                             datasets, whereas others feed continuously into the platform. Pre-
                   extract metadata /            create
                        content               initial CAS                                                                    processing takes original metadata from the different sources and
        DW
                                                                                       Processing
                                                                                                                             transforms it into a unified representation with a common metadata
                                                               add to / update
                                                             processing queues
                                                                                                                             vocabulary.
                                    MAM /
       feeds                       MongoDB                                                                                       The EUMSSI API abstracts away from the underlying storage
                                                                                                                             (MongoDB and CAS data representation) to facilitate access for ex-
                                                                                                    1.   get raw content /
         ...
                                                       video      audio       text                  2.
                                                                                                         previous CAS
                                                                                                         process
                                                                                                                             ternal components such as video and audio processing. It acts as a
                                                                                           ...
                                                      analysis   analysis   analysis                3.   update CAS
                                                                                                                             light-weight layer that translates between the internal data structure
                                                                                                                             and REST-like operations tailored to the needs of the components.
                                                                                                                                 Indexing takes care of making the metadata (from the original
                             Figure 2. Architecture design                                                                   source as well as automatically extracted) available to demonstrators
                                                                                                                             and applications by mirroring the data on a Solr server that is acces-
  The process flow, pictured in Figure 2, can be summarized as fol-                                                          sible to those applications. It is performed using mongo-connector3 ,
lows:                                                                                                                        leveraging built-in replication features of MongoDB for low-latency
                                                                                                                             real time indexing of new (even partial) content, as well as content
     1. new data arrives (or gets imported)                                                                                  updates.
     2. preprocessing stage                                                                                                      Components that are part of the core platform can be found on
        (a) make content available through unique internal identifier                                                        GitHub and are organized into directories corresponding to the type
        (b) create initial CAS with aligned metadata / text content                                                          of component. More detailed information about those components
           and content URI                                                                                                   may be found in their respective README.md files.
        (c) mark initial processing queue states
     3. processing / content analysis
                                                                                                                             2.1    Design decisions and related content analysis
        (a) distributed analysis systems query queue when they have
                                                                                                                                    platforms
           processing capacity
        (b) retrieve CAS with existing data (or get relevant metadata                                                        Apart from integrating a wide variety of analysis components work-
           from wrapper API)                                                                                                 ing on text, audio, video, social media, etc., at different levels of se-
        (c) retrieve raw content based on content URI                                                                        mantic abstraction, a key aspect of EUMSSI is the integration and
        (d) process                                                                                                          combination of those different information layers. This is the main
        (e) update CAS (possibly through wrapper API)                                                                        motivation for using UIMA as the main underlying framework, as de-
        (f) create simplified output for indexing                                                                            scribed in section 3. This also has the advantage of providing a plat-
        (g) update queues                                                                                                    form for building processing pipelines that has low overhead when
           i. mark item as processed by the given queue                                                                      running on a single machine (all information is passed in-memory),
           ii. mark availability of data to be used by other analysis                                                        while still enabling distributed and scaled-out processing when nec-
             processes                                                                                                       essary.
     4. updating the Solr indexes whenever updated information is                                                               On the other hand, it quickly became apparent that not all kinds
        available for a content items                                                                                        of analysis are a good fit for such a workflow, leading to the hybrid
                                                                                                                             approach described in section 4. Having the workflow control in the
2         Unstructured      Information                           Management                        Architecture:
    http://uima.apache.org/                                                                                                  3 https://github.com/mongodb-labs/mongo-connector
same database as the data itself eliminates some of the potential fail-   • CASes can be serialized in a standardised OASIS format10 for
ures of more complex queue management systems by ensuring con-              storage and interchange
sistency between the stored data and its analysis status. It also means
that efforts in guaranteeing availability and performance can focus          Annotations based directly on multimedia content (video and au-
on optimizing and allocating resources for the MongoDB database           dio) naturally refer to that content via timestamps, whereas text anal-
(for which best practices are well established).                          ysis modules normally work with character offsets relative to the text
   While there are commercial content management systems on the           content. It is therefore fundamental that any textual views created
market, some of which allow for the integration of some automatic         from multimedia content (e.g. via ASR or OCR) refer back to the
content analysis, none of them have the flexibility of the EUMSSI         timestamps in the original content. This is done by creating annota-
platform, and in particular none are aimed at facilitating cross-         tions, e.g. tokens or segments, that include the original timestamps
modality integration.                                                     as attributes in addition to the character offsets.
   Some recent research projects approach similar goals. MultiSen-           As an example, we may have a CAS with an audio view which
sor4 combines analysis services through distributed RESTful ser-          contains the results of automatic speech recognition (ASR), provid-
vices based on NIF as an interchange format, incurring higher com-        ing the transcription as a series of tokens/words with a timestamp for
munication overheads in exchange for greater independence of ser-         each word as an additional feature.
vices (compared to the UIMA-based parts of EUMSSI). LinkedTV5                In this way it is possible to apply standard text analysis modules
has a similar approach to EUMSSI (also using MongoDB and Solr),           (that rely on character offsets) on the textual representation, while
integrating the outputs of different analysis processes in a com-         maintaining the possibility to later map the resulting annotations
mon MPEG-7 representation in the consolidation step, however (it          back onto the temporal scale.
appears) with far less mutual integration of outputs from differ-            So called SofA-aware UIMA components are able to work on mul-
ent modalities. MediaMixer6 focuses on indexing Media Fragments7          tiple views, whereas “normal” analysis engines only see one specific
with metadata to improve retrieval in media production, and BRID-         view that is presented to them. This means that e.g. standard text
GET8 provides means to link (bridge) from broadcast content to re-        analysis engines don’t need to be aware that they are being applied to
lated items, partly based on automatic video analysis.                    an ASR view or an OCR view; they just see a regular text document.
                                                                          SofA-aware components, however, can explicitly work on annota-
                                                                          tions from different views and can therefore be used to integrate and
3    Aligned data representation                                          combine the information coming from different sources or layers,
                                                                          and create new, integrated views with the output from that integra-
Much of the reasoning and cross-modal integration depends on an           tion and reasoning process.
aligned view of the different annotation layers, e.g., in order to con-
nect person names detected from OCR with corresponding speakers           4     Synchronous and asynchronous workflow
from the speaker recognition component, or faces detected by the                management
face recognition.
   The Apache UIMA9 CAS (common analysis structure) represen-             In EUMSSI we decided to use a dual approach to workflow man-
tation is a good fit for the needs of the EUMSSI project as it has a      agement, allowing for synchronous (and even on-demand) analysis
number of interesting characteristics:                                    pipelines as well as the execution of large batch jobs which need to
                                                                          be run asynchronously, possibly scheduled according to the availabil-
• Annotations are stored “stand-off”, meaning that the original con-      ity of computational resources.
  tent is not modified in any way by adding annotations. Rather, the         We opted for UIMA as the basis for synchronous workflows, as
  annotations are entirely separate and reference the original content    well as the data representation used for integrating different analy-
  by offsets                                                              sis layers. On the other hand, a web-based API allows other analy-
• Annotations can be defined freely by defining a “type system” that      sis processes, such as audio and video analysis, to retrieve content
  specifies the types of annotations (such as Person, Keyword, Face,      and upload results independently, giving them complete freedom to
  etc.) and the corresponding attributes (e.g. dbpediaUrl, canonical-     schedule their work according to their specific needs.
  Representation, ...)
• Source content can be included in the CAS (particularly for text
  content) or referenced as external content via URIs (e.g. for mul-
                                                                          4.1    Analysis pipelines using UIMA
  timedia content)                                                        UIMA provides a platform for the execution of analysis components
• While each CAS represents one “document” or “content item”,             (Analysis Engines or AEs), as well as for managing the flow between
  it can have several Views that represent different aspects of that      those components. CPE or uimaFIT11 [2] can be used to design and
  item, e.g. the video layer, audio layer, metadata layer, transcribed    execute pipelines made up of a sequence of AEs (and potentially
  text layer, etc., with separate source content (SofA or “subject of     some more complex flows), and UIMA-AS12 (Asynchronous Scale-
  annotation”) and separate sets of annotations                           out) permits the distribution of the process among various machines
• CASes can be passed efficiently in-memory between UIMA anal-            or even a cluster (with the help of UIMA DUCC13 ).
  ysis engines                                                               Within the EUMSSI project we have developed and integrated
                                                                          a number of UIMA analysis components, mostly dealing with text
4 http://multisensorproject.eu/
                                                                          analysis and semantic enrichment. Whenever possible, components
5 http://linkedtv.eu
6 http://mediamixer.eu                                                    10 http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html
7 https://www.w3.org/2008/WebVideo/Fragments/                             11 https://uima.apache.org/uimafit.html
8 http://ict-bridget.eu                                                   12 http://uima.apache.org/doc-uimaas-what.html
9 http://uima.apache.org/                                                 13 http://uima.apache.org/doc-uimaducc-whatitam.html
from the UIMA-based DKPro project [1] were used, especially for
the core analysis components (tokenization, part-of-speech, parsing,
etc.). In addition to a large number of ready-to-use components,
DKPro Core provides a unified type system to ensure interoperabil-
ity between components from different sources. Other components
developed or integrated in EUMSSI were made compatible with this
type system.


4.2    Managing content analysis with MongoDB
There are some components of the EUMSSI platform, however, that
do not integrate easily in this fashion. This is the case of computa-
tionally expensive processes that are optimized for batch execution.
A UIMA AE needs to expose a process() method that operates on a
single CAS (= document), and is therefore not compatible with batch
processing. This is particularly true for processes that need to be run
on a cluster, with significant startup overhead, such as many video
and audio analysis tasks.
   It is therefore necessary to have an alternative flow mechanism for
offline or batch processes, which needs to integrate with the process-
ing performed within the UIMA environment.
   The main architectural and integration issues revolve around the                        Figure 3. data flow and transformations
data flow, rather than the computation. In fact, the computationally
complex and expensive aspects are specific to the individual analysis       information from the CAS on demand.
components, and should not have an important impact on the design              In its simplest form, the processes responsible for the data tran-
of the overall platform.                                                    sitions are fully independent and poll the database periodically to
   As such, the design of the flow management is presented in terms         retrieve pending work. Those processes can then be implemented in
of transformations between data states, rather than from the proce-         any language that can communicate comfortably with MongoDB.
dural point of view. The resulting system should only rely on the ro-
bustness of those data states to ensure the reliability and robustness of
                                                                            4.3    Multimodal multilayer data integration and
the overall system, protecting against potential problems from server
                                                                                   enrichment
failures or other causes. At any point, the system should be able to
resume its function purely from the state of the persisted data.            The integration of data from different analysis layers is usually done
   To ensure reliability and performance of the data persistence, we        by loading the CAS representations generated by different prior pro-
use the well-established and widely used database system MongoDB,           cesses and merging them as individual Views in a single CAS. Layers
which provides great flexibility as well as proven scalability and ro-      that work on different representations, e.g. speaker recognition, au-
bustness.                                                                   dio transcript and OCR, are aligned by using timestamps associated
   Figure 3 shows the general flow of the EUMSSI system, focusing           with the segments or tokens. As a result, new integrated views can be
on the data states needed for the system to function.                       created, combining the different information layers. Metadata is also
   In order to avoid synchronization issues, the state of the data pro-     enriched by adding information to existing annotations or creating
cessing is stored together with the data within each content item,          new ones, e.g. with information obtained from SPARQL DBpedia
and the list of pending tasks can be extracted at any point through         lookups.
simple database queries. We therefore only depend on the Mon-
goDB database (which can be replicated across several machines or
                                                                            4.4    Indexing for scalable data-driven applications
even a large cluster for performance and reliability) to fully estab-
lish the processing state of all items. For example, the queues for         The final applications do not use the information stored in MongoDB
analysis processes can be constructed directly from the “process-           directly, but rather access Solr indexes created from that information
ing.queues.queue_name" field of an item by selecting (for a given           to respond specifically to the types of queries needed by the appli-
queue) all items that have not yet been processed by that queue and         cations. Those indexes are updated whenever new analysis results
that fulfill all prerequisites (dependencies).                              are available for a given item, through the use of mongo-connector
   The analysis results are stored in CAS format (optionally with           which keeps the indexes always up-to-date with the content of the
compression). In order to avoid potential conflicts or race conditions      “meta.source” and “meta.extracted” sections.
between components (most analysis processes run independently of
one another), the different layers are stored in separate database fields
                                                                            5     Standards and interoperability
as independent CASes. Components that work across layers then
merge the separate CASes into a single one (as separate Views) in           EUMSSI uses established protocols and uses freely available and
order to combine the information. The “meta.extracted" section of a         widely used open source software as its underpinnings, in addition
document is used to store the simplified analysis results that are au-      to publishing in-project developments under permissive open source
tomatically synchronized with the Solr index, and can also be used          licenses through popular platforms such as GitHub.
as inputs to other annotators (such as detected Named Entities as in-          The API for external analysis components is REST-like and uses
put to speech recognition), to avoid the overhead of extracting that        JSON for communication, whereas the end applications access the
data through Solr’s REST-like API (which supports various result            7    Conclusions
formats). Metadata is represented using a vocabulary built upon
schema.org and the internal representations in UIMA use the DKPro           In the EUMSSI project we have developed a platform capable of han-
type system as a core. Entity linking is performed against the DBpe-        dling large amounts of multimedia, with support for online and of-
dia, thus yielding Linked Open Data URIs for entities and allowing          fline processing as well as the alignment and combination of different
for the use of SPARQL and RDF to access additional information.             information layers. The system includes many interactions between
   There is now a starting initiative to establish a “standard” type sys-   different modalities, such as doing text analysis on speech recogni-
tem for UIMA, with initial conversations pointing towards building          tion output, or adding Named Entities from surrounding text to the
upon the DKPro type system for this purpose. Various institutions           vocabulary known to the ASR system, among others.
have expressed their interest in endorsing such a type system, lead-           The platform has proven capable of handling millions of content
ing to a major step forward in improving interoperability between           items on modest hardware, and is designed to allow for easily adding
UIMA components from different sources.                                     capacity through horizontal scaling.
                                                                               The source code of the platform is publicly available at https:
                                                                            //github.com/EUMSSI/. Additional documentation can be
                                                                            found in the corresponding wiki at https://github.com/
6   Demonstrators and applications                                          EUMSSI/EUMSSI-platform/wiki.


                                                                            ACKNOWLEDGEMENTS
Using the data stored in the system, and made available through Solr
indexes, as well as on-demand text analysis services, a variety of ap-      The work presented in this article is being carried out within the
plications can be built that provide access to the content and informa-     FP7-ICT-2013-10 STREP project EUMSSI under grant agreement
tion. Two very different applications were built within the EUMSSI          n◦ 611057, receiving funding from the European Union’s Seventh
project, serving as technology demonstrators as well as having real-        Framework Programme managed by the REA-Research Executive
world use.                                                                  Agency http://ec.europa.eu/research/rea.
   The storytelling tool provides a web interface that allows a jour-
nalist to work on an article in a rich editor, when writing an news         REFERENCES
article or preparing a report. The system then analyses the text the
                                                                            [1] Richard Eckart de Castilho and Iryna Gurevych, ‘A broad-coverage
journalist is writing in order to provide relevant background infor-
                                                                                collection of portable nlp components for building shareable analysis
mation. In particular, the journalist can directly access the Wikipedia         pipelines’, in Proceedings of the Workshop on Open Infrastructures and
pages of entities that appear in the text, or find related content in the       Analysis Frameworks for HLT, pp. 1–11, Dublin, Ireland, (August 2014).
archives (including content from outside and social media sources).             Association for Computational Linguistics and Dublin City University.
A variety of graphical widgets then allow to explore the content col-       [2] Philip V. Ogren and Steven J. Bethard, ‘Building test suites for UIMA
                                                                                components’, SETQA-NLP ’09 Proceedings of the Workshop on Soft-
lection, finding relevant video snippets, quotes, or presenting relevant        ware Engineering, Testing, and Quality Assurance for Natural Language
entities and the relations between them.                                        Processing, 1–4, (June 2009).
   The tool is built on web technologies (HTML, Javascript, ...) and
in particular AJAX-Solr14 which manages the communication with
the Solr backend that provides the data. Most of the functionality
in the widgets is based on using Solr queries (automatically gener-
ated or manually specified) to select a relevant set of items (articles,
videos, etc.) from the index.
   The second-screen application, on the other hand, is aimed at
viewers of TV or streaming content, at home who would like to
easily access background information, or have their viewing aug-
mented with entertainment (or edutainment) activities such as auto-
matically generated quizzes relating to the currently viewed content.
The EUMSSI second screen application is implemented as a server-
side application that connects a ‘first screen’ client, which shows the
video in an HTML5 player, with one or more ‘second screen’ clients
(which may be on the same machine, or on a separate laptop, tablet
or smartphone). All possible information and questions that can be
shown to the user, are stored in JSON format in a WebVTT file (We-
bVTT is a W3C standard for displaying timed text, such as subtitles,
in connection with the HTML5 video15 ).
   All second screen clients that are logged in using the same identi-
fier as the video show the same content at the same time. The video
client sends the relevant content to the server at the moment speci-
fied in the VTT file. The server-side agent forwards the content to the
second screen clients.

14 https://github.com/evolvingweb/ajax-solr
15 https://w3c.github.io/webvtt/