EUMSSI: Multilayered analysis of multimedia content using UIMA, MongoDB and Solr Jens Grivolla and Maite Melero and Toni Badia 1 Abstract. We present a scalable platform that allows for distributed processing of large quantities of multimedia content. The EUMSSI platform provides support for both synchronous and asynchronous analysis processes and thus allows for on-demand services as well as long running batch processes. Analysis services for speech, video and text are integrated in the platform, as well as transversal services that combine and enrich the existing outputs from various modal- ities. It builds on established open source projects such as UIMA, MongoDB and Solr and the project outcomes are published under permissive open source licenses. 1 Introduction For reasoning with and about the multimedia data, the EUMSSI plat- Figure 1. Video Mining Analysis form needs to recognize entities, such as actors, places, topics, dates and genres. A core idea is that metadata resulting from analyzing 2 Architecture overview one media helps reinforce the aggregation of information from other media. For example, an important issue in speech recognition is the The EUMSSI platform has been developed using UIMA, MongoDB, transcription of previously unknown (out-of-vocabulary) words. This Solr and other Open Source technologies to manage complex work- is particularly important when dealing with current news content, flows involving online (on-demand) and offline (batch) processing, where person and organization names, and other named entities that with mutual dependencies between the different modalities. The may not appear in older training corpora, are among the most critical three main challenges of the core platform are: parts of the transcription. Existing text, tags and other metadata, as well as information automatically extracted from these sources, are • Enabling the integration and combination of different annotation used to improve and adapt the language models. Further, OCR on layers using UIMA and its CAS format video data, speech analysis and speaker recognition mutually rein- • Managing the processing workflow using MongoDB and UIMA force one another. • Providing efficient and scalable access to the analyzed content for The combined and integrated results of the audio, video and text applications and demonstrators using Solr analysis significantly enhance the existing metadata, which can be used for search, visualization and exploration. In addition, the ex- The EUMSSI architecture was designed with a few core principles tracted entities and other annotations are exploited for identifying and requirements in mind: specific video fragments in which a particular person speaks, a new • Simplicity: The platform should not be overly complex, in order to topic begins, or an entity is mentioned. Figure 1 illustrates some of make it maintainable as well as to rapidly have a working system the different layers of analysis that may exist for a video content item. that all involved parties can build on The EUMSSI system currently includes a wide variety of analy- • Robustness: Failures, even hardware failures, should not have dis- sis components (many of which leverage and improve upon existing astrous consequences open source systems), such as automatic speech transcription (ASR), • Portability: It should be possible to easily migrate the platform to person identification (combining voice and face recognition, OCR a different system on subtitles for naming, and Named Entity Recognition and Link- • Flexibility: It must be possible to quickly extend the platform, in ing), and many natural language processing approaches, applied to particular by adding new analysis processes or content sources speech transcripts as well as original written content or social me- • Scalability: The platform must be able to support large-scale con- dia, e.g. NER (Stanford NLP), Entity Linking (DBpedia Spotlight), tent processing, as well as efficiently provide results to end users keyphrase extraction (KEA), quote extraction, topic segmentation, sentiment analysis, etc. As a result, the EUMSSI platform relies on open source technolo- gies with a proven track record of reliability and scalability as its 1 Universitat Pompeu Fabra, Spain, email: .@upf.edu foundation. The EUMSSI platform functions as a set of loosely coupled com- Note that this architecture design mainly depicts the data analysis ponents that only interact through a common data back-end (Mon- part of the EUMSSI system. The applications for end users are built goDB) that ensures that the system state is persisted and can be ro- upon the Solr indexes that are automatically synchronized with the bustly recovered after failures of individual components or even the analysis results. whole platform (including hardware failures). Crawlers, preprocessors and API layer are maintained as part All components run independently and can be seen as basically of the core EUMSSI platform. The MongoDB database is installed “stateless” in that they maintain only the information necessary for separately and managed from within the platform components (with immediate execution. As such it is possible to restart individual com- little or no specific configuration and setup), and the same goes for ponents without affecting the overall system, making it relatively some external dependencies such as having a Tomcat server on which easy to ensure the overall reliability of the platform. to run the API layer. All new content coming into the system is first normalized to a Analysis components for video and audio are fully external and common metadata schema (based on schema.org) and stored in a independent and communicate with the platform through the API MongoDB database to make it available for further processing. Anal- layer. Text analysis and cross-modality components are implemented ysis results, as well as the original metadata, are stored in UIMA’s as UIMA components and run as pipelines integrated into the plat- CAS format2 to allow integration of different aligned layers of anal- form using custom input (CollectionReader) and output (CASCon- ysis as well as in a simplified format that is then indexed with Solr. sumer) modules that read an existing CAS representation of the doc- The applications use the Solr indexes for efficient and scalable ac- ument from the MongoDB back-end, and write back a modified CAS cess to the analyzed content, as well as statistical metrics over the with added annotations (and possibly layers/views) as well as ex- whole document collection or specific subsets that can be used for tracted or "flattened" metadata that can be used by other components exploration and visualization. (e.g. a list of all detected entities in the document). Data Sources Crawlers make external data sources available to the platform. Preprocess queue Some crawler components are run only once to import existing crawlers manager datasets, whereas others feed continuously into the platform. Pre- extract metadata / create content initial CAS processing takes original metadata from the different sources and DW Processing transforms it into a unified representation with a common metadata add to / update processing queues vocabulary. MAM / feeds MongoDB The EUMSSI API abstracts away from the underlying storage (MongoDB and CAS data representation) to facilitate access for ex- 1. get raw content / ... video audio text 2. previous CAS process ternal components such as video and audio processing. It acts as a ... analysis analysis analysis 3. update CAS light-weight layer that translates between the internal data structure and REST-like operations tailored to the needs of the components. Indexing takes care of making the metadata (from the original Figure 2. Architecture design source as well as automatically extracted) available to demonstrators and applications by mirroring the data on a Solr server that is acces- The process flow, pictured in Figure 2, can be summarized as fol- sible to those applications. It is performed using mongo-connector3 , lows: leveraging built-in replication features of MongoDB for low-latency real time indexing of new (even partial) content, as well as content 1. new data arrives (or gets imported) updates. 2. preprocessing stage Components that are part of the core platform can be found on (a) make content available through unique internal identifier GitHub and are organized into directories corresponding to the type (b) create initial CAS with aligned metadata / text content of component. More detailed information about those components and content URI may be found in their respective README.md files. (c) mark initial processing queue states 3. processing / content analysis 2.1 Design decisions and related content analysis (a) distributed analysis systems query queue when they have platforms processing capacity (b) retrieve CAS with existing data (or get relevant metadata Apart from integrating a wide variety of analysis components work- from wrapper API) ing on text, audio, video, social media, etc., at different levels of se- (c) retrieve raw content based on content URI mantic abstraction, a key aspect of EUMSSI is the integration and (d) process combination of those different information layers. This is the main (e) update CAS (possibly through wrapper API) motivation for using UIMA as the main underlying framework, as de- (f) create simplified output for indexing scribed in section 3. This also has the advantage of providing a plat- (g) update queues form for building processing pipelines that has low overhead when i. mark item as processed by the given queue running on a single machine (all information is passed in-memory), ii. mark availability of data to be used by other analysis while still enabling distributed and scaled-out processing when nec- processes essary. 4. updating the Solr indexes whenever updated information is On the other hand, it quickly became apparent that not all kinds available for a content items of analysis are a good fit for such a workflow, leading to the hybrid approach described in section 4. Having the workflow control in the 2 Unstructured Information Management Architecture: http://uima.apache.org/ 3 https://github.com/mongodb-labs/mongo-connector same database as the data itself eliminates some of the potential fail- • CASes can be serialized in a standardised OASIS format10 for ures of more complex queue management systems by ensuring con- storage and interchange sistency between the stored data and its analysis status. It also means that efforts in guaranteeing availability and performance can focus Annotations based directly on multimedia content (video and au- on optimizing and allocating resources for the MongoDB database dio) naturally refer to that content via timestamps, whereas text anal- (for which best practices are well established). ysis modules normally work with character offsets relative to the text While there are commercial content management systems on the content. It is therefore fundamental that any textual views created market, some of which allow for the integration of some automatic from multimedia content (e.g. via ASR or OCR) refer back to the content analysis, none of them have the flexibility of the EUMSSI timestamps in the original content. This is done by creating annota- platform, and in particular none are aimed at facilitating cross- tions, e.g. tokens or segments, that include the original timestamps modality integration. as attributes in addition to the character offsets. Some recent research projects approach similar goals. MultiSen- As an example, we may have a CAS with an audio view which sor4 combines analysis services through distributed RESTful ser- contains the results of automatic speech recognition (ASR), provid- vices based on NIF as an interchange format, incurring higher com- ing the transcription as a series of tokens/words with a timestamp for munication overheads in exchange for greater independence of ser- each word as an additional feature. vices (compared to the UIMA-based parts of EUMSSI). LinkedTV5 In this way it is possible to apply standard text analysis modules has a similar approach to EUMSSI (also using MongoDB and Solr), (that rely on character offsets) on the textual representation, while integrating the outputs of different analysis processes in a com- maintaining the possibility to later map the resulting annotations mon MPEG-7 representation in the consolidation step, however (it back onto the temporal scale. appears) with far less mutual integration of outputs from differ- So called SofA-aware UIMA components are able to work on mul- ent modalities. MediaMixer6 focuses on indexing Media Fragments7 tiple views, whereas “normal” analysis engines only see one specific with metadata to improve retrieval in media production, and BRID- view that is presented to them. This means that e.g. standard text GET8 provides means to link (bridge) from broadcast content to re- analysis engines don’t need to be aware that they are being applied to lated items, partly based on automatic video analysis. an ASR view or an OCR view; they just see a regular text document. SofA-aware components, however, can explicitly work on annota- tions from different views and can therefore be used to integrate and 3 Aligned data representation combine the information coming from different sources or layers, and create new, integrated views with the output from that integra- Much of the reasoning and cross-modal integration depends on an tion and reasoning process. aligned view of the different annotation layers, e.g., in order to con- nect person names detected from OCR with corresponding speakers 4 Synchronous and asynchronous workflow from the speaker recognition component, or faces detected by the management face recognition. The Apache UIMA9 CAS (common analysis structure) represen- In EUMSSI we decided to use a dual approach to workflow man- tation is a good fit for the needs of the EUMSSI project as it has a agement, allowing for synchronous (and even on-demand) analysis number of interesting characteristics: pipelines as well as the execution of large batch jobs which need to be run asynchronously, possibly scheduled according to the availabil- • Annotations are stored “stand-off”, meaning that the original con- ity of computational resources. tent is not modified in any way by adding annotations. Rather, the We opted for UIMA as the basis for synchronous workflows, as annotations are entirely separate and reference the original content well as the data representation used for integrating different analy- by offsets sis layers. On the other hand, a web-based API allows other analy- • Annotations can be defined freely by defining a “type system” that sis processes, such as audio and video analysis, to retrieve content specifies the types of annotations (such as Person, Keyword, Face, and upload results independently, giving them complete freedom to etc.) and the corresponding attributes (e.g. dbpediaUrl, canonical- schedule their work according to their specific needs. Representation, ...) • Source content can be included in the CAS (particularly for text content) or referenced as external content via URIs (e.g. for mul- 4.1 Analysis pipelines using UIMA timedia content) UIMA provides a platform for the execution of analysis components • While each CAS represents one “document” or “content item”, (Analysis Engines or AEs), as well as for managing the flow between it can have several Views that represent different aspects of that those components. CPE or uimaFIT11 [2] can be used to design and item, e.g. the video layer, audio layer, metadata layer, transcribed execute pipelines made up of a sequence of AEs (and potentially text layer, etc., with separate source content (SofA or “subject of some more complex flows), and UIMA-AS12 (Asynchronous Scale- annotation”) and separate sets of annotations out) permits the distribution of the process among various machines • CASes can be passed efficiently in-memory between UIMA anal- or even a cluster (with the help of UIMA DUCC13 ). ysis engines Within the EUMSSI project we have developed and integrated a number of UIMA analysis components, mostly dealing with text 4 http://multisensorproject.eu/ analysis and semantic enrichment. Whenever possible, components 5 http://linkedtv.eu 6 http://mediamixer.eu 10 http://docs.oasis-open.org/uima/v1.0/uima-v1.0.html 7 https://www.w3.org/2008/WebVideo/Fragments/ 11 https://uima.apache.org/uimafit.html 8 http://ict-bridget.eu 12 http://uima.apache.org/doc-uimaas-what.html 9 http://uima.apache.org/ 13 http://uima.apache.org/doc-uimaducc-whatitam.html from the UIMA-based DKPro project [1] were used, especially for the core analysis components (tokenization, part-of-speech, parsing, etc.). In addition to a large number of ready-to-use components, DKPro Core provides a unified type system to ensure interoperabil- ity between components from different sources. Other components developed or integrated in EUMSSI were made compatible with this type system. 4.2 Managing content analysis with MongoDB There are some components of the EUMSSI platform, however, that do not integrate easily in this fashion. This is the case of computa- tionally expensive processes that are optimized for batch execution. A UIMA AE needs to expose a process() method that operates on a single CAS (= document), and is therefore not compatible with batch processing. This is particularly true for processes that need to be run on a cluster, with significant startup overhead, such as many video and audio analysis tasks. It is therefore necessary to have an alternative flow mechanism for offline or batch processes, which needs to integrate with the process- ing performed within the UIMA environment. The main architectural and integration issues revolve around the Figure 3. data flow and transformations data flow, rather than the computation. In fact, the computationally complex and expensive aspects are specific to the individual analysis information from the CAS on demand. components, and should not have an important impact on the design In its simplest form, the processes responsible for the data tran- of the overall platform. sitions are fully independent and poll the database periodically to As such, the design of the flow management is presented in terms retrieve pending work. Those processes can then be implemented in of transformations between data states, rather than from the proce- any language that can communicate comfortably with MongoDB. dural point of view. The resulting system should only rely on the ro- bustness of those data states to ensure the reliability and robustness of 4.3 Multimodal multilayer data integration and the overall system, protecting against potential problems from server enrichment failures or other causes. At any point, the system should be able to resume its function purely from the state of the persisted data. The integration of data from different analysis layers is usually done To ensure reliability and performance of the data persistence, we by loading the CAS representations generated by different prior pro- use the well-established and widely used database system MongoDB, cesses and merging them as individual Views in a single CAS. Layers which provides great flexibility as well as proven scalability and ro- that work on different representations, e.g. speaker recognition, au- bustness. dio transcript and OCR, are aligned by using timestamps associated Figure 3 shows the general flow of the EUMSSI system, focusing with the segments or tokens. As a result, new integrated views can be on the data states needed for the system to function. created, combining the different information layers. Metadata is also In order to avoid synchronization issues, the state of the data pro- enriched by adding information to existing annotations or creating cessing is stored together with the data within each content item, new ones, e.g. with information obtained from SPARQL DBpedia and the list of pending tasks can be extracted at any point through lookups. simple database queries. We therefore only depend on the Mon- goDB database (which can be replicated across several machines or 4.4 Indexing for scalable data-driven applications even a large cluster for performance and reliability) to fully estab- lish the processing state of all items. For example, the queues for The final applications do not use the information stored in MongoDB analysis processes can be constructed directly from the “process- directly, but rather access Solr indexes created from that information ing.queues.queue_name" field of an item by selecting (for a given to respond specifically to the types of queries needed by the appli- queue) all items that have not yet been processed by that queue and cations. Those indexes are updated whenever new analysis results that fulfill all prerequisites (dependencies). are available for a given item, through the use of mongo-connector The analysis results are stored in CAS format (optionally with which keeps the indexes always up-to-date with the content of the compression). In order to avoid potential conflicts or race conditions “meta.source” and “meta.extracted” sections. between components (most analysis processes run independently of one another), the different layers are stored in separate database fields 5 Standards and interoperability as independent CASes. Components that work across layers then merge the separate CASes into a single one (as separate Views) in EUMSSI uses established protocols and uses freely available and order to combine the information. The “meta.extracted" section of a widely used open source software as its underpinnings, in addition document is used to store the simplified analysis results that are au- to publishing in-project developments under permissive open source tomatically synchronized with the Solr index, and can also be used licenses through popular platforms such as GitHub. as inputs to other annotators (such as detected Named Entities as in- The API for external analysis components is REST-like and uses put to speech recognition), to avoid the overhead of extracting that JSON for communication, whereas the end applications access the data through Solr’s REST-like API (which supports various result 7 Conclusions formats). Metadata is represented using a vocabulary built upon schema.org and the internal representations in UIMA use the DKPro In the EUMSSI project we have developed a platform capable of han- type system as a core. Entity linking is performed against the DBpe- dling large amounts of multimedia, with support for online and of- dia, thus yielding Linked Open Data URIs for entities and allowing fline processing as well as the alignment and combination of different for the use of SPARQL and RDF to access additional information. information layers. The system includes many interactions between There is now a starting initiative to establish a “standard” type sys- different modalities, such as doing text analysis on speech recogni- tem for UIMA, with initial conversations pointing towards building tion output, or adding Named Entities from surrounding text to the upon the DKPro type system for this purpose. Various institutions vocabulary known to the ASR system, among others. have expressed their interest in endorsing such a type system, lead- The platform has proven capable of handling millions of content ing to a major step forward in improving interoperability between items on modest hardware, and is designed to allow for easily adding UIMA components from different sources. capacity through horizontal scaling. The source code of the platform is publicly available at https: //github.com/EUMSSI/. Additional documentation can be found in the corresponding wiki at https://github.com/ 6 Demonstrators and applications EUMSSI/EUMSSI-platform/wiki. ACKNOWLEDGEMENTS Using the data stored in the system, and made available through Solr indexes, as well as on-demand text analysis services, a variety of ap- The work presented in this article is being carried out within the plications can be built that provide access to the content and informa- FP7-ICT-2013-10 STREP project EUMSSI under grant agreement tion. Two very different applications were built within the EUMSSI n◦ 611057, receiving funding from the European Union’s Seventh project, serving as technology demonstrators as well as having real- Framework Programme managed by the REA-Research Executive world use. Agency http://ec.europa.eu/research/rea. The storytelling tool provides a web interface that allows a jour- nalist to work on an article in a rich editor, when writing an news REFERENCES article or preparing a report. The system then analyses the text the [1] Richard Eckart de Castilho and Iryna Gurevych, ‘A broad-coverage journalist is writing in order to provide relevant background infor- collection of portable nlp components for building shareable analysis mation. In particular, the journalist can directly access the Wikipedia pipelines’, in Proceedings of the Workshop on Open Infrastructures and pages of entities that appear in the text, or find related content in the Analysis Frameworks for HLT, pp. 1–11, Dublin, Ireland, (August 2014). archives (including content from outside and social media sources). Association for Computational Linguistics and Dublin City University. A variety of graphical widgets then allow to explore the content col- [2] Philip V. Ogren and Steven J. Bethard, ‘Building test suites for UIMA components’, SETQA-NLP ’09 Proceedings of the Workshop on Soft- lection, finding relevant video snippets, quotes, or presenting relevant ware Engineering, Testing, and Quality Assurance for Natural Language entities and the relations between them. Processing, 1–4, (June 2009). The tool is built on web technologies (HTML, Javascript, ...) and in particular AJAX-Solr14 which manages the communication with the Solr backend that provides the data. Most of the functionality in the widgets is based on using Solr queries (automatically gener- ated or manually specified) to select a relevant set of items (articles, videos, etc.) from the index. The second-screen application, on the other hand, is aimed at viewers of TV or streaming content, at home who would like to easily access background information, or have their viewing aug- mented with entertainment (or edutainment) activities such as auto- matically generated quizzes relating to the currently viewed content. The EUMSSI second screen application is implemented as a server- side application that connects a ‘first screen’ client, which shows the video in an HTML5 player, with one or more ‘second screen’ clients (which may be on the same machine, or on a separate laptop, tablet or smartphone). All possible information and questions that can be shown to the user, are stored in JSON format in a WebVTT file (We- bVTT is a W3C standard for displaying timed text, such as subtitles, in connection with the HTML5 video15 ). All second screen clients that are logged in using the same identi- fier as the video show the same content at the same time. The video client sends the relevant content to the server at the moment speci- fied in the VTT file. The server-side agent forwards the content to the second screen clients. 14 https://github.com/evolvingweb/ajax-solr 15 https://w3c.github.io/webvtt/