Visual Computing and Data Analytics A Conceptual Architecture for AI-based Big Data Analysis and Visualization Supporting Metagenomics Research Thoralf Reis, Thomas Krause, Marco X. Bornschlegl, and Matthias L. Hemmje University of Hagen, Faculty of Mathematics and Computer Science, 58097 Hagen, Germany {thoralf.reis, thomas.krause, marco-xaver.bornschlegl, matthias.hemmje}@fernuni-hagen.de Abstract. This paper targets to introduce an architecture for Artificial Intelligence (AI) based Big Data Analysis and Visualization supported metagenomics research based on the AI2VIS4BigData Reference Model. Metagenomics research covers the examination of huge amounts of data to improve the understanding of microbial communities. Technological and methodical improvements in Big Data Analysis drive progress in metagenomics research and thereby support practical applications like, e.g., the analysis of cattle rumen with the research goal of reducing the negative impact of cattle breeding on global warming. AI2VIS4BigData is a reference model for the combined application areas of Big Data Anal- ysis, AI, and Visualization. Its purpose is to support scientific and indus- trial activities with guidelines and a common terminology to enable effi- cient exchange of knowledge and information and thereby prevent ”rein- venting the wheel”. The general applicability of the AI2VIS4BigData model for metagenomics has been validated in a previous publication. As a next step, this paper derives a conceptual architecture that speci- fies a possible adaption of AI2VIS4BigData for metagenomics. For this, three new metagenomic publications utilizing AI and Visualizations are assessed. Keywords: Metagenomics ➲ Big Data ➲ AI ➲ Visualization ➲ AI2VIS4BigData. 1 Introduction and Motivation Metagenomics research analyzes relationships within whole microbial communi- ties while genomics research focuses on the analysis of genes or the genome of a single organism [1]. A practical example for metagenomics research is the in- vestigation of the rumen microbiota regarding its influence in cattle greenhouse gas emissions and food conversion efficiency [2] as cattle are a major contributor to climate change and relevant for food security, two significant challenges soci- ety is facing [2]. The demand for data in metagenomics research is significantly bigger than for regular genomics research: the investigation of relationships and coherence between organisms or genes in and between metagenomic samples Copyright © 2020 for this paper by its authors. CERC 2020 264 Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Visual Computing and Data Analytics 2 T. Reis, T. Krause et al. [1] requires biological researchers to process, store, and exchange big amounts of data e.g. via specialized bioinformatics databases [3]. Hence, metagenomics research benefits on a large scale from progress and development in Big Data Analysis such as decreasing costs for storage and processing of huge amounts of data. With the EU-funded MetaPlat1 project, scientists from different research institutions with either Big Data Analysis or bioinformatics background worked together to develop the MetaPlat platform. This cloud based Big Data Analysis platform is specialized to analyze metagenomics data like, e.g., rumen microbiota [2]. For an effective analysis of Big Data, the platform empowers the researchers to utilize cutting-edge technology such as Artificial Intelligence (AI) [2] and vari- ous forms of Information Visualization (IVIS) [4] to provide the researchers with visual feedback of their activities and enable them to identify new insights. To define the vague term Big Data, a popular approach is to follow the data management challenges outlined by Doug Laney [5]. These challenges comprise three dimensions (the three v’s): variety (ambiguous data manifestations regard- ing e.g. data format, data structure or data semantics), volume (big amount of data), and velocity (high frequent data inflow) [5]. By this definition, the sheer volume of data in metagenomics research allows labeling it as Big Data. The collective term AI summarizes techniques and methods such as symbolic AI or Machine Learning (ML) to implement intelligence for machines (in contrary to human or animal natural intelligence) [6]. Example application scenarios of AI in metagenomics research of rumen are the analysis of data through clustering [3] or the training of classifiers to categorize data samples [2]. Big Data and AI are closely connected to each other [6]: Big Data is very useful to derive, validate, apply, and enhance AI models while AI-driven algorithms enable the exploration of Big Data and its potential. Visualization of data, processing steps as well as AI model development is an important link between both application areas. It enhances comprehension and decreases entry barriers for new users. In addition, visualization offers the chance to meet the growing demand for explainability and transparency of AI With [7], the AI2VIS4BigData Reference Model for research and practical applications in the application areas Big Data Analysis, AI, and Visualization was introduced. Its objective is to provide a common specification as well as a common basis for discussion and thereby reduce the risk of inefficiency through reinventing the wheel and solving problems that have already been solved else- where. The reference model’s theoretical applicability was evaluated in an expert round table workshop featuring presentations from three practical application domains: health care, economics, and metagenomics [8]. Until now, the reference model was validated only for one metagenomics research application [9]. This paper targets to validate three further metagenomics research applications from the MetaPlat project to assess if an architecture can be derived. Within the remainder of this paper, the AI2VIS4BigData Reference Model and the three assessed metagenomics research publications from the MetaPlat project are introduced, the pursued architecture modeling approach is presented 1 https://metaplat.eu 265 CERC 2020 Visual Computing and Data Analytics AI2VIS4BigData Architecture supporting Metagenomics Research 3 Model Deployment C Machine Learning / Symbolic Statistical AI Models Analytics Automation UI AI Models Analytics Automation UI Effectuation Insight & Data Management Analytics Interaction & Perception Data Integration & Curation Model Configuration Knowledge Representation E Raw Data Source A B D AI Results / Labels AI Transparency, Explanation AI Metrics & Data Privacy Label Annotation Design, Data Selection, F Implementation Data Data && Model Model Data Operation Data & & Model & Model System / User Activities Verification AI Model Meta Information & Training Design Design & Validation Monitoring Design Design Data Mapping & Transformation Views Data Intelligence Fig. 1. AI2VIS4BigData - A Reference Model for AI Supporting Big Data Analysis (Section 1.3), a multi-layered conceptual architecture is introduced in Section 2 together with an initial validation in Section 3 before this paper concludes with outlining its contributions and providing an outlook (Section 4). 1.1 AI2VIS4BigData Reference Model The AI2VIS4BigData Reference Model (Figure 1) was derived through pro- jecting the AI lifecycle phases of AIGO’s AI System Lifecycle [10] onto the IVIS4BigData Reference Model [11] considering the different AI models (ML or statistical AI models as well as symbolic AI models) for supporting Big Data Analysis, AI data, and AI user stereotypes [7]. It contains the three processing steps Data Management & Curation, Analytics, Interaction and Perception ac- companied with a data intelligence layer for user interaction and User Interfaces (UI) of IVIS4BigData [11], a reference model that target to ”close the gap in research with regard to information visualization challenges of Big Data Anal- ysis as well as context awareness” [11]. AI2VIS4BigData introduces a model deployment layer that spreads over the three processing steps [7]. AI models are executed directly within the data and information loop which links the de- ployed models to the input data they need for execution and compute output data that is fed back into the Big Data Analysis system [7]. The remaining ac- tivities of AI system life cycle phases are displayed within the analytics layer as Design, Implementation & Training, Data Selection, Verification & Validation as well as Operation & Monitoring and interconnected through bidirectional ar- rows emphasizing the iterative nature of AI model design [7]. The different refer- ence model elements are linked to five clearly distinguished AI user stereotypes (model designer, domain expert, model deployment engineer, model operator, model end user, and model governance officer) and four clearly distinguished Big Data Analysis user stereotypes (system owner, data scientists, management consultants as well as directors including C-levels) [7]. 1.2 Assessed Metagenomic Use Cases In [9] a conceptual workflow for metagenomic studies was presented and demon- strated using two previously published metagenomic use cases. The first of these CERC 2020 266 Visual Computing and Data Analytics 4 T. Reis, T. Krause et al. use cases [12] was the visualization of gene dependencies using a whole-genome approach and a new framework for improved correlation measurement between genes. The second publication [13] analyzed the relationship between micro- biomes in feces and rumen using a taxonomic analysis of partial genome se- quences (barcode sequences). Together they cover the two main branches of metagenomic analyses (taxonomic and functional). It was shown in [9] that the metagenomics analysis workflow extracted from these publications can be mapped directly onto the AI2VIS4BigData Reference Model therefore validated its relevance for the field of metagenomics. This section will introduce three additional publications in metagenomics research that serve as a base to further validate and transform this conceptual workflow into a generic architecture. The publications were selected as they are practical examples for metagenomic analysis (which can represent Big Data Analysis applications depending on the sample size), carried out by different researchers and most importantly, they describe the usage of statistical methods or ML as well as Visualization. Although all selected publications originate from the MetaPlat project, they represent different research approaches like, e.g., the analysis of genes or the analysis of OTUs. In addition, the homogeneous MetaPlat terminology eases the architecture derivation. The publication A Metagenomics Analysis of Rumen Microbiome [2] by P. Walsh et al. demonstrates a metagenomic analysis of the ”Bos taurus” rumen microbiome using ML models in a cloud based environment. For optimal perfor- mance and scalability, a queueing system is used between individual components, thus enabling asynchronous and parallel execution. After importing raw sequence data into the system, it is written to one of these processing queues which feed into a similar metagenomic analysis workflow that uses the QIIME toolset to perform data cleanup and clustering of sequences into Operational Taxonomic Units (OTUs). The workflow assigns taxonomic labels to these OTUs. In an an- alytics step, various ML models are used to classify the samples into phenotypes using the taxonomic data of the previous steps as an input. Finally, the publi- cation showcases various visualizations ranging from a taxonomic composition chart to plots of algorithmic accuracy and other AI metrics. In Analysis of Rumen Microbial Community in Cattle through the Integra- tion of Metagenomic and Network-based Approaches [3], H. Wang et al. func- tionally analyze the rumen microbial community in cattle through application of a network-based approach: the authors construct a co-abundance network uti- lizing the ”relative abundance of 1570 microbial genes” [3] that enables them to identify functional modules. In doing so, they present a method to automatically determine a cutoff threshold value to generate the co-abundance network in the first place [3]. While the first publication [2] uses partial sequences sufficient to identify and analyse the taxonomic composition, this publication is based on whole genome data which enables the analysis of genes. Together they cover the two main branches of metagenomic studies. To construct the co-abundance network used in the publication, the short reads generated by next-generation sequencing platforms are assembled into longer sequences. These sequences are 267 CERC 2020 Visual Computing and Data Analytics AI2VIS4BigData Architecture supporting Metagenomics Research 5 then matched to the KEGG2 database to identify genes (and associated meta- data) present in the samples. Using the relative abundances of these genes, cor- relations can then be computed by analyzing how the abundance of one gene affects the abundance of other genes across the various samples. Since the pres- ence or absence of a correlation is not always distinguishable from statistical noise, a suitable cutoff value is then determined using an automated computa- tional method. Using the cutoff values, a network graph can be constructed that represents genes as nodes and the correlation strength as the length of edges connecting these nodes. As third and last assessed publication, M. Wang et al., the authors of Un- derstanding the relationships between rumen microbiome genes and metabolites to be used for prediction of cattle phenotypes [4] combined metabolomics with metagenomics in order to identify differences in diets and methane emissions from rumen metabolites and microbial genes. They analyzed 36 rumen samples and identified the difference in the response of rumen microbes to different basal diets which down the road affect cattle methane emissions [4]. The study starts from gene abundance data of cattle rumen obtained from previous studies on the experiment designed by Roehe et al. [14]. The abundance data was cleansed and transformed before conducting multiple activities to determine correlations between genes and metabolites related to the differences in diets in the experi- ment design. The correlation data was then used to build correlation networks as well as various other plots and result tables. 1.3 Dicsussion, Conclusion, and Identification of remaining Architectural Challenges In order to arrive at a generic architecture that enables the management, anal- ysis, and visualization of metagenomic data as well as the fusion with other health related data and knowledge, the first step is mapping the introduced metagenomics publications to the generic stages of the AI2VIS4BigData Ref- erence Model (”Data Management & Curation”, ”Analytics” and ”Interaction & Curation”). This is easy to validate as all three publications include steps to ingest, manage or cleanup metagenomic sequences, all of them include statistical or ML methods for analytics and also all of them produce one or more visual- izations. The same was previously already demonstrated in [9]. Therefore, it is proposed that an architectural model should explicitly model these stages. Looking at the papers in detail, further requirements for a comprehensive ar- chitectural model can be derived: The first publication describes the importance of using individual components that communicate through asynchronous mech- anisms like, e.g., queuing systems to achieve high performance and scalability. The impact of Big Data and ML in Metagenomics is also mentioned as a chal- lenge in [9]. A suitable architecture should therefore aim to separate individual parts and components of the system where possible so that they can operate and scale individually. The second publication [3] shows the need of additional 2 Kyoto Encyclopedia of Genes and Genomes, https://kegg.jp CERC 2020 268 Visual Computing and Data Analytics 6 T. Reis, T. Krause et al. knowledge sources like, e.g., gene databases for the analysis of metagenomic se- quences. Our proposed architecture should therefore support the ingestion and persistence of these additional data sources into a knowledge network that can be used by metagenomic workflows. The third publication [4] is important as it does not start from raw sequence data but from intermediate results obtained from other studies. Our architecture should be able to reuse the same interme- diate results for several distinct analyses thus requiring the persistence of these intermediate results. This requirement also partially addresses the challenge of ”Reproducibility” mentioned in [9] and the area of ”AI Transparency, Expla- nation & Data Privacy” of the AI2VIS4BigData model. All three publications differ significantly in the exact steps executed in the analysis phase and the vi- sualizations produced. It is therefore important that the analysis is done in a modular fashion where the order and type of steps is dynamic and that a wide range of visualizations is supported. 2 AI2VIS4BigData Conceptual Architecture supporting Metagenomics Research This paper introduces the AI2VIS4BigData architecture (Figure 2) for process- ing and analysis of metagenomic data in an AI and Big Data environment. It was designed by extending the Big Data Analysis and Visualization architecture of IVIS4BigData [11] with AI and metagenomic aspects in order to fulfill the metagenomics requirements outlined in Section 1.3. The architecture is verti- cally split into three pillars separating the components for metagenomics data integration and processing (domain-specific input), AI and data science model- ing and configuration (AI analysis input) from the components responsible for result visualization and data generation (output). This is based on the design principle of Separation of Concerns (SoC) [15] and makes it easier to develop, scale or exchange the components separately. Each of these three pillars is struc- tured into three layers following the Model View Controller (MVC) pattern [16] with a shared persistence layer interconnecting all three pillars. The bottom layer represents the model, the top layers represent the view while the middle layers contain the controllers. Metagenomics-specific architecture elements are a dedicated user, knowledge and data artifacts within the input layer, assets and knowledge networks in persistence layer as well as domain-specific end user inter- faces. The following rough description of the individual layers and components follows the flow of data, starting from the top left at data input and ending with result visualization at the top right corner: Knowledge & Data Input. Within this layer, expert users or systems ingest metagenomics-related knowledge and data into the system. These infor- mation comprise biological and genetical knowledge (e.g. protein metadata or knowledge automatically extracted from scientific publications) as well as diag- nostic and subject data (e.g. metagenomic sequences). AI Integration & Fusion. This layer contains all services and methods to integrate the various domain-specific inputs into the system, to perform a data fusion and persist it as structured content or knowledge network. The se- 269 CERC 2020 Visual Computing and Data Analytics AI2VIS4BigData Architecture supporting Metagenomics Research 7 Knowledge & Data Input Model & Configuration Input End User Interface Domain Harvesting NGS Diagnostic AI & Data Domain Governance End Expert System Platform Expert Scientist Expert Officer User Biological Diagnostic & AI & Data Science Ethical & Legal Taxonomic Phenotype Gene Knowledge Subject Data Knowledge Policies Composition Predictions Correlation Reference Metagenomic Model Data Data Sequences Configuration Policies Multimodal Interfaces Scientific Laboratory Workflow Access Publications Data Configuration Interactive Multilingual Policies Visualizations Reports Protein Subject Service Audit Metadata History Configuration Policies Dialog Systems AI Integration & Fusion AI Analysis AI Input/Output Data Integration Service Hub Analysis Services Presentation Services Interpretation Services Wrapper Visualization Language Wrapper Wrapper Registry Clustering Generation Processing Scheduler Annotation Visualization Feature Mediator Selection Extraction Analytics Dimensionality Intent Data Fusion Workflow Engine Reduction Detection AI Assisted Deep Learning AI Assisted AI Assisted Automation Analysis Presentation Interpretation Persistence Data Lake Structured Content Knowledge Network Metagenomic Sys. / User Protein AI Results Taxonomies Assets Activities Metadata Raw Data Audit Log Model Semantic Labels AI Metrics Symbolic AI Configuration Repr. Fig. 2. AI2VIS4BigData Conceptual Architecture Supporting Metagenomics Research mantic integration is realized through implementation of the mediator wrapper approach. Model & Configuration Input. The necessary knowledge and information for configuring the AI applications within the system is provided by AI and data science expert users within this layer. The input contains the required knowledge to register and schedule all AI services and to select appropriate analysis methods and algorithms. The additional AI2VIS4BigData role of the Governance Officer ensures legal compliance and maintaining ethical standards through providing relevant constraints. AI Analysis. The middle layer is responsible for performing analysis tasks on behalf of the user. A workflow system together with a service registry al- low for flexible configuration of the required analysis steps while the scheduler manages the execution of these steps on distributed or local computing nodes. Intermediate and final results are stored persistently. Persistence. The persistence layer targets to store various types of data and enable data exchange between overlying layers. Raw data is stored in a data lake with little to none processing performed to improve reproducibility and trans- CERC 2020 270 Visual Computing and Data Analytics 8 T. Reis, T. Krause et al. parency of the system. Structured data includes parsed genetic sequences, inter- mediate results from analysis processes and other kind of schema-bound data. Lastly a knowledge network tries to represent biological and medical knowledge as well as semantic rules required for Symbolic AI in a machine readable way. AI Input/Output. The purpose of this layer is to intelligently interpret the intentions of the system’s end user (e.g. through applying natural language processing) and present the information that is relevant for them in a suiting form (e.g. after performing a dimensionality reduction or selecting appropriate visualization techniques). End User Interface. The end user interface layer contains the multimodal interfaces through which the system’s end users access its data and informa- tion. These interfaces comprise visualizations, reports and dialogue systems that present the domain-specific artifacts (e.g. taxonomic compositions). 3 Initial Validation and Remaining Challenges The proposed architecture specifies all areas of the AI2VIS4BigData Reference Model. The area ”Data Management & Curation” of the reference model is ad- dressed by the left pillar. The ”Analysis” area is covered by the second pillar and especially the ”AI Analysis” layer. Finally, the ”Interaction & Perception” area is implemented through the right pillar. The architecture also implements all requirements that were outlined previously. A detailed mapping of the require- ments to the architecture elements would be beyond the scope of this paper, yet is planned for future work. Individual components for input, analysis and visu- alization are strictly split by the three pillars and communicate asynchronously through the persistence layer allowing for flexible scaling and Big Data process- ing. Additional knowledge sources are supported by providing a data agnostic input layer together with a mediator wrapper architecture for data integration. The persistence layer ensures that reproducibility and transparency is possible by storing intermediate and final results. Finally, a flexible workflow system and a service registry support the heterogeneity of metagenomic studies and al- low easy integration of new analysis methods. The remaining challenges for the architecture comprise a harmonization with the IVIS4BigData architecture, a generalization for application domains beyond metagenomics research, a techni- cal specification as well as a proof of concept technical implementation. Since the selected publications were limited to the MetaPlat project, the assessment of practical applicability for the introduced architecture in metagenomics research beyond MetaPlat is a further remaining challenge. 4 Conclusion and Outlook In the course of this paper, three MetaPlat publications were assessed that ana- lyze rumen microbiota through metagenomics research utilizing Big Data Analy- sis, AI as well as visualization. Objective of this assessment was the derivation of a AI2VIS4BigData-based conceptual architecture for real-life application in this three-fold research area. The resulting AI2VIS4BigData conceptual architecture supporting metagenomics research was introduced in Section 2. It consists of 271 CERC 2020 Visual Computing and Data Analytics AI2VIS4BigData Architecture supporting Metagenomics Research 9 seven layers arranged alongside the three levels of the MVC pattern. As outlook, future work is planned to overcome the challenges introduced in Section 3. References 1. F. Engel, M. Fuchs, P. M. Kevitt, M. Hemmje, and P. Walsh, “A Metagenomic Content and Knowledge Management Ecosystem Platform,” 2019. 2. P. Walsh, C. Palu, B. Kelly, B. Lawor, J. T. Wassan, H. Zheng, and H. Wang, “A Metagenomics Analysis of Rumen Microbiome,” Proceedings - 2017 IEEE Inter- national Conference on Bioinformatics and Biomedicine, pp. 2077–2082, 2017. 3. H. Wang, H. Zheng, F. Browne, R. Roehe, R. J. Dewhurst, F. Engel, M. Hemmje, and P. Walsh, “Analysis of Rumen Microbial Community in Cattle through the Integration of Metagenomic and Network-based Approaches,” 2016 IEEE Interna- tional Conference on Bioinformatics and Biomedicine, pp. 198–203, 2017. 4. M. Wang, H. Zheng, H. Wang, R. J. Dewhurst, and R. Roehe, “Understanding the relationships between rumen microbiome genes and metabolites to be used for pre- diction of cattle phenotypes,” in BIBE 2019; The Third International Conference on Biological Information and Biomedical Engineering. VDE, 2019, pp. 1–5. 5. D. Laney, “3D Data Management: Controlling Data Volume, Velocity, and Vari- ety,” META Group, Tech. Rep., 2001. 6. ISO, “ISO/IEC JTC 1/SC 42 Artificial Intelligence,” 2018. [Online]. Available: https://isotc.iso.org/livelink/livelink/open/jtc1sc42 7. T. Reis, M. X. Bornschlegl, and M. L. Hemmje, “Towards a Reference Model for Artificial Intelligence Supporting Big Data Analysis,” To appear in: Proceedings of the 2020 International Conference on Data Science (ICDATA’20), 2020. 8. ——, “AI2VIS4BigData: Qualitative Evaluation of a Big Data Analysis, AI, and Visualization Reference Model,” To appear in: Lecture Notes in Computer Science, vol. LNCS 10084, 2020. 9. T. Krause, B. Andrade, H. Afli, H. Wang, H. Zheng, and M. Hemmje, “Under- standing the Role of (Advanced) MachineLearning in Metagenomic Workflows,” To appear in: Lecture Notes in Computer Science, vol. LNCS 10084, 2020. 10. OECD, Artificial Intelligence in Society, 2019. 11. M. X. Bornschlegl, “Advanced Visual Interfaces Supporting Distributed Cloud- Based Big Data Analysis,” Dissertation, University of Hagen, 2019. 12. H. Zheng, H. Wang, R. Dewhurst, and R. Roehe, “Improving the Inference of Co- occurrence Networks in the Bovine Rumen Microbiome,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, p. 1, 2018. 13. B. G. N. Andrade, F. A. Bressani, R. R. C. Cuadrat, P. C. Tizioto, P. S. N. de Oliveira, G. B. Mourão, L. L. Coutinho, J. M. Reecy, J. E. Koltes, P. Walsh, A. Berndt, L. C. A., J. C. P. Palhares, and L. C. A. Regitano, “The structure of microbial populations in Nelore GIT reveals inter-dependency of methanogens in feces and rumen,” Journal of animal science and biotechnology, vol. 11, p. 6, 2020. 14. R. Roehe, R. J. Dewhurst, C.-A. Duthie, J. A. Rooke, N. McKain, D. W. Ross, J. J. Hyslop, A. Waterhouse, T. C. Freeman, M. Watson, and R. J. Wallace, “Bovine Host Genetic Variation Influences Rumen Microbial Methane Production with Best Selection Criterion for Low Methane Emitting and Efficiently Feed Converting Hosts Based on Metagenomic Gene Abundance,” PLOS Genetics, pp. 1–20, 2016. 15. E. W. Dijkstra, “On the role of scientific thought,” in Selected writings on com- puting: a personal perspective. Springer, 1982, pp. 60–66. 16. M. Fowler, Patterns of enterprise application architecture. Addison-Wesley Long- man Publishing Co., Inc., 2002. CERC 2020 272