=Paper= {{Paper |id=Vol-2858/paper2 |storemode=property |title=Construction techniques of Baikal microbiome research information-computational environment |pdfUrl=https://ceur-ws.org/Vol-2858/paper2.pdf |volume=Vol-2858 |authors=Evgeniy Cherkashin,Alexey Shigarov |dblpUrl=https://dblp.org/rec/conf/aicts/CherkashinS20 }} ==Construction techniques of Baikal microbiome research information-computational environment== https://ceur-ws.org/Vol-2858/paper2.pdf
Construction techniques of Baikal microbiome research
information-computational environment
                   E A Cherkashin and A O Shigarov
                   Matrosov Institute for System Dynamics and Control Theory, Siberian Branch
                   of Russian Academy of Sciences, Irkutsk, Russia,
                   Irkutsk Scientific Center, Siberian Branch of Russian Academy of Sciences, Irkutsk, Russia
                   E-mail: eugeneai@icc.ru

                   Abstract. A toolset and model data sources for research and developing an environment for Next
                   Generation Sequencing data processing are considered in the paper. The environment is constructed on
                   the base of model transformations targeted to industrial grade systems allowing domain specialists to carry
                   on the Next Generation Sequencing research, which includes genetic data processing, visualization, and
                   data integration. The integration allows one to get rid of restrictions imposed by an application library by
                   its operation set and properties. The technique of the transformation is based on Model Driven Architecture
                   principles and logical inference of the derived models and the code. The current results and the future fork
                   are presented and discussed.




1. Introduction
In the last decade after the invention of methods for sequencing of new generation and their introduction in
practice of biological systems research, a new direction of molecular genetics is formed, which is referred
to as metagenomics. Its main object of study goes beyond the individual microscopic cultivated organisms
to their communities, microbiomes. A total DNA (DeoxyriboNucleic Acid) is extracted from a sample,
resulting in a general image of the microbiome. The method allows one to describe a significant number
of new groups of organisms at all taxonomic levels. A comprehensive review of present sequencing
approaches and challenges is presented in [1].
    One of the types of the metagenomic studies is the analysis of amplicons. It is applied to investigation
of the microbiota of different environments of Lake Baikal [2]. To perform the analysis, significant
computational resources are required, as well as bioinformatics skills for analysis and interpretation.
The Researcher composes the computational process by combining different modules of bioinformatic
software, data conversion, data analysis, and visualization. To carry on the studies, the domain specialists
are required to be skilled in scripting of the command shell of the operating system (Linux, Windows),
running a distributed computing environment on a cluster computing system, and programming general
and domain-specific languages, e.g., Python and R.
    The aim of this study is to develop mathematical and software support of the processes of analysis
of the results of Next Generation Sequencing (NGS). We are to develop techniques and software for
a visual representation of the computational process of amplicon analysis so that domain specialists
would be able to compose computational pipelines, which are executed on distributed heterogeneous

Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International
(CC BY 4.0).
computing resources (clouds). The software implementation is a cloud infrastructure built on models of
representation of the computational process in the form of a set of operations, structure and functions of
the computing resources, and scheduling algorithms for computational resources.
   In [3], the problem of cloud usage for storage and computing is stated as the basic problem of NGS
since storage capacity exponential growth is slower as compared to the growth of the NGS generated data
volume. The transportation between the mirror storages and the computer systems for processing data
could easily cause the network capacity exhausting. The data processing in the general case of whole
genome reconstruction requires terabytes of RAM and, in the case of cluster computing usage, special
high-performance parallel algorithms. Two classes of users identified: power user, who analyses the
genome, and causal user, who deals with the power user results, e.g., integrating gene data between
datasets and studies.
   The domain of our IT R&D is related to processing data within natural science research activities,
which are characterized with varieties of tasks, methods and multidisciplinary aims. We are observing
constant increasing of the data obtained from field investigations, each new data is compared to all the
data obtained in the previous years. The number of scientific problems is being increased too. Another
problem to be solved is the interest of domain specialists, biologists, to be involved in data processing,
guaranteeing a reasonable quality of processing. This processing pipeline should be automatic in general
and be adjustable in terms of math- and bio-domains.
   We propose to construct a PaaS and DaaS cloud consisting of independent network-connected SaaS-
services adopted for MiSeq standard operational procedure (MiSeq SOP) used in microbiome research of
Lake Baikal. PaaS will enable biologists to investigate data themselves. DaaS will allow bioinformatic
specialists to work with data on demand when designing new data processing techniques. SaaS services
will support the individual operations for PaaS. There is a number of SaaS software already made for
NGS data processing, reference [3] contains their detailed survey.

2. Automation of MiSeq standard operational procedure
To be more concrete in the further reasoning, we briefly describe MiSeq SOP implemented with Mothur
software [4]. This NGS data analysis process consists of the individual operations on genetic data, which
are stored in files. To get accounted with the technique, we tried to process limnological data [5] according
to the procedure presented on the Mothur website1 .
    After executing the technique manually, we observed a number of issues. Parameter structure and
output of Mothur commands are intricate: the researcher needs to trace file names change from a com-
mand to the following one. After each execution, Mothur commands add suffixes to the input file names.
File naming depends on the input parameters, e.g., the method used to process data. For example, af-
ter application align.seq to file named HXH779K01.shhh.trim.good.unique.fasta we obtain2
HXH779K01.shhh.trim.good.unique.align, HXH779K01...trim.good.unique.align.report,
and HXH779K01...im.good.unique.flip.accnos. Almost each operation adds new suffixes. Re-
peating the application of an operation to data results in repetitive suffixes. Mothur can pass correct
filenames itself if the process goes directly forward, e.g., in scripts. Making a mistake or returning to a
previous step to refine coefficients, the researcher must trace the filenames manually.
    After obtaining the results in the form of tables and charts, biologists usually repeat some filtering
stages excluding additional OTUs (Operational Taxonomic Unit3 ), e.g., which is similar to mitochondrial
and chloroplast, but were not recognized within the MiSeq technique. Sometimes users want to replace a
command with an analogous one from another package, e.g., QUIIME2 or Usearch, to check the default
one or take advantage of special features of the external command. In this case, data conversion must be
performed.
1
  https://www.mothur.org/wiki/MiSeq_SOP
2
  File names are truncated in the beginning for better text layout.
3
  An abstract notion of species, used when the taxon is not determined or it is no sense in the determination.
   Another possible but less frequent deviation from the technique is the involvement of the previously
processed OTUs from early research, e.g., samples of the previous year in the same places during
ecological monitoring. In this case, the researcher needs to match the OTUs from different studies. User
should write a routine for comparing OTU contents and merging group data.
   Visualization is partially presented in the command set of Mothur. It is able to produce SVG vector
images, but the images, in general, cannot be customized. Researchers have to use external software like
R to build charts of the desired quality. Our experience shows that despite the time spent studying the
chart building techniques, which is needed once, most time is spent for converting and filtering input data
and refining parameters of the chart building commands.

3. Related works
Main R&D activity in the NGS domain is divided into these main directions:
  • development of new efficient algorithms for data processing operations and building charts,
  • organizing standardized cloud computing pipelines with HPC (High-Performance Computing)
    implementations of various operations,
  • representing pipelines as workflows, as well as user interfaces to support interactive data processing
    and assessment.
    At first, let us consider research devoted to the productivity of the algorithms and improving the
analysis results quality. In [6], Go, C++, and Java programming languages were assessed with respect to
the ease of implementation, memory consumption and overall computation performance, Go was chosen.
The main requirements were addressed to big data string processing. The system is designed to store string
under processing in main memory. It shares data parts between CPUs and limits input/output operations.
Interesting is the fact that C++ was the slowest.
    In [7], an NGS analysis pipeline was developed for the investigation of viruses’ DNA contained in
human body. The pipeline allowed medical engineers to focus studying on vaccine development. These
findings demonstrated that the proposed NGS data analysis pipeline identified unknown viruses from
the mixed clinical samples, revealed their genetic identity and variants, and characterized their genetic
features in terms of the viral evolution. The process of detection is based on comparing parts of viral
genome with the BLAST database and the coverage analysis.
    Paper [8] deals with the implementation of a heuristic algorithm for scaffolding, i.e., ordering, moving
and orienting contigs using additional information to produce longer sequences on the next stages. In [9],
a data-driven user interface and visualization are considered in the process of a clinical decision support
system implementation. The user interface supplies decision-making for doctors and explanations for
patients as a list of events of various kinds (medical records, NGS results over autopsy tissue, etc.)
represented as HTML5 portlets. A comprehensive review of quality control, error detection and correction
in processing NGS data is presented in [10].
    The HPC techniques review we will start with an application [11] of the BOINC technology to
the alignment procedure, where the Novoalign algorithm was scaled. The reference [3] reviewed the
problem area, described the existing approaches and services already made, but has no mentions of
the implemented cloud computing environments. Paper [12] contains an excellent review of current
achievements in NGS and related areas. Authors doubt the possibility to organize a laboratory HPC-center
based on cluster computing by NGS software users and suggest dealing with IaaS cloud computing. The
paper has a very good review of existing commercial and open-source platforms allowing construction
of pipelines of computing processes. Commercial software mostly implement predefined pipelines and
inflexible, whereas open-source tends to implement either standardized pipelines or present a set of
modules for individual operations and cloud service implementations, a toolsets. In [13], Rainbow
software, a cloud implementation of NGS data processing, is considered. Rainbow is essentially a
Perl script implementing map (division) for input and reduce (join) for output data, together with
the distribution of the data pieces between Amazon EC2 cloud nodes. Cloud nodes perform only
the alignment. The paper also has a good review of Linux virtual machine cloud distributions and
bioinformatic packages. Another interesting review of cloud computing techniques is presented in [14].
Authors pay attention to the open-source cloud (Open Stack) and construction tools, like Common
Workflow Language (CWL) [15] used to represent the computational process in a cloud. A comprehensive
short survey can be found in [16], but at this point, it will not add essential data.
    There are visual tools for genetic analysis, e.g., Galaxy [17], which implements a popular approach
(metaphor) of an interactive web page, where data are imported and processed with modules. Galaxy
also can analyze the user script and constructs dataflow representation. Its primary purpose is to
teach biologists to process single genome data, it can be extended to implement other NGS research
procedures. This is an open-source project, under active development, and we could use it as one of
the implementation platforms. Another tool is UGENE [18], it is a desktop application, open-source,
written with QT5 framework, and it is also under active development. UGENE’s primary function is
visualizing workflows and gene data. The main criticism of the tools is presented in [19], where it is
stated, that command line utilities support more functions and have higher flexibility than visual tools.
Authors propose their own visual tool VisPro, connected to a cloud. The tool is built on the base of
the agile approach, which assumes principal participation of the developer in the process of workflow
construction, configuration and execution.
    Summing up the review, we conclude that there is a good background to the construction of our
infrastructure, and the techniques used in Baikal microbiome research must be adapted to this background.
The technologies we spoke about before allow us to account the peculiarity of our problem, which is the
requirement of greater flexibility of the computation process and domain user experience adaptation.

4. A MiSeq SOP automation approach
Limnologists perform both power and casual users activities [3], i.e., processing both raw sequencing
data and the result of the sequence processing, visualizing, comparing and generalizing results. HPC is
usually based on two popular programming models [3]: MapReduce (Hadoop) and task programming.
The first one implies that data can be split into subsets, which could be processed mostly independently.
The results of the parallel processing, then, joined (reduced) to an aggregative object.
     Mothur’s simple filtering commands are easily run in parallel, indeed it uses CPU cores, but the
computational complexity seems to be not as high as it would be reasonable to spend time on splitting
and joining. The main reason for using clouds here is the accumulation of the RAM if data would not
fit in the workstation memory. Some filtering is based on classification, which processes the whole gene
data using subsampling. The transition of the corresponding algorithms to SaaS implies their substitution
to a cluster version.
     In general, to make our cloud computing architecture simpler, at the first stage of R&D we decided
to use the task queue execution model, where computational resources execute individual tasks from a
network of modules representing a variant of MiSeq SOP. The network of modules is constructed on the
base of a MiSeq SOP model, where each module is a module of the Mothur package. The network is
designed with Rapidminer studio as a dataflow. In figure 1 we represented the beginning of the MiSeq
SOP. Integration with the cloud will require transfer data between DaaS and SaaS, store objects with
metadata, and this is also accounted in our model.

5. Implementation concepts
The review of papers shows that as of to date there are two open-source projects related to automation
of NGS data analysis being in active development: Galaxy and UGENE. If take Galaxy as the main data
processing visualization technique, we are to make Galaxy modules adapting Mothur command, as well
as to adapt Galaxy visualization techniques to the MiSeq SOP. On the other hand, we can do the same
for UGENE, allowing users to work with a more responsive dynamic interface of the UGENE desktop
application.
                  Figure 1. MiSeq first stages representation as dataflow modules [20]


    In [20] and [21], we proposed and implemented a technique for dataflow representation of all Mothur
commands. We use Model Driven Architecture (MDA) to generate modules for Rapidminer studio, a
visual dataflow editor. According to MDA, the source code of modules is generated from the Platform
Specific Model (PSM), which represents the software under development in a notation allowing the direct
code generation by means of templates and other algorithmic procedures. In our case, PSM represents
Java source code of the dataflow modules.
    PSM is built out of the Platform Independent Model (PIM) representing the software on a more abstract
level than PSM. It expresses the relations between entities, their object structures, metainformation and so
on. The transition from PIM to PSM (a model transformation) is carried out by means of logical inference
of the PSM properties on the base of facts representing PIM and the properties of the implementation
platform, a Platform Model (PM), in our case it is Java programming language.
    Some properties of PIM, e.g., list of object fields, are constructed by transformation of
Computationally Independent Model (CIM), an even more abstract model, which represents software
as entities of the domain of Mothur commands. The transformation is also implemented as a logical
inference realizing a pattern recognition. The CIM is also obtained automatically from the analysis of the
C++ source code of Mothur. The analysis is implemented as a Python program scanning the sources for
specific structures. Each command implementation is analyzed with a set of regular expressions matches
organized in a scenario.

5.1. RDF data representation
The source model data are represented on the base of Semantic Web technologies. The model and its
comprising structures are identified globally as resources. Relations between resources and literal values
are expressed using standard and ad hoc designed ontologies. Usage of the ontologies allowed us to direct
the research “along known spaces” of metadata and use the experience of the designers of ontologies,
narrow the search space of the solutions.
    In the cloud DaaS, we are to store files and their contents as objects with their metadata. The
nowadays OMG (Object Management Group) standards describe specifications of converting relational,
UML (Unified Modeling Language), SysML (System Modeling Language) metadata to RDF (Resource
Description Framework) representation. So, we can store data in conventional relational or key-value
databases and while retrieving supply metadata as well. In the simplest case, data and metadata can
be stored in metadata storages, such as ClioPatria. While designing our cloud storage we use the
JHipster Domain Language [22] and its tools to create database structures, metadata converters, formal
representations of the ontologies of the stored data.
    Metadata of the database stored objects describes mostly relations between a resource and its attributes.
Some attributes, namely foreign keys, are the references to other resources, which are also reflected with
metadata. There are rare relations between resources, which are not stored in the conventional databases.
These relations reflect, e.g., data provenance, additional special attributes for a particular data object.
Such rare relations are infrequent and can be added in a special research investigation, so modification
of the relational database structure for each of the cases has no sense. For the representation of the data,
we adopted a number of standardized ontologies.
    • Friend-of-a-friend (foaf) ontology is used for agent information: individuals, legal entities, program
      agents;
    • Provenance (prov) is used for making references between documents;
    • Dublin Core (dc) is used for published resource metadata mark up;
    • DBPedia resource (dbr) refers external globally used classes and instance objects;
    • Open annotation (oa) is used as a published document content representation ontology;
    • The Bibliographic Ontology (bibo) is used for literature reference mark up.
For the representation of Mothur CIM and PIM, we developed two ontologies mothur and uml. CIM
and PIM ontologies are used to represent relations between stored objects as subjects of input and output
of Mothur commands.
   The used approach allowed us to solve many technical problems, including providing our dataflow
visualization tools with an actual set of Mothur commands, implementing an abstract engine of Mothur
command properties mapping to a software environment. Within the R&D, we obtained a set of
transformation scenarios expressed as object knowledge sets represented in the Logtalk [23] programming
language, which is also used to generate PSM and the source code for the representation of Mothur
commands for new computation and visualization environments.

5.2. Data integration: Metadata inference
To allow a casual user to take advantage of the obtained results, they are to be represented as RDF/RDFa4
marked-up report documents (Word, Excel, PDF) and HTML5 (Hyper-Text Markup Language, version
5) web pages, e.g., produced by Galaxy software. Such format allows both the user and a software agent
to acquire the resulting data for their research. The markup for the documents is a part of our LOD5 -
based service providing integration with other NGS Internet resources. At present, we have not found a
standardized way of the integration: there are only prototypes of annotation resources, like BioSearch [24]
implemented on the base of BIO2RDF LOD technologies.
    The LOD service and the desired flexibility of scientific research software require us to associate
metadata to all pieces of NGS data. The metadata is stored for the main input data of MiSeq
SOP and transformed with each application of commands into metadata describing command output
objects. For Mothur, we construct an automatic metadata inference rules analyzing its C++ source and
filename conversion algorithms. To conserve memory usage we decided to implement dynamic metadata
4
    An RDF language dialect for representing the sematic markup in web published documents.
5
    Linked Open Data, an RDF technology constraint defined by usage rules.
reconstruction when DaaS returns queried objects. For example, as thousands of sequences are organized
in files, groups and OTUs, the metadata of the sequences are extended with all file/group/OUT metadata.
Each sequence metadata is generated using the context of its storage, namely fasta–file name and
relation to its group, its file provenance, etc.

6. Evaluation
We have been evaluating the implemented technologies on the criteria of expressiveness of the
RDF source models (CIM, PIM, and PSM) representation, LogTalk programming capabilities of the
transformation scenarios, and representation of the MiSeq SOP with the synthesized Rapidminer plug-in.
The following results are obtained.
    Semantic web technologies and knowledge graphs are universal way of describing data, basic relations
between notions and model structures. The last procedure we have done was the analysis of Galaxy
implementation of the Mothur’s MiSeq SOP. We have spent about two days for converting it into RDF,
resulting in more expressive representation of current 138 of 144 Mothur modules. The representation in
the form of knowledge graphs of the source model data allows us aggregate various model data sources
in one representation and select model elements with SPARQL and Prolog queries, which are interpreted
in the transformation as a semantically meaningful target structures.
    The Logtalk language has various structures, which allow programmer to express the transformational
knowledge base with objects, providing tools for knowledge manipulation in the object-oriented way. The
general scenario of the transformation is represented as a system of interconnected objects, encapsulating
knowledge. Some of the objects organize facades for SPARQL and Prolog queries to the graph data,
other generate target PSM structures and the source code. All the necessary structures of MDA were
representable, as well as there is a number of Logtalk syntactic structures, whish are to be investigated in
sense of their applicability for knowledge representation.
    As we said before, as testing ground we used the source limnological data of [5]. At first stage we
repeated the investigation procedure manually to grasp better understanding of the MiSeq SOP. After
each refinement of the transformational knowledge base, we construct the procedure out of dataflow
modules, generating scripts, which are executed and their results are compared to manually obtained
ones. Some stages of Mothur MiSeq SOP contain algorithms employing subsampling, so the final results
differ between runs.
    A similar technique will be used in evaluation of the generated procedures in the Galaxy environment.
The constructed dataflow diagram will be converted into a Galaxy notebook, executed and the results are
evaluated against to the manually constructed procedure. At the last stage of the refinement, the power
and casual users will be engaged to collect their opinions on the degree of convenience of the environment
usage.

7. Further development
At this point, we developed means for MiSeq SOP modeling with dataflow diagrams in Rapidminer
studio, execution of the models as creating scripts in the Mothur scripting language, together with MDA
instrumentation support. The main problem to be solved in the future is as follows.
  • Development of a converter of the models into Galaxy scenarios,
  • Integrate data storage with Galaxy data import subsystem,
  • Implement metadata storage and adapters for NGS data,
  • Create a more sophisticated source code parser or PIM model of computation, so we could be able
    to infer metadata for synthesizing metadata conversion rules,
  • Adopt our document authoring tools to Galaxy allowing LOD representation of results,
  • Implement integration to biological/gene databases,
  • Develop a set of predefined scenarios,
             Metadata processing, LOD server                                                Text indexing engine
                  DB                                                                              DB
                                                     Edit mode                  Template
                                    DB-to-RDF        switching                   loader
                                     adapter                                                                       Elas�csearch




                                                                                 Pa�erm-
                 Text                Security-       Document                   directed                            RDFa-to-
               template               aware                                                   Text load,
                                                                                inference                            JSON-LD
               generator              server                                                 query engine
                                                                                                                    converter
                                                   Authoring tool (Browser)




                                                                               Inference        RDF                 Text data
                                                      Pengines
               T-Module              T-Module                                   machine       converter             extractor


                                                                  ClioPatria
                                                                                  DB              DB
               Generated
                                                       SPARQL                                                       Metadata
              source code            T-Module         endpoint                                                      interface
                and data

                MDA Tool                           Server of domain models                  NGS data storage




               Ontology               SPARQL          Mothur
                                                                               Converter    Data exchange           Notebook
                T-Box                endpoint          source
                                                                                to RDF        interface              Engine
                                                      analyzer

                                                   CIM, PIM, PM
                  DB

                                    DB adapter       UML Editor            XMI presen-        Processing
                                                     (Modelio)              ta�on of         template DB
                                                                              PIM

              Ontology server (e.g. DBPedia.org)




                       Figure 2. Architecture of information-computation environment


  • Create scenario templates supporting research in monitoring.
The target modular software architecture is presented in figure 2.

Conclusions
The approach to the construction of an infrastructure for supporting Lake Baikal microbiome research
based on Next Generation Sequencing is proposed. A good background of algorithms and software
already constructed by various developers allows us to implement our environment utilizing adaptation
of the techniques used by biologists to the background. The main contribution of the paper is as follows.
We (1) analyzed the existing IT experience in the field of NGS data processing, (2) constructed a dataflow
model of the technique (MiSeq implemented by Mothur package), which is to be converted to the
software modules of various visual environments and cloud services. A conversion technique (3) has been
developed using Model Driven Architecture (MDA), where model transformation is implemented as a
logical inference system. This allows us transit from one implementation platform to another conserving
gained and formalized experience.
    At this stage, we limited our implementation with Mothur software specifics, but at the same time do
not bother with data conversion. This will be relevant if we transit to an analogous software for the NGS
data processing platform. The next stage will deal with data conversion to other NGS processing software
such as open-source QIIME2 and proprietary ones like Usearch. These platforms have advantages over
Mothur in data visualization and processing performance on special operations, as well as application
of other methods and algorithms not included in Mothur. The NGS SaaS services [3, 12] could also be
integrated into the cloud under development. Such an advantage will supply the better ground for carrying
on experiments with data.
    The specifics of the problems stated by Baikal microbiome research relate to construction
mathematical models describing the microbial communities interaction. The models are constructed on
the base of annual monitoring, data analysis and structural and parametric identification and refinement of
model elements. In this case, the standard NGS procedures constructed out of operation applications must
be extended with toolsets supporting user-friendly joining the previous stages of a continuous research.
The infrastructure must support integral data representation and efficient query-based semantically rich
access, and the proposed MDA approach allows us to quickly integrate new operations in the existing
dataflow model. As we can see from our review, most realized techniques in the NGS data processing
are either fixed uni-problem oriented software with ad-hoc infrastructures, or just a problem-oriented
packages comprising sets of individual operations.
    The problem set to be solved includes optimization of the utilization of cluster computing resources,
planning parallel computing executions based on process structure analysis and the properties of
algorithms that implement specific operations, and implementation of the control points for services.

Acknowledgment
The results related MDA transformation and its adaptation to RDF were obtained within the framework
of the State Assignment of the Ministry of Education and Science of the Russian Federation for the
project “Methods and technologies of cloud-based service-oriented platform for collecting, storing,
and processing large volumes of multi-format interdisciplinary data and knowledge based upon the
use of artificial intelligence, model-guided approach and machine learning” using facilities of the
Centre of collective usage “Integrated information network of Irkutsk scientific educational complex”.
The development of the infrastructure for Mothur command transformation to Rapidminer dataflow
diagram is supported by the project of Irkutsk scientific center of Siberian branch of Russian Academy
of sciences, grant No 4.2.

References
[1] Pereira R, Oliveira J and Sousa M 2020 Bioinformatics and computational tools for next-generation sequencing analysis in
      clinical genetics J. Clin. Med. 9 No. 1 1–30.
[2] Bashenkhaeva M V, Zakharova Yu R, Petrova D P et al 2015 Sub-ice microalgal and bacterial communities in freshwater
      lake Baikal, Russia Environmental Microbiology 70 No. 3 751–65.
[3] Guo X, Yu N, Li N and Pan Y 2016 Cloud computing for next-generation sequencing data analysis Computational Methods
      for Next Generation Sequencing Data Analysis ed I I Mandoiu and A Zelikovsky (John Wiley & Sons, Inc.) 3–24
[4] Kozich J J, Westcott S L, Baxter N T, Highlander S K and Schloss P D 2013 Development of a dual-index sequencing
      strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform Applied
      and Environmental Microbiology 79 No. 17 5112–20
[5] Mikhailov I S, Zakharova Y R, Bukin Yu S et al 2019 Co-occurrence networks among bacteria and microbial eukaryotes
      of lake Baikal during a spring phytoplankton bloom Microbial Ecology 77 96–109
[6] Costanza P, Herzeel C and Verachtert W 2019 A comparison of three programming languages for a full-fledged next
      generation sequencing tool BMC Bioinformatics 20 No. 1 (2019)301
[7] Gong Y-N, Chen G-W, Yang S-L, Lee C-J et al 2016 A next-generation sequencing data analysis pipeline for detecting
      unknown pathogens from mixed clinical samples and revealing their genetic diversity
[8] Gritsenko A A, Nijkamp J F, Reinders M J T and Ridder D de 2012 GRASS: a generic algorithm for scaffolding next-
      generation sequencing assemblies Bioinformatics 28 No. 11 1429–37
[9] Müller H, Reihs R, Posch A E, Kremer A, Ulrich D and Zatloukal K 2016 Data driven GUI design and visualization for a
      NGS based clinical decision support system Procs. of 20th International Conference Information Visualization, 19–22
      July 2016, Universidade NOVA de Lisboa, Lisbon, Portugal 355–60
[10] Boekhorst R te, Naumenko F M, Orlova N G, Galieva E R, Spitsina A M et al 2016 Computational problems of analysis
      of short next generation sequencing reads Vavilovskii Zhurnal Genetiki i Selektsii = Vavilov Journal of Genetics and
      Breeding 20 No. 6 746–55
[11] Srimani J K, Wu P, Phan J H and Wang M D 2010 A distributed system for fast alignment of next-generation sequencing
      data 2010 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), Hong, Kong 579–84
[12] Kwon T, Yoo W G, Lee W et al 2015 Next-generation sequencing data analysis on cloud computing Genes Genom 37
      489–501
[13] Zhao S, Watrous K, Zhang Ch and Zhang B 2017 Cloud computing for next-generation sequencing data analysis Cloud
      Computing – Architecture and Applications ed J Sen (IntechOpen Limited) 29–51
[14] Langmead B and Nellore A 2018 Cloud computing as a platform for genomic data analysis and collaboration Nat. Rev.
      Genet. 19 No. 4 208–19
[15] Amstutz P, Crusoe M R, Tijanic N, Chapman B et al 2016 Common workflow language, v1.0
[16] Baker Q B, Al-Rashdan W and Jararweh Y 2018 Cloud-based tools for next-generation sequencing data analysis Procs. of
      Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), Valencia 99–105
[17] Batut B, Hiltemann S, Bagnacani A, Baker D, Bhardwaj V et al 2018 Community-driven data analysis training for biology
      cell systems
[18] Rose R, Golosova O, Sukhomlinov D, Tiunov A and Prosperi M 2019 Flexible design of multiple metagenomics
      classification pipelines with UGENE Bioinformatics 35 No. 11 1963–5
[19] Milicchio F, Rose R, Bian J et al 2016 Visual programming for next-generation sequencing data analytics BioData Mining
      9 No. 16
[20] Cherkashin E, Shigarov A, Malkov F and Morozov A 2019 An instrumental environment for metagenomic analysis
      Information Technologies in the Research of Biodiversity. Springer Proceedings in Earth and Environmental Sciences
      ed I Bychkov and V Voronin (Springer, Cham) 151–158 .
[21] Cherkashin E, Shigarov A and Paramonov V 2019 Representation of MDA transformation with logical objects
      International Multi-Conference on Engineering, Computer and Information Sciences (SIBIRCON), Novosibirsk, Russia
      0913–8
[22] Halin A, Nuttinck A, Acher M, Devroey X, Perrouin G and Heymans P 2017 Yo variability! JHipster: a playground for
      web-apps analyses Procs. of the Eleventh international workshop on variability modelling of software-intensive systems,
      VAMOS’17 (ACM, New York) 44–51
[23] Moura P 2009 Programming patterns for Logtalk parametric objects Applications of Declarative Programming and
      Knowledge Management. Lecture Notes in Computer Science ed A Abreu and D Seipel 6547 (Springer, Berlin,
      Heidelberg) 52–69
[24] Hu W, Qiu H, Huang J and Dumontier M 2017 BioSearch: a semantic search engine for Bio2RDF Database 2017
      (2017)bax059