=Paper= {{Paper |id=Vol-1747/BT103_ICBO2016 |storemode=property |title=Collaborative Workspaces for Pathway Curation |pdfUrl=https://ceur-ws.org/Vol-1747/BT103_ICBO2016.pdf |volume=Vol-1747 |authors=Funda Durupinar-Babur,Metin Can Siper,Ugur Dogrusoz,Istemi Bahceci,Ozgun Babur,Emek Demir |dblpUrl=https://dblp.org/rec/conf/icbo/Durupinar-Babur16 }} ==Collaborative Workspaces for Pathway Curation == https://ceur-ws.org/Vol-1747/BT103_ICBO2016.pdf
                                Collaborative Workspaces for Pathway Curation

     Funda Durupinar-Babur1, Metin Can Siper2, Ugur Dogrusoz2, Istemi Bahceci2, Ozgun Babur1, Emek Demir1.
                     1 Oregon Health and Science University, Computational Biology Program, Portland, OR
                               2. Bilkent University, Dept. of Comp Engineering, Ankara, Turkey

    Abstract— We present a web based visual biocuration              require, potentially, a social change in the way we publish and
workspace, focusing on curating detailed mechanistic pathways.       disseminate our results. There are already multiple efforts in
It was designed as a flexible platform where multiple humans,        these directions – but they are currently very disjoint. In order
NLP and AI agents can collaborate in real-time on a common           to enable exploring these directions, we developed a web-
model using an event driven API. We will use this platform for
                                                                     based collaborative workspace as a common platform. We
exploring disruptive technologies that can scale up biocuration
such as NLP, human-computer collaboration, crowd-sourcing,           also describe a pilot study, as an example downstream
alternative publishing and gamification. As a first step, we are     application, where authors directly curate formal pathway
designing a pilot to include an author-curation step into the        snippets that they discovered as a part of the publication
scientific publishing, where the authors of an article create        process.
formal pathway fragments representing their discovery- heavily
assisted by computer agents. We envision that this “micro-                        II. A COLLABORATIVE WORKSPACE
curation” use-case will create an excellent opportunity to           The platform is composed of a web-based graphical editor and
integrate multiple NLP approaches and semi-automated
                                                                     an application server. Figure 1 shows the platform
curation.
                                                                     architecture.    The graphical editor is an extension of the
   Keywords—biocuration, pathways,                                   SBGNViz framework [1], which is a web application based on
                                                                     Cytoscape.js [4] to visualize BioPAX [2] models represented
                                                                     by SBGN Process Description Notation (SBGN-PD) [3].
                      I. INTRODUCTION                                Cytoscape.js provides a rich set of graph visualization and
Molecular biology studies the molecular components and               manipulation features. SBGNViz adds domain specific glyphs,
mechanisms that control a cell’s response to stimuli.                layouts and tool. Finally, our platform adds collaboration,
Traditionally, this was a piecemeal effort primarily conducted       conflict resolution and a strong link to existing NLP tools.
through carefully-designed series of experiments that isolate,
elucidate and confirm a part of the mechanism under a certain        The application server is based on Node.js – a server-side
context. A byproduct of this process is knowledge                    Javascript environment with an event-driven, asynchronous IO
fragmentation – putting these components into a mechanism,           model. This non-blocking structure allows combining agents
often called a pathway, that can describe cellular behavior is       that operate in different time scales. For example, extracting a
extremely challenging. One needs to assemble these pieces            relevant piece of information from the literature might take
across many publications, negotiate differences due to               minutes whereas a visualization action can be accomplished in
biological context and experimental setup, resolve conflicts         seconds. The server keeps track of the graphical editor’s state
and create a coherent model. The majority of this knowledge
                                                                     and updates an underlying shared JSON model. The model
integration happens informally through review articles within a
                                                                     can be edited concurrently via an Operational Transformation
very limited scope – often focusing around a couple of proteins
or genes.                                                            (OT) library called Share.js. OT provides versioning,
                                                                     concurrency control, conflict resolution in a manner similar to
The biology, however, is changing rapidly primarily due to           modern distributed code versioning systems and is
system scale “-omic” profiling and “big data”. This, in turn,        successfully used in large collaborative editing applications.
creates an urgent necessity for large scale, formal models of        All operations are made persistent in a MongoDB database.
cellular processes – way beyond the scale of a current review
article. This led to a rapid proliferation of pathway databases or
curation groups- which currently stands at 600. Together, they
have curated hundreds of thousands detailed biochemical
reactions and millions of interactions. Yet, this is still only a
small amount of the knowledge in the literature (our estimation
is somewhere between 1 to 3%) and due to the rapid increase
in our knowledge, this gap widens every day. Curation efforts
– although extremely valuable-- also tend to be expensive. NIH
spent 1.2 billion dollars on supporting data curation in the last
decade– and the need is ever increasing.
How can we scale biocuration up by two orders of magnitude?
We believe that the solution lies in the intersection of NLP,
Artificial Intelligence and crowd-sourcing. It might also
                                                                    linked to an existing pathway corpus. This model fragment will
                                                                    be published as a supplement to the paper and can then be
                                                                    harnessed easily by curators and algorithms to assemble
                                                                    increasingly large mechanistic models. We envision that such
                                                                    a “micro-curation” step, if successfully embedded to the
                                                                    publication process, can enable rapid and scalable capture of
                                                                    atomic facts about cellular processes. By getting the fragments
                                                                    directly from the scientists who discovered them we hope to
                                                                    reduce communication noise, ambiguity and interpretation
                                                                    errors inherent to our current scientific communication and
                                                                    curation process.

                                                                              IV. A GENERAL TOOL FOR BIOCURATION


                                                                    In parallel to this pilot study, we also plan to actively use this
                                                                    platform to select most successful NLP and AI tools for
                                                                    pathway curation. The open real-time nature of the platform
                                                                    makes it extremely straightforward to swap, compare and
                                                                    combine different NLP approaches. This platform can also be
                                                                    used to extract non-interaction/pathway related information
                                                                    such as diseases, drug targets and mutations and combine these
                                                                    with pathway models. We can also explore how to extract
                                                                    supporting information such as experimental evidence or
                                                                    biological context. The web-based system enables
                                                                    collaborative use-cases and it can be used to collect large
                                                                    human curation data that can be used to build a gold standard
Figure 1. Platform architecture                                     corpus. We believe that the platform can become a major tool
                                                                    for biocuration and NLP communities.

Computer agents connect to the server through an API that                                 ACKNOWLEDGMENT
uses socket.io interface. The API provides a two-way
communication channel. Therefore, agents can both make              This work was funded by the DARPA Big Mechanism
changes to the model and get notifications about model              program under ARO contract W911NF-14-1-0395.
updates. The server also provides a chat interface through
which human users and computer agents can send text                                           REFERENCES
messages or image files. We have implemented the first              [1] M. Sari, I. Bahceci, U. Dogrusoz, S.O. Sumer, B.A.
version of the platform as well as adapters to several NLP          Aksoy, O. Babur, E. Demir, "SBGNViz: a tool for
systems and Pathway Commons pathway database.                       visualization and complexity management of SBGN process
                                                                    description maps", PLoS ONE, 10(6), e0128985, 2015.

                                                                    [2] Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik
               III. MICROCURATION USE-CASE                          I, et al. The BioPAX community standard for pathway data
As an initial use-case we aim to utilize this platform for          sharing. Nature Biotechnology. 2010;28(9):935–942. doi:
capturing pathway fragments in scientific manuscripts with the      10.1038/nbt.1666. pmid:20829833
help of authors of the publications. Since authors will have
little to no training in curating pathways, maintaining a low-      [3] Novère NL, Hucka M, Mi H, Moodie S, Schreiber F,
barrier of entry is crucial without sacrificing too much from the   Sorokin A, et al. The systems biology graphical notation.
level of detail and formality of the representation. We envision    Nature     Biotechnology.    2009;27(8):735–741.     doi:
to solve this through a heavily assisted, semi-automated            10.1038/nbt.1558. pmid:19668183
process: First we will use text mining agents to automatically
create a draft visual diagram of the pathway information in the     [4] Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N,
manuscript. We will visually flag grounding issues as well as       Workman C, et al. Integration of biological networks and gene
potential ways to resolve them. The fragments will also be          expression data using Cytoscape. Nature Protocols.
aligned automatically to the pathways in public databases and       2007;2(10):2366–2382.       doi:      10.1038/nprot.2007.324.
users will be given option to use existing curated fragments        pmid:179479
when possible. The authors will then fix these issues or extend
existing information. The resulting artifact will be a visual
diagram and a formal BioPAX model, properly grounded and