=Paper=
{{Paper
|id=Vol-2751/short-5
|storemode=property
|title=SciBERT-based Semantification of Bioassays in the Open Research Knowledge Graph
|pdfUrl=https://ceur-ws.org/Vol-2751/short5.pdf
|volume=Vol-2751
|authors=Marco Anteghini,Jennifer D'Souza,Vitor A.P. Martins Dos Santos,Sören Auer
|dblpUrl=https://dblp.org/rec/conf/ekaw/AnteghiniDSA20
}}
==SciBERT-based Semantification of Bioassays in the Open Research Knowledge Graph==
<pdf width="1500px">https://ceur-ws.org/Vol-2751/short5.pdf</pdf>
<pre>
    SciBERT-based Semantification of Bioassays in
        the Open Research Knowledge Graph

                 Marco Anteghini1,2[0000−0003−2794−3853] ? , Jennifer
               D’Souza3[0000−0002−6616−9509] , Vitor A.P. Martins dos
               1,2[0000−0002−2352−9017]
        Santos                          , and Sören Auer3[0000−0002−0698−2864]
              1
             Lifeglimmer GmbH, Markelstr. 38, 12163 Berlin, Germany
 2
   Wageningen University & Research, Laboratory of Systems & Synthetic Biology,
              Stippeneng 4, 6708 WE, Wageningen, The Netherlands
                        {anteghini,vds}@lifeglimmer.com
 3
   TIB Leibniz Information Centre for Science and Technology, Hannover, Germany
                     {jennifer.dsouza,soeren.auer}@tib.eu


         Abstract. As a novel contribution to the problem of semantifying bio-
         logical assays, in this paper, we propose a neural-network-based approach
         to automatically semantify, thereby structure, unstructured bioassay text
         descriptions. Experimental evaluations, to this end, show promise as the
         neural-based semantification significantly outperforms a naive frequency-
         based baseline approach. Specifically, the neural method attains 72% F 1
         versus 47% F 1 from the frequency-based method.
         The work in this paper aligns with the present cutting-edge trend of
         the scholarly knowledge digitalization impetus which aim to convert the
         long-standing document-based format of scholarly content into knowl-
         edge graphs (KG). To this end, our selected data domain of bioassays
         are a prime candidate for structuring into KGs.

         Keywords: Open Science Graphs · Bioassays · Machine Learning


1      Introduction
Biological assays are defined as standard biochemical test procedures used to
determine the concentration or potency of a stimulus (physical, chemical, or
biological) by its effect on living cells or tissues [3,4].
    In the context of the current Covid-19 pandemic, bioassays are critical, for ex-
ample, for vaccine development. They reveal the functional and biologically rele-
vant immunological responses that correlate with vaccine efficacy. However, mas-
sive volumes of bioassays are being produced and researchers are inundated with
this information. Apart from their sheer quantity, bioassay diversity presents
enormous challenges to organizing, standardizing, and integrating the data with
the goal to maximize their scientific and ultimately their public health impact
as the screening results are carried forward into drug development programs.
?
     Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
     mons License Attribution 4.0 International (CC BY 4.0).
2       Anteghini et al.

    Against this broad societal application setting, we present a solution as a
step in the easier knowledge acquisition of bioassays for researchers: the neural-
based automated structuring of unstructured, non-standardized bioassays based
on the standardized BioAssay Ontology (BAO) [7]. Bioassays, until their re-
cent semantification in an expert-annotated dataset [2,5,6] based on the BAO,
were published in the form of unstructured text. Integrating their semantified
counterpart in a KG facilitates their advanced computational processing. E.g.,
bioassays can be easily compared across their key properties, viz. Target, Per-
turbagen, Participants, and Detection Technology, captured as KG nodes and
links. Nonetheless, the fine-grained semantification of bioassays as a manual
task is a costly and time-intensive endeavor. Their automated semantification
not only alleviates the costly manual task, but potentially makes it possible
to rapidly semantify this data in large volumes. Herein, we present our novel
SciBERT-based [1] neural BAO [7] bioassay semantification system.


2     Method
For automated bioassay semantification, we carry out the supervised machine
learning of semantic statements (i.e., subject-predicate-object triples) based on
the BioAssay Ontology (BAO) [7] for a given unstructured bioassay description.
The code for our method is publicly available at: https://github.com/MarcoAnteghini/
SciBERT-bioassays ORKG.


2.1   Dataset
Our dataset for learning comprises an expert manually annotated collection of 983
semantified bioasssays [5,6]. In the data, each assay has between 5 and 92 semantic
statements at an average of 53. To better reflect the data, we show example annotations
in Table 1 for a selected bioassay.


has assay format → biochemical format
has assay format → protein format
has assay format → single protein format
assay measurement type → endpoint assay
Table 1: Four example semantic statement annotations (from 50 total) for PubChem
Assay ID 346. Note, these statements are triples with subject “bioassay.”


2.2   Problem Formulation
The dataset can be formalized as follows. Let b be a bioassay from the assays dataset
B. Each bi is annotated with an annotation sequence asi such that asi ∈ S, where S
is a set of all possible semantic statements seen in the training dataset. Specifically,
                                    SciBERT-based Semantification of Bioassays              3

asi = {s1 , s2 , s3 , ..., sk }, such that sx is a semantic statement ∈ S; asi has k different
statements. In general, annotation sequences are of varying lengths. The dataset we
use has |S| = 1756 unique statements (after filtering for non-informative ones).
    In the supervised task, the input data instance corresponds to a pair (b, s; c) where
c ∈ {true, f alse} is the classification label. Thus, specifically, our semantification prob-
lem is formulated as a binary classification task. (b, s) is true if s ∈ b’s annotation se-
quence (as), else f alse. Where f alse instances are formed by pairing b with any other
label not in the annotation sequence as of b. As an aggregate, the semantification of
each bioassay is a multi-label, multi-class classification problem which we have broken
up into binary classification decisions.
    Intuitively, our task formulation is meaningful because it emulates the way the
human expert annotates the data. Basically, the expert, from their memory of all
semantic statements S, simply assigns s to a given b if they deem it as true; irrelevant
statements are not considered, thus implicitly deemed f alse.


2.3    SciBERT-based Machine Learning
Our machine learning system is the state-of-the-art, bidirectional transformer-based
SciBERT [1], pre-trained on millions of scientific articles. In each data instance (b, s; c),
the classifier input representation for the pair ‘b, s’ is the standard SciBERT format,
treating them as sentence pairs separated by the special [SEP] token; the special clas-
sification token ([CLS]) remains the first token of every instance. Its final hidden state
is used as the aggregate sequence representation for classification tasks fed into a linear
classification layer.


3     Experiments
3.1    Experimental Setup
For robust evaluations, we perform 3-fold cross validation (2:1 train-test split). In each
fold experiment, training data contains roughly 655 bioassays and the remaining 328
bioassays are used for testing, where the test assays are unique across the folds. Stan-
dard precision (P ), recall (R), and f-score (F 1) metrics are used. We refer the reader
to the SciBERT paper [1] for hyperparameter details. Finally, we have an additional
parameter: f alse instances per bioassay. They are varied between 100 to 300, in incre-
ments of 10, to obtain an optimal model.


3.2    Results and Discussion
Our results are depicted in Tables 2 and 3. And we examine the RQ: can advanced
neural technologies be leveraged to automatically semantify bioassays? We find that
the cumulative obtainable F 1 by the SciBERT classifier out-of-the-box is 0.72 (bold
in Table 3)—significantly higher than 0.47 from a naive frequency-based semantifica-
tion approach. Furthermore, the difference of the neural approach from the frequency
method is clearly evident in the hit-and-miss illustration in Fig 1. The top thin neck of
the curve in Fig 1(a) indicates that the neural approach, for most bioassays, had faster
true semantic statement hits among its top-scoring predictions. Thus, answering RQ,
neural technologies can indeed perform reliable semantification of bioassays. They are
also practically efficient, since, given the 1756 unique statements considered as labels,
each test assay is semantified at a rate of 4 seconds.
4            Anteghini et al.


    f alse     P         R       F1
    labels
                                                   test set       P        R        F1
    100        0.517     0.968   0.674
    ...        ...       ...     ...               1st fold       0.600    0.939    0.729
    160        0.549     0.931   0.688             2nd fold       0.573    0.956    0.713
    170        0.600     0.939   0.729             3rd fold       0.589    0.936    0.719
    180        0.573     0.945   0.711             Avg.           0.588    0.944    0.720
    ...        ...       ...     ...
    300        0.471     0.674   0.551             Table 3: Automatic bioassay seman-
                                                   tification results from 3-fold cross val-
Table 2: Bioassay semantification re-              idation with the optimal number of
sults from five training optimization              f alse classification labels (170).
with different f alse classification in-
stances (full table in appendix)


                                    (a) SciBERT classifier


                                 (b) Frequency-based classifier

Fig. 1: Hit-and-miss Plots for semantifying bioassays by SciBERT vs. a naive frequency-
based approach. Black dot is a hit; purple dot is a miss. For each assay, after all the
true statements are predicted, the remaining dots are white.


4      Conclusion


The discovery of cures during pandemics such as Covid-19 can be greatly expedited if
scientists are given intelligent information access tools, and our work toward automat-
ically semantifying bioassays are a step in this direction. We refer the reader to the
Appendix for an illustrated use case of semantified bioassays data in next-generation
digital libraries.
                                  SciBERT-based Semantification of Bioassays            5

References
1. Beltagy, I., Lo, K., Cohan, A.: Scibert: Pretrained language model for scientific text.
   In: EMNLP (2019)
2. Clark, A.M., Bunin, B.A., Litterman, N.K., Schürer, S.C., Visser, U.: Fast and
   accurate semantic annotation of bioassays exploiting a hybrid of machine learning
   and user confirmation. PeerJ 2, e524 (2014)
3. Hoskins, W.M., Craig, R.: Uses of bioassay in entomology. Annual review of ento-
   mology 7(1), 437–464 (1962)
4. Irwin, J.: Statistical method in biological assay. Nature 172(4386), 925–926 (1953)
5. Schürer, S.C., Vempati, U., Smith, R., Southern, M., Lemmon, V.: Bioassay ontology
   annotations facilitate cross-analysis of diverse high-throughput screening data sets.
   Journal of biomolecular screening 16(4), 415–426 (2011)
6. Vempati, U.D., Przydzial, M.J., Chung, C., Abeyruwan, S., Mir, A., Sakurai, K.,
   Visser, U., Lemmon, V.P., Schürer, S.C.: Formalization, annotation and analysis of
   diverse drug and probe screening assay datasets using the bioassay ontology (bao).
   PloS one 7(11), e49198 (2012)
7. Visser, U., Abeyruwan, S., Vempati, U., Smith, R.P., Lemmon, V., Schürer, S.C.:
   Bioassay ontology (bao): a semantic description of bioassays and high-throughput
   screening results. BMC bioinformatics 12(1), 257 (2011)
                   SciBERT-based Semantiﬁ-
                   cation of Bioassays in the
                                                                           Automatic Semantiﬁcation of
                                                                                                           Our Dataset
                   Open Research Knowledge
                   Graph
                                                                                                           We based our analyses on the dataset provided by
                                                                                                           Clark et al., containing 983 bioassays downloaded
                                                                           Bioassays for Next-Generation
                                                                                                           from PubChem. For each bioassay, we ﬁltered the
                   Marco Anteghini                                                                         non informative labels including the ones with val-
                   Jennifer D’Souza                                                                        ues as string literals remaining with 1756 labels. In
                   Vitor Martins dos Santos
                                                                                                           our experiments, we varied between 100 to 300
                                                                           Scholarly Knowledge-Graph
                                                                                                           false labels to obtain an optimal model.
                   Sören Auer
                                                                                                           Hit-&-Miss Plots of Semantify-
                   Lifeglimmer GmbH
                                                                                                           ing Bioassays by Frequency-based
                   Wageningen University
                                                                           Digital Libraries
                                                                                                           Classiﬁer (top) vs. SciBERT-based
                   TIB Leibniz Information Centre for Science and Tech-
                                                                                                           classiﬁer (bottom)
                   nology, Hannover
                   Introduction
                   - Comparisons across several bioassays is a challeng-
                   ing problem in today’s context-dependent structure.
                   - Semantically structured scholarly knowledge can
                   alleviate the problem.
                   - We do not have automatic methods to automati-
                   cally semantify the data.
                                                                                                           Survey Comparisons of Structured
                   Performance                                                                             Bioassays
                   3-Fold CV on a dataset of 983 bioassays                                                 An automatically generated comparisons of seman-
                                                                                                           tiﬁed bioassays in the ORKG digital library (DL). Full
                                                                                                           graph at www.orkg.org
Anteghini et al.


                   Conclusion
                   We propose a neural-based semantiﬁcation ap-
                                                                                                           References
                   proach for text-based bioassays to be integrated as a
                   software module in next-generation digital libraries                                    Clark, A., Bunin, B., Litterman, N., Schurer, S., Visser,
                   such as the Open Research Knowledge Graph.                                              U.: Fast and accurate semantic annotation of bioas-
                                                                                                           says exploiting a hybrid of machine learning and user
                                                                                                           conﬁrmation. PeerJ 2, e524 (08 2014).
6
                                 SciBERT-based Semantification of Bioassays            7

A    Unique statements (labels) distribution
Each bioassays present on average 53 labels. The distribution is visible in Figure 2


                      Fig. 2: Unique statements distribution


B    Snapshot of Semantified Bioassay in the Open
     Research Knowledge Graph
Figure 3 is an instance of integrating one semantified bioassay in the ORKG DL. This
bioassay was semantified on eight semantic statements based on the BAO. Integrating
machine actionable graphs of bioassays is essential for the ORKG DL to automatically
compute the tabulated comparison surveys of several bioassays as shown in Figure 4
in the next section.


C    Application: Comparisons of Bioassays in ORKG
Next generation DLs target semantified scholarly knowledge. The ORKG with the
semantified bioassays integrated, automatically computes their survey comparisons de-
pending on how many of the machine-actionable assays were selected to be compared
by the user. Such tools must be available to scientists to assist them in such mas-
sive knowledge ingestion scenarios to quickly grasp the scholarly knowledge highlights
fostering faster progress with discoveries.
8     Anteghini et al.


Fig. 3:  An    ORKG      representation of   a    semantified  Bioassay
with an overlayed graph view of the              assay. Accessible at:
https://www.orkg.org/orkg/paper/R48146/R48147
                            SciBERT-based Semantification of Bioassays    9


Fig. 4: Automatically generated comparisons of semantified bioassays in the
ORKG digital library (DL). Full graph https://www.orkg.org/orkg/comparison?
contributions=R48195,R48179,R48147

</pre>