=Paper=
{{Paper
|id=Vol-3127/paper-1
|storemode=property
|title=Creating and Exploiting the Intrinsically Disordered Protein Knowledge Graph (IDP-KG)
|pdfUrl=https://ceur-ws.org/Vol-3127/paper-1.pdf
|volume=Vol-3127
|dblpUrl=https://dblp.org/rec/conf/swat4ls/GrayPAMH22
}}
==Creating and Exploiting the Intrinsically Disordered Protein Knowledge Graph (IDP-KG)==
<pdf width="1500px">https://ceur-ws.org/Vol-3127/paper-1.pdf</pdf>
<pre>
                 Creating and Exploiting the Intrinsically
             Disordered Protein Knowledge Graph (IDP-KG)

                                     Alasdair J. G. Gray1[0000−0002−5711−4872] ,
                 Petros Papadopoulos1[0000−0002−8110−7576] , Imran Asif1[0000−0002−1144−6265] ,
                  Ivan Mičetić2[0000−0003−1691−8425] , and András Hatos2[0000−0001−9224−9820]
                   1
                       Department of Computer Science, Heriot-Watt University, Edinburgh, UK
                       2
                        Department of Biomedical Sciences, University of Padua, Padova, Italy


                       Abstract. There are many data sources containing overlapping infor-
                       mation about Intrinsically Disordered Proteins (IDP). IDPcentral aims
                       to be a registry to aid the discovery of data about proteins known to be
                       intrinsically disordered by aggregating the content from these sources.
                       Traditional ETL approaches for populating IDPcentral require the API
                       and data model of each source to be wrapped and then transformed into
                       a common model.
                       In this paper, we investigate using Bioschemas markup as a mechanism to
                       populate the IDPcentral registry by constructing the Intrinsically Dis-
                       ordered Protein Knowledge Graph (idp-kg). Bioschemas markup is a
                       machine-readable, lightweight representation of the content of each page
                       in the site that is embedded in the HTML. For any site it is accessible
                       through a HTTP request. We harvest the Bioschemas markup in three
                       IDP sources and show the resulting idp-kg has the same breadth of pro-
                       teins available as the original sources, and can be used to gain deeper
                       insight into their content by querying them as a single, consolidated
                       knowledge graph.

                       Keywords: Knowledge Graphs · Schema.org · Bioschemas · Findable ·
                       Intrinsically Disordered Proteins


             1     Introduction

             One of the goals of the ELIXIR Intrinsically Disordered Protein (IDP) commu-
             nity is to create a centralised registry for IDP data to support the community in
             their data analyses. The registry will aggregate data contained in the commu-
             nity’s numerous specialist data sources, such as DisProt [8], MobiDB [11], and
             Protein Ensemble Database (PED) [9], that contain overlapping but complimen-
             tary data about IDPs. Users of the registry should be able to search for IDPs
             and be presented with summary details of the protein and how it is known to
             be disordered; with the specialist source consulted for more detailed data.
                  Bioschemas is a community effort to provide machine-readable markup within
             life sciences resources to increase their discoverability [6]. The community have


Copyright © 2022 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2        A.J.G. Gray et al.

developed extensions to the core Schema.org vocabulary [7] to enable the repre-
sentation of life science concepts such as proteins. Deployments of this markup
have been made in several life sciences resources, including DisProt, MobiDB,
and PED. Bioschemas also provide usage profiles that recommend which prop-
erties should be present in the markup to represent a specific resource. The
purpose of these profiles is to simplify the consumption and use of the markup.
    In this paper, we demonstrate that Bioschemas markup can be harvested
to create a central repository of IDPs by creating the Intrinsically Disordered
Proteins Knowledge Graph (idp-kg3 ). It is not sufficient to simply harvest all
the markup into a single data repository. The concepts within the markup need
to be identified and reconciled since the sources contain overlapping, and poten-
tially conflicting, information about proteins but use different identifiers for the
proteins. Therefore, the provenance of each statement should be tracked so that
users of the registry can retrieve full details from the original data source.

2     Background
We will now discuss the background material for our work. Note that throughout
this paper we will use CURIEs to link to items in databases. These can be
resolved using Identifiers.org. Similarly, ontology terms will be given as CURIEs
that correspond to the widely used prefixes given in https://prefix.cc.

2.1    Schema.org and Bioschemas
Schema.org provides a way to add semantic markup to web pages to enable
those web pages to become more understandable by the search engines that
index them, and therefore to improve search results [7]. Markup is increasingly
being applied to web pages as it boosts a site’s ranking in search results. The
markup in web pages also enhances the search experience for end users, e.g.
enabling them to make more informed decisions when deciding between two
search results, or by providing dedicated search portals such as Google Dataset
Search [2] or ELIXIR’s training portal TeSS [1].
   The Schema.org vocabulary provides types which correspond to the things we
can describe, and properties which capture the characteristics of those things.
The majority of the vocabulary is focused on generic web search, e.g. books,
movies, or places, but it also includes types relevant to science, e.g. Dataset,
and most recently types have been added for Bioinformatics4 such as Gene,
Protein, and Taxon. A major benefit of this approach is that the markup is
accessible to all through a common API, i.e. HTTP Get requests, there is no
need to learn and code for the REST API of each individual source.
   The Bioschemas community (Bioschemas.org) promotes the use of Schema.org
markup within life sciences web resources to improve their Findability, and pro-
vide lightweight Interoperability (c.f. the FAIR Data Principles) [6]. The com-
munity achieve this by:
3
    https://alasdairgray.github.io/IDP-KG/ accessed 30 Sept 2021
4
    Schema.org v13.0 https://schema.org/version/13.0 accessed 30 Aug 2021.
                                                       IDP-KG Generation         3

 1. Proposing extensions to the Schema.org vocabulary to include types and
    properties relevant for life sciences resources; and
 2. Providing recommended usage profiles over Schema.org types.
    Seven types covering key life sciences areas have been included in the Schema.org
pending vocabulary. The Bioschemas community continue to work to add more
types, e.g. the annotation of genes or proteins using a SequenceAnnotation
type. The goal is not to replace existing life sciences ontologies, but to pro-
vide a lightweight vocabulary to aid discovery of resources. Once discovered it
is expected that detailed biological models, captured with rich Interoperable
ontologies, will be used to accurately describe the data.
    For any given type in Schema.org, there can be a large number of proper-
ties available to use, many of which can be inherited from parent types. For
example, the Dataset type has over 100 properties due to the inheritance from
CreativeWork and Thing. This can make it difficult for developers of markup
to know which properties to use, and certainly they are unlikely to use all.
Bioschemas profiles5 provide usage guidelines for Schema.org types; identifying
the most critical properties to aid search, and important properties for disam-
biguation; presented as minimal and recommended properties respectively. This
provides a much smaller pool of properties for types relating to the life sciences.

2.2    Intrinsically Disordered Protein Data Sources
The ELIXIR IDP community6 curates and maintains many data resources that
function as the basis of the IDPcentral registry. These specialist data sources
are built around a subset of proteins having the interesting property of being
unstructured or structurally disordered. The structural and functional aspects
of such proteins are covered in three distinct resources.
    DisProt [8] is a manually curated database of IDPs where structural disorder
and functional annotation is recorded directly from evidence in scientific publica-
tions. For each protein, Bioschemas markup is exposed describing all disordered
regions and their functions. These are represented as a SequenceAnnotation
that identify a region of the protein sequence using a SequenceRange and asso-
ciating it with a defined term from the IDPOntology [8].
    MobiDB [11] is a comprehensive database with experimental and predicted
protein disorder for all known protein sequences. Although all MobiDB entries
are marked up with Bioschemas, only the most interesting subset of entries
appears in the sitemap index. This subset contains ˜2k entries out of 189M
entries in the complete MobiDB. A set of SequenceAnnotation types is exposed
for each Protein identifying all consensus predicted disordered regions, with the
range of the region captured as a SequenceRange.
    The Protein Ensemble Database (PED) [9] is a primary database for the
deposition of protein structural assemblies which include intrinsically disordered
5
    Bioschemas profiles https://bioschemas.org/profiles/ accessed 30 Aug 2021.
6
    https://elixir-europe.org/communities/intrinsically-disordered-proteins
    accessed Sept 2021
4        A.J.G. Gray et al.

proteins. A database entry in PED consists of an ensemble of proteins, in con-
trast to the other two resources where an entry describes a single protein. At
the protein level, the description is comparable to DisProt and MobiDB with
individual proteins annotated with a series of SequenceAnnotation types hav-
ing defined terms describing the detection method used to obtain structural
information connected to a specific SequenceRange region.


3     Knowledge Graph Generation

The creation of the idp-kg requires two steps. First we must harvest the markup
from each of the data sources. Second we need to transform the source markup
into the model for the knowledge graph, reconciling the multiple identifiers for
a specific protein into a single concept.


3.1    Data Harvesting

The markup was extracted from the three data sources using the Bioschemas
Markup Scraper and Extractor (bmuse). Markup is extracted using HTTP Get
requests which means that resource specific APIs do not need to be coded for.
To verify the correctness of the harvesting, we developed three datasets.


BMUSE. The Bioschemas Markup Scraper and Extractor (bmuse7 ) is a data
harvester developed specifically to extract markup embedded within web pages.
bmuse has been developed to extract markup embedded as either JSON-LD or
RDFa, and also supports the use of both in the same page. The pages to be
harvested can be static, or be single page applications (dynamic) that require
JavaScript processing on the client side to generate the page content. bmuse
harvests data from a given list of URLs or sitemaps; it does not perform web
crawling by following links embedded within pages. A maximum number of pages
to harvest per sitemap is also required.
    For each page extracted, bmuse generates an n-quad file containing:

1. The extracted markup stored in an RDF named graph with the IRI of the
   named graph being uniquely constructed based on the date of the scrape and
   the page visited. Where the markup does not contain a subject IRI for the
   data, i.e. the JSON-LD markup does not include an @id attribute, bmuse
   substitutes in the page URL to avoid the use of blank nodes.
2. Provenance data about the data harvesting. This is stored in the default
   graph and describes the named graph in which the data is stored. The prove-
   nance data provided is:
    – URL of the page visited using pav:retrievedFrom
    – Date of extraction using pav:retrievedOn
    – The version of bmuse used to harvest the data using pav:createdWith
7
    https://github.com/HW-SWeL/BMUSE accessed Sept 2021
                                                            IDP-KG Generation           5

   The pav:retrievedFrom property can be used to provide the links back
from individual pieces of data to the source from which it came. The other
two properties are primarily used for debugging purposes, although the retrieval
date can also be used to ensure that the most up to date data is available in the
generated idp-kg.


Harvested Data. To develop and test the data processing pipeline to be applied
to the harvested data, we used a series of test datasets. These correspond to data
harvested from the three data sources on 28 September 2021. We note that in
the initial run of bmuse 13 pages produced errors due to timeouts. These pages
were harvested in a second run with just those pages listed as targets.

Test-8: This dataset consists of eight sample pages that correspond to those
   used in [5]. In constructing this test dataset, we ensured that there was at
   least one protein (uniprot:P03265) that was present in all three datasets.
   Two additional pages have been added since the previous work which cor-
   respond to the DisProt homepage and another page that exists in the Dis-
   Prot sitemap but contains no markup. These were added to ensure that the
   pipeline would work with pages not corresponding to protein information.
Sample-25: This dataset contains the first 25 pages harvested from each of the
   sitemaps of the source databases. This corresponds to 5 to 9 pages of site
   structure and then first 25 protein pages per source8 . This dataset allowed
   us to check the pipeline would scale up.
Full: This dataset contains all pages that could be harvested from the sitemaps
   of the three data sources. This contains the 5 to 9 pages of site structure per
   data source and all protein pages listed in the sitemap. This dataset is used
   to construct the idp-kg.


3.2     Data Transformation

After the data has been harvested, it is processed so that information about a
particular protein, which can come from multiple sources, is consolidated into
a single concept for the protein, with links back to where each piece of data
originated. The data transformation process is available as a Jupyter Notebook9 .
This is an extended version of the notebook presented in [5], containing bug fixes
and the ability to extract markup corresponding to more Bioschemas profiles.
    The notebook uses SPARQL CONSTRUCT queries to extract the data from
the harvested pages and convert them into the idp-kg model, based on the
Bioschemas vocabulary. While the queries are based on the properties listed
in the corresponding Bioschemas profile, they make extensive use of OPTIONAL
clauses since the data does not always exactly correspond to the profile.
8
    The sitemap of each source is split into two entries in the bmuse configuration file.
9
    https://github.com/AlasdairGray/IDP-KG/blob/main/notebooks/ETLProcess.
    ipynb accessed Sept 2021
6        A.J.G. Gray et al.

Bioschemas Profiles. Within the three data sources, we expected to find
markup conforming to the following Bioschemas profiles:

    – DataCatalog (v0.3-RELEASE)
    – Dataset (v0.3-RELEASE)
    – Protein (v0.11-RELEASE)
    – SequenceAnnotation (v0.1-DRAFT)
    – SequenceRange (v0.1-DRAFT)

Additionally, within these profiles there are uses of the Schema.org types
PropertyValue and DefinedTerm, which must be processed separately, and ref-
erences to pages of type ScholarlyArticle.
   While all the data conforms to the same data vocabulary, there are differ-
ences in the underlying usage. DisProt and MobiDB provide protein centric
representations of the data. PED provides a cluster of proteins on a single page.
These differences need to be consolidated into a coherent knowledge graph model
centred around proteins.


Instance Merging. Each of the data sources uses their own identifier scheme
to identify concepts in their data. Within the idp-kg, we need to aggregate the
data from the multiple sources into a single consolidated entry, which will need
its own identifier. In considering the different entity types, it was decided that
only the proteins would be merged, as there is no clear way to decide when two
annotations are equivalent and it is not expected that multiple instances of the
Dataset and DataCatalog data would appear in the different datasets.
    The IDPcentral team have decided they will use UniProt accessions [12] as
a central spine for identifying proteins. This means that for each source web
page about a protein, where the protein is identified by the data source’s IRI,
e.g. https://disprot.org/DP00003, the conversion process needs to align and
merge this to a UniProt accession number. Fortunately each source includes a
schema:sameAs declaration to the UniProt accession, although different UniProt
namespaces were used by the different sources. Each protein was given an IRI
in the IDPcentral namespace of the form

      https://idpcentral.org/id/<accession>

where <accession> is replaced by the UniProt accession for the protein.


Knowledge Graph Construction. While constructing the IDPcentral knowl-
edge graph, it was assumed that the data sources would contain declarations of
the same property of information, e.g. the name of the protein. However, we
do not assume that they are consistent in their content. There are two cases to
consider. The first is that each source contains different values but these compli-
ment each other, e.g. a list of synonyms where no source will necessarily have a
complete set but by merging the data from the sources the IDPcentral knowledge
graph would have a more complete set. The second case is where two sources
                                                        IDP-KG Generation          7

have differing values for a property which should have a single specific value, e.g.
protein name. Rather than decide that a specific source’s value should be used,
we have decided to include all values available in the sources together with the
provenance. Users of the data can then decide on the correct value, and feedback
issues to the source with the erroneous value.
    To support providing statement level provenance, we adopted the named
graph approach that was used in the Open PHACTS platform [4]. This involves
placing data statements in named graphs based on the page where they have
been harvested from. The provenance data declared about the named graph is
stored in the default graph.


4     Data Analysis
To verify the generated knowledge graph, we performed various data analyses.
These build on the queries from [5] but go further in their analysis. The queries
are available in a Notebook10 and also through the idp-kg SPARQL endpoint11 .

4.1   Knowledge Graph Statistics
We first give an overview of the idp-kg using the statistics recommended in the
HCLS Community Profile for Dataset Descriptions [3]. A summary of some of
the key statistics can be found in Table 1, with the full statistics available in the
notebook. The basic statistics show that the key difference between our three
knowledge graphs is the number of proteins. This is shown by the number of
properties and classes being constant between the three samples. This verified
that we were getting consistent performance from our ETL process over the dif-
ferent harvested data samples, and our domain experts have verified the content
of the test-8 knowledge graph. We note that there are only two instances of the
Dataset type. This is due to an unresolved bug in bmuse, but does not affect
the retrieval of proteins.
    Table 2 presents a comparison between the number of proteins found in the
original data sources and the number in the idp-kg. The comparison gives the
number of proteins in the different intersections of the data sources. The table
shows that the data harvesting completely recreates the information available in
the data sources.

4.2   IDP Analysis Queries
Now that we have verified that the idp-kg is complete with respect to the
content of the sources, we can use it to analyse the data available about IDPs.
10
   https://github.com/AlasdairGray/IDP-KG/blob/main/notebooks/
   AnalysisQueries.ipynb accessed Sept 2021
11
   We have deployed the “Snorql - Extended Edition (https://github.com/
   ammar257ammar/snorql-extended)” query interface at https://swel.macs.hw.ac.
   uk/idp with access to the same queries that are used in the analysis notebook.
8       A.J.G. Gray et al.

                                       KG                 Test-8 Sample-25 Full
KG         Test-8 Sample-25    Full
                                       DataCatalog             1         2      2
Triples       766     7,704 278,572
                                       Dataset                 1         2      2
Subjects      179     1,706 62,972
                                       DefinedTerm            32       126 4,262
Properties     34        34      34
                                       PropertyValue          57       652 17,607
Objects       207     1,818 67,334
                                       Protein                 8        69 2,701
Classes         8         8       8
                                       ScholarlyArticle        7        75 2,578
Literals      140       715 18,177
                                       SequenceAnnotation     32       350 15,767
Graphs         10        81   4,287
                                       SequenceRange          32       350 15,767
              Table 1. HCLS Dataset Description statistics for idp-kg.


                  Description                    IDP Sources IDP-KG
                  Havested Pages                                4286
                  DisProt Pages                                 2039
                  MobiDB Pages                                  2075
                  PED Pages                                      172
                  Protein Pages                                 4284
                  DisProt entries (from sitemap)        2038    2038
                  MobiDB entries (from sitemap)         2074    2074
                  PED entries (from sitemap)             172     172
                  Distinct Proteins (Union)             2701    2701
                  DisProt Proteins                      2038    2038
                  MobiDB Proteins                       2074    2074
                  PED Proteins                            90      90
                  DisProt \ (MobiDB ∪ PED)               586     586
                  MobiDB \ (DisProt ∪ PED)               624     624
                  PED \ (DisProt ∪ MobiDB)                34      34
                  (DisProt ∪ MobiDB)                    2667    2667
                  (DisProt ∪ PED)                       2077    2077
                  (MobiDB ∪ PED)                        2115    2115
                  DisProt ∩ MobiDB                      1445    1445
                  DisProt ∩ PED                           51      51
                  MobiDB ∩ PED                            49      49
                  (DisProt ∩ MobiDB) \ PED              1401    1401
                  (DisProt ∩ PED) \ MobiDB                 7       7
                  (MobiDB ∩ PED) \ DisProt                 5       5
                  DisProt ∩ MobiDB ∩ PED                  44      44
Table 2. Comparison of proteins present in the idp-kg and the original data sources.


The answers presented in this section are possible due to the aggregation of the
data into a single knowledge graph. We only perform the following analysis over
the full idp-kg. The full set of responses to these queries are available through
the notebook or idp-kg SPARQL endpoint.
    From Table 1 we can see that there are 15,767 annotations on the proteins.
These correspond to 11,046 from DisProt, 4,488 from MobiDB, and 233 from
PED (annotations per dataset query). Using the annotations in multiple datasets
query, we can see that there are 912 proteins with annotations from more than
one dataset, with https://idpcentral.org/id/P04637 having a total 77 annota-
tions, contributed by all 3 datasets. Using the annotations per article query, we
find that there are 2,578 distinct scholarly articles referenced in the annotations,
with the article pubmed:20657787 providing 80 annotations. Finally, using the
annotations per term code query, we found that 149 codes from the Intrinsically
Disordered Protein Ontology are used, with IDPO:00076 (Disorder) being the
                                                      IDP-KG Generation        9

most common with 7,542 instances, followed by IDPO:00063 (Protein Binding)
with 1,325 instances.


5      Related Work

Schema.org markup is extensively used by search engines (Google, Microsoft,
and Yandex) to optimise search results (SEO) [7]. Rather than trying to infer
the topic and content of a page, the markup states explicitly what the page is
about. Based on this markup, search companies have been building extensive
knowledge graphs about the content of the Web, with the Google Knowledge
Graph being the most widely known. As well as improving search results, these
internal knowledge graphs are used to provide information boxes and rich snip-
pets for search results. Google have developed a dedicated Dataset Search Portal
based on the markup embedded within web pages about data on the web [2].
The work reported here uses the same approach of harvesting data from the Web
to generate a knowledge graph, but rather than doing this at the scale of the
Web, we have focused on a specific life sciences community who had a need to
aggregate their disparate data sources without needing to establish an agreed set
of web services. The ELIXIR TeSS training portal [1] uses Bioschemas markup
embedded within web pages to populate its registry. TeSS maintains a list of
sources that it gathers its data from, and as there is no overlap in the content
it does not need to reconcile the concepts that it retrieves.
    The work presented in this paper relies on the ability to harvest markup
embedded within web pages. The common crawl [10] is a public dataset con-
taining content retrieved from the Web. While it contains large amounts of data
that can be utilised to imitate the search engines, it does not have the focus
required for this work. Gleaner12 is an open source tool that can be used for
harvesting markup embedded within web sites. It has been built to exclusively
extract Schema.org markup; which limits it applicability when using new types
and properties that have yet to be included into the Schema.org vocabulary. It
also does not track where content has been retrieved from.


6      Conclusions and Future Work

In this work, we have shown that Bioschemas markup can be harvested, trans-
formed using a standard API (c.f. HTTP Get), and used to generate a com-
munity focused knowledge graph. We verified that the breadth of coverage was
equivalent to the original sources, and showed that the resulting knowledge graph
can be used to gain further insight into the domain. As future work, we plan
to extend the number of sources from which we harvest data and to further
exploit the idp-kg to gain further insights into IDPs. We also intend to extend
our transformation framework so that it can be applied in other life sciences
communities with Bioschemas markup.
12
     https://gleaner.io/ accessed September 2021
10      A.J.G. Gray et al.

Acknowledgements. This work was funded through the ELIXIR Strategic Im-
plementation Study Exploiting Bioschemas Markup to Support ELIXIR Com-
munities https://elixir-europe.org/about-us/commissioned-services/
exploiting-bioschemas-markup-support-elixir-communities. Early stages
of this work were carried out during the BioHackathon Europe 2020 organized
by ELIXIR in November 2020.


References
 1. Beard, N., Bacall, F., Nenadic, A., et al: TeSS: a platform for discover-
    ing life-science training opportunities. Bioinformatics 36(10), 3290–3291 (2020).
    https://doi.org/10.1093/bioinformatics/btaa047
 2. Brickley, D., Burgess, M., Noy, N.: Google Dataset Search: Building a search engine
    for datasets in an open Web ecosystem. In: WWW ’19. pp. 1365–1375 (2019).
    https://doi.org/10.1145/3308558.3313685
 3. Dumontier, M., Gray, A.J., Marshall, M.S., et al: The health care and life sciences
    community profile for dataset descriptions. PeerJ 4 (2016)
 4. Gray, A.J.G., Groth, P., Loizou, A., et al: Applying linked data approaches to
    pharmacology: Architectural decisions and implementation. Semantic Web 5(2),
    101–113 (2014). https://doi.org/10.3233/SW-2012-0088
 5. Gray, A.J.G., Papadopoulos, P., Mičetić, I., Hatos, A.: Exploiting
    Bioschemas Markup to Populate IDPcentral. Tech. rep., BioHackrXiv (2021).
    https://doi.org/10.37044/osf.io/v3jct, type: article
 6. Gray, A.J., Goble, C.A., Jimenez, R.: Bioschemas: From Potato Salad to Protein
    Annotation. In: ISWC (Posters, Demos & Industry Tracks) (2017)
 7. Guha, R.V., Brickley, D., Macbeth, S.: Big data makes common schemas even more
    necessary. CACM 59(2) (2016). https://doi.org/10.1145/2844544
 8. Hatos, A., Hajdu-Soltész, B., Monzon, A.M., et al: DisProt: intrinsic protein dis-
    order annotation in 2020. Nucleic Acids Research 48(D1), D269–D276 (2020).
    https://doi.org/10.1093/nar/gkz975
 9. Lazar, T., Martı́nez-Pérez, E., Quaglia, F., et al: PED in 2021: a major update of
    the protein ensemble database for intrinsically disordered proteins. Nucleic Acids
    Research 49(D1), D404–D411 (2021). https://doi.org/10.1093/nar/gkaa1021
10. Patel, J.M.: Introduction to Common Crawl Datasets. In: Getting Structured Data
    from the Internet: Running Web Crawlers/Scrapers on a Big Data Production
    Scale, pp. 277–324. Apress (2020). https://doi.org/10.1007/978-1-4842-6576-5 6
11. Piovesan, D., Necci, M., Escobedo, N., et al: MobiDB: intrinsically disor-
    dered proteins in 2021. Nucleic Acids Research 49(D1), D361–D367 (2021).
    https://doi.org/10.1093/nar/gkaa1058
12. The UniProt Consortium: UniProt: the universal protein knowledge-
    base in 2021. Nucleic Acids Research 49(D1), D480–D489 (2021).
    https://doi.org/10.1093/nar/gkaa1100

</pre>