=Paper= {{Paper |id=Vol-1747/IP19_ICBO2016 |storemode=property |title=Opportunities and Challenges Presented by Wikidata in the Context of Biocuration |pdfUrl=https://ceur-ws.org/Vol-1747/IP19_ICBO2016.pdf |volume=Vol-1747 |authors=Benjamin Good,Timothy Putman,Andrew Su,Andra Waagmeester,Sebastian Burgstaller-Muehlbacher,Elvira Mitraka |dblpUrl=https://dblp.org/rec/conf/icbo/GoodPSWBM16 }} ==Opportunities and Challenges Presented by Wikidata in the Context of Biocuration == https://ceur-ws.org/Vol-1747/IP19_ICBO2016.pdf
Opportunities and challenges presented by Wikidata in the
context of biocuration
          Benjamin M. Good1, Sebastian Burgstaller-                                Andra Waagmeester 2, Elvira Mitraka3
                                                                               2
                                                                             Micelio, Veltwijklaan 305, 2180 Antwerp, Belgium
           Muehlbacher1, Tim Putman1, Andrew Su1                           3
1
                                                                            University of Maryland, School of Medicine, 655 West
 Department of Molecular and Experimental Medicine, The                           Baltimore Street, Baltimore, MD, 21201
Scripps Research Institute, 10550 North Torrey Pines Road,
                   La Jolla, CA, 92037

    Abstract—Wikidata is a world readable and writable              is currently the only major Semantic Web resource that
knowledge base maintained by the Wikimedia Foundation. It           supports open, collaborative editing. Further, through its
offers the opportunity to collaboratively construct a fully open    association with the Wikipedias, it has thousands of editors
access knowledge graph spanning biology, medicine, and all other
                                                                    working to improve its content. If orchestrated effectively,
domains of knowledge. To meet this potential, social and
technical challenges must be overcome most of which are
                                                                    this combination of technology and community could produce
familiar to the biocuration community.           These include      a knowledge resource of unprecedented scale and value. In
community ontology building, high precision information             terms of distributing knowledge, its direct integration with the
extraction, provenance, and license management. By working          Wikipedias can allow its community vetted content to be
together with Wikidata now, we can help shape it into a             shared with literally millions of consumers in hundreds of
trustworthy, unencumbered central node in the Semantic Web of       languages. Outside of the Wikipedias, Wikidata’s CC0 license
biomedical data.                                                    removes all barriers on re-use and redistribution of its contents
                                                                    in other applications. Such legal barriers to data sharing are
   Keywords—wikidata; semantic web; ontology; crowdsourcing;
                                                                    critical blockers to scientific progress [6]. Because of its truly
wiki; biocuration; knowledge graph;
                                                                    open access status and its standards compliant implementation,
            I.          INTRODUCTION                                it could become the central component of the long promised
                                                                    Semantic Web in the life sciences.
    Wikidata is a world readable and writable knowledge base
currently maintained by the Wikimedia Foundation [1]. It is         B. As a shared concept resource for information extraction
used by the many different language Wikipedias to manage                Apart from its use as a knowledge graph, Wikidata could
inter-language links and to host data rendered in infoboxes.        provide great value to the text-mining community as a multi-
Its contents are accessible for all users via the Creative          lingual collection of concept labels, descriptions, and links to
Commons CC0 1.0 Universal license1. The data can be                 encyclopedic text. So-called ‘Items’ in Wikidata are roughly
queried via a SPARQL endpoint2, retrieved as a full database        analogous to the concepts in the Unified Medical Language
download 3 , and manipulated both manually and                      System (UMLS) [7].            Each item may have labels and
programmatically via a REST API4.                                   descriptions in any of hundreds of different human languages
    In addition to its function as a structured datastore for the   as well as links to corresponding Wikipedia articles in each of
Wikimedia projects, Wikidata is being used to integrate and         these languages. In addition, Wikidata provides links to
distribute biomedical knowledge [2]. For example, it has been       unique concept identifiers in a growing number of controlled
used to disseminate knowledge about drug-drug interactions          vocabularies and ontologies, thus easing integration with and
[3], human genes [4], and microbial genomics [5]. Here, we          between existing knowledge bases. For example, the Wikidata
suggest a few of the opportunities and associated challenges        item for peritonitis5 provides terms, aliases and article links in
that Wikidata presents to the broad biocuration community.          approximately 50 languages. Further it provides links to
                                                                    equivalent concepts in 11 different external resources
            II.         OPPORTUNITIES                               including e.g. MeSH, Disease Ontology, and ICD10. This
                                                                    lexical information, coupled with the growing amount of
A. As a fully open public knowledge graph
                                                                    semantic information represented in the Wikidata knowledge
   Wikidata’s CC0 license, Semantic Web compatible                  graph, provides a powerful resource for natural language
implementation and active community provide a unique                processing. Already, applications such as ContentMine are
opportunity to assemble and disseminate knowledge. Wikidata         using Wikidata for this purpose [8]. Unlike the UMLS, which
                                                                    is centrally curated, Wikidata’s distributed curation model
                                                                    offers the potential for far greater scale and adaptability– at the
1
    https://creativecommons.org/publicdomain/zero/1.0/              cost of greater challenges in establishing and maintaining
2                                                                   order.
    https://query.wikidata.org/
3
    https://www.wikidata.org/wiki/Wikidata:Database_download
4                                                                   5
    https://www.wikidata.org/w/api.php                                  https://www.wikidata.org/wiki/Q223102
                              III.       CHALLENGES                C. Building up Wikidata with text mining
A. Community ontology building                                         The majority of the world’s biomedical knowledge
                                                                   remains locked up in unstructured text. As text mining
    When creating a knowledge base that spans all domains of       matures, it is increasingly possible to extract this knowledge
knowledge, what are the most effective patterns for                automatically; however (1) most people, even within the
representation?      How can the community work most               bioinformatics community, do not have the skills and
effectively together to move iteratively closer to the most        resources to perform this work themselves and (2) despite
useful forms? These questions are currently being tackled by       many advances, workflows for generating highly reliable
a distributed, mostly-volunteer community of ontologists,          content still require human review. If extracted knowledge
technologists, domain experts, and interested citizens in          could be shared through Wikidata, it would reach the broadest
discussions held in forums such as the Wikidata property           possible audience, eliminating the need for consumers to build
proposal page6. Before a property (e.g. ‘part of’, ‘MeSH id’,      and run their own extraction pipelines. However, to achieve
or ‘used to treat’) can be used in Wikidata it must be proposed    this, the quality of such workflows would need to be at the
and approved by community consensus. Once consensus is             same level as institutional biocuration processes – likely with
achieved, an elected community member with administrative          human verification as the final step. A challenge for the text-
powers creates the property and it can then be used to add         mining research community is to identify ways to engage the
claims to any item. This property collection, and the              thousands of Wikidata community members to define truly
guidelines associated with their use, forms a major part of the    scalable, high quality biocuration workflows by effectively
active ‘ontology’ of Wikidata.                                     integrating machine intelligence with community intelligence.
    In comparison to other efforts to build large knowledge
graphs, the Wikidata approach is on the chaotic side. There is                                  IV.        CONCLUSION
no rigid application of an upper ontology, no automated                With diligence, persistence and patience, Wikidata could
reasoning to support class inference or quality control, and no    become the central hub of the Web of data, uniting all domains
over-arching plan to govern the system’s evolution. Instead,       of knowledge. The biocuration community has an opportunity
there is a large, motivated, highly heterogeneous community        to help lead this process and, in doing so, benefit all aspects
doing their best to assemble useful structures one step at a       of biomedical research. The time is now.
time. So far, good progress has been made as evidenced by
the early applications of Wikidata content such as the new                                  ACKNOWLEDGMENT
infobox for human genes in Wikipedia [4]. That being said,            This work was supported by the US National Institute of
there is a clear need for experienced ontologists to join the      Health (grants GM089820 and U54GM114833 to AIS) and by
conversations and help to collaboratively guide this               the Scripps Translational Science Institute with an NIH-
community forward if it is to reach its full potential.            NCATS Clinical and Translational Science Award (CTSA; 5
B. Establishing computable trust                                   UL1 TR001114).
    A key enabling feature of the Wikidata infrastructure is the                                REFERENCES
capacity to provide provenance for its claims (the triples that
compose the knowledge graph) through references. Each              [1]   Vrandecic D, Krotzsch M: Wikidata: a free collaborative
claim can be supported by any number of references to                    knowledgebase. Commun ACM 2014, 57(10):78-85.
supporting sources of information. Unfortunately, many of the      [2]   Mitraka E, Waagmeester A, Burgstaller-Muehlbacher S, Schriml LM, Su
claims that are currently in Wikidata were not assigned                  AI, Good BM: Wikidata: A platform for data integration and
                                                                         dissemination for the life sciences and beyond. bioRxiv 2015:031971.
references. These unsourced claims are of uncertain quality        [3]   Pfundner A, Schonberg T, Horn J, Boyce RD, Samwald M: Utilizing
and may weaken the chances of community uptake. Many                     the Wikidata system to improve the quality of medical content in
long-time Wikipedians are hesitant to embrace Wikidata and               Wikipedia in diverse languages: a pilot study. J Med Internet Res
use the lack of references as an argument against broadly                2015, 17(5):e110.
deploying its contents to support infoboxes. This situation        [4]   Burgstaller-Muehlbacher S, Waagmeester A, Mitraka E, Turner J,
                                                                         Putman T, Leong J, Naik C, Pavlidis P, Schriml L, Good BM et al:
poses a challenge to the information extraction community.               Wikidata as a semantic framework for the Gene Wiki initiative.
Given an unsourced claim (e.g. that a drug treats a particular           Database (Oxford) 2016, 2016.
disease) can we develop automated or semi-automated                [5]   Putman TE, Burgstaller-Muehlbacher S, Waagmeester A, Wu C, Su AI,
processes for finding sources to validate or invalidate these            Good BM: Centralizing content and distributing labor: a community
claims? Could we apply similar processes to automatically                model for curating the very long tail of microbial genomes. Database
                                                                         (Oxford) 2016, 2016.
verify references that do exist to ensure high quality? If
                                                                   [6]   Himmelstein D, Jensen LJ, Smith M, Fortney K, Chung C: Integrating
successful, such automation could greatly help drive Wikidata            resources with disparate licensing into an open network. Thinklab
and other similarly open initiatives forward by allaying                 2015, doi:10.15363/thinklab.d107.
concerns     about     the   trustworthiness     of    content.    [7]   Bodenreider O: The Unified Medical Language System (UMLS):
                                                                         integrating biomedical terminology. Nucleic acids research 2004,
                                                                         32:267-270.
                                                                   [8]   Martone M, Murray-Rust P, Molloy J, Arrow T, MacGillivray M, Kittel
                                                                         C, Kasberger S, Steel G, Oppenheim C, Ranganathan A et al:
                                                                         ContentMine/Hypothes.is Proposal. Research Ideas and Outcomes
6
    https://www.wikidata.org/wiki/Wikidata:Property_proposal             2016, 2(e8424).