<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Opportunities and challenges presented by Wikidata in the context of biocuration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin M. Good</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Burgstaller- Muehlbacher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tim Putman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Su</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Andra Waagmeester</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Molecular and Experimental Medicine, The Scripps Research Institute</institution>
          ,
          <addr-line>10550 North Torrey Pines Road, La Jolla, CA, 92037</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Micelio</institution>
          ,
          <addr-line>Veltwijklaan 305, 2180 Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Maryland, School of Medicine</institution>
          ,
          <addr-line>655 West Baltimore Street, Baltimore, MD, 21201</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Wikidata is a world readable and writable knowledge base maintained by the Wikimedia Foundation. It offers the opportunity to collaboratively construct a fully open access knowledge graph spanning biology, medicine, and all other domains of knowledge. To meet this potential, social and technical challenges must be overcome - many of which are familiar to the biocuration community. These include community ontology building, high precision information extraction, provenance, and license management. By working together with Wikidata now, we can help shape it into a trustworthy, unencumbered central node in the Semantic Web of biomedical data.</p>
      </abstract>
      <kwd-group>
        <kwd>wikidata</kwd>
        <kwd>semantic web</kwd>
        <kwd>ontology</kwd>
        <kwd>crowdsourcing</kwd>
        <kwd>wiki</kwd>
        <kwd>biocuration</kwd>
        <kwd>knowledge graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION</p>
      <p>
        Wikidata is a world readable and writable knowledge base
currently maintained by the Wikimedia Foundation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is
used by the many different language Wikipedias to manage
inter-language links and to host data rendered in infoboxes.
Its contents are accessible for all users via the Creative
Commons CC0 1.0 Universal license1. The data can be
queried via a SPARQL endpoint2, retrieved as a full database
download 3 , and manipulated both manually and
programmatically via a REST API4.
      </p>
      <p>
        In addition to its function as a structured datastore for the
Wikimedia projects, Wikidata is being used to integrate and
distribute biomedical knowledge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For example, it has been
used to disseminate knowledge about drug-drug interactions
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], human genes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and microbial genomics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Here, we
suggest a few of the opportunities and associated challenges
that Wikidata presents to the broad biocuration community.
      </p>
      <p>II.</p>
      <p>OPPORTUNITIES</p>
    </sec>
    <sec id="sec-2">
      <title>A. As a fully open public knowledge graph</title>
      <p>Wikidata’s CC0 license, Semantic Web compatible
implementation and active community provide a unique
opportunity to assemble and disseminate knowledge. Wikidata</p>
      <sec id="sec-2-1">
        <title>1 https://creativecommons.org/publicdomain/zero/1.0/</title>
      </sec>
      <sec id="sec-2-2">
        <title>2 https://query.wikidata.org/ 3 https://www.wikidata.org/wiki/Wikidata:Database_download 4 https://www.wikidata.org/w/api.php</title>
        <p>
          Apart from its use as a knowledge graph, Wikidata could
provide great value to the text-mining community as a
multilingual collection of concept labels, descriptions, and links to
encyclopedic text. So-called ‘items’ in Wikidata are roughly
analogous to the concepts in the Unified Medical Language
System (UMLS) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Each item may have labels and
descriptions in any of hundreds of different human languages
as well as links to corresponding Wikipedia articles in each of
these languages. In addition, Wikidata provides links to
unique concept identifiers in a growing number of controlled
vocabularies and ontologies, thus easing integration with and
between existing knowledge bases. For example, the Wikidata
item for peritonitis5 provides terms, aliases and article links in
approximately 50 languages. Further it provides links to
equivalent concepts in 11 different external resources
including e.g. MeSH, the Human Disease Ontology, and
ICD10. This lexical information, coupled with the growing
amount of semantic information represented in the Wikidata
knowledge graph, provides a powerful resource for natural
language processing. Already, applications such as
ContentMine are using Wikidata for this purpose [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Unlike
the UMLS, which is centrally curated, Wikidata’s distributed
curation model offers the potential for far greater scale and
adaptability– at the cost of greater challenges in establishing
and maintaining order.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>5 https://www.wikidata.org/wiki/Q223102</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Community ontology building</title>
      <p>When creating a knowledge base that spans all domains of
knowledge, what are the most effective patterns for
representation? How can the community work most
effectively together to move iteratively closer to the most
useful forms? These questions are currently being tackled by
a distributed, mostly-volunteer community of ontologists,
technologists, domain experts, and interested citizens in
discussions held in forums such as the Wikidata property
proposal page6. Before a property (e.g. ‘part of’, ‘MeSH id’,
or ‘used to treat’) can be used in Wikidata it must be proposed
and approved by community consensus. Once consensus is
achieved, an elected community member with administrative
powers creates the property and it can then be used to add
claims to any item. This property collection, and the
guidelines associated with their use, forms a major part of the
active ‘ontology’ of Wikidata.</p>
      <p>
        In comparison to other efforts to build large knowledge
graphs, the Wikidata approach is on the chaotic side. There is
no rigid application of an upper ontology, no automated
reasoning to support class inference or quality control, and no
over-arching plan to govern the system’s evolution. Instead,
there is a large, motivated, highly heterogeneous community
doing their best to assemble useful structures one step at a
time. So far, good progress has been made as evidenced by
the early applications of Wikidata content such as the new
infobox for human genes in Wikipedia [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. That being said,
there is a clear need for experienced ontologists to join the
conversations and help to collaboratively guide this
community forward if it is to reach its full potential.
      </p>
    </sec>
    <sec id="sec-4">
      <title>B. Establishing computable trust</title>
      <p>A key enabling feature of the Wikidata infrastructure is the
capacity to provide provenance for its claims (the triples that
compose the knowledge graph) through references. Each
claim can be supported by any number of references to
supporting sources of information. Unfortunately, many of the
claims that are currently in Wikidata were not assigned
references. These unsourced claims are of uncertain quality
and may weaken the chances of community uptake. Many
long-time Wikipedians are hesitant to embrace Wikidata and
use the lack of references as an argument against broadly
deploying its contents to support infoboxes. This situation
poses a challenge to the information extraction community.
Given an unsourced claim (e.g. that a drug treats a particular
disease) can we develop automated or semi-automated
processes for finding sources to validate or invalidate these
claims? Could we apply similar processes to automatically
verify references that do exist to ensure high quality? If
successful, such automation could greatly help drive Wikidata
and other similarly open initiatives forward by allaying
concerns about the trustworthiness of content.</p>
      <sec id="sec-4-1">
        <title>6 https://www.wikidata.org/wiki/Wikidata:Property_proposal</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>C. Building up Wikidata with text mining</title>
      <p>The majority of the world’s biomedical knowledge
remains locked up in unstructured text. As text mining
matures, it is increasingly possible to extract this knowledge
automatically; however (1) most people, even within the
bioinformatics community, do not have the skills and
resources to perform this work themselves and (2) despite
many advances, workflows for generating highly reliable
content still require human review. If extracted knowledge
could be shared through Wikidata, it would reach the broadest
possible audience, eliminating the need for consumers to build
and run their own extraction pipelines. However, to achieve
this, the quality of such workflows would need to be at the
same level as institutional biocuration processes – likely with
human verification as the final step. A challenge for the
textmining research community is to identify ways to engage the
thousands of Wikidata community members to define truly
scalable, high quality biocuration workflows by effectively
integrating machine intelligence with community intelligence.</p>
      <sec id="sec-5-1">
        <title>CONCLUSION</title>
        <p>With diligence, persistence and patience, Wikidata could
become the central hub of the Web of data, uniting all domains
of knowledge. The biocuration community has an opportunity
to help lead this process and, in doing so, benefit all aspects
of biomedical research. The time is now.</p>
      </sec>
      <sec id="sec-5-2">
        <title>ACKNOWLEDGMENT</title>
        <p>This work was supported by the US National Institute of
Health (grants GM089820 and U54GM114833 to AIS) and by
the Scripps Translational Science Institute with an
NIHNCATS Clinical and Translational Science Award (CTSA; 5
UL1 TR001114).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Vrandecic</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krotzsch</surname>
            <given-names>M</given-names>
          </string-name>
          :
          <article-title>Wikidata: a free collaborative knowledgebase</article-title>
          .
          <source>Commun ACM</source>
          <year>2014</year>
          ,
          <volume>57</volume>
          (
          <issue>10</issue>
          ):
          <fpage>78</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Mitraka</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waagmeester</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgstaller-Muehlbacher</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schriml</surname>
            <given-names>LM</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Good</surname>
            <given-names>BM</given-names>
          </string-name>
          :
          <article-title>Wikidata: A platform for data integration and dissemination for the life sciences and beyond</article-title>
          .
          <source>bioRxiv</source>
          <year>2015</year>
          :
          <volume>031971</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Pfundner</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schonberg</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horn</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boyce</surname>
            <given-names>RD</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Samwald</surname>
            <given-names>M</given-names>
          </string-name>
          :
          <article-title>Utilizing the Wikidata system to improve the quality of medical content in Wikipedia in diverse languages: a pilot study</article-title>
          .
          <source>J Med Internet Res</source>
          <year>2015</year>
          ,
          <volume>17</volume>
          (
          <issue>5</issue>
          ):
          <fpage>e110</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Burgstaller-Muehlbacher</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waagmeester</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitraka</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turner</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Putman</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leong</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naik</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pavlidis</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schriml</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Good</surname>
            <given-names>BM</given-names>
          </string-name>
          et al:
          <article-title>Wikidata as a semantic framework for the Gene Wiki initiative</article-title>
          .
          <source>Database (Oxford)</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Putman</surname>
            <given-names>TE</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burgstaller-Muehlbacher</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waagmeester</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Good</surname>
            <given-names>BM</given-names>
          </string-name>
          :
          <article-title>Centralizing content and distributing labor: a community model for curating the very long tail of microbial genomes</article-title>
          .
          <source>Database (Oxford)</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Himmelstein</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jensen</surname>
            <given-names>LJ</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fortney</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chung</surname>
            <given-names>C</given-names>
          </string-name>
          :
          <article-title>Integrating resources with disparate licensing into an open network</article-title>
          .
          <source>Thinklab</source>
          <year>2015</year>
          , doi:10.15363/thinklab.d107.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Bodenreider</surname>
            <given-names>O</given-names>
          </string-name>
          :
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research</source>
          <year>2004</year>
          ,
          <volume>32</volume>
          :
          <fpage>267</fpage>
          -
          <lpage>270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Martone</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murray-Rust</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molloy</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arrow</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacGillivray</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kittel</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kasberger</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steel</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oppenheim</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranganathan</surname>
            <given-names>A</given-names>
          </string-name>
          et al: ContentMine/Hypothes.is Proposal.
          <source>Research Ideas and Outcomes</source>
          <year>2016</year>
          ,
          <volume>2</volume>
          (
          <issue>e8424</issue>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>