Detecting Semantic Overlap and Discovering Precedents
        in the Biodiversity Research Literature
                                       Position Paper


                     Graeme Hirst*, Nadia Talent†, and Sara Scharf‡

               *Department of Computer Science, University of Toronto
               †Department of Natural History, Royal Ontario Museum
                               ‡Independent Scholar
     gh@cs.toronto.edu;nadia.talent@utoronto.ca;sara.scharf@gmail.com


       Abstract. Scientific literature on biodiversity is longevous, but even when legacy
       publications are available online, researchers often fail to search it adequately or
       effectively for prior publications; consequently, new research may replicate, or
       fail to adequately take into account, previously published research. The mecha-
       nisms of the Semantic Web and methods developed in contemporary research in
       natural language processing could be used, in the near-term future, as the basis
       for a precedent-finding system that would take the text of an author’s early draft
       (or a submitted manuscript) and find potentially related ideas in published work.
       Methods would include text-similarity metrics that take different terminologies,
       synonymy, paraphrase, discourse relations, and structure of argumentation into
       account.

       Keywords: Biodiversity literature, taxonomy, systematics, natural language pro-
       cessing, Semantic Web, paraphrase, textual entailment, text similarity, discourse
       relations, structure of scientific papers.


1   Introduction

Scientific progress comes from building on, and occasionally overturning, past results.
It is therefore a researcher’s responsibility to know the history of the topic on which
they are working, and this is so for two primary reasons: (1) to do the best possible
work, building upon the state of the art, and neither duplicating what has already been
done nor repeating the mistakes of the past; (2) to include in any publication of the work
a literature review that allows the reader to understand the work in its broader context,
compare it with cognate research, and evaluate it for quality and novelty. This requires
the researcher both to maintain a knowledge of current research (current awareness)
and to perform searches for relevant work in the legacy literature when their new work
necessitates it (finding precedents).
     Nonetheless, for a variety of reasons, researchers do not always adequately achieve
these tasks, and this can lead to subsequent problems both for their own work and for
that of other researchers. And this is particularly so in research in biodiversity, more
2         Hirst, Talent, and Scharf

than perhaps most other sciences. Because of its longevous literature1 and its need,
in research on changes in biodiversity in ecosystems, to understand past conditions,
finding precedents is both more important and more difficult than in the fast-moving
don’t-look-back-or-you’ll-get-run-over sciences such as genomics.
    In this position paper, we sketch the design of a proposed system that would draw
on the mechanisms of the Semantic Web and methods in natural language processing to
facilitate a search for precedents in the legacy biodiversity literature, especially (but not
exclusively) the literature relating to systematics. It should be noted that what we are
describing here is neither conventional search nor plagiarism detection (see footnote 8
below); our approach is influenced by research in the history of ideas in systematics
on the detection of influence between authors and of independent re-invention (Scharf
2008).


2     What is a precedent, why do they matter, and why can they be
      hard to find?

We use the word precedent here, for want of a better term, to refer to any earlier pub-
lished work or body of work that is, in an important way similar to, relevant to, or related
to the current work in question. This is rather vague and subjective, but we can make it a
little more concrete thus: An earlier published work is a precedent for current work if it
has affected, or should have affected, the course of the newer work. This could include
relevant methodologies, earlier attempts to solve the same problem, and earlier results
and data. The most serious examples would be earlier work that is essentially the same
as the present work (the new work is an independent re-invention), and, in particular,
when the earlier work demonstrates that the new work is doomed to failure. Of primary
interest to us in this paper are precedents in biodiversity research that, if not known and
taken into account, render the current work seriously incomplete or erroneous.
     Biodiversity research depends heavily on the legacy literature, which is the key
source of important information about former biodiversity, and which also contains the
results of massively time-consuming research that is difficult to replicate. The legacy lit-
erature of biodiversity includes a large component that is taxonomic literature. Besides
the primary descriptions of new taxa, a major component of the taxonomic literature
is synoptic volumes such as field guides, floras and faunas, synonymies, and ‘manu-
als’, which give varying levels of detail about the taxa present in a geographic area,
including newly described taxa, summaries of opinion about previously defined taxa,
and amended circumscriptions and descriptions. Modern synoptic works also include
species-occurrence databases and analyses of biodiversity.
 1 “Natural history scientists work in fragmented, highly distributed and parochial communities,

    each with domain specific requirements and methodologies [Scoble 2008]. Their output is
    heterogeneous, high volume and typically of low impact, but with a citation half-life that may
    run into centuries” (Smith et al. 2009). “The cited half-life of publications in taxonomy is
    longer than in any other scientific discipline, and the decay rate is longer than in any scientific
    discipline” (Moritz 2005). Unfortunately, we have been unable to identify the study that is the
    basis for Moritz’s remark.
                       Discovering Precedents in the Biodiversity Research Literature       3

     Taxonomic nomenclature is a component of systematics that functions as a gateway
to much of the taxonomic literature. It involves the application of the sets of rules that
are laid down in the codes of nomenclature (ICZN 1999; McNeill et al. 2012) and peri-
odically updated, with most provisions retroactively in force. The nomenclatural rules
determine how the correct name for each species (or taxon of a higher or lower rank)
must be determined. The principle of priority enshrined in the nomenclature rules holds
as far back as the mid–eighteenth century, and literature of that vintage may be required
to discover which name is correct. The definition of a taxon is anchored by the type
specimen and the circumscription may be expressed either as a list of characteristics or
as a list of specimens that the author considers to fit within the definition of the taxon.
The specimen list may be either a list of typical specimens, or may be chosen to illus-
trate the range of morphological variation (or, potentially, the range of DNA sequences
seen). Subsequent authors may wish to add to or subtract from the circumscription:
common cases are (1) that a specimen of the other sex or a different life stage (such as
a larva) is found, or (2) that a specimen originally cited is found to belong to a different
taxon.
     A taxonomist who wishes to create a new definitive list of the species in an geo-
graphic area or in a taxonomic group (a new “revision”) must therefore search the legacy
literature to find previous work that lists species in the area, or describes new species
that might or might not be relevant, that amends previous descriptions, and (crucially)
that works out the relationships between new or previously known species. They will
need to find, evaluate, and cite prior publications that merge or split species (taxa), re-
classify them into different groups, or assign new names to previously described species
(taxa). All name alterations need to be re-evaluated in light of the rules of nomenclature
now in force, which in practice means that previously ignored literature may resurface
and lead the literature search into new areas. The precedents that were assumed for a
work, and even the literature that was deliberately ignored for a work, may be listed in
a way that requires a considerable sophistication in text understanding, for example in
a book preface (e.g., Bentham and Hooker’s Genera Plantarum).
     Because of what has been termed the “citation gap” in the biodiversity literature
(Payne et al. 2012), the taxonomic literature is massively undercited, and “such unin-
tended omissions are likely to result in the decline of the [taxonomic] disciplines upon
which the synoptic analyses depend” (Payne et al. 2012: p. 1350). This has occurred
because the rules of nomenclature are now considered arcane by many researchers,
and complete ignorance of the rules is common, not only among authors in ecology
and biological taxonomy,2 but lately even among the editors of major journals.3 Large
databases are being developed that already reduce the need to check the older literature,
but their coverage is far from complete (Reveal 2012). Because of their ignorance and
misunderstanding of the rules of nomenclature, the legacy literature becomes incompre-

2 Systematics was traditionally a significant component of university biology courses, but the

  courses that provide this fundamental training have almost disappeared (Garnock-Jones 2013),
  replaced by courses that deal solely with molecular phylogenetic analysis, which is just one
  component of systematics.
3 For an example of editorial problems, see the discussion in Taxacom at http://mailman.

  nhm.ku.edu/pipermail/taxacom/2004-December/045547.html et seq.
4        Hirst, Talent, and Scharf

hensible to ecologists and inaccessible for biodiversity studies. But the consequences of
mistakes, including failure to understand the older literature, can thus be very serious.4
     Moreover, these kinds of mistakes may have a personal cost for their authors. When
nomenclatural or taxonomic changes are referred to in later works, even in brief sum-
maries, they usually carry a pointer to the authors who made the original change. There-
fore, publications that err in this regard, if not ignored completely, are likely to be cited
in a way that makes their transgressions apparent, an embarrassment for both the au-
thors and the journal editors. For example, a taxonomic name may appear with an anno-
tation such as nomen dubium, nomen invalidum, or nomen illegitimum, which indicates
that the original authors erred. A correction may be published by later authors (neo-
or lectotypification). When synonyms are listed, the authors commonly point to where
their opinion differs from that of earlier authors, for example, Synonyms: Leptosper-
mum flavescens sensu W.L. Wagner et al. p.p., non Sm. means that W.L. Wagner et al.
included in the definition of Leptospermum flavescens some plants (p.p. = pro parte ‘in
part’)) that did not match Smith’s original description (non Sm.), and the present authors
consider them to belong in another species; such a list may include implicit allegations
that mistakes were made.
     In the past, the principal problem had been lack of access to the required literature,
but this is reducing, in large part due to the freely accessible Biodiversity Heritage Li-
brary5 (Gwinn and Rinaldo 2009) and the (pay-walled) JSTOR collection, though much
still remains inaccessible. But access helps only if researchers are willing to search this
literature and can do so effectively. Non-technical barriers to doing so, in addition to the
ignorance of the need and of the rules of nomenclature mentioned above, include time
pressure, and the “Google effect” of just searching the Web and ignoring all but the top
few results.
     But even competent and well-intentioned researchers often have difficulties search-
ing this literature. Simple Google-style keyword searches are frequently insufficient,6
because in this literature, more so perhaps than most other fields of science, related con-
cepts are often described or explained in different terms, or in completely different con-
ceptual frameworks, from those of contemporary research. As a result, interesting and
beneficial relations with legacy publications, or even with whole literatures, may remain
hidden to term-based methods. In the case of taxonomy in particular, this implies the
existence of what Nic Lughadha (2004) has called “hidden synonymies”. The problem
is compounded by ubiquitous Latin, non-obvious (to the modern reader) abbreviations,
particularly Latin abbreviations and varied abbreviations of people’s names, compact
tabulations, and misspellings and multiple spellings of the same name.

 4 “International conventions and national or regional legislation concerning threatened or endan-

   gered animals specify the species or subspecies name of the animals that the law intends to pro-
   tect. Thereafter, protection goes with the name rather than the endangered species itself. Any
   subsequent change in name could therefore affect conservation measures. The Commission of-
   ten acts to protect the names of endangered species.” — From the web site of the International
   Commission on Zoological Nomenclature (http://iczn.org/content/conservation)
 5 http://www.biodiversitylibrary.org
 6 Moreover, the quality of the OCR of many scans in the Biodiversity Heritage Library is

   presently so poor that keyword searches frequently result in false negatives.
                         Discovering Precedents in the Biodiversity Research Literature         5

    Of course, none of this is to say that exact keyword matches are irrelevant or un-
helpful. Term overlap can play its usual roles, and matches to names of taxa and of
geographic locations are of particular importance.7 However, our goal in the present
work is to use semantic and structural relationships to discover the covert legacy litera-
ture that is not found with just a Google search or similar.


3     Foundational research

Ironically, we had great difficulty finding legacy literature on the topic of the difficulty
of finding legacy literature, and on the topic of how researchers, in practice, search for
and use this literature and the extent to which they do so.
     The body of work that is perhaps most related to the former point is that of Swanson
and colleagues (e.g., Swanson 1986; 1988; 1990) on identifying undiscovered public
knowledge by analyzing the complementary but disjoint literature in two distinct fields
of research and connecting knowledge in each to create new knowledge. For example,
Swanson showed (1990; 1993) that studies on magnesium and studies on migraine, in
two different fields, had terms in common, and the discovery that the two were related
led in turn to the discovery that magnesium deficiency is connected with migraine.
Superficially, the aim of this kind of analysis is the exact opposite of ours — it is looking
at cases where, a priori, the authors are working in different research fields (rather than
the same or closely related fields), and it does not operate at the level of the individual
research paper. But methodologically it is similar nonetheless in that it is looking for
an overlap or similarity in some aspect or aspects of the research. However, this work
is limited in that the identification of related sub-fields was based simply on common
terms used in both studies, and as we noted above, identical terminology cannot be
assumed, even within a single research field. Moreover, the work needs, by its own
background assumptions, to look at all possible pairings of topics of scholarship, and
hence is prohibitively combinatorially explosive; in practice, a human must choose one
topic or question as a starting point (Swanson 1993).
     By contrast, in the approach that we will describe below, the search is constrained
by assumption to a single, but large, field. This limits it sufficiently that it is compu-
tationally feasible with contemporary computing clusters. In the future, it will surely
become computationally feasible to use our approach for Swanson’s purposes.


4     Finding precedents in taxonomy and systematics

The confluence of research in natural language processing with Semantic Web tech-
nologies suggests the possibility in the near-term future of developing systems that
would markedly improve researchers’ ability to search and use the legacy literature in
taxonomy and systematics. We assume the online availability of the literature itself —
that is the continuing development of the Biodiversity Heritage Library (with improved
 7 A barrier that remains beyond the scope of this paper is the need for translation of literature

    written in languages not spoken by the searcher. Except for the special case of Latin, we do
    not address cross-lingual issues.
6         Hirst, Talent, and Scharf

OCR), and access to the more-recent (still-in-copyright) twentieth-century literature in
JSTOR and elsewhere. In this context, a precedent-finding system would take the text
of an author’s early draft (or a submitted manuscript) and find potentially related ideas
in previously published work, matching not just words and phrases but ideas, regard-
less of how they are expressed. It would integrate current and expected near-term future
research on the NLP technologies that we will describe below.8
     We do not expect such a system to have a very high precision — many or most
of its matches would be false alarms, although the design would attempt to minimize
that. But the emphasis would be on high recall, bringing the potential matches to the
attention of the user.
     In the following subsections, we look at some of the primary elements, beyond
literal keyword matching, of finding a match between new text and a potential precedent
publication. We do not attempt a formal functional specification, which is the next step
for this research, nor in the space available can we present examples, which would be
textually large. We assume, without further comment, that a component for reasonably
accurate translation of the Latin of taxonomic descriptions is available, and that the
Latin is retained for keyword matching while the translation is used by other matching
processes. We also assume that we have a component for recognizing taxonomic names
in text, such as that of Koning, Sarkar, and Moritz (2005).


4.1    Paraphrase and similarity of meaning

The first element is the identification of sentences and phrases that are close in meaning.
This has become an important research topic in computational linguistics in the last
decade. It takes three forms; the first two are these:

 1. Paraphrase recognition: identifying that two sentences or phrases are semantically
    equivalent or close to equivalent, even if very different in expression.
 2. More generally, recognizing textual entailment (RTE): determining that the mean-
    ing of one sentence is entailed by, or is a consequence of, that of another. (Sentence-
    level paraphrase, then, can be thought of as mutual textual entailment.)

Dagan et al. (2013) provide a comprehensive survey of the techniques that have been
developed for paraphrase recognition and RTE. Clearly, if we found this kind of a rela-
tionship between new work and a legacy publication, we would want to look further to
see whether the latter might be a precedent.
    The third form is this:
8 Although there has been much research recently on plagiarism detection (see, for example, the

    evaluation lab overview by Potthast et al. (2012)), it is only peripherally relevant here, as it fo-
    cuses primarily on finding matches for fragments of text that are precisely identical or differing
    in relatively minor ways, as when a plagiarizing student makes small changes in an attempt
    to evade detection. These are not the kinds of matches we are looking for. Current research
    in plagiarism detection has begun to take greater amounts of rewriting (including translation)
    into account (e.g., Barrón-Cedeño et al. 2013), making the task more like paraphrase detection
    (see below).
                       Discovering Precedents in the Biodiversity Research Literature     7

 3. Measuring semantic text similarity (STS): identifying the degree to which two sen-
    tences, even if not paraphrases or entailing, are related in meaning.

Here, we are not looking for full equivalence or entailment, but rather trying to deter-
mine a degree of similarity or relatedness in meaning, and the methods that are used are
rather different. Agirre et al. (2012) summarize the varied techniques and performance
in a competitive evaluation of 35 STS systems. Even in the absence of equivalence
or entailment, a high degree of relatedness throughout the two texts could indicate a
potential precedent.
    We expect that precedent-finding systems would draw on all three forms of this
research. However, it should be noted that this research is presently limited to com-
parisons of pairs of sentences, whereas our goal inclues far broader comparisons long
segments or complete texts, to find these relationships. So it will be important for this
research to develop in this direction.


4.2   The low-level structure of scientific papers

The next element is the automatic analysis of the structure of scholarly discourse, es-
pecially scientific papers. Over the last decade, this has grown to become an important
area of natural language processing (e.g., Ananiadou et al. 2012). This work endeavours
to determine the structural purpose and discourse function of both individual sentences
and of larger fragments of text in a scientific paper. Purposes or functions include such
things as stating a claim, describing a gap in knowledge, criticizing or praising past
work, and asserting the novelty of the present work (e.g., Teufel and Kan 2011; An-
grosh et al. 2013a). This research also attempts to determine the purpose and scope of
each citation in a paper (e.g., Siddharthan and Teufel 2007).
    As this work becomes better and more mature, it can start to inform research on
various relationships between texts (section 4.1 above), as the kind of information that
it derives will be important in determining precedents. For example, if it is found that
two sentences in different papers that are related in meaning are both claims, or both are
statements of results, then we have a rather different situation with regards to identifying
a precedent than if the sentence in the earlier paper is a result and the one in the later
paper is a statement of the present state of the art.
    The analysis of the structure of scientific texts will become more sophisticated in
the future as it starts to incorporate more-detailed analysis of the discourse and rhetor-
ical structures of text (e.g., Feng and Hirst 2012) — that is, the ability to find semantic
discourse relationships between the clauses or sentences of a text, and then, in turn, the
relationships that are built between larger fragments of text. That means not just the
similarity or entailment relationships of section 4.1, but relationships such as CAUSE ,
CONTRAST, ELABORATION , and so on. And, in particular, it means finding them even
when the author has left them only implicit in the text, which authors frequently do;
in many contexts, human readers are able to recognize these relations without explicit
textual cues, and authors tend to take advantage of this. Recognizing such implicit rela-
tionships is a current topic of research (Lin, Kan, and Ng 2009; Feng and Hirst 2013).
8        Hirst, Talent, and Scharf

4.3   The argumentation structure of scientific papers
Our final element also relates to the structure of scientific papers, but at a higher level
than the discourse relations. Ultimately, we would like to derive the structure of the
overall argumentation9 of a scientific text, and use that information too as a component
of the matching process in our precedent-finding system. This is very difficult, even
for people; a more realistic near-term goal based on current research (e.g., Lin, Kan,
and Ng 2009; Feng and Hirst 2011) is to classify sentences as to their local role in the
argumentation (e.g., premise, evidence) and use this information, and other identified
discourse relations, to recognize larger components of the argumentation of the text and
the kinds of argumentation scheme that it is using — for example, argument by analogy,
or by induction, or by appeal to authority.
    This could then allow matching of papers on the basis of the structure of the argu-
mentation and how the content relates to this structure — or, indeed, independently of
the content.10 This kind of matching is less of an issue for the primarily fact-gathering
aspects of searching the legacy literature that we described in section 2 above, but it
would be of help in many other aspects of biodiversity (and other scientific) research.

4.4   Practical realization
Last, how would all this be realized in practice? Each item in the biodiversity and sys-
tematics legacy literature will need to be analyzed (including newly added items as they
are published and as scanning of old literature continues) and annotated with an exten-
sive representation for meaning and structure at all the levels of analysis. An important
aspect of the representation and indexing of the legacy publications is that it must facil-
itate the process of checking for matches against new text, and must make this complex
process as cheap as possible.
     We anticipate that this representation would be based on XML and ontologies that
are the topics of present-day research on mechanisms and resources for the Seman-
tic Web. The annotation of some levels of analysis will be straightforward, such as
the extraction of technical terms. Others will require further research and other design
choices, as the nature of the representation will depend in part on the technical aspects
of the methods chosen. For example, Dagan et al. (2013) list five distinct classes of
methods for recognizing textual entailment; each implies different choices in the repre-
sentation of the legacy text. One choice might involve annotating the text with details of
the filled semantic roles of each sentence (Palmer, Gildea, and Xue 2010); another (not
mutually exclusive) choice could be explicit annotation with contextually appropriate
synonyms.
     Practicality thus depends not only on our restriction of the domain (compared to
the combinatorial problems of Swanson’s approach, in section 3 above), but also on
developing an effective representation.
9 We refer, somewhat hyper-correctly, to argumentation structure to prevent the misinterpreta-

   tion that we are talking about argument structure in the sense used in sentence-level syntax.
   We nonetheless refer to kinds of argument where there can be no terminological ambiguity.
10 Retrieval of precedents by argumentation structure, without regard to the facts of any individ-

   ual case, is also of particular concern to legal researchers (Dick 1991).
                      Discovering Precedents in the Biodiversity Research Literature     9

4.5   What’s not included

The attentive reader will have observed that there are two things omitted from our pro-
posal that might have been expected. The first is the use of citations and citation chains.
One of our assumptions here is that our system is looking for things that are or might
be completely disconnected, with respect to citations, from its starting point. Therefore,
citations can play only a supporting role. Nonetheless, citations, including indirect con-
nections, could still be a helpful factor in finding precedents; elaborating on this point
is beyond the scope of this paper.
    The other omission is semantic interpretation into a logical form, represented in
XML, that draws on ontologies in the style of the original Berners-Lee, Hendler, and
Lassila (2001) proposal for the Semantic Web. The problem with logical-form repre-
sentation is that it implies a degree of precision in meaning that is not appropriate for
the kind of matching we are proposing here. This is not to say that logical forms would
be useless. On the contrary, they are employed by some approaches to paraphrase and
textual entailment (section 4.1 above) and hence might appear in the system if only
for that reason; but even so, they would form only one component of a broader and
somewhat looser kind of semantic representation.


5     Conclusion

The precedent-finding system as we have sketched it here would be the culmination of
a number of threads of research in computational linguistics and natural language pro-
cessing and in document processing for the Semantic Web, and it can be thought of as
a grand challenge for these fields. Moreover, we argue that by restricting our goals to
the special case of the literature of systematic taxonomy and ecosystem biodiversity, we
can achieve useful results in the near-term. But more generally, in a world in which in-
creasingly interdisciplinary scholars must search an increasingly large legacy literature,
precedent-finding systems would have great utility.


Acknowledgments. This work was financially supported by the Natural Sciences and
Engineering Research Council of Canada and the Canadian Newt and Eft Foundation.
We are grateful to Heike Zinsmeister for helpful discussions.


Bibliography

Angrosh, M.A.; Cranefield, Stephen; Stanger, Nigel (2013a). Context identification of
  sentences in research articles: Towards developing intelligent tools for the research
  community. Natural Language Engineering, to appear.
Angrosh, M.A.; Cranefield, Stephen; Stanger, Nigel (2013b). Contextual information
  retrieval in research articles: Semantic publishing tools for the research community.
  Semantic Web Journal, to appear. http://iospress.metapress.com/content/
  q7j360604746l315
10      Hirst, Talent, and Scharf

Agirre, Eneko; Cer, Daniel; Diab, Mona; Gonzalez-Agirre, Aitor (2012). SemEval-2012
   Task 6: A pilot on semantic textual similarity. First Joint Conference on Lexical and
   Computational Semantics (*SEM), Montreal, 385–393.
Ananiadou, Sophia; van den Bosch, Antal; Sándor, Ágnes; Shatkay, Hagit; de Waard,
   Anita (editors) (2012). Proceedings of the Workshop on Detecting Structure in Schol-
   arly Discourse, 50th Annual Meeting of the Association for Computational Linguis-
   tics, Jeju, Korea. http://aclweb.org/anthology-new/W/W12/W12-43.pdf
Barrón-Cedeño, Alberto; Vila, Marta; Martı́, M. Antònia; Rosso, Paolo (2013). Plagia-
   rism meets paraphrasing: Insights for the next generation in automatic plagiarism
   detection. Computational Linguistics, to appear.
Berners-Lee, Tim; Hendler, James; and Lassila, Ora (2001). The Semantic Web. Scien-
   tific American, 284(5), May 2001, 34–43.
Dagan, Ido; Roth, Dan; Sammons, Mark; Zanzotto, Fabio Massimo (2013). Recogniz-
   ing Textual Entailment: Models and Applications. Morgan & Claypool Publishers.
Dick, Judith (1991). Representation of legal text for conceptual retrieval. Proceedings,
   Third International Conference on Artificial Intelligence and Law, Oxford, 244–252.
   http://ftp.cs.toronto.edu/pub/gh/Dick-1991.pdf
Feng, Vanessa Wei and Hirst, Graeme (2011). Classifying arguments by scheme. Pro-
   ceedings, 49th Annual Meeting of the Association for Computational Linguistics,
   Portland, Oregon, 978–996.
Feng, Vanessa Wei and Hirst, Graeme (2012). Text-level discourse parsing with rich
   linguistic features. Proceedings, 50th Annual Meeting of the Association for Compu-
   tational Linguistics, Jeju, Korea, 60–68.
Feng, Vanessa Wei and Hirst, Graeme (2013). Removing deleterious information to
   improve recognition of implicit discourse relations. Submitted.
Garnock-Jones, Phil (2013). The citation gap and its effects on taxonomy. In Blog:
   Theobrominated, 5 February 2013. http://theobrominated.blogspot.co.uk/
   2013/02/the-citation-gap-and-its-effects-on.html
Gwinn, Nancy E. and Rinaldo, Constance (2009). The Biodiversity Heritage Library:
   Sharing biodiversity literature with the world. IFLA Journal, 35(1): 25–34.
International Commission on Zoological Nomenclature 1999. International Code of Zo-
   ological Nomenclature, fourth edition. http://www.nhm.ac.uk/hosted-sites/
   iczn/code
Koning, Drew; Sarkar, Indra Neil; Moritz, Thomas (2005). TaxonGrab: Extracting tax-
   onomic names from text. Biodiversity Informatics, 2, 79–82.
Lin, Ziheng; Kan, Min-Yen; Ng, Hwee Tou (2009). Recognizing implicit discourse re-
   lations in the Penn Discourse Treebank. Proceedings of the 2009 Conference on Em-
   pirical Methods in Natural Language Processing (EMNLP 2009), Singapore, pages
   343–351.
Moritz, Tom (2005). “Macro-economic case for open access.” Talk at Library and Lab-
   oratory: The Marriage of Research, Data and Taxonomic Literature, London, 5–
   6 February 2005. http://barcoding.si.edu/LibraryAndLaboratory.htm or
   http://barcoding.si.edu/LibraryAndLaboratory/3-11_Moritz.pdf
McNeill, J.; et 13 al. (2012). International Code of Nomenclature for algae, fungi,
   and plants (Melbourne Code) adopted by the Eighteenth International Botanical
   Congress Melbourne, Australia, July 2011. A.R.G. Gantner Verlag KG.
                     Discovering Precedents in the Biodiversity Research Literature   11

Nic Lughadha, Eimear (2004). Towards a working list of all known plant species.
  Philosophical Transactions: Biological Sciences, 359(no. 1444): Taxonomy for the
  Twenty-First Century (2004-04-29), 681–687. http://www.jstor.org/stable/
  4142261
Palmer, Martha; Gildea, Dan; Xue, Nianwen (2010). Semantic Role Labeling. Morgan
  & Claypool Publishers.
Payne, Jonathan L.; et 8 al. (2012). A lack of attribution: Closing the citation gap
  through a reform of citation and indexing practices. Taxon 61(6): 1349–1351.
  http://www.ingentaconnect.com/content/iapt/tax/2012/00000061/
  00000006/art00030
Potthast, Martin; et 11 al. (2012). Overview of the 4th International Competition on
  Plagiarism Detection. Proceedings, PAN 2012 Lab: Uncovering Plagiarism, Author-
  ship and Social Software Misuse. In: Forner, Pamela; Karlgren, Jussi; and Womser-
  Hacker, Christa (editors), CLEF 2012 Evaluation Labs and Workshop — Working
  Notes Papers, Rome.
Reveal, James L. (2012). A divulgation of ignored or forgotten binomials. Phytoneuron
  2012-28: 1–64. http://www.phytoneuron.net/PhytoN-Divulgation.pdf
Scharf, Sara (2008). Multiple independent inventions of a non-functional technol-
  ogy: Combinatorial descriptive names in botany, 1640–1830. Spontaneous Gener-
  ations, 2(1):145–184. http://spontaneousgenerations.library.utoronto.
  ca/index.php/SpontaneousGenerations/article/view/3552
Scoble, Malcolm J. (2008). Networks and their role in e-taxonomy. In: Wheeler,
  Quentin D. (editor), The New Taxonomy New York: CRC Press, 19–31.
Siddharthan, Advaith and Teufel, Simone (2007). Whose idea was this, and why does it
  matter? Attributing scientific work to citations. Proceedings, Human Language Tech-
  nologies 2007: The Conference of the North American Chapter of the Association for
  Computational Linguistics, Rochester, NY, 316–323.
Smith, Vincent S.; et 4 al. (2009). Scratchpads: a data-publishing framework to build,
  share and manage information on the diversity of life. BMC Bioinformatics, 10(Suppl
  14):S6. http://www.biomedcentral.com/1471-2105/10/S14/S6
Swanson, Don R. (1986). Fish oil, Raynaud’s Syndrome, and undiscovered public
  knowledge. Perspectives in Biology and Medicine, 30, 7–18.
Swanson, Don R. (1988). Migraine and magnesium: Eleven neglected connections. Per-
  spectives in Biology and Medicine, 31, 526–557.
Swanson, Don R. (1990). Somatomedin C and Arginine: Implicit connections between
  mutually isolated literatures. Perspectives in Biology and Medicine, 33, 157–186.
Swanson, Don R. (1993). Intervening in the life cycles of scientific knowledge. Library
  Trends, 41(4), 606–631.
Teufel, Simone; Kan, Min-Yen (2011). Robust argumentative zoning for sensemaking
  in scholarly documents. In: Bernadi, Raffaella et 4 al. (editors) Advanced Language
  Technologies for Digital Libraries, Lecture Notes in Computer Science, Volume
  6699, 154–170.