<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>853 research articles published between 1984 and 2016 in 244 prominent
journals</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Extracting and matching patent in-text references to scienti c publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Suzan Verberne[</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Chios</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden Institute of Advanced Computer Science, Leiden University</institution>
          ,
          <addr-line>Leiden</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1984</year>
      </pub-date>
      <abstract>
        <p>References in patent texts to scienti c publications are valuable for studying the links between science and technology but are di cult to extract. This paper tackles this challenge, speci cally, we extract references embedded in USPTO patent full texts and match them to Web of Science (WoS) publications. We approach the reference extraction problem as a sequence labelling task, training CRF and Flair models. We then match references to the WoS using regular expression patterns. We train and evaluate the reference extraction models using cross validation on a sample of 22 patents with 1,952 manually annotated in-text references. Then we apply the models to a large collection of 33,338 biotech patents. We nd that CRF obtains better results on citation extraction than Flair, with precision scores of around 90% and recall of around 85%. However, Flair extracts much more references from the large collection than CRF, and more of those can be matched to WoS publications. We nd that 88% of the extracted in-text references are not listed on patent front page, suggesting distinct roles played by in-text and front-page references. CRF and Flair collectively extract 603,457 references to WoS publications that are not listed on the front page. In addition to the 1.17 Million front-page references in the collection, this is a 51% increase in identi ed patent{publication links compared with only relying on frontpage references.</p>
      </abstract>
      <kwd-group>
        <kwd>citations</kwd>
        <kwd>patents</kwd>
        <kwd>sequence labelling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Scienti c non-patent references (sNPRs, i.e., references in patents to scienti c
literature) provide a paper trail of the knowledge ow from science to
technological innovation. They have wide applications for science and innovation studies,
science policy, and innovation strategy [
        <xref ref-type="bibr" rid="ref1 ref12 ref15 ref19 ref21 ref6 ref8">15, 8, 12, 6, 1, 19, 21</xref>
        ]. However, the
current practice relies exclusively on patent front-page references but neglects the
more di cult patent in-text references. Front-page references are the references
listed on the front page of the patent document, which are deemed as relevant
prior art for assessing the patentability by inventors, patent attorneys, or
examiners. In-text references are references embedded in patent text, serving a very
similar role as references in scienti c publications. Because of their di erent
generation processes, front-page and in-text references embody di erent information
and have a low overlap [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, several recent studies have suggested
that in-text references are a better indication of knowledge ow than front-page
references [
        <xref ref-type="bibr" rid="ref14 ref3 ref4">14, 3, 4</xref>
        ].
      </p>
      <p>While patent front-page references are readily retrievable from the
metadata of patents, in-text references are part of the unstructured, running text.
Therefore, identifying the start and end of a reference is a non-trivial task.
Furthermore, patent in-text references are shorter and contain less information than
front-page references (e.g. the title of the publication is typically not included),
adding to the di culty of matching in-text references to publications. For
example the USPTO patent US8158424B2, \Primate pluripotent stem cells cultured
in medium containing gamma-aminobutyric acid, pipecolic acid and lithium"
cites a publication twice in the patent text: once as Chan et al., Nat. Biotech.
27:1033-1037 (2009) and the second time as Chan et al. Nat. Biotech 2009 Nov.
27(11):1033-7. This reference also appears as a front-page reference with more
information: Chan et al., Live cell imaging distinguishes bona de human iPS cells
from partially reprogrammed cells, Nature, Biotechnology, vol. 27, pp. 1033-1037,
(2009). However, most of the in-text references do not appear on the front-page
and need to be extracted from the running text.</p>
      <p>In this paper, we take up the challenges of (1) extracting references from
patent texts and (2) matching the extracted references to a publication database.
The second step (matching) is required because we need to uniquely identify
the publications referenced in the patent for further research into the relation
between science and industry.</p>
      <p>We approach the problem of extracting in-text references as a sequence
labelling task, similar to named entity recognition (NER). Sequence labelling in
this regard is a supervised learning process in which each word in the text is
labelled as being outside or inside a reference. We create a manually labelled
training corpus and train two sequence labelling models on this corpus. We
apply the models to a large corpus of 33,338 USPTO biotech patents to extract all
scienti c references. Once extracted, we match the extracted references to the
Web of Science (WoS) database of scienti c publications in a rule-based
manner using regular expressions and pattern matching. We address the following
research questions:
1. With what accuracy can in-text references be extracted using sequence
labelling models?
2. What proportion of automatically extracted in-text references can with
certainty be matched to a publication database?
3. What is the overlap between patent in-text and front-page references, and
how many additional references do we discover from the full text?
We make the following contributions: (1) we deliver a solution for the challenging
and unsolved problem of extracting in-text references from patents, including an
annotated corpus of 22 patents;1 (2) we show that a large number of extracted
1 https://github.com/tmleiden/citation-extraction-with-flair
references can be matched to WoS publication database; (3) we show that in the
biotech domain, there are a substantial number of in-text references to scienti c
papers that are not listed on the front page of the patent. The extraction of those
in-text references will advance research into the interaction between science and
innovation.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Matching patent front-page references Prior studies of sNPRs primarily use
patent front-page references. Although front-page references are relatively easy
to extract, matching them to individual publications is not a trivial task, as
these references do not have a consistent format, often miss important
information (e.g., author names, publication title, journal name, issue, volume, and
page numbers), and are prone to errors. A number of approaches for matching
front-page references to scienti c publications have been proposed. Typically,
the reference string is rst parsed into the relevant elds: publication year, last
name of the rst author, journal title, volume, issue, beginning page, and
article title [
        <xref ref-type="bibr" rid="ref10 ref22 ref6">6, 22, 10</xref>
        ]. Then the identi ed elds are matched to metadata elds of
Web of Science (WoS) publications. The title is matched using string similarity
metrics such as relative Levensthein distance [
        <xref ref-type="bibr" rid="ref10 ref13 ref22">22, 13, 10</xref>
        ].
      </p>
      <p>
        Yang [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] reports a precision above 99% and recall above 95%, depending
on di erent text similarity score thresholds. Knaus and Palzenberger [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] use
the Solr full text search engine to retrieve WoS publications for all PATSTAT
front-page references. They report a precision of 99%, and a recall of 96% and
92%, for EPO and USPTO respectively, requiring matches in at least three elds.
Marx and Fuegi [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] report recall ranges from 76% to 92% and precision from
100% to 75% for di erent thresholds of matching scores. These methods are
not directly applicable for extracting or matching patent in-text references, for
two reasons. First, while front-page references are readily retrievable from the
metadata in patent records, in-text references are embedded in the full text of the
patent without consistent structural cues. Second, in-text references are shorter
than front-page references. In particular, publication titles are rarely included,
excluding the use of string similarity metrics for title overlap.
large
patent
collection
(html)
html2text
large
patent
collection
(plain text)
annotated
patent
collection
get front-page
references from
      </p>
      <p>html
label large
collection
sequence
labelling
model
train and test
sequence labelling
model
front-page
references
for all patents
in-text
references
for all patents
deduplicate and match
references to publication</p>
      <p>database
WoS publication
database</p>
      <p>
        List of
scientific
publications
that are cited
in the patent
full text and
not on the
front page
than 10 thousands journals, but only a very small share of them are cited by
patents: around 5% WoS publications are cited on the front-page [
        <xref ref-type="bibr" rid="ref1 ref19">1, 19</xref>
        ], and
for the 244 prominent journals, 10% publications are cited on the front-page or
in-text [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>Our methods entail the following steps: data pre-processing and annotation,
reference extraction, reference matching, and reference ltering, as illustrated in
Figure 1. We will explain these steps in the subsections 3.2{3.5, after we rst
introduce our data in Section 3.1.
3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>We downloaded two collections of patent HTML les from Google Patents: (1)
As training data for manual annotation we compiled a small collection of 22
patents with IPC class C12N, published in 2010;2 (2) As full domain collection
we obtained a larger set patents from the biotech domain, published in the years
2 These 22 were randomly selected from the complete set of 2,365 patents with class
C12N from 2010. We annotated the patents one by one until we approached 2000
references.
2006{2010. For this second set we searched for all IPC classes associated with
the biotech domain according to the OECD de nition.3 The result is a collection
of 33,338 patents.</p>
        <p>The publication database comes from Web of Science (WoS), consisting of
the metadata for 22,928,875 journal articles published between 1980 and 2010
(excluding book series and non-articles, e.g., review, letter, and note). Included
in this database is also a table of 19,200 journals with titles, abbreviated titles,
and unique identi ers. The same unique identi ers are used in the database of
publications to refer to the journal in which a paper was published.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-processing and annotation</title>
        <p>We converted the patent HTML sources to plain text using the Python package
BeautifulSoup. We extracted all text in the HTML tags `p', `h1', `h2', `h3', and
`heading', excluding the text inside the tags `style', `script', `head', `title', and
`meta'. We manually annotated all in-text references in the 22 patents in the
training set using the BRAT annotation tool.4 The 22 patents contain 1,952
in-text references altogether.5</p>
        <p>We converted the annotated les to IOB-format, the required format for
sequence labelling methods. In IOB, each word in the text has a label B, I, or O.
B means that the word is the beginning of an entity (in our case a reference), I
means that the word is inside an entity, and O means that the word is not part
of an entity. Figure 2 shows an example of IOB markup for a brief span of text
from a patent in our hand-coded set. One di erence between our problem and
3 Query used on Google Patents: (((A01H1/00) OR (A01H4/00) OR (A61K38/00) OR
(A61K39/00) OR (A61K48/00) OR (C02F3/34) OR (C07G11/00) OR (C07G13/00)
OR (C07G15/00) OR (C07K4/00) OR (C07K14/00) OR (C07K16/00) OR
(C07K17/00) OR (C07K19/00) OR (C12M) OR (C12N) OR (C12P) OR (C12Q) OR
(C12S) OR (G01N27/327) OR (G01N33/53) OR (G01N33/54) OR (G01N33/55) OR
(G01N33/57) OR (G01N33/68) OR (G01N33/74) OR (G01N33/76) OR (G01N33/78)
OR (G01N33/88) OR (G01N33/92) )) country:US before:publication:20101231
after:publication:20060101 status:GRANT language:ENGLISH type:PATENT
4 http://brat.nlplab.org/
5 The labelled data and our processing scripts are available at https://github.com/
tmleiden/citation-extraction-with-flair</p>
        <p>Current token: lowercased word (string), part-of-speech tag (string), the last 3
characters (string), the last 2 characters (string), is uppercase (boolean),
starts with capital (boolean), is a number (boolean), is punctuation
(boolean), is a year (boolean, pattern match), is name (boolean, list
lookup), is page number (boolean, pattern match)
Context tokens lowercased word (string), part-of-speech tag (string), is uppercase
(left 2, right 2): (boolean), starts with capital (boolean), is a number (boolean), is
punctuation (boolean)
named entity recognition tasks is that the references are longer than common
entity types (names, places). We are interested to see to what extent the sequence
labelling models can cope with these long spans.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Reference extraction</title>
        <p>We experimented with two sequence labelling methods for reference extraction,
both originally developed for named entity recognition: Conditional Random
Fields (CRF) and the Flair framework.</p>
        <p>
          CRF Conditional Random Fields (CRF) is a traditional sequence labelling
method based on manually de ned features [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The model nds the
optimal sequence of labels (IOB) for each sentence. Each word is represented by
a feature vector describing the word form, its part of speech (noun, verb, etc)
and its left and right context. For part-of-speech tagging we used the
maxent treebank pos tagger of NLTK in Python. We used the implementation of
CRF in sklearn, CRFsuite, for training the sequence labelling classi er on the
hand-labelled data.6 We extended the default feature set of CRFsuite with a
few reference-speci c features such as the explicit recognition of page number
patterns. We included features for the 2 words before the current word and 2
words after. The feature set is shown in Table 1.
        </p>
        <p>
          One potential limitation of CRF for reference extraction is the limited context
size. This motivates the use of a method that takes a larger context into account
for the labelling sequence:
Flair The Flair framework is the current state-of-the-art method for named
entity recognition (NER) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Flair combines a BiLSTM-CRF sequence labelling
model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] with pre-trained word embeddings. The Flair embeddings capture
latent syntactic-semantic information which contextualizes words by the
surrounding text. One advantage of this is that the same word will have di erent
embedding representations depending on its contextual use. In addition, the
context used in Flair embeddings is a complete paragraph (or sentence, depending
6 https://sklearn-crfsuite.readthedocs.io/en/latest/
on the input format), which provides more information for the relatively long
references than the limited context of CRF. The Flair framework is available
online including pre-trained models.7 The purpose of the pre-trained models is
that the labelled data can be relatively small, because transfer learning is applied
from the pre-trained model. It was shown in previous work that the knowledge
of the pre-trained language models transfers across domains and that in small
labeled datasets the use of pretrained embeddings has larger impact [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>Flair processes the text input line by line. Unfortunately, long input lines in
the training data cause the Flair framework to overload its memory.8 To work
around this, the developers advise to split paragraphs in sentences.
Unfortunately, standard sentence splitting packages (NLTK, Spacy) erroneously split
sentences in the middle of references because of punctuation marks in the
reference text. Therefore, we decided to split sentences in the training data using the
following procedure: we added a sentence split between each occurrence of a full
stop and a capital letter, but only when both tokens have the O label. This way,
we did not split in the middle of references. In addition, we used a minimum
length of 20 tokens (to prevent non-sentences to split in short bits) and a soft
maximum of 40 (to prevent memory overload).9 In the test data however, we
kept the full paragraphs instead of sentence splitting using the IOB information
because this would leak ground truth labels to the test setting.</p>
        <p>
          We evaluated Flair with two di erent embeddings models, both provided in
the Flair framework distribution: the Glove embeddings for English named
entity recognition [
          <xref ref-type="bibr" rid="ref17 ref20">17, 20</xref>
          ], and the English-language Flair embeddings that were
trained on the 1-Billion word corpus by Chelba et al. [
          <xref ref-type="bibr" rid="ref2 ref7">7, 2</xref>
          ].10 We adopted most
parameter settings from earlier work on BiLSTM-CRF models for named entity
recognition [
          <xref ref-type="bibr" rid="ref11 ref2 ref9">9, 11, 2</xref>
          ]. For the learning rate (LR), Flair uses an annealing method
that halves the parameter value after 5 epochs, based on the training loss.
Starting from a LR of 0.1 we assume that our model will converge (and it does) as a
result of the annealing process.
        </p>
        <p>Post-processing We added a post-processing step after running the sequence
labelling models because sometimes multiple references are concatenated into
one. This happens more often in the Flair output than in the CRF output (i.e.,
the beginning of references is not always marked by `B'). Our post-processing
script xes this by splitting references on `;' if there are multiple years in the
reference string with a semi-colon in between.
7 https://github.com/zalandoresearch/flair
8 This out-of-memory error is reported as known issue by the Flair developers and
will be solved in a future release: https://github.com/zalandoresearch/flair/
issues/685
9 Our sentence splitting script is available at https://github.com/tmleiden/
citation-extraction-with-flair
10 Information on the embeddings models in Flair can be fount at https:
//github.com/zalandoresearch/flair/blob/master/resources/docs/TUTORIAL_
3_WORD_EMBEDDING.md
Evaluation We evaluated CRF and Flair with ve-fold cross validation. We
split references in such a way that (a) references from the same patent are kept
together in the same partition (in order to prevent over- tting caused by similar
reference contexts in the same patent); (b) the number of references is equally
distributed between the partitions. Of the ve partitions, three are used for
training, one for validation (learning rate annealing is based on the validation
set loss) and one for test, in ve rotating runs.</p>
        <p>As evaluation metrics we report Precision and Recall for the B and I labels, as
well as for the complete reference. For the complete references, we do sub-string
matching, where Precision is de ned as the proportion of predicted references
that are a substring of a true reference, and recall is de ned as the proportion
of true references that are found as substring of at least one predicted reference.
The substring matching ensures that the presence or absence of punctuation
marks at the end of reference strings do not in uence the comparison.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Reference matching</title>
        <p>
          A few examples of automatically extracted in-text references, illustrating their
formats, are:
{ Geysen et al., J. Immunol. Meth., 102:259-274 (1987)
{ Altschul (1990) J. Mol. Biol. 215:403-410.
{ Caohuy, and Pollard, J. Biol. Chem. 277 (28), 25217-25225 (2002);
{ D. Hess, Intern Rev. Cytol., 107:367 (1987)
Matching these references to the WoS involves two steps: First, similar to prior
work on front-page reference matching [
          <xref ref-type="bibr" rid="ref10 ref22 ref6">6, 22, 10</xref>
          ], we analyzed and parsed all
extracted references and stored separate elds: rst author, second author, year,
journal title, volume/issue, and page numbers. Second, we matched these elds
to WoS publications. We counted the number of matching elds to determine the
strength of the match. For e ciency, we only read the WoS database once and
searched for all potentially matching extracted references while reading. These
are the main steps of our matching process:
1. The set Re contains all extracted reference strings. For each r 2 Re:
(a) Skip r if it does not contain one of the years 1980{2010;
(b) Parse r to extract: last name of rst author (authorr), last name of
second author, publication year (yearr), journal title (journalr),
volume/issue, and page number;
(c) Try to match journalr to the journal database using the abbreviated
title variants in the WoS. For all references from which journalr could
be extracted and matched, store the reference per journal id (the set
Rj ); For all references from which the journal id could not be deduced,
store the reference per author (the set Ra).
2. Match references to the publication database:
(a) Per year, read the corresponding WoS publication database. For each
publication record p from this database:
i. Find references in Rj that have the same journal id as p; nd the
references in Ra that have the same rst author as p; store them as
references with possible match: the set of tuples Rm : (r; p)
ii. For each (r; p) 2 Rm, count the number of additional matching elds.
        </p>
        <p>The maximum number of matched elds is 6: publication year,
journal, pages, issue/volume, rst author, second author.11
(b) For each r 2 Rm, identify the best match:
{ Find p with the highest number of matching elds. If at least 4 elds
match then it is a strong match; if fewer elds match then it is a weak
match. If there are multiple publications with the highest number of
matching elds:</p>
        <p>If r contains page numbers, match p that has the same page
numbers;
If r contains page numbers but there is no p with the same page
numbers, the reference does not exist in the database;</p>
        <p>If r does not have page numbers, the reference is ambiguous.
(c) If there is no publication by authorr in yearr, or by journalr in yearr,
the reference does not exist in the database.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Reference ltering</title>
        <p>We extracted all front-page references from the patent HTML in the metadata
elds with the name attribute citation reference, using the Python package
BeautifulSoup. Then we searched whether each in-text reference is also listed on
the front-page of the same patent, by looking up its rst author and publication
on the list of front-page references. This gives us information on how many
additional references we can retrieve by taking the full text into account.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We present results for reference extraction using the sample of 22 manually
annotated patents (Section 4.1) and reference matching (Section 4.2) and then
statistics for reference extraction and matching on the large patent collection
(Section 4.3) and the combination of CRF and Flair (Section 4.4).
4.1</p>
      <sec id="sec-4-1">
        <title>Reference extraction</title>
        <p>Cross-validation results for CRF and Flair are reported in Table 2. For Flair,
we found that the Flair embeddings reach higher precision and recall than the
Glove embeddings, but Flair with the Glove embeddings is 30 to 40 times faster
than Flair with the Flair embeddings. Flair extracts more references than CRF
and a bit more than the ground truth (1,952). CRF outperforms Flair in terms
of precision and recall on the B-labels and the complete references.
11 The rst author can also be a fuzzy match with an edit distance of 1. This matches
author names that have a slight spelling variant in the publication database (typically
a missing hyphen, e.g. SCHAEFERRIDDER for Schaefer-Ridder) or a misspelling
in the reference (e.g. DEVEREUX vs. Devereaux).
# %
Matched 136 100.0%
True positive: correctly matched 117 86.0%
Ambiguous reference text (too little information) 1 0.7%
Error in reference text (e.g. cite wrong journal, author, year) 3 2.2%
Error in reference extraction (e.g. partial or multiple references) 1 0.7%
Publication not in database (e.g. not journal) 14 10.3%
Unmatched 275 100.00%
True negative: publication not in database (e.g. not journal) 161 58.5%
Ambiguous reference text (too little information) 47 17.1%
Error in reference text (e.g. cite wrong journal, author, year) 16 5.8%
Error in reference parsing and matching 34 12.4%
Error in reference extraction (e.g. partial or multiple references) 17 6.2%
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Reference matching</title>
        <p>
          To evaluate the performance of our reference matching method, we manually
checked 136 matched and 275 unmatched references. Table 3 shows the result of
this analysis: 86.0% matched references are true positives and 58.5% unmatched
references are true negatives. Our reference extraction method is only responsible
for 0.7% false positives and 6.2% false negatives, and our reference matching
method is responsible for 10.3% false positives and 12.4% false negatives. It is
important to note that even if all in-text references are extracted perfectly with
complete information, we cannot expect that all of them can be matched to WoS
records. Callaert et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] found that only 58% of patent front-page references
are scienti c. Our WoS database only includes journal articles and is a subset
of what they consider as scienti c. In addition, it is known of the WoS that its
coverage is not fully complete. The incomplete coverage contributes to matching
errors.
4.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Application to the large collection of Biotech patents</title>
        <p>Statistics on extracting and matching patent in-text references from the large
patent collection are reported in Table 4. The Flair results were obtained using</p>
        <p>CRF Flair
# of extracted in-text references from 1980{2010 519,562 100% 1,233,095 100%
# of extracted in-text references that can be parsed 484,085 93.2% 1,126,676 91.4%
" " " " with a de nite match in WoS 174,899 33.7% 671,317 54.4%
" " " " with a de nite match and not on the front-page 125,631 493,583
the Glove embeddings because the Flair embeddings were 30 to 40 times slower
in generating output (see Section 4.1). Table 5 presents the breakdown for the
unmatched in-text references. The number of patents in the collection is 33,338;
altogether they have 1,174,661 front-page references. CRF extracts 125,631
intext references that are not on the front-page; Flair extracts much more: 493,583.</p>
        <p>Table 4 and 5 show large di erences between CRF and Flair. A much larger
number of references extracted by Flair can be matched to WoS than that of
CRF. For CRF, the majority of references without a de nite match are
ambiguous, meaning that they have multiple possible matches in WoS. One example is:
\Sunamoto et al. (Bull. Chem. Soc. Jpn., 1980, 53,". There are ve records in
WoS with Sunamoto as the rst author, multiple other authors, the same
journal, year, and volume, three of which even appear in the same issue. Without
additional information about other authors, issue, and page numbers, it is
impossible to know which publication is actually being referenced. Further analysis
of these cases indicated that the ambiguity occurs because the disambiguating
information is not part of the reference text extracted by CRF: For the majority
(72%) references extracted by CRF, only 2 of the 5 most important elds ( rst
author, year, journal id, issue, page numbers) can be extracted through parsing
the references, while for the majority (72%) references extracted by Flair, at
least 4 of those elds can be extracted.
4.4</p>
      </sec>
      <sec id="sec-4-4">
        <title>Combining the output of Flair and CRF</title>
        <p>To assess the total overlap between in-text and front-page references and added
information value of in-text references, we combine Flair and CRF outputs. Flair
and CRF collectively extracted 686,956 in-text references from the large patent
collection that could matched to the WoS publications, and 603,457 (88%) of
those are not listed on the patent front-page.</p>
        <p>
          The collection of 33,338 Biotech patents contains 1,174,661 non-patent
frontpage references in total. The additionally retrieved 603,457 in-text references
constitute a 51% increase in identi ed patent-publication-links, which is a
conservative estimation considering that only around 58% front-page references are
actually scienti c [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>This paper tackles the challenge of extracting and matching patent in-text
references to scienti c publications. We approach the reference extraction problem as
a sequence labelling task using CRF and Flair. We solve the reference matching
problem in a rule-based manner, using regular expressions for extracting
publication elds and then matching them to the Web of Science database. Speci cally,
We trained the models and developed the patterns on a small, manually labelled
sample of 22 patents with 1,952 references. Then we applied the models to a
large collection of 33,338 biotech patents.</p>
      <p>(RQ1) We trained two supervised models on the manually annotated
sample. CRF achieved the best result in cross validation: for individual B and I
labels, precision scores are 89% and 91% respectively, and recall scores 84% and
87% respectively. For complete references, precision is 83% and recall 81%. The
state-of-the-art sequence labelling method Flair did not beat CRF on most of
the evaluation metrics (only Recall for the I-labels). This is probably due to a
mismatch between train and test settings for sentence splitting in Flair,
necessitated by known memory issues of the framework. We are currently investigating
whether we can improve our Flair model by ne-tuning the language model on
domain data (i.e., biotech patents).</p>
      <p>(RQ2) Our method is able to match a large number of the extracted
references to WoS publications. CRF extracted 519,562 in-text references from the
years 1980{2010 from the large patent collection, 33.7% (172,899) of which had
a de nite match in the WoS publication database. Flair extracted much more
references (1,233,095), and 54.4% of them (671,317) had a de nite match. Thus,
although Flair is less exact in extracting references than CRF, it extracts more
references, and more of those can be matched to WoS publication records,
because the extracted strings are more complete (re ected by a higher recall for
the I-labels).</p>
      <p>(RQ3) Flair and CRF collectively matched 686,956 in-text references from
the large patent collection to WoS, and 603,457 (88%) of those are not listed
on the patent front-page. These additionally retrieved references constitute a
substantial increase (51%) compared to the set of front-page references. These
ndings highlight the added value of patent in-text references for studying the
interaction between science and innovation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>We thank Yuanmin Xu for making the manual annotations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ahmadpoor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          :
          <article-title>The dual frontier: Patented inventions and prior scienti c advance</article-title>
          .
          <source>Science</source>
          <volume>357</volume>
          (
          <issue>6351</issue>
          ),
          <volume>583</volume>
          {
          <fpage>587</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Akbik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blythe</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vollgraf</surname>
          </string-name>
          , R.:
          <article-title>Contextual string embeddings for sequence labeling</article-title>
          .
          <source>In: Proceedings of the 27th International Conference on Computational Linguistics</source>
          . pp.
          <volume>1638</volume>
          {
          <fpage>1649</fpage>
          . Association for Computational Linguistics, Santa Fe, New Mexico, USA (Aug
          <year>2018</year>
          ), https://www.aclweb.org/anthology/C18-1139
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bryan</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ozcan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>The impact of open access mandates on invention</article-title>
          .
          <source>Mimeo</source>
          , Toronto (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bryan</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ozcan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampat</surname>
            ,
            <given-names>B.N.</given-names>
          </string-name>
          <article-title>: In-text patent citations: A users guide</article-title>
          .
          <source>Tech. rep., National Bureau of Economic Research</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Callaert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grouwels</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Van Looy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Delineating the scienti c footprint in technology: Identifying scienti c publications within non-patent references</article-title>
          .
          <source>Scientometrics</source>
          <volume>91</volume>
          (
          <issue>2</issue>
          ),
          <volume>383</volume>
          {
          <fpage>398</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Callaert</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vervenne</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Looy</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magerman</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeuris</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Patterns of science-technology linkage</article-title>
          .
          <source>Tech. rep., European Commission</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chelba</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuster</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ge</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brants</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robinson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>One billion word benchmark for measuring progress in statistical language modeling</article-title>
          .
          <source>arXiv preprint arXiv:1312.3005</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fleming</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sorenson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Science as a map in technological search</article-title>
          .
          <source>Strategic Management Journal</source>
          <volume>25</volume>
          (
          <issue>8-9</issue>
          ),
          <volume>909</volume>
          {
          <fpage>928</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          .
          <source>arXiv preprint arXiv:1508</source>
          .
          <year>01991</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Knaus</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palzenberger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Parma. a full text search based method for matching non-patent literature citations with scienti c reference databases. a pilot study</article-title>
          .
          <source>Tech. rep., Max Planck Digital Library</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ballesteros</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kawakami</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Neural architectures for named entity recognition</article-title>
          .
          <source>arXiv preprint arXiv:1603.01360</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Azoulay</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sampat</surname>
            ,
            <given-names>B.N.:</given-names>
          </string-name>
          <article-title>The applied value of public investments in biomedical research</article-title>
          .
          <source>Science</source>
          <volume>356</volume>
          (
          <issue>6333</issue>
          ),
          <volume>78</volume>
          {
          <fpage>81</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Marx</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuegi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Reliance on science in patenting</article-title>
          . SSRN: https://ssrn.com/abstract=3331686 (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Nagaoka</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamauchi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>The use of science for inventions and its identi cation: Patent level evidence matched with survey</article-title>
          .
          <source>Research Institute of Economy, Trade and Industry (RIETI)</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Narin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamilton</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olivastro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>The increasing linkage between us technology and public science</article-title>
          .
          <source>Research policy 26(3)</source>
          ,
          <volume>317</volume>
          {
          <fpage>330</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Okazaki</surname>
          </string-name>
          , N.:
          <article-title>Crfsuite: a fast implementation of conditional random elds (crfs) (</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ammar</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhagavatula</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Power</surname>
          </string-name>
          , R.:
          <article-title>Semi-supervised sequence tagging with bidirectional language models</article-title>
          .
          <source>arXiv preprint arXiv:1705.00108</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Poege</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harho</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaessler</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baru aldi</surname>
          </string-name>
          , S.:
          <article-title>Science quality and the value of inventions</article-title>
          . arXiv preprint arXiv:
          <year>1903</year>
          .
          <volume>05020</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Reimers</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurevych</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Reporting score distributions makes a di erence: Performance study of LSTM-networks for sequence tagging</article-title>
          .
          <source>arXiv preprint arXiv:1707.09861</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Veugelers</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Scienti c novelty and technological impact</article-title>
          .
          <source>Research Policy</source>
          <volume>48</volume>
          (
          <issue>6</issue>
          ),
          <volume>1362</volume>
          {
          <fpage>1372</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Linking Science and Technology: Reference Matching for Co-citation Network Analysis</article-title>
          .
          <source>Master's thesis</source>
          , Leiden University (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>