<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Improving reference mining in patents with BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ken Voskuil</string-name>
          <email>k.s.voskuil@umail.leidenuniv.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Suzan Verberne</string-name>
          <email>s.verberne@liacs.leidenuniv.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leiden Institute of Advanced Computer Science, Leiden University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>78</fpage>
      <lpage>88</lpage>
      <abstract>
        <p>In this paper we address the challenge of extracting scientific references from patents. We approach the problem as a sequence labelling task and investigate the merits of BERT models to the extraction of these long sequences. References in patents to scientific literature are relevant to study the connection between science and industry. Most prior work only uses the front-page citations for this analysis, which are provided in the metadata of patent archives. In this paper we build on prior work using Conditional Random Fields (CRF) and Flair for reference extraction. We improve the quality of the training data and train three BERT-based models on the labelled data (BERT, bioBERT, sciBERT). We find that the improved training data leads to a large improvement in the quality of the trained models. In addition, the BERT models beat CRF and Flair, with recall scores around 97% obtained with cross validation. With the best model we label a large collection of 33 thousand patents, extract the citations, and match them to publications in the Web of Science database. We extract 50% more references than with the old training data and methods: 735 thousand references in total. With these patent-publication links, follow-up research will further analyze which types of scientific work lead to inventions.</p>
      </abstract>
      <kwd-group>
        <kwd>Patent analysis</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Reference mining</kwd>
        <kwd>BERT</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        References in patents to scientific literature provide relevant information for studying the
relation between science and technological inventions. These references allow us to answer
questions about the types of scientific work that leads to inventions. Most prior work analysing
the citations between patents and scientific publications focuses on the front-page citations,
which are well structured and provided in the metadata of patent archives such as Google
Patents. It has been argued that in-text references provide valuable information in addition to
front-page references: they have little overlap with front-page references [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and are a better
indication of knowledge flow between science and patents [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">2, 3, 1</xref>
        ].
      </p>
      <p>
        In the 2019 paper by Verberne et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the authors evaluate two sequence labelling methods
for extracting in-text references from patents: Conditional Random Fields (CRF) and Flair. In
this paper we extend that work, by (1) improving the quality of the training data and (2) applying
BERT models to the problem. We use error analysis throughout our work to find problems in
online
© 2021 Copyright for this paper by its authors.
      </p>
      <p>CEUR</p>
      <p>CEUR Workshop Proceedings (CEUR-WS.org)
the dataset, improve our models and analyze the types of errors diferent models are susceptible
to.</p>
      <p>We first discuss the prior work in Section 2. We describe the improvements we make in
the dataset in Section 3, and the new models proposed for this task in Section 4. We compare
the results of our new models with previous results, both on the labelled dataset and a larger
unlabelled corpus (Section 5). We end with a discussion on the characteristics of the results of
our new models (Section 6), followed by a conclusion.</p>
      <p>Our code and improved dataset are released under an open-source license on github.1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Prior work</title>
      <p>
        Reference analysis in patents has primarily been done using the references that are listed on the
patent’s front page. Patents often contain many more references in the patent text themselves,
but these are more dificult to extract and analyze because their formatting is not standardized.
Verberne et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] introduce a new labelled dataset consisting of 22 patents and 1,952
handlabelled references. They apply two sequence labelling methods to the reference extraction
tasks.
      </p>
      <p>
        Conditional Random Fields (CRF) model sequence labelling problems as an undirected graph
of observed and hidden variables, to find an optimal sequence of hidden variables (labels) given
a sequence of feature vectors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Feature vectors usually consist of several manually designed
heuristics on the level of individual tokens and small neighborhoods of tokens. For extracting
references, Verberne et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use a set of 11 + 6 ∗ 4 features. This includes 11 features derived
from the current token, ranging from the part-of-speech (POS) tag (extracted with NTLK),
lexical features such as whether the token starts with a capital or is a number, and pattern-based
features to mark tokens that look like a year or a page number.2 It also includes a subset of 6
features for each of the two preceding and following tokens.
      </p>
      <p>
        As the authors note, CRF has limited capabilities to take context into account. They chose to
compare CRF with the Flair framework, which is better able to use token contexts. Flair uses a
BiLSTM-CRF model in combination with pre-trained word embeddings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. One downside of
Flair models is that they are memory intensive, which limits the maximum sequence length
it can process at once. Where the CRF model can analyze a complete patent at once, the Flair
models required to split sequences up into subsequences of 20 to 40 tokens [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Verberne et al.
used the IOB labels during training to prevent splitting within a reference.
      </p>
      <p>The models were evaluated by measuring precision and recall using cross validation on the
labelled data. CRF performed better than Flair in all measures except the recall of I-labels. The
models were also applied to a large corpus of 33,338 unlabelled USPTO biotech patents, and
the resulting extracted references were matched against the Web of Science (WoS) database.
Here, Flair performed significantly better. Counting references with a definitive match in WoS
that were not included in the patent front-page, CRF was able to find 125,631 of such references
compared to 493,583 references found by Flair.</p>
      <p>1https://github.com/kaesve/patent-citation-extraction
2The features are similar to the ones used in https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html</p>
      <p>
        Recent developments in transfer learning have improved the state of the art in numerous NLP
tasks. BERT [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a large transformer model that is pre-trained on a large corpus for multiple
language modelling tasks. The resulting model can be used as a basis for new tasks on diferent
data sets. Even when the contents of these data sets or the task deviate significantly from the
pre-training corpus and tasks, the pre-training is still beneficial. Several authors have trained
models with the same architecture as BERT on diferent, more domain-specific corpora. These
include SciBERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and BioBERT [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Improving data quality</title>
      <p>While exploring the results of our models, we found that several prediction errors seemed to be
caused by mistakes in the labelled data. These mistakes result in a more pessimistic evaluation
of our models and, more importantly, could influence the efectiveness of training our models.
We noticed two types of problems; inconsistent or missing labels, and inconsistent tokenization.
We include examples of both kinds of problems below, and describe our attempts to improve
the data quality.</p>
      <sec id="sec-3-1">
        <title>3.1. pre-processor inconsistencies</title>
        <p>The patent dataset contains text from 22 patents taken from Google Patents. Labels were added
manually by one annotator using the BRAT annotation tool3, and the text was subsequently
transformed into IOB files using a pipeline consisting of splitting the text into sentences, then
tokens and adding IOB and POS tags. Because tokenization was applied after annotation, the
labels produced by BRAT needed to be aligned with the produced tokens. In some cases, this
was done by recombining tokens. When comparing the source text with the IOB data, we found
that some sequences of tokens seemed to have been accidentally reordered. An example of this
is shown in Figure 1 After reviewing the pre-processing pipeline we were able to find the likely
cause of this problem. We chose to replace this pipeline with a simpler procedure, that does not
do sentence splitting or combining of tokens. Besides sentence boundaries, our method also
ignores paragraph boundaries and white space in general.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Inconsistent labelling</title>
        <p>After improving the pre-processing, we still found examples of label inconsistencies. Moreover,
our models found several references that were not included in the annotations. Finally, we found
multiple instances of references to patents and other non-academic literature. These are often
hard to distinguish from scientific literature references. We manually looked at each diference
between predicted and expected labels, and changed the annotations where necessary. We
repeated this process several times, with diferent models and after retraining on the updated
data. In this process, we labelled 330 new references, resulting in a total of 2,318 references
and 32,359 (I)nside tokens. We chose to include patent references when they included author
names or titles, and other non-literature references, when the reference shares the format of
(Eskildsen et al., Nuc. Acids Res. 31:3166-3173, 2003;</p>
        <sec id="sec-3-2-1">
          <title>Kakuta et al., J. Interferon &amp; Cytokine Res. 22:981-993, 2002.)</title>
          <p>(a) Original text</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Token</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Label</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Eskildsen B et al., I I</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Nuc. Acids</title>
          <p>I I</p>
          <p>Res.</p>
          <p>I
…
…</p>
          <p>Res. 22:981-993, 2002.)(
I I O
(b) Original tokenization</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Token</title>
        </sec>
        <sec id="sec-3-2-7">
          <title>Label ( O</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>Eskildsen</title>
          <p>B
et al. ,
I I I</p>
        </sec>
        <sec id="sec-3-2-9">
          <title>Nuc . Acids I I I … …</title>
          <p>Res . 22:981-993 , 2002 .</p>
          <p>I I I I I I
)
O
(c) New tokenization
a literature reference. This simplifies the task, as the model does not have to disambiguate
references by their type. Since these extracted non-literature references will not match with the
publications in WoS, they will be filtered out in the next step of the pipeline.</p>
          <p>While we think this process has improved the data quality significantly, our method does
introduce biases in the training and evaluation of our models. By only fixing labelling mistakes
that our models find, we may overlook unlabelled references that our models miss. This leads
to an overestimation in our evaluation, and biases in our model due to the feed back loop in the
training process. By using multiple diferent models for finding incorrect labels, we mitigate
the efect to some extent. Beside an intrinsic evaluation using the labelled data, we will also
evaluate our models on an extrinsic task using unlabelled data. This allows us to still compare
our model performance with previous results, without biases in the dataset or overestimations.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Extracting references with BERT, BioBERT and SciBERT</title>
      <p>We compare three diferent pre-trained models for extracting references from our data set;
BERT, BioBERT and SciBERT. Since our data set consists of patents from the biomedical domain,
we expect that these more domain-specific pre-training corpora will have a positive efect on
our task. Before comparing the results between these models, we describe our method for fine
tuning the pre-trained model for reference extraction.</p>
      <sec id="sec-4-1">
        <title>4.1. pre-processing</title>
        <p>BERT-based models have two characteristics that require additional pre-processing of our
dataset. BERT uses its own neural subword tokenization. Our dataset is already tokenized
into words, as described above, so we apply the BERT tokenizer to each token in our dataset.
Transformer-based models such as BERT also work on fixed sequence lengths, using padding for
shorter sequences, and are memory intensive. The models we train use a maximum sequence
length of 64 tokens, limited by the memory available. Though this can be configured to be
higher depending on the available hardware and the size of the model, it is infeasible to apply
these models on complete patents, which can contain tens of thousands of subword tokens.
There are several common strategies to divide text into shorter sequences. A natural approach
is to use paragraph or sentence splitting. We found this insuficient, as many sentences in our
data set run for much longer than the limit of 64 tokens. Our data set contains not only long
sentences; even references, the entities we are looking to extract, can be longer than 64 tokens.
Because of this observation we decide to not use any semantic or structural information in
splitting our text, except for our original token boundaries.</p>
        <p>Our BERT specific pre-processing can be summarized in the following steps:
1. Collect the sequence of tokens  and their respective labels  for a given patent
2. Create two empty lists  ′ and  ′
3. Add the sequence start token to  ′
4. While there are tokens left in :
a) Get the next token  and label 
b) Use the word tokenizer to get sub tokens  1′, ...,  ′
c) If | ′| +  + 1 is larger than our limit of 64 tokens or when we reach the end of the
document, add the sequence end token to  ′, pad both sequences and add them to
the data set. Set  ′ and  ′ to new empty lists
d) Add  1′, ...,  ′ to  ′, add  to  ′</p>
        <p>We note that the retokenization changes our task from a one-to-one to a many-to-many
sequence-to-sequence task, as there could now be multiple subword tokens associated with
one label. Another implication of these pre-processing steps, is that the entities that we seek
to extract can be split across multiple sequences of 64 subword tokens. As mentioned earlier,
we have a total of 2,318 references and 32,359 tokens labelled as (I)nside. This gives us a total
of 34, 677 reference tokens (labelled either B or I). We find that the average reference contains
324,3,61787 ≈ 15 word tokens, and thus at least that many subword tokens. We can expect a large
number of references to be split across two or more sequences. We expect that this could have
a significant efect on the performance of our models, as the model will not always have access
to the context of a reference.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Training the BERT models</title>
        <p>We fine-tunet three diferent BERT models to our labelled data: BERT-base, bioBERT, and
sciBERT (all cased). We used To fine-tune the BERT models, we use the open source BERT
implementation by HuggingFace4, with a token classification head consisting of a single linear
layer. In the case that an input sequence is shorter than 64 tokens (which only occurs at the end
of a patent), we mask out the loss for the output past the input sequence. We train the models
for three epochs through our training data, with a batch size of 32.5
4https://huggingface.co/transformers/model_doc/bert.html#bertfortokenclassification
5We published the trained models on https://github.com/kaesve/patent-citation-extraction</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Intrinsic evaluation</title>
        <p>We evaluate our models using a leave-one-out training scheme. For each patent in the data set
we train a new model using the other 21 patents as the training data. Aside from the maximum
sequence length, we used the default hyperparameter configurations provided by the chosen
framework. We evaluate on both the original and updated dataset.</p>
        <p>
          Table 1 shows the results of evaluating the models on the labelled data using leave-one-out
validation. We also include the results of [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as a baseline, however, the results are not directly
comparable as they used five-fold cross-validation for evaluation. Their models therefore were
trained on less data. Finally, we include the results of applying the original CRF implementation
on our updated dataset, using the same leave-one-out validation strategy.
        </p>
        <p>We see that our new models perform reasonably well on the original dataset. Comparing to
the baseline methods, we see that the BERT models consistently achieve a much higher recall.
This is especially useful for the WoS matching task, as was discussed earlier.</p>
        <p>When we compare the results of our models obtained with the updated dataset to those
obtained with the original data, we see that the changes in the dataset lead to improvements in
every metric. Especially in the precision column, we see a large jump in quality. This jump is in
part the direct result of our relabelling process. Most changes in the dataset concerned changing
labels from ‘O’ to ‘I’ or ‘B’ tokens, where our models found references that were missed during
labelling.</p>
        <p>Comparing the BERT-based models with each other, we find that the diferences are small.
With the updated data the SciBERT and BioBERT models seem to perform slightly better than
the plain BERT model.</p>
        <p>Finally, we can compare the results of the CRF model on the original and updated dataset.
We again see a clear jump in performance. This comparison does sufer from the training bias
and diferent evaluation strategy mentioned earlier. Furthermore, the CRF model uses features
designed for the original dataset. As we changed the tokenization process, this means that some
of the pattern based features do not work as intended. Still, we think the results do show that
the changes to the dataset make this task easier.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Extrinsic evaluation</title>
        <p>
          We also apply each model to an unlabelled data set of 33,338 patents [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. For this application,
the models are trained on the complete labelled data set. The references produced by these
models are matched against the Web of Science database, using the same procedures as reported
in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          From the set of 33,338 patents, we extract references to papers published in the years
1980–2010 (the ‘focus years’). This results in a list of extracted references. We parse them into
separate fields: first author, second author, year, journal title, volume/issue, and page numbers.
Then we match those fields to publications in the database. If we find a non-ambiguous match
for a subset of the fields, we count this as a ‘definite match’ [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>The results are displayed in Figure 2. There is a clear diference between the new BERT-based
models and the previous CRF and Flair models, but these results are not directly comparable
since CRF and Flair were trained on the original data. The figure also shows that the three BERT
models perform nearly identical to each other. As with the results from our intrinsic evaluation,
SciBERT seems to perform better than the other two BERT models by a small margin.</p>
        <p>We found that our models do not always produce clean sequences of IOB tokens; sometimes
the beginning is not marked as a B, or a word in the middle of a reference is labelled as O. We
extract references from sequences of I tokens starting with a B token or an I token preceded
by an O token, and ending before an O or B token. In the case that our model misses a word
in the middle of a reference, this means that we split this reference in two references during
extraction. Our matching script reports unique matches per patent, so this does not lead to
double-counting references. On the other hand, it could mean that neither part of the split
reference contains enough information to make a definite match in the WoS database.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Our results show that our BERT-based models outperform both CRF and Flair, especially after
improving the training data. While the increased precision and recall is likely overestimated in
our intrinsic evaluation, the new models also perform better in our extrinsic evaluation, which
does not have the same training biases. Our models were able to extract roughly 240,000 more
references that could be matched with the WoS database from the unlabelled data than Flair
could, an increase of almost 50%.</p>
      <p>The diference between the numbers of matched publications found by CRF and BERT is
striking given the small diferences in quality of the models measured with leave-one-out
validation (Table 1). This can for a large part be explained by the improved training data, but
also by the higher recall for the BERT models. In addition, we investigate two characteristics
of errors made by our models, and show the diferences between BERT and CRF. We focus on
prediction errors within references, as these have the largest efect on the downstream task of
parsing references. Specifically, we look at cases where the model labels a token as O when
that token is labelled as B or I in the ground truth.</p>
      <p>Figure 3 shows the relative position of errors within references. This data was captured during
the leave-one-out evaluation. One major diference between BERT and CRF-based models is
that CRF explicitly learns ordered patterns in sequences. We would expect CRF models to make
errors by starting or ending a label sequence too early or too late, but we do not usually expect
errors to occur in the middle of a reference, as CRF learns that an I never follows an O. Without
this structural prior, we expect the errors to occur more uniformly across the references for the
BERT models. The histograms seem to confirm these intuitions. Leaving out mistakes in the
ifrst word, we see that the distribution for especially the SciBERT and BioBERT models seem
0.0
0.2
uniform. The CRF model shows a clear drop in the first third of the distribution, and a steady
increase in the second half.</p>
      <p>By manually looking at references where CRF predicts an O close to the middle, we found
we could categorize these mistakes almost completely in two groups: CRF only labelled the
ifrst or last few tokens as part of the reference, or the reference is very long and CRF finds two
references at beginning and end of the reference. In both scenarios CRF does produce coherent
sequences of a B label followed by I labels. On the other hand, our BERT models sometimes do
not predict a B at all, or in the wrong place. The models are also prone to missing an I label in
the middle of a reference.</p>
      <p>Figure 4 is another way to visualize this diference. Here we plot the lengths of sequences
of O’s found within references. The median error sequence length is one or two for the BERT
models, and four for CRF. In other words, BERT models not only make fewer mistakes than
CRF, but the mistakes are smaller on average, and more uniformly spread across the reference.
We speculate that this helps with the ultimate task of parsing and matching the references.
CRF errors almost always include the first or last few tokens, which often contain important
information for parsing the reference, such as the publication year and the author names.</p>
      <p>Error positions relative to the reference
BaseBERT</p>
      <p>Distribution of error sequence lengths
40
s
h
t
g
len30
e
c
n
e
qu20
e
s
r
o
r
rE10
0</p>
      <p>BERT</p>
      <p>BioBert</p>
      <p>SciBERT</p>
      <p>CRF</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We applied BERT-based models to extract references from patent texts. We found that these
models achieve better recall than CRF and Flair. We use an external database of publications to
match these references, which means that recall is more important than precision, as
imprecisions will be resolved during matching. During the development of our models, we found
that the original dataset for this task had errors in labelling and pre-processing. We used our
models interactively to find these mistakes, and repaired them.</p>
      <p>We find that the improved training data leads to a large improvement in the quality of the
trained models. In addition, the BERT models beat CRF and Flair, with recall scores around 97%
obtained with cross validation. Our models were also applied to a large unlabelled dataset, and
were able to extract 50% more references than previous methods.</p>
      <p>
        We also show that BERT models are prone to a diferent kind of errors than CRF models.
Combining these methods could potentially lead to a stronger model. We think that the limited
maximal sequence size that BERT can handle afects its performance, due to the average length
of references. Recent work focuses on modifying the attention architecture underlying BERT
to better accommodate longer sequences. This includes new models such as the Reformer,
Longformer, Linformer, Big Bird and the Performer [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. We think these models could achieve
even better results, with little modification to our method.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Bryan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ozcan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Sampat</surname>
          </string-name>
          , In-Text Patent Citations:
          <article-title>A User's Guide</article-title>
          ,
          <source>Technical Report, National Bureau of Economic Research</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nagaoka</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Yamauchi</surname>
          </string-name>
          ,
          <article-title>The use of science for inventions and its identification: Patent level evidence matched with survey</article-title>
          , Research Institute of Economy, Trade and
          <string-name>
            <surname>Industry (RIETI</surname>
          </string-name>
          ) (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K. A.</given-names>
            <surname>Bryan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Ozcan,</surname>
          </string-name>
          <article-title>The impact of open access mandates on invention</article-title>
          ,
          <source>Mimeo</source>
          , Toronto (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Verberne</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Extracting and matching patent in-text references to scientific publications</article-title>
          ,
          <source>in: Proceedings of the 4th Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL</source>
          <year>2019</year>
          ),
          <year>2019</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <article-title>Conditional random fields: An introduction</article-title>
          ,
          <source>Technical Reports (CIS)</source>
          (
          <year>2004</year>
          )
          <fpage>22</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Akbik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blythe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rasul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schweter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vollgraf</surname>
          </string-name>
          ,
          <string-name>
            <surname>Flair:</surname>
          </string-name>
          <article-title>An easy-to-use framework for state-of-the-art nlp</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>54</fpage>
          -
          <lpage>59</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          , Scibert:
          <article-title>Pretrained contextualized embeddings for scientific text</article-title>
          , CoRR abs/
          <year>1903</year>
          .10676 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1903</year>
          .10676.
          <article-title>a r X i v : 1 9 0 3 . 1 0 6 7 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. H.</given-names>
            <surname>So</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kang</surname>
          </string-name>
          ,
          <article-title>Biobert: a pre-trained biomedical language representation model for biomedical text mining</article-title>
          , CoRR abs/
          <year>1901</year>
          .08746 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1901</year>
          .08746.
          <article-title>a r X i v : 1 9 0 1 . 0 8 7 4 6</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bahri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <article-title>Eficient transformers: A survey</article-title>
          , arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>06732</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>