=Paper=
{{Paper
|id=Vol-3756/GenoVarDis2024_paper1
|storemode=property
|title=FRE at GenoVarDis: A sane approach to Disease and Genomic Variant NER
|pdfUrl=https://ceur-ws.org/Vol-3756/GenoVarDis2024_paper1.pdf
|volume=Vol-3756
|authors=Ander Martínez
|dblpUrl=https://dblp.org/rec/conf/sepln/Martinez24
}}
==FRE at GenoVarDis: A sane approach to Disease and Genomic Variant NER==
<pdf width="1500px">https://ceur-ws.org/Vol-3756/GenoVarDis2024_paper1.pdf</pdf>
<pre>
                         FRE at GenoVarDis: A sane approach to Disease and
                         Genomic Variant NER
                         Ander Martínez1
                         1
                             AI & Computing Research Group, Fujitsu Research of Europe Ltd., Spain


                                      Abstract
                                      Fujitsu Research of Europe (FRE) has participated in the GenoVarDis [1] competition on Named
                                      Entity Recognition (NER) of variants, genes and associated diseases. This competition was part of the
                                      IberLEF 2024 [2] campaign. In this paper, we describe our approach to the challenge and an analysis
                                      of our results. Our approach consisted of a combination of Pretrained Language Model fine-tuning,
                                      Conditional Random Fields (CRF), Byte-Pair Encoding dropout (BPE dropout) and model ensembling.
                                      With this solution, we ranked first in the competition. In this paper, we analyze the benchmark dataset
                                      and our results. Now that the gold data for the test set has been released, we can consider how our
                                      results could have been better.

                                      Keywords
                                      Natural Language Processing, Named Entity Recognition, Conditional Random Fields, RoBERTa


                         1. Introduction
                        Among other tasks proposed by the IberLEF 2024 [2] campaign, GenoVarDis fell in the Biomedical
                        NLP category, with the full title of GenoVarDis: NER in Genomic Variants and related Diseases
                        [1]. The task released a much-needed benchmark dataset in Spanish on Biomedical Named Entity
                        Recognition (NER). The presentation of the dataset cited tmVar3 [3] and BERN2 [4] as similar
                        datasets in English, both of them well known albeit limited in size.
                           Named Entity Recognition (NER) is one of the cornerstones of text mining, a necessary step to
                        go from unstructured to structured data. In the recent years, fine-tuning Large Language Models
                        (LLM), such as BERT[5] or RoBERTa[6], has become the most popular approach to NER. When
                        compared to previous non-LLM deep learning approaches[7], the newer LLM approaches require
                        less data to train. However, hand-labeled (or gold-standard ) data is still necessary, and so, the
                        dataset released for this competition makes a great contribution to Spanish language Biomedical
                        NLP.
                           Our team has been interested on working with unstructured biomedical data, and we recently
                        participated [8] in the SympTEMIST challenge [9], on Spanish language symptom NER. For
                        the GenoVarDis competition we have used a solution similar to the one that we used in that
                        occasion.
                           For our submission, we have used a sane combination of well known techniques that we think
                        delivers best results on most occasions. The techniques that we used are LLM fine-tuning for
                        NER, Conditional Random Fields (CRF), BPE-Dropout, and model ensembling with majority
                        voting.
                           Using this approach we ranked first in the GenoVarDis competition. In this paper, we analyze
                        the benchmark dataset and our results. Now that the gold data for the test set has been released,
                        we can consider how our results could have been better.


                          IberLEF 2024, September 2024, Valladolid, Spain
                          email: ander.martinez@fujitsu.com (A. Martínez)
                          orcid: 0000-0003-2290-8194 (A. Martínez)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC
                                    BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Table 1
Dataset text statistics. Size of each partition. The rows Mean length in characters and Mean length in words
represent the mean size of the documents contained.
                                                         Train           Dev      Test
                                 Total documents           427          70          136
                         Mean length in characters    1,809.65    1,885.11     1,503.50
                            Mean length in words        276.78      287.46       219.37


Table 2
Dataset statistics for the three partitions. The entities column represents the total number of entities in the
partition. The chars and words columns represent the average length of the mentions in chars and words
separated by space. The row Nucl...eChange stands for NucleotideChange-BaseChange.
                                     Train                        Dev                        Test
              category    entities   chars   words   entities    chars    words   entities   chars   words
            Disease        4028      15.98   2.15      588       15.53    2.07     1433      20.1     2.43
               Gene        3093       7.27   1.33      550        7.95    1.38     514        6.46    1.22
        DNAMutation        496       14.31   2.31      103       15.46    2.61      73        7.92    1.05
      OtherMutation        271        25.3    4.3      53        21.15     3.0      22       19.86    2.68
          DNAAllele        139        9.27   1.71      12           8     1.58      15         7.2     2.2
                SNP        120        8.52   1.03      15         8.47      1       42        8.69       1
     Nucl...eChange         51       11.43   2.43      11         9.82      2       1           3        1
         Transcript         1          11      1        1          11       1       1          11        1


2. Dataset
The GenoVarDis dataset is a collection of texts that have been labeled with the spans of mentions
of various genes, mutations and diseases. Each text or document consists of a title and a block of
text such as a paragraph. The blocks of text normally contain multiple sentences. The dataset
was distributed partitioned into train, development and test, with the gold data of the test
partition only released after the competition concluded.
   Table 1 shows some statistics of the text contained in dataset. We see that of a total of 633
documents (and 1,109,153 characters), about 70% was directed to the training data, 12% to
the development and 18% to the test. We also observe that the documents in the training and
development partitions are similar in length but a bit longer than the texts in the test partition.
   Table 2 shows the statistics of the entities contained in each of the dataset partitions. The
dataset contains entities of eight different categories, all of varying frequencies. In the table,
we sorted the categories from most to least frequent. While the categories Disease and Gene
have a fair amount of mentions, mutations are found in different categories and some of them
do not have enough data to successfully train a NER model. As an example, the Transcript
category contains a single mention in the training data. We observe that the Disease- and
Gene-category mentions make up about 90% of the annotations (92.7% of the test annotations),
so when micro-averaging the f1-scores of the different entities these will make most of the score
of the competition, rendering the mutation annotations not so relevant.
   The average length of the annotations shows that the mentions of the OtherMutation category
are particularly long. These are descriptions of the mutations such as "insertion introduced
eight additional amino acids". We have extracted this English translation from the official
task description. All samples of the Transcript category are single word, 11-character long:
"NM_203475.1", "NG_008724.1" and "NM_000747.2". These examples are extracted from the
training, development and test respectively. They follow a regular pattern and could be extracted
using regular expressions.
3. Approach
In the introduction, we have described our technology as a combination of a few well-known
techniques: Language Model Fine-tuning and Conditional Random Fields, BPE dropout and
Model ensembling. In this section, we will provide a short description of each of them and some
details on how they were implemented.

3.1. Language Model Fine-tuning and Conditional Random Fields
The NER task is usually reduced to a token classification task, where each of the tokens (words
or subwords) in a text (sentence or paragraph) are classified to a BIO, Beginning-Inside-Outside
[10], schema class. This schema represents each mention in the text as a B- label followed by
zero or more I- labels. Tokens that do not belong to a mention are classified as O (outside). For
eight entity categories we need to have 8 × 2 + 1 = 17 classes. That is the O class and B-Disease,
I-Disease, B-Gene. . . . Other popular schemas are SBIO and BIOES.
   Fine-tuning of LLM is a very popular approach to train NER models. This approach adds a
classification layer on top of the token representations learned by the LLM. The LLM models can
be trained on raw (not annotated) text, so they can leverage large amounts of data. An early
example of this approach is the original BERT paper [5].
   The performance of these models can be improved using Conditional Random Fields (CRFs),
although CRFs have been popular long before the introduction of LLMs. [11, Souza et al.] is an
example of combining LLMs NER with CRF.
   The contribution of CRFs is that they can model the probability of transitioning from one
output label to the next one by training an additional matrix. This is usually useful because I-
labels cannot come without a preceding B- label. CRF can help avoid impossible transitions. On
prediction time, the Viterbi algorithm[12] is used to produce the most likely sequence of labels
after considering the transition probability.

3.2. Subword Representation and BPE Dropout
Texts can be represented as strings of characters. Characters form a closed set of symbols,
whether these are Latin characters, or any other set of Unicode characters, which means one can
enumerate all of them in a list. However, representing texts as strings or sequences of characters
results in very long sequences. Word sequences have been used instead, resulting in shorter
sequences for the same text. But words cannot form a closed set of symbols, which means we
can encounter words that we have never seen before (out-of-vocabulary). This is a problem when
training deep learning (DL) models, particularly for NER, where we expect to encounter many
new words. A compromise is using subwords (or wordpieces) that consists in defining a closed set
of substrings that can be used to compose words. The closed vocabulary is selected to produce
shorter sequences. Byte-pair encoding was originally formulated as a compression algorithm,
and later repurposed to represent texts for DL models[13], originally in the context of Neural
Machine Translation. After that, BPE has found wide adoption, and it is widely used as the
method to present text to DL models.
   An alternative to BPE was introduced by [14, Kudo]. A benefit of this approach is that it
can produce multiple representations for the same text by segmenting it differently. For this,
it requires training a unigram model, and it uses Expectation–Maximization (EM) and Viterbi
[12] algorithms to sample segmentations, adding some complexity and being a drawback to its
adoption.
   BPE dropout[15] was introduced as a simpler alternative to Kudo’s unigram approach. It
can be applied to existing BPE vocabularies, and so it can also be applied to many of the
pretrained language models available at HuggingFace, such as RoBERTa. In comparison, the
unigram language model subword regularization method uses a statistical model and dynamic
programming to be able to sample different segmentations from the same sequence. BPE dropout
uses random noise to discard certain merge-operations, randomly generating a different sequence
of subwords each time. This is so because BPE does not store the frequencies of each subword,
only the order of the merge-operations. Merge-operations are discarded with a probability p,
which is usually 0.1. Provilkov et al. [15] concluded through several experiments that BPE
dropout achieves better results. Our systems used BPE dropout during training, with a dropout
probability p of 0.1.

3.3. Model Ensembling
Because we are only fine-tuning our models from pretrained LM (and not training from scratch),
and because the data available for training is relatively scarce, we can only train our models for
a limited number of iterations before they start to overfit.
   Training a single model does not take a long time, but its predictions depend on the initialization
that was used for the classification layer. Under these circumstances, we can easily combine a
few models to make more robust predictions.
   We combined five models that were initialized with different seeds but using the same base
LLM model. We used a majority voting strategy ensemble the models. Under this simple strategy,
each model makes a prediction for each label and the label that got more prediction votes is
selected. We observed that this strategy improved the final f1-score.


4. Results
The competition was held on Codalab[16] 1 , with a development phase preceding the evaluation
phase. During the development phase, the annotations for the development partition were not
available.
   We trained a model based on the PlanTL-GOB-ES/bsc-bio-ehr-es model [17] available at
HuggingFace 2 . We submitted the predictions of this model to the Codalab system to make sure
that the format of our predictions was correct. The system reports the score of the submission
right away and keeps a ranking of all submissions live. We observed that the submissions were
ranked with respect to the micro average of the f1-score. With this in mind, we decided it was
not worth to optimize for the minority classes and focus on Gene and Disease.
   After the development annotations were released, we trained five models using both training
and development partition for 1 epoch. We combined the predictions of the five models as we
described in Subsection 3.3. Our submission got an F1-score of 0.820977 and ranked first until
the completion of the competition.
   The Codalab system only reports the average scores, but after the annotations of the test data
were released, we could analyze the errors in our submission. Table 3 shows the scores that we
got for each of the categories. Our submission got good scores for both Gene and Disease, that
made 92.67% of the entities (and the score), as shown in Table 2. We also got good results for
the DNAMutation category: 91.39% F1-score. This category was the third most common category
in the training data with 496 mentions. For all the other categories our model did not get good
results, but this did not impact the final score. The fourth most common category in the training
data was OtherMutation, with 271 mentions. Although the number of mentions is more than
half of those for DNAMutation, our system could only get 16% F1-score for this category, the
reason being that these mentions were considerably longer and more complex than the others, as
shown in Table 2.


1
    URL: https://codalab.lisn.upsaclay.fr/competitions/17733
2
    URL: https://huggingface.co/PlanTL-GOB-ES/bsc-bio-ehr-es
Table 3
Full results for the submitted prediction. The categories are ordered alphabetically.
                                           category   precision    recall   f1-score    support
                                  DNAAllele            1.0000     0.0667    0.1250          15
                                DNAMutation            0.8846     0.9452    0.9139          73
                                    Disease            0.8074     0.8193    0.8133        1433
                                       Gene            0.8444     0.8444    0.8444         514
                NucleotideChange-BaseChange            0.0000     0.0000    0.0000           1
                              OtherMutation            0.6667     0.0909    0.1600          22
                                        SNP            1.0000     1.0000    1.0000          42
                                 Transcript            1.0000     0.0000    0.0000           1
                                       micro avg       0.8223     0.8196    0.8210        2101
                                       macro avg       0.7754     0.4708    0.4821        2101
                                    weighted avg       0.8226     0.8196    0.8156        2101


5. Conclusions
We participated in the GenoVarDis competition and ranked first. We showed that our standard
approach to NER works well for different settings when provided enough training data. Still,
there are cases where not enough annotated data is available to train a reliable model. In these
cases using dictionary matching and regular expressions is a better option.


References
 [1] M. M. Agüero-Torales, C. R. Abellán, M. C. Mata, J. I. D. Hernández, M. S. López,
     A. Miranda-Escalada, S. López-Alvárez, J. M. Prats, C. C. Moraga, D. Vilares, L. Chiruzzo,
     Overview of GenoVarDis at IberLEF 2024: NER of Genomic Variants and Related Diseases
     in Spanish, Procesamiento del Lenguaje Natural 73 (2024).
 [2] L. Chiruzzo, S. M. Jiménez-Zafra, F. Rangel, Overview of IberLEF 2024: Natural Language
     Processing Challenges for Spanish and other Iberian Languages, in: Proceedings of the
     Iberian Languages Evaluation Forum (IberLEF 2024), co-located with the 40th Conference
     of the Spanish Society for Natural Language Processing (SEPLN 2024), CEUR-WS.org,
     2024.
 [3] C.-H. Wei, A. Allot, K. Riehle, A. Milosavljevic, Z. Lu, tmVar 3.0: an improved variant
     concept recognition and normalization tool, Bioinformatics 38 (2022) 4449–4451. URL:
     https://doi.org/10.1093/bioinformatics/btac537. doi:10.1093/bioinformatics/btac537.
 [4] M. Sung, M. Jeong, Y. Choi, D. Kim, J. Lee, J. Kang, BERN2: an advanced neural
     biomedical namedentity recognition and normalization tool (2022). _eprint: 2201.02080.
 [5] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
     Transformers for Language Understanding, 2019. URL: http://arxiv.org/abs/1810.04805.
     doi:10.48550/arXiv.1810.04805, arXiv:1810.04805 [cs].
 [6] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, 2019. URL:
     http://arxiv.org/abs/1907.11692. doi:10.48550/arXiv.1907.11692, arXiv:1907.11692 [cs].
 [7] S. Chowdhury, X. Dong, L. Qian, X. Li, Y. Guan, J. Yang, Q. Yu, A multitask bi-directional
     RNN model for named entity recognition on Chinese electronic medical records, BMC
     bioinformatics 19 (2018) 499. doi:10.1186/s12859-018-2467-9.
 [8] A. Martínez, N. García-Santa, FRE @ BC8 SympTEMIST track: Named Entity Recognition,
     2023. URL: https://doi.org/10.5281/zenodo.10103882. doi:10.5281/zenodo.10103882.
 [9] S. L. López, L. G. Sánchez, E. Farré, L. V. Gimenez, M. Krallinger, SympTEMIST Corpus:
     Gold Standard annotations for clinical symptoms, signs and findings information extrac-
     tion, 2024. URL: https://doi.org/10.5281/zenodo.10635215. doi:10.5281/zenodo.10635215,
     version Number: 4.
[10] L. A. Ramshaw, M. P. Marcus, Text Chunking using Transformation-Based Learning,
     1995. URL: http://arxiv.org/abs/cmp-lg/9505040. doi:10.48550/arXiv.cmp-lg/9505040,
     arXiv:cmp-lg/9505040.
[11] F. Souza, R. Nogueira, R. Lotufo, Portuguese Named Entity Recognition using BERT-
     CRF, 2020. URL: http://arxiv.org/abs/1909.10649. doi:10.48550/arXiv.1909.10649,
     arXiv:1909.10649 [cs].
[12] A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding
     algorithm, IEEE Transactions on Information Theory 13 (1967) 260–269. URL: https:
     //ieeexplore.ieee.org/document/1054010. doi:10.1109/TIT.1967.1054010, conference Name:
     IEEE Transactions on Information Theory.
[13] R. Sennrich, B. Haddow, A. Birch, Neural Machine Translation of Rare Words with
     Subword Units, arXiv:1508.07909 [cs] (2015). URL: http://arxiv.org/abs/1508.07909, arXiv:
     1508.07909.
[14] T. Kudo, Subword Regularization: Improving Neural Network Translation Models with
     Multiple Subword Candidates, in: Proceedings of the 56th Annual Meeting of the Association
     for Computational Linguistics (Volume 1: Long Papers), Association for Computational Lin-
     guistics, Melbourne, Australia, 2018, pp. 66–75. URL: https://www.aclweb.org/anthology/
     P18-1007. doi:10.18653/v1/P18-1007.
[15] I. Provilkov, D. Emelianenko, E. Voita, BPE-Dropout: Simple and Effective Subword
     Regularization, 2020. URL: http://arxiv.org/abs/1910.13267. doi:10.48550/arXiv.1910.
     13267, arXiv:1910.13267 [cs].
[16] A. Pavao, I. Guyon, A.-C. Letournel, D.-T. Tran, X. Baro, H. J. Escalante, S. Escalera,
     T. Thomas, Z. Xu, CodaLab Competitions: An Open Source Platform to Organize Scientific
     Challenges, Journal of Machine Learning Research 24 (2023) 1–6. URL: http://jmlr.org/
     papers/v24/21-1436.html.
[17] C. P. Carrino, J. Llop, M. Pàmies, A. Gutiérrez-Fandiño, J. Armengol-Estapé, J. Silveira-
     Ocampo, A. Valencia, A. Gonzalez-Agirre, M. Villegas, Pretrained Biomedical Language
     Models for Clinical NLP in Spanish, in: Proceedings of the 21st Workshop on Biomedical Lan-
     guage Processing, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 193–
     199. URL: https://aclanthology.org/2022.bionlp-1.19. doi:10.18653/v1/2022.bionlp-1.19.

</pre>