-

Lorcán Pigott-Dix

lorcan.pigott-dix@earlham.ac.uk 0

Robert P. Davey

robert.davey@earlham.ac.uk 0

Deep Learning, Ontology, Named Entity Recognition, Concept Recognition

0 Earlham Institute , Norwich Research Park, Colney Lane, Norwich NR4 7UZ , United Kingdom

The increasing scale of scientific output necessitates the use of machine-based tools to index, interpret, and allow scientists to digest the expanding volumes of data and literature. These tools depend on rich machine-readable ontology-based metadata. The scale of this task renders manual annotation infeasible. This work compares multi-ontology deep learning-based models for identifying ontology concepts in the natural language text of scientific literature. An existing convolutional neural net (CNN) architecture was improved and compared with two attention-based variants, a CNN with a Squeeze-and-Excite (SAE) mechanism, and a self-attention architecture that had been adapted for limited training data. The models were assessed against a gold-standard dataset of 228 PubMed abstracts, annotated with Human Phenotype Ontology terms. All models exceeded the previous state-of-the-art, with the SAE model promising to be the best candidate for multi-domain ontology concept extraction.

Sciences

1. Introduction 2. Related work 2.1. Concept recognition

Previous tools for ontology-based concept recognition have largely been rule-based [ 7, 8 ]. They typically identify potential concepts within text using string matching, coupled with heuristics to refine the candidate concepts. These methods tend to have high precision but low recall scores, as they are less able to identify synonyms for concepts if they are not represented as found in the ontology.

Recently ontology-based concept recognition methodologies have begun to incorporate neural nets, typically employing recurrent neural nets (RNNs), as RNNs can learn dependencies between words in sequences. These methods employ word embeddings which help to address the synonym-gap as they represent meaning rather than lexical structure. However, these methods rely on substantive manual annotation or noisy heuristic data generation. For example, one [ 9 ] used an ontology to heuristically label a training corpus, while another [ 10 ] relied on manual annotation carried out by medical specialists.

Arbabi et al. [ 11 ] created a method that exploits word embeddings and only requires an ontology for training: the Neural Concept Recognizer (NCR). NCR is a “neural dictionary” that uses a convolutional neural net (CNN) to learn associations between sequences of word embeddings and embeddings representing ontology concepts. The neural dictionary converts input text into the representation space of the concepts and finds the most similar concept embeddings. It only requires an OBO format ontology, and no heuristic annotation. When the NCR model was evaluated against text annotated with concepts from the Human Phenotype Ontology (HPO), it achieved both micro and macro F1 scores of 70.2% and 73.9% respectively.

“PhenoTagger” [12] combined a string matching approach with a neural classifier for the task of HPO concept recognition. A dictionary of concepts names, synonyms, and their lemmatised forms, was created from an ontology. The dictionary was used to label a distantly supervised training set, which in turn, was was used to fine-tune a pre-trained BioBERT model. In deployment, PhenoTagger combines the outputs from the string-matching dictionary and BioBERT. PhenoTagger achieves a “document-level” f-score of 75.7% – the current state-of-the-art (SOTA) for neural dictionary methods.

2.2. Attention

In recent years, models containing attention mechanisms have become the SOTA for many natural language processing (NLP) and computer vision tasks, from machine translation [13] to object detection [14]. In neural networks, attention mechanisms are specific trainable weights, that learn to modify the model to increase the signal of task-relevant features and diminish less relevant features. For example, the Squeeze-and-Excite (SAE) attention mechanism [15] emphasises or diminishes feature signals by modelling dependencies between convolutional iflters in a CNN. A more sophisticated attention mechanism, multi-headed self-attention (MHSA), explicitly models the semantic dependencies between words in a sequence, in order to compute updated word embeddings.

MHSA is an integral part of SOTA transformer models [13]. However, these models require large volumes of training data to be efective. Guo et al. [16] argues that this is because selfattention models have a poor inductive bias, and instead rely heavily on these large volumes in order to generalise well. This becomes a problem for models that are trained on limited datasets, such as the text provided by an ontology. Arbabi et al. [ 11 ] tried attention mechanisms as an alternative to the CNN used in their model, but they were not efective. This may be explained by either the poor inductive bias, the limited training data, or the relative eficacy of CNNs at local feature extraction.

Guo et al. [16] describe an alternative configuration of MHSA, called Scale-Aware SelfAttention (SASA). With SASA, each attention head attends to a variable scale. The scaling restricts attention to within a certain neighbourhood of each sequence position. The intuition here is that words that are in close proximity within a sentence are more likely to have more relevance to each other. This scaling forces the attention heads to attend to a smaller set of features, so the relative diferences between the remaining features are more pronounced, improving the inductive bias. SASA models exceed the SOTA, or are competitive, for a number of low-resource NLP tasks – requiring far fewer training examples than typical MHSA models.

Other methods have been developed to improve the performance of MHSA models. Zhou et al. [17] randomly drop entire attention heads during training to prevent a minority of heads from dominating the model, improving the model’s ability to generalise. Wu et al. [18] applied dropout at various points within the attention mechanism: to the attention weights and activation layer; to the query, key, and value matrices; and to the output features prior to the linear transformation. They also randomly removed a proportion of tokens from the input sequence. This was found to improve the performance of MHSA models without additional training data or computational power.

3. Contribution

We achieve a new SOTA for neural ontology-based concept recognition. The improved performance is obtained by incorporating an attention mechanism in the CNN architecture and by using higher quality word embeddings. Unlike previous neural dictionaries, it can incorporate multiple domain ontologies at once. We found those trained using a combination of diverse ontologies performed the best. This work also demonstrates that transformer-based architectures can be modified, with variable attention scaling, to perform competitively with CNNs in situations where training data is substantially limited. All of the models tested are available here: https://github.com/lorcanpd/adorNER.

4. Methods 4.1. Neural Concept Recognition (NCR) adapted to use ELMo Word Embeddings

The NCR classifier comprises two parts: the concept embeddings and the CNN classifier. The concepts are represented by a matrix of randomly initialised embeddings, which share information between related concepts via multiplication by an ancestry matrix. The CNN classifier passes filters over the sequence of word embeddings (obtained from a pre-trained ELMo model [19]) representing the natural language descriptions of the concepts, extracts semantic signals, and outputs them into the representational space of the concept embeddings. As the model trains, it learns to reduce the distance between the CNN output and the correct concept’s position in representational space.

To perform concept recognition, input sentences are split into all of the possible n-grams they contain, where ∈ {1, ..., 7} . Each n-gram is passed to the classifier, returning candidates with the highest match confidence score above a given threshold. Heuristics then resolve overlap in the text, favouring longer – likely more specific – terms.

4.2. Squeeze-and-Excite (SAE)

The CNN model is augmented to include an SAE mechanism. Here, the methodology described in Hu et al. [15] is adapted for a one-dimensional CNN. Average-pooling is applied to each of the non-zero elements of each feature map produced by the convolutional layer. This reduces the feature maps into a single vector, z ∈ ℛ

where is the number of feature maps. Each element of z is calculated like so: 1 =

∑ u s = ( W2( W1z)) Where is the statistic for the -th filter,

u is the -th feature map’s vector, and is the number of non-zero elements in that vector. The vector of all the feature map statistics is then compressed and decompressed, as follows: Where s is the vector of filter weights, W1 ∈ ℛ × the parameter weights of the compression transformation, W2 ∈ ℛ

× the parameter weights of the decompression transformation, and are the sigmoid and ReLU activation functions respectively, and is the compression ratio. The max-pooled feature maps are then scaled element-wise by s. As the model is trained, the SAE mechanism learns to model dependencies between feature maps.

4.3. Multi-Scale Self Attention (MSSA) Architecture

Here, the CNN architecture is replaced entirely by an encoder based upon Guo’s SASA transformer [16]. Given an input of word embeddings X ∈ ℛ × , where represents the number of embeddings and their dimensionality, each scale-aware attention head can be described as follows: (1) (2) (3) (4) head(X, ) , = softmax (

) (V, ) Q (K, )

√ ℎ ℎ W , and W (all W ∈ ℛ

× ℎ ).

Where is the scale parameter, corresponds to the -th head, and to the -th element of the sequence. K, Q, V are the projections of X into × subspaces, with ℎ being the number of heads. X is projected into these subspaces by multiplying it by the parameter matrices W ,

Q = XW , K = XW , V = XW (x, ) is the context-extraction function, which is defined as:

(X, ) = [ x,− , ... , x,+ ] When the scale parameter exceeds the range of the sequence, the context-extraction function pads the sequence with zeros of the appropriate dimensions.

The ℎ heads are incorporated into a Multi-Scale Multi-headed Self-Attention (MSMSA) block. This block consists of the scale aware attention layer and a feed-forward network and is computed as follows:

MSMSA(X, Ω) = norm([head1(X, 1), ... , headℎ(X, ℎ))]W + X) Where Ω ∈ { 1, ... , ℎ} is the set of scale parameters, W is a parameter matrix and norm is the layer normalisation function. The output is then passed to a feed forward (FF) layer. MSMSA blocks can be stacked multiple times, with varying sets of scale parameters. The final output is reduced into a single vector z by summing each output vector element-wise, normalised by the square-root of the sequence length: z = ∑X⋅ =1 √ z is then passed onto a final feed-forward layer that converts the sentence representation into the representation space of the concept embeddings. This layer comprises two consecutive linear transformations, each with GELU activations and l2 normalisation, followed by a final linear transformation with no activation layer.

Dropout regime

For each input batch, there was 50% chance of the dropout function being applied to the entire batch. If applied, entire embeddings were dropped from a sequence with a probability of 20%. Sequences shorter than three tokens in length were excluded from this, as there was a chance the input signal would be too degraded. Each attention-head had a 25% chance of an attention head being dropped out. A subsequent dropout was applied after the activation function within the attention block’s feed forward network, where random elements were dropped from the matrix with a 10% probability. This dropout is also applied within the final feed-forward layer. Inside the attention-heads, a further dropout was applied to the iteratively extracted sections of the Q, K, and V matrices. A final dropout was applied to the attention scores prior to being scaled and the softmax function being applied.

Scale regimes

The scale regimes for SASA were originally designed for much larger sequences, whereas these models have a maximum input size of ten tokens. Figure 1 illustrates the diferent combinations of self-attention head scaling parameters tested.

4.4. Ontologies

This work also explored combining multiple ontologies from various domains of knowledge, rather than a single ontology. Table 1 displays an overview of the ontology combinations, the (5) (6) (7) block 1 block 2 block 3 block 1 block 2 block 3 block 1 block 2 block 3 Regime 3

Regime 4 Regime 5

Regime 6 number of unique concepts represented, and their total number of training examples. One set contained the HPO alone, another both the HPO and the semantically similar Mammal Phenotype Ontology (MPO). The last set comprised three semantically distinct ontologies: the HPO, the Cell Ontology (CLO), and the Ontology of Host-Pathogen Interactions (OHPI). Unique concept IDs in common between ontologies were combined into single concepts.

4.5. Training

In total 81 models were trained (78 MSSA, and 3 NCR) in a python 3.6.13 environment using TensorFlow 2.2.0 [20]. An unresolvable compiler incompatibility prevented the use of FastText and Tensorflow 2 together, as was used in Arbabi et al. [ 11 ]. Instead, the word embeddings were obtained from the pre-trained ELMo model (v3) using TensorFlow Hub. Nine models (all SAE) were trained in a python 3.9.12 environment with Tensorflow 2.9.1. A bug in the earlier version of Tensorflow prevented the calculation of the means of only non-zero elements for each feature map.

All models were trained with a batch size of 256. After five epochs with no improvement in the training loss the NCR and SAE models would cease training and revert to the best performing parameter weights. For the MSSA models, after five epochs of no improvement, the model parameters revert to the previous best parameter weights, and training resumed with 1/5ℎ of the previous learning rate. After the 5ℎ learning rate change, when there was no improvement for five epochs, training was stopped with the model reverting to the best scoring parameter weights. Both the SAE and NCR models had an initial learning rate of 1/512. The MSSA models have a warm-up period where the learning rate increased linearly from zero to 1/512 over

4.6. Evaluation

Once trained, the models were calibrated and then assessed against an annotated gold-standard 228 PubMed abstract corpus [21] (available here: https://github.com/lasigeBioTM/IHP), an update of the bench-marking dataset created by Groza et al. [22]. To calibrate, the models were used to annotate 40 randomly selected abstracts using ∈ {0.05, 0.10, ... , 0.95} confidence thresholds. The thresholds with the highest sum of both macro and micro F-scores for each model were then used as the threshold for annotating the remaining 188 abstracts.

5. Results

Table 2 displays the performance metrics for the NCR and SAE models and Table 3 shows the results of best performing MSSA models for each ontology combination. The best performing score for each metric is highlighted in bold.

When trained solely using the HPO, ELMo embeddings improved the two f-scores of the NCR model by at least 4%, compared with the FastText NCR model reported in Arbabi et al. [ 11 ]. However, the performance declined when additional ontologies were combined with the HPO. The SAE model, trained with diverse domain ontologies, achieved a new SOTA, while only slightly increasing the total number of model parameters.

The MSSA model scores were competitive with the other models, with performance increasing

HPO Others with the inclusion of additional domain ontologies in the training set. Both Table 2 and 3 show that the models trained using a combination of the HPO and MPO did not perform as well as those with the HPO alone, or in combination with the CLO and OHPI. Figure 2 contains density-contour plots of the concept embeddings for the best performing SAE models for each ontology combination.

6. Discussion

In addition to the SAE model setting a new SOTA for concept recognition, this work reinforces the findings that scaled attention architectures can be competitive with CNNs in low-resource settings.

Training the SAE and MSSA models with diverse domain ontologies resulted in better performance extracting HPO terms. The additional ontologies carry general English-language semantics that may improve the inductive bias. Conversely, when the domains of the ontologies have significant overlap, performance is reduced. In Figure 2 panel B, HPO and MPO terms are less clearly separated compared to the more clearly demarcated ontologies in C. Although HPO and MPO contain similar concepts with similar natural language names, their ancestry structure is diferent, which separates these parallel concepts in representational space. To discriminate between them, the models need to attend to more features. This is reflected in the explained variance percentages in Figure 2. Further research regarding how the specific ontology composition and features impact model performance is needed, and assessment may pose a challenge. Currently, we are unaware of any other benchmark dataset annotated with terms from another domain ontology. While the performance of each model will depend upon the language, we expect the scaled attention architecture to be particularly sensitive to syntactical variation. Specific eforts will need to be carried out by fluent native or multi-lingual experts.

Qualitative analysis of our model predictions found that the heuristics, preventing overlapping matches, lead to a not-insignificant number of false negatives. Lobo et al. [21] found that 26% of annotated concepts in the gold standard corpora overlap. As a result, Luo et al. [12] altered their heuristics to allow overlap. Indeed, concept overlap is required for multi-domain annotation.

Acknowledgments

Thank you to the reviewers for their kind feedback. This work was funded as part of the Norwich Research Park Biosciences Doctoral Training Partnership, grant number BB/M011216/1, reference code 2243628. Text Using Ontology-Guided Machine Learning, JMIR Medical Informatics 7 (2019) e12596. doi:1 0 . 2 1 9 6 / 1 2 5 9 6 . [12] L. Luo, S. Yan, P.-T. Lai, D. Veltri, A. Oler, S. Xirasagar, R. Ghosh, M. Similuk, P. N. Robinson, Z. Lu, PhenoTagger: A Hybrid Method for Phenotype Concept Recognition using Human Phenotype Ontology, Bioinformatics 37 (2021) 1884–1890. doi:1 0 . 1 0 9 3 / b i o i n f o r m a t i c s / b t a b 0 1 9 . [13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [14] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-End Object Detection with Transformers, in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), European Conference on Computer Vision, Springer International Publishing, Cham, 2020, pp. 213–229. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 0 3 0 - 5 8 4 5 2 - 8 _ 1 3 . [15] J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132–7141. [16] Q. Guo, X. Qiu, P. Liu, X. Xue, Z. Zhang, Multi-scale Self-Attention for Text Classification, in: Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 2020, pp. 7847–7854. doi:1 0 . 1 6 0 9 / a a a i . v 3 4 i 0 5 . 6 2 9 0 . [17] W. Zhou, T. Ge, K. Xu, F. Wei, M. Zhou, Scheduled DropHead: A Regularization Method for Transformer Models, arXiv preprint arXiv:2004.13342 (2020). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 0 0 4 . 1 3 3 4 2 . [18] Z. Wu, L. Wu, Q. Meng, Y. Xia, S. Xie, T. Qin, X. Dai, T.-Y. Liu, UniDrop: A Simple yet Efective Technique to Improve Transformer without Extra Cost, arXiv preprint arXiv:2104.04946 (2021). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 1 0 4 . 0 4 9 4 6 . [19] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep contextualized word representations, arXiv preprint arXiv:1802.05365 (2018). doi:1 0 . 4 8 5 5 0 / a r X i v . 1 8 0 2 . 0 5 3 6 5 . [20] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, 2016. URL: https://www.tensorflow.org/. doi:1 0 . 4 8 5 5 0 / a r X i v . 1 6 0 3 . 0 4 4 6 7 , software available from tensorflow.org. [21] M. Lobo, A. Lamurias, F. M. Couto, Identifying Human Phenotype Terms by Combining Machine Learning and Validation Rules, BioMed Research International 2017 (2017). doi:1 0 . 1 1 5 5 / 2 0 1 7 / 8 5 6 5 7 3 9 . [22] T. Groza, S. Köhler, S. Doelken, N. Collier, A. Oellrich, D. Smedley, F. M. Couto, G. Baynam, A. Zankl, P. N. Robinson, Automatic concept recognition using the Human Phenotype Ontology reference and test suite corpora, Database 2015 (2015). doi:1 0 . 1 0 9 3 / d a t a b a s e / b a v 0 0 5 .

[1]

Fortunato ,

C. T.

Bergstrom ,

Börner ,

J. A.

Evans ,

Helbing ,

Milojević ,

A. M.

Petersen ,

Radicchi ,

Sinatra ,

Uzzi ,

Vespignani ,

Waltman ,

Wang ,

A.-L.

Barabási , Science of science, Science 359 ( 2018 ) eaao0185 . URL: https://www.science.org/doi/abs/10.1126/ science.aao0185. doi:1 0 . 1 1 2 6 / s c i e n c e . a a o 0 1 8 5 .

[2]

Ali ,

Dahlhaus , The Role of FAIR Data towards Sustainable Agricultural Performance: A Systematic Literature Review , Agriculture 12 ( 2022 ) 309. doi:1 0 . 3 3 9 0 / a g r i c u l t u r e 1 2 0 2 0 3 0 9 .

[3]

Cooper ,

R. L.

Walls ,

Elser ,

M. A.

Gandolfo ,

D. W.

Stevenson ,

Smith ,

Preece ,

Athreya ,

C. J.

Mungall ,

Rensing ,

Hiss ,

Lang ,

Reski ,

T. Z.

Berardini ,

Li ,

Huala ,

Schaefer ,

Menda , E. Arnaud,

Shrestha ,

Yamazaki ,

Jaiswal , The Plant Ontology as a Tool for Comparative Plant Anatomy and

Genomic

Analyses , Plant and Cell Physiology 54 ( 2013 ) e1 - e1 . doi:1 0 . 1 0 9 3 / p c p / p c s 1 6 3 .

[4]

Jupp ,

Burdett ,

Leroy ,

H. E.

Parkinson , A new Ontology Lookup Service at EMBLEBI ., in: SWAT4LS , 2015 , pp. 118 - 119 .

[5]

Eine ,

Jurisch , W. Quint, Ontology-Based Big Data Management, Systems 5 ( 2017 ). URL: https://www.mdpi. com/2079-8954/5/3/45. doi:1 0 . 3 3 9 0 / s y s t e m s 5 0 3 0 0 4 5 .

[6]

M. D.

Wilkinson ,

Dumontier ,

I. J.

Aalbersberg , G. Appleton,

Axton ,

Baak ,

Blomberg ,

J.-W.

Boiten ,

L. B. da Silva

Santos ,

P. E.

Bourne , et al., The FAIR Guiding Principles for scientific data management and stewardship , Scientific Data 3 ( 2016 ). doi:1 0 . 1 0 3 8 / s d a t a . 2 0 1 6 . 1 8 .

[7]

Tseytlin ,

Mitchell , E. Legowski,

Corrigan , G. Chavan, R. S. Jacobson, NOBLEFlexible concept recognition for large-scale biomedical natural language processing , BMC Bioinformatics 17 ( 2016 ). doi:1 0 . 1 1 8 6 / s 1 2 8 5 9 - 0 1 5 - 0 8 7 1 - y .

[8]

Jonquet ,

Shah ,

Youn ,

Callendar , M. -

A. Storey , M.

Musen , NCBO Annotator: Semantic Annotation of Biomedical Data , in: International Semantic Web Conference, Poster and Demo session, volume 110 , Washington

, USA, 2009 .

[9]

Batbaatar ,

K. H.

Ryu , Ontology-Based Healthcare Named Entity Recognition from Twitter Messages Using a Recurrent Neural Network Approach , International Journal of Environmental Research and Public Health 16 ( 2019 ) 3628 . doi:1 0 . 3 3 9 0 / i j e r p h 1 6 1 9 3 6 2 8 .

[10]

Dong ,

Chowdhury ,

Qian ,

Guan ,

Yang ,

Yu , Transfer bi-directional LSTM RNN for named entity recognition in Chinese electronic medical records , in: 2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom) , IEEE, 2017 , pp. 1 - 4 . doi:1 0 . 1 1 0 9 / H e a l t h C o m . 2 0 1 7 . 8 2 1 0 8 4 0 .

[11]

Arbabi ,

D. R.

Adams ,

Fidler ,

Brudno , et al., Identifying Clinical Terms in Medical