-

A Generic Neural Exhaustive Approach for Entity Recognition and Sensitive Span Detection

Mohammad Golam Sohrab

Pham Minh Thang

pham.thangg@aist.go.jp 0

Makoto Miwa

makoto-miwa@toyota-ti.ac.jp 0 1 0 Arti cial Intelligence Research Center, National Institute of Advanced Industrial Science and Technlogy , 2-4-7 Aomi, Koto-ku, Tokyo, 135-0064 , Japan 1 Toyota Technological Institute , 2-12-1 Hisakata Tempaku-ku Nagoya , Japan

2019

735 743

In this work, we present a deep exhaustive framework for the MEDDOCAN shared task. The framework employs a generic named entity recognition (NER) model that captures the underlying semantic information of texts. The key idea of our model is to enumerate all possible spans as potential entity mentions and classify them with deep neural networks. We introduce di erent sets of learning algorithms, including base representation(BR) average (BR-Avg), BR with attention mechanigm (BR-Attn), LSTM-Minus-based average (LM-Avg), LSTMMinus-based attention (LM-Attn), where with or without context is used after LSTM layer (Context or None) and an ensemble approach using maximum voting of all the approaches. We evaluate our exhaustive model on two sub-tasks in the MEDDOCAN shared task in medical domain using the o cial evaluation script. Among the ve submitted runs, the best run for each sub-task achieved the F-score of 93.12% on Sub-task 1 and the F-scores of 93.52% (strict) and 94.92% (merged) on Sub-task 2 without any external knowledge resources.

Deep learning NER

The MEDDOCAN shared task [ 9 ] is an open challenge medical entity detection task that allows participants to use any methodology and knowledge sources for the clinical records with protected health information (PHI). The task allows the comparison of the participating systems using the same benchmark dataset and evaluation method. Named entity recognition has drawn considerable attentions as the rst step towards many natural language processing (NLP) applications including relation extraction [ 10 ], event extraction [ 3 ], co-reference resolution [ 4 ], and entity linking [ 5 ]. Recently, deep neural networks have shown impressive performance on at named entity recognition in several domains [ 8 ]. Such models achieved the state-of-the-art results without requiring any handcrafted features or external knowledge resources.

In this paper, we present a novel neural exhaustive model that detects at and nested entities. The model reasons over all the regions within a speci ed maximum size. The model rst represents each region as the combination of the boundary and inside representations by using the outputs of bidirectional long short-term memory (LSTM). The inside representation simply treats all the tokens in a region equally by taking the average of LSTM outputs corresponding to tokens inside the region. It then classi es the regions into their entity types or non-entity. Unlike the existing models that rely on token-level labels, our model directly employs an entity type as the label of a region.

We evaluated our model on the MEDDOCAN task in clinical domain, which aims at named entity recognition (NER), which is o cially called NER o set and entity type classi cation, and sensitive span detection. The best run for each sub-task achieved the F-score of 93.12% on sub-task 1 and the F-scores of 93.52% (strict) and 94.92% (merged) on sub-task 2. 2

Related Works

Sohrab et al. [ 11 ] detected the inner and outermost entities using exhaustive approach and outperformed the state-of-the-art results by achieving 77.1% in terms of F-score. Zhou et al. [ 14 ] detected nested entities in a bottom-up way. They detected the innermost at entities and then found other NEs containing the at entities as sub-strings using rules derived from the detected entities. The authors reported an improvement of around 3% in the F-score under certain conditions on the GENIA corpus [ 2 ]. Recent studies show that the conditional random elds (CRFs) can produce signi cantly higher tagging accuracy in at or nested (stacking at NER to nested representation) [ 12 ] NERs. Ju et al. [ 6 ] proposed a novel neural model to address nested entities by dynamically stacking at NER layers until no outer entities are extracted. A cascaded CRF layer is used after the LSTM output in each at layer. The authors reported that the model outperforms state-of-the-art results by achieving 74.5% in F-score. 3

Neural Exhaustive Approach

We solve the NER and sensitive span detection (SSD) tasks using a neural exhaustive approach that exhaustively consider all possible regions in a sentence using a single neural network. The model detects nested entities by enumerating all possible spans or regions. Our model is built upon a shared bidirectional LSTM (Bi-LSTM) layer. Figure 1 shows the exhaustive model to solve the entity recognition and SSD. 3.1

Embedding Layer

In the embedding layer, each word is represented by concatenating the pretrained word embedding and character-based word representations where we encode the character-level information of the word. The character-based word representations are obtained by feeding the sequence of character embeddings comprising a word to a Bi-LSTM layer and concatenate the forward and backward output representations. The character embeddings in a word is randomly initialized. Given an input sentence sequence X = fx1; x2; :::xng, where xi denotes the i-th word and n denotes the number of words in the sentence sequence, the distributed embeddings of words, which are introduced in the last section, are fed into a bidirectional LSTM (Bi-LSTM) layer. The Bi-LSTM layer computes the hidden vector sequence in forward !h = nh!1; h!2; : : : ; h!no and backward h = nh1; h2; : : : ; hno manners. We concatenate the forward and backward outputs as hi = h!hi; hii, where [; ] denotes concatenation. 3.3

Exhaustive Layer

The exhaustive layer enumerates all possible regions by exhaustive combination. We generate all possible regions with the sizes less than or equal to the maximum region size L, which is prede ned. We use (i; k) to represent the region from i to k inclusive, where 1 i < k n and k i < L.

We represent each region using the outputs of the shared underlying LSTM layer. We represent the region with two separate representations: the boundary representation for region detection and the inside representation for semantic type classi cation. In the latter part of this section, we rst introduce the base region representations and then explain two enhancements.

Base Region Representations The boundary representation is prepared to capture the both ends of the region. We rely on the outputs of the bidirectional LSTM layer corresponding to the boundary words of a target region for this purpose. We obtain the left- and right-boundary representations R(i; k)[L;R] of the region (i; k) as follows:

R(i; k)[L;R] = [hi; hk] : (1)

The inside representation is prepared to capture its semantic type by encoding the whole semantic information of the region. In the base representation, we average the outputs of the Bi-LSTM layer in the region to treat them equally.

Using the boundary and inside representations, we obtain the left-, inside with average representation, and right-boundary R(i; k)[L;A;R] of the region (i; k) as follows:

Region Representations using Attention Mechanism Instead of relying

only on the average of the outputs of Bi-LSTM layer, we also try an attention mechanism [ 1 ] over words in each region for the task of notion of headness. Speci cally, we extend the inside representations using attention mechanism as follows: 1 i + 1 k X hj ; hk5 : j=i 3 (2) (3) (4) (5) (6) (7) (8) 1 i + 1 k X hj ; hi j=i

3 hk+15 : t = w F F N N

!xt ; exp( t) Pend(i) k=start(i) exp( k)

; end(i)

X k=start(i) i;t !xt; where !xt is the concatenated output of the Bi-LSTM layer over a region. xiis a weighted sum of word vectors in region (i; k). Instead of Eq. 2, we obtain left-, inside with attention-based representation, and right-boundary R(i; k)[L;A;R] of the region (i; k) as follows:

R(i; k)[L;A;R] = [hi; xi; hk] :

Region Representations using LSTM-Minus We also employ LSTM-Minus

[ 13 ] for the boundary representation. The left-boundary computed as the representation of the previous word of the region subtracted from the representation of the last word of the current region. Similarly, the right-boundary computed as the representation of the next word of the region subtracted from the representation of the rst word of the current region. We obtain the representation R(i; k)[L;R] of the region (i; k) by concatenating the left- and right-boundary based on LSTM-Minus and it is computed as follows:

R(i; k)[L;R] = [hk hi 1; hi The above region or span information is concatenated with average embeddings of the region (i; k) to produce the LSTM-Minus-based representations as: With the LSTM output hi, we introduce a context level representation from bidirectional LSTM layer. The idea of this approach is to capture the surrounding LSTM output of a target region (i; k) by concatenating vector output of previous hi 1, current hi, and next index hk+1 of LSTM output. With contextual region representations, we can further generate new representation from Eqs. 1-9. Figure 3 shows an architecture of contextual level integration. We then feed the representation of each segmented region to a recti ed linear unit (ReLU) as an activation function. Finally, the output of the activation layer is passed to a softmax output layer to classify the region into a speci c entity type.

Experimental Settings Evaluation Settings

We evaluated our exhaustive model on MEDDOCAN3 dataset to provide empirical evidence for the e ectiveness of the exhaustive model both in NER and SSD. Our model is implemented in Chainer4 deep learning framework. We generated task speci c word embeddings by merging the raw text of training, development, and test (including background set) sets, which included 200-dimensional embeddings of 77,559 vocabulary. We used Adam [ 7 ] for learning with a mini-batch size of 10. We used the same hyper-parameters in all the experiments; we set the dimension of word embedding to 200, the dimension of character embedding to 25, the hidden layer size to 200, the gradient clipping to 5, and the Adam hyper-parameters to its default values [ 7 ]. We employed the o cial MEDDOCAN evaluation script5 to evaluate our system performances for both tasks. 4.2

Data Pre-processing

We read text directly from input text les. We learn and detect spans using the neural exhaustive approach from Bi-LSTM layer, creating all possible combination from beginning to end of a given sequence. Unlike the traditional NER models, our model is independent from traditional 'BIO' tagging scheme, where 'B', 'I', and 'O' are stands for 'Begin', 'Inside', and 'Outside' of named entities, respectively. Thus, each text and annotation les are processed by several simple rules only for tokenization. After tokenization, each text with mapping annotation les are passed to deep neural approach for mention detection, classi cation, and sensitive token detection. Note that the o sets are restored to the original o sets in evaluation. 5

Results and Discussions

In order to evaluate the performance of NER and sensitive token detection, we conduct experiments on di erent sets of learning algorithms, including base representation(BR) average (BR-Avg), BR attention (BR-Attn), LSTM-Minusbased average (LM-Avg), LSTM-Minus-based attention (LM-Attn), where with 3 http://temu.bsc.es/meddocan/index.php/data/ 4 https://chainer.org/ 5 https://github.com/PlanTL-SANIDAD/MEDDOCAN-CODALAB-EvaluationScript or without context is used after LSTM layer (Context or None). Table 1 shows the ve submitted results on NER in terms of F-score on the test sets. In strict matching, it is shown that ensemble approach using maximum voting of all the approaches, including BR-avg-None, BR-Attn-None, BR-Avg-Context, BR-Attn-Context, LM-avg-None, LM-Attn-None, LM-Avg-Context, LM-AttnContext for NER and sensitive token detection is very e ective to improve the system performance. In contrast, the BR-Avg-None shows the best performance on NER in terms of F-score when using merged matching. Table 2 shows the categorical performances on the MEDDOCAN dataset.

We show the di erences in performance on the development data set to compare the possible scenarios of the given solutions and to report the best system submissions for NER and SSD. Table 3 shows the performances of di erent approaches on the development set in Sub-task 1 and 2. Table 3 in Sub-task 1 shows that almost all the results in di erent approaches are close to each other to solve the Sub-task 1. In contrast, Table 3 in Sub-task 2 shows that attention and average with di erent boundary representations of a region are e ective both in strict and merged evaluations to detect sensitive token. This paper presented approaches of neural exhaustive and neural contextual exhaustive models model that considers all possible regions exhaustively for named entity recognition and sensitive token detection. The model obtains the representation of each region using the outputs of the underlying shared LSTM layer, and it represents the di erent regions by concatenating boundary and inside representations of the region. Several enhancements, namely attention mechanism, LSTM-Minus, context from base representations, and context from LSTM-Minus are investigated for the representations. It then classi es the region into an entity type or non-entity. The model does not depend on any external NLP tools. In the experiment, we show that our model learns to detect at and nested entities from the generated mention candidates of all possible regions. Among the ve submitted runs, the best run for each subtask achieved the F-score of 93.12% on Sub-task 1 and the F-scores of 93.52% (strict) and 94.92% (merged) on Sub-task 2 without any external knowledge resources.

Acknowledgments We thank the anonymous reviewers for their valuable comments. This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).

1. Bahdanau , D. , Cho , K. , Bengio , Y.: Neural machine translation by jointly learning to align and translate . In: ICLR 2015 ( 2015 )

2. Collier , N. , Park , H.S. , Ogata , N. , Tateisi , Y. , Nobata , C. , Ohta , T. , Sekimizu , T. , Imai , H. , Ibushi , K. , Tsujii , J.: The GENIA Project: Corpus-based Knowledge Acquisition and Information Extraction from Genome Research Papers . In: Proceedings of EACL . pp. 171 { 172 . ACL ( 1999 )

3. Feng , X. , Huang , L. , Tang , D. , Ji , H. , Qin , B. , Liu , T. : A Language-Independent Neural Network for Event Detection . In: Proceedings of the 54th Annual Meeting of the ACL (Volume 2: Short Papers) . pp. 66 { 71 . Berlin, Germany ( 2016 ), http://anthology.aclweb.org/P16-2011.

4. Fragkou , P. : Applying named entity recognition and co-reference resolution for segmenting english texts . Progress in Arti cial Intelligence 6 ( 4 ), 325 { 346 ( 2017 ), https://doi.org/10.1007/s13748-017-0127-3.

5. Gupta , N. , Singh , S. , Roth , D. : Entity Linking via Joint Encoding of Types, Descriptions, and Context . In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing . pp. 2671 { 2680 . ACL, Copenhagen, Denmark ( 2017 ), https://www.aclweb.org/anthology/D17-1284.

6. Ju , M. , Miwa , M. , Ananiadou , S.: A Neural Layered Model for Nested Named Entity Recognition . In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Volume 1 (Long Papers). pp. 1446 { 1459 . ACL , New Orleans, Louisiana ( 2018 ), http://www.aclweb.org/anthology/P16-1105

7. Kingma , D. , Ba., J.: Adam: A Method for Stochastic Optimization . In: ICLR ( 2015 )

8. Lample , G. , Ballesteros , M. , Subramanian , S. , Kawakami , K. , Dyer , C. : Neural Architectures for Named Entity Recognition . In: Proceedings of the 2016 Conference of the North American Chapter of the ACL: Human Language Technologies. ACL . vol. 1 , pp. 260 { 270 . ACL, San Diego, California ( 2016 ), http://www.aclweb.org/anthology/N16-1030.

9. Marimon , M. , Gonzalez-Agirre , A. , Intxaurrondo , A. , Rodrguez , H. , Lopez

Martin

, J.A. , Villegas , M. , Krallinger , M. : Automatic de -identi cation of medical texts in spanish: the meddocan track, corpus, guidelines, methods and evaluation of results . In: Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 ). vol. TBA , p. TBA. CEUR Workshop Proceedings (CEUR-WS.org) , Bilbao, Spain (Sep 2019 ), TBA

10. Miwa , M. , Bansal , M. : End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures . In: Proceedings of the 54th Annual Meeting of the ACL . pp. 1105 { 1116 . ACL , Berlin, Germany ( 2016 ), http://aclweb.org/anthology/N18-1131

11. Sohrab , M.G. , Miwa , M. : Deep exhaustive model for nested named entity recognition . In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing . pp. 2843 { 2849 . Association for Computational Linguistics, Brussels, Belgium (Oct-Nov 2018 ), https://www.aclweb.org/anthology/D18-1309

12. Son , N.T. , Minh , N.L. : Nested Named Entity Recognition Using Multilayer Recurrent Neural Networks . In: Proceedings of PACLING 2017 . pp. 16 { 18 . Sedona

Hotel

, Yangon, Myanmar ( 2017 )

13. Wang , W. , Chang , B. : Graph-based dependency parsing with bidirectional LSTM. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . pp. 2306 { 2315 . Association for Computational Linguistics, Berlin, Germany (Aug 2016 ). https://doi.org/10.18653/v1/ P16 - 1218, https://www.aclweb.org/anthology/P16-1218

14. Zhou , G. , Zhang, J., Su , J. , Shen , D. , Tan , C. : Recognizing Names in Biomedical Texts: a Machine Learning Approach . Bioinformatics 20 ( 7 ), 1178 { 1190 ( 2004 ), https://doi.org/10.1093/bioinformatics/bth060.