-

A Deep Learning-Based System for the MEDDOCAN Task

Dehuan Jiang

jiangdehuan@stu.hit.edu.cn 0

Yedan Shen

shenyedan@stu.hit.edu.cn 0

Shuai Chen

Buzhou Tang

Xiaolong Wang

wangxl@insun.hit.edu.cn 0

Qingcai Chen

Ruifeng Xu

xuruifeng@hit.edu.cn 0

Jun Yan

Jun.YAN@Yiducloud.cn 2

Yi Zhou

zhouyi@sysu.edu.cn 1 0 Department of Computer Science, Harbin Institute of Technology Shenzhen Graduate School , Shenzhen, China, 518055 1 Sun YAT-SEN UNIVERSITY 2 Yidu Cloud (Beijing) Technology Co., Ltd , Beijing

2019

761 767

Copyright c 2019 for this paper by its authors. Use permitted under CreativeCommons License Attribution 4.0 International (CC BY 4.0). IberLEF 2019, 24 September 2019, Bilbao, Spain.

De-identification Protected Health Information medical document anonymization deep learning

De-identification is a prerequisite of clinical record accessing and sharing outside of hospitals, which is very important for secondary use of clinical data. In the past few years, de-identification had attracted plenty of attention and a large number of efforts had been made for de-identification, especially for clinical documents in English. The representative works are natural language processing (NLP) challenges including the de-identification task of clinical text, such as the i2b2 (the Center of Informatics for Integrating Biology and Bedside) 2006 [ 1 ] and 2014 [ 2-4 ], and the N-GRID (the Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains) 2016 [ 5 ]. As these challenges are public and provide manually annotated corpora for de-identification, they attract lots of research teams to participate in and develop various kinds of systems [ 6-9 ]. According to the overview report of the N-GRID 2016 NLP challenge [ 5 ], the best system is a hybrid system based on deep learning methods [ 10 ].

In 2019, Martin Krallinger et al. organized a challenge task special for the deidentification of medical documents in Spanish, called the MEDDOCAN (Medical Document Anonymization) task [ 11 ]. The organizers provided a training set of 500 clinical records, a development set of 250 clinical records and a test set of 250 clinical records embedded in synthetic corpus 3751 clinical records. We participated in this challenge task and developed a system based on latest deep learning methods such as BERT (Bidirectional Encoder Representations from Transformers) (https://github.com/google-research/bert) and flair (https://github.com/zalandoresearch/flair). The system developed on the training set and development set achieved a “strict” F1-score of 0.9646 at entity level, a “strict” F1-score of 0.97 at span level and a “merged” F1-score of 0.9821 at span level. It should be noted that the results reported here were the new results after we added a post-processing module to fix tokenization errors when testing. 2

Material and Methods

The overview architecture of our system for the MEDDOCAN task is shown Fig.1. We first tokenized raw clinical texts in Spanish, and then deployed two individual deep learning methods (i.e., BERT+CRF and flair) for de-identification respectively. Our system was described below in detail. The organizers of the MEDDOCAN task provided participants with a synthetic corpus of 1000 discharge summaries and medical genetics clinical records manually annotated by medical experts according to a guideline defining 22 types of PHI. The corpus were divided into three parts: a training set of 500 records with 11,333 PHI mentions, a development set of 250 records with 5801 PHI mentions, and a test set of 250 records with 5661 PHI mentions. The test set was embedded in a background set of 3751 clinical records that have been manually split into sentences. The statistics of the corpus, including number of documents, sentences and PHI mentions are listed in Table 1, where “NA” denotes unknown. Sentence split and tokenization are two important preprocessing steps for natural language processing (NLP). We developed a simple rule-based system for sentence split and tokenization. A document was split into sentences by ‘;’, ‘?’, ‘!’, ‘\n’ or ‘.’ not in numbers, and each sentence was tokenized by the method proposed by Liu et al. [ 10 ] 2.3

Deep Learning Methods

De-identification is a typical named entity recognition problem, which is usually recognized as a sequence labeling problem. In this study, we deployed two deep learning methods for the MEDDOCAN task, that is, BERT+CRF and flair as follows: BERT+CRF. a method that appends a condition random field (CRF) layer to BERT. In our study, we compared the cases using different settings.

Flair. a sequence labeling method based on contextual string embeddings. 2.4

Post-processing

As clinical records in the test set have been manually split into sentence, to fixed errors caused by sentence split, we mapped the split sentences back to the gold ones and combined the neighbor PHI mentions of the same type together. 2.5

Evaluation

All system performance was measured by micro-average precisions (P), recalls (R), and F1-scores (F1) under three criteria: “strict” at entity level (track 1), “strict” at span level (track 2), and “merged” at span level, where “strict” at entity level checks whether a recognized PHI mention exactly matches a gold one of the same type, “strict” at span level checks whether a recognized PHI mention has the same span as a gold one no matter their types, and “merged” at span level is a “strict” at span level after merging the spans of PHI mentions connected by non-alphanumerical characters. All evaluations were conducted on the independent test data set, and the measures were calculated by the tool provided by the MEDDOCAN organizers. 2.6

Experiments Setup

In this study, PHI mentions were represented by “BIO” (B-beginning of a PHI mention, I-insider a PHI mention, O-outside a PHI mention). The hyper-parameters and parameter estimation algorithm listed in Table 2 were used in the deep learning methods. The pre-trained neural language models (https://storage.googleapis.com/bert_models/2018_11_03/multilingual_L-12_H768_A-12.zip and https://github.com/zalandoresearch/flair) were used in BERT+CRF and flair respectively. The other parameters were optimized on the development set, and all models are evaluated on the independent test set. The micro-average precisions, recalls and F1-scores of our system under the three criteria were listed in Table 3. BERT+CRF outperformed flair by about 0.3% in F1scores because of higher recalls. When POS features were added, the performance of BERT+CRF decreased a little bit. When we further fine-tuned BERT+CRF on the combination of training and development sets, BERT+CRF did not change very much. Our system achieved the highest “strict” F1-score of 0.9646 at entity level, a “strict” F1-score of 0.97 at span level and a “merged” F1-score of 0.9821 at span level. The new results (shown in Table 3) reported here are the results of our first submissions (shown in Table 4) after post-processing. The great differences between “strict” F1-scores (track 2) and “merged” F1-scores inspired us to find errors caused by sentence split. For example, in sentence ”Domicilio: Av. de Jaén, 28.”, “Av. de Jaén, 28” is a entity of “CALLE”, but was split into two entities of “CALLE”: “Av.” and “de Jaén, 28” as the sentence were split into two sentence “Domicilio: Av.” and “de Jaén, 28.” by ‘.’. The sentence split errors result in an F1-score difference of about 0.4 between “strict” F1-scores and “merged” F1-scores. We can see that the post-processing module brings a “strict” F1-score gain of 0.0245 for track 1 and a “strict” F1-score gain of 0.0243 for track 2. The differences between “strict” F1-scores (track 2) and “merged” F1-scores decrease sharply when the post-processing module is added.

To analyze errors in our system, we evaluated the performance on each category of entity and found that the F1-scores on “PROFESION” and “INSTITUCION” are much lower than other categories except “OTROS_SUJETO_ASISTENCIA”, on which the F1-score is zero. There are main three reasons why these three categories of entities are not well recognized. Firstly, entities in some categories are too few. For example, there are only 15 entities of “OTROS_SUJETO_ASISTENCIA” in the training set and development set in all, and only 7 in the test set. Secondly, entities of “INSTITUCION” vary greatly in format. Thirdly, there may be some entities wrongly labeled as gold standards. For example, “militar” and “ex-operario de industria textil”, which means “soldier” and “ex-textile industry operator” respectively, are recognized by our system but not labeled as gold standards. 5

Conclusion

In this study, we developed a deep learning-based system for the MEDDOCAN task, a challenge special for de-identification of clinical text in Spanish. The system achieves a promising performance. Besides, “BERT+CRF” outperforms flair. In the future, we will investigate whether BERT and flair can be combined together for further improvement.

Acknowledgements

This paper is supported in part by grants: NSFCs (National Natural Science Foundations of China) (U1813215, 61876052 and 61573118), National Key Research and Development Program of China (2017YFB0802204), Special Foundation for Technology Research Program of Guangdong Province (2015B010131010), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20170307150528934 and JCYJ20180306172232154), Innovation Fund of Harbin Institute of Technology (HIT.NSRIF.2017052).

1. Ö. Uzuner,

Luo and

Szolovits , Evaluating the state-of-the-art in automatic deidentification , Journal of the American Medical Informatics Association , vol. 14 , no. 5 , 2007 , pp. 550 - 563 .

Stubbs and Ö. Uzuner, Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , Journal of biomedical informatics , vol. 58 , 2015 , pp. S20 - S29 .

3. Ö. Uzuner and

Stubbs , Practical applications for natural language processing in clinical research: The 2014 i2b2/UTHealth shared tasks , Journal of biomedical informatics , vol. 58 , 2015 , pp. S1 - S5 .

Stubbs , C. Kotfila and Ö. Uzuner, Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , Journal of biomedical informatics , vol. 58 , 2015 , pp. S11 - S19 .

5. Stubbs

, Filannino

, Uzuner

. De-identification of psychiatric intake records: Overview of 2016 CEGS N-GRID Shared Tasks Track 1[J] . Journal of biomedical informatics , 2017 , 75 : S4 - S18 .

S.M.

Meystre ,

F.J.

Friedlin ,

B.R.

South ,

Shen and

M.H.

Samore , Automatic deidentification of textual documents in the electronic health record: a review of recent research , BMC medical research methodology , vol. 10 , no. 1 , 2010 , pp. 70 .

Ferrández ,

B.R.

South ,

Shen ,

F.J.

Friedlin ,

M.H.

Samore and

S.M.

Meystre , Evaluating current automatic de-identification methods with Veteran's health administration clinical documents , BMC medical research methodology , vol. 12 , no. 1 , 2012 , pp. 109 .

Deleger ,

Molnar ,

Savova ,

Xia ,

Lingren ,

Li ,

Marsolo ,

Jegga ,

Kaiser and

Stoutenborough , Large-scale evaluation of automated clinical note deidentification and its impact on information extraction , Journal of the American Medical Informatics Association: JAMIA , vol. 20 , no. 1 , 2013 , pp. 84 - 94 .

Liu ,

Chen ,

Tang ,

Wang ,

Chen ,

Li ,

Wang ,

Deng and

Zhu , Automatic de-identification of electronic medical records using token-level and character-level conditional random fields , Journal of Biomedical Informatics , vol. 58 , 2015 , pp. S47 - S52 .

10. Liu

, Tang

, Wang

, et al. De-identification of clinical notes via recurrent neural network and conditional random field[J] . Journal of biomedical informatics , 2017 , 75 : S34 - S42 .

11. Marimon , Montserrat, Gonzalez-Agirre, Aitor, et al. Automatic de-identification of medical texts in Spanish: the MEDDOCAN track, corpus, guidelines, methods and evaluation of results , Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2019 )