EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents


       Extracting Domain Entities from Scientific Papers Leveraging
                           Author Keywords
              Jiabin Peng                                        Jing Chen                                           Guo Chen
      School of Economics &                              School of Economics &                                School of Economics &
          Management                                         Management                                           Management
 Nanjing University of Science and                  Nanjing University of Science and                    Nanjing University of Science and
           Technology                                         Technology                                           Technology
     Nanjing, JiangSu, China                            Nanjing, JiangSu, China                              Nanjing, JiangSu, China
       2542505085@qq.com                                 chenjinguuu@126.com                                   delphi1987@qq.com


ABSTRACT                                                                     reduces the dependence of models on labeled data and improves
                                                                             the generalization ability of models to a certain extent.
Current methods of domain entity extraction of scientific texts rely         Actually, domain entity extraction can be divided into two
heavily on manually annotation corpus and thus have poor                     subtasks: entity boundary recognition and entity type
generalization ability. In this paper, we proposed a two-stage               classification. Taking the domain of artificial intelligence (AI) as
methodology that can make good use of existed author keywords                an example, we firstly used the domain glossary to help to identify
of the given domain to solve this problem. Firstly, the author               the entity boundaries and then constructed a low-cost training data
keyword set was used to mark the boundary of candidate entities,             to classify the entities. Problems and solutions were viewed as the
and then their features are integrated to classify their entity type.        key-insights of scientific papers[1], so we took them as the main
In the experiment on artificial intelligence (AI) documents from
                                                                             entity types in the experiment. According to related studies, we
WOS, our approach obtains an F1 value of 0.753 without manual
                                                                             summarized the research objectives, domains, applications, and
annotation, which is slightly lower than the BERT-BiLSTM-CRF
                                                                             tasks in technical papers as problems, and the methods, schemes,
baseline model (F1=0.772) trained on manual annotation corpus,
                                                                             models, technologies, tools, software, algorithms, and theories
showing the usability of our approach in practice.
                                                                             used to solve these problems as solutions[4][5][6]. The
                                                                             experimental results showed that a good index of our methodology
CCS CONCEPTS                                                                 was obtained without manual annotation, and the F1-measure
• Computing methodologies • Artificial intelligence • Natural                reached 0.753.
language processing • Information extraction

KEYWORDS                                                                     2   Related Studies
Information Extraction, Domain Named Entity Recognition,                     At present, the mainstream methods of domain NER are divided
Analysis of Scientific Papers, Author Keywords                               into two categories: methods based on statistical machine learning
                                                                             (ML) and methods based on deep learning (DL). The NER method
                                                                             based on ML is essentially classification, that is, given multiple
1     Introduction                                                           types of named entities, and then models are used to classify the
At present, there have been many studies on knowledge entity                 entities in the text. And there are two ideas in the implementation.
extraction in scientific papers, and the biggest problem is the lack         One is to identify the boundaries of all named entities in the text
of labeled data[1]. As we all know, scientific papers usually                firstly, and then classify them into different types, such as
belong to a specific domain, so manual annotation needs                      CoBoost[7]. The other is sequence annotation. Each word in the
corresponding domain knowledge, which makes the annotation                   text is given several candidate type labels, which correspond to its
more expensive, and many popular named entity recognition                    position in various entities. The classical NER models based on
(NER) models can not play their inherent excellent performance.              sequence annotation in ML include HMM[8], CRF[9], etc. The
To ensure the generalization ability of NER models, it is necessary          NER models based on DL use pre-trained word vectors to
to reduce their dependence on manual annotation. At present,                 represent words, which can solve the problem of data sparsity in
thanks to the rapid development of databases and the Internet, a             high latitude vector space. Meanwhile, pre-trained word vectors
large number of knowledge resources have been accumulated in                 contain more semantic information than manually selected
many domains, such as knowledge bases, gazetteers, glossaries,               features and can obtain the feature representation in unified vector
dictionaries, etc. These resources are widely used in NER models             space from heterogeneous texts, which brings strong development
of distant supervision[2] or semi-supervision learning[3], which             for sequence annotation tasks, especially for NER[10].


    Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                        41
                 EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

The biggest problem of domain NER is the lack of labeled corpus             3   Methodology
nowadays. When a general NER method is applied to a specific
domain, corresponding adjustment strategies need to be taken                3.1 Framework
according to the domain corpus. A common idea is to use transfer            Traditional NER was regarded as a sequence labeling task, which
learning to share data and models among domains. Ni et al.                  assigned the corresponding entity type and location label to each
projected labeled data and distributed representation of words              token in the text. In fact, NER could be regarded as two subtasks:
without manual annotation in the target domain[11]. Giorgi et al.           boundary recognition and entity classification. That was, we can
transferred the source domain model parameters to the target                firstly identify the boundaries of named entities in the text, and
domain for initialization, and then fine-tune the parameters to             then classify them into different types. The NER method based on
meet the task[12]. Another idea is to make full use of the existing         sequence labeling treated two subtasks as a whole, in which the
knowledge resources in domains to automatically build datasets              same labeled data was shared by two subtasks so that the
and carry out distant supervision, semi-supervision, weak                   requirement for its quality was pretty high. As a result, many
supervision, etc. Nooralahzadeh et al. adopted a technique of               classic NER methods could not be applied in some subdivided
partial annotation and implemented a reinforcement learning                 domains. In addition, the NER method based on sequence labeling
strategy with a neural network policy in distant supervision                could not be effectively integrated into the existing domain
NER[2]. Peters et al. demonstrated a general semi-supervised                resources. At present, the common practice was to use domain
approach for adding pre-trained context embeddings from                     terms as auxiliary data to help roughly label data. On the contrary,
bidirectional language models to NLP systems and apply it to                by dividing NER into boundary recognition and entity
NER[3]. Lison et al. relied on a broad spectrum of labeling                 classification, we could make full use of the existing domain
functions to automatically annotate texts from the target                   knowledge resources.
domain[13].                                                                 Entity boundary recognition could be regarded as a word
From the above researches, it could be seen that various domain             segmentation task, which required large-scale resources (i.e. user-
resources were widely used to reduce the manual annotation cost             defined lexicon). And there usually exist some domain glossaries
as much as possible, which thus achieved good results. However,             and a large-scale author keyword set in a given domain, which can
the domain NER models based on transfer learning, semi-                     help to solve the word segmentation task. Compared with word
supervision etc. still could not avoid manual participation in the          segmentation, entity classification required smaller-scale resources
construction of datasets. Therefore, after analyzing the essence of         (i.e. training data). At present, many domains knowledge can be
the NER task, domain NER was divided into two subtasks in this              obtained easily through an online database or knowledge graph,
paper, which avoided manual annotation with the help of domain              which can provide necessary training data for entity type
resources. In addition, some new ideas such as zero-shot                    classification without manual annotation. To sum up, the
learning[14] and learning with noisy labels[15] have also been              framework of this paper was shown in Figure 1.
applied to domain NER to help further reduce labor costs.

              Domain Resources                       Entity Boundary Recognition                           Entity Classification

                                                                                                 Training Data
                 Glossary
                                                                                                                                   Model
                                                                Lexicon                             Features
                         Author Keywords                                                          Word Vector
    Document                                                                                     Pos of Speech               Evaluation &
                                                             Segmentation
                                                                                                                             Optimization
                              Abstract                                                             Word Case


                                         Figure 1 Framework of knowledge entities extraction.

The framework was divided into three parts. The first was the               CNKI, etc.), then author keywords and abstract were extracted.
acquisition of domain resources. The domain resources used in               Author keywords were indispensable large-scale resources for
this paper included domain glossary and domain documents.                   entity boundary recognition, and abstracts could be used in
Domain glossary could be obtained directly through browser (such            constructing features for the training data. The second was entity
as Google, Firefox, etc.) or relevant domain knowledge websites             boundary recognition, which was regarded as a word segmentation
(such as Wikipedia, Baidu baike, etc.). In addition, we also got the        task. The user-defined lexicon of the word segmentation task was
types of terms when constructing a domain glossary. Domain                  constructed by combining domain glossary and author keywords
documents could be obtained through databases (such as WOS,                 set and helped to realize entity boundary recognition at a low cost.
                                                                            The third was entity classification. The training data required for


                                                                       42
                  EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

classification were extracted from the domain glossary, and the               context-independent. Because our core task was phrase
text features were obtained from abstract training or counting.               classification, which not needed context information, we chose
                                                                              Word2Vec to train word vectors. Before training, we used
3.2    Implementation                                                         underline to concatenate the words in a phrase in the corpus after
    3.2.1 Entity Boundary Recognition. As mentioned above, entity             word segmentation, to ensure that phrases were regarded as a
boundary recognition was transformed into word segmentation.                  whole when training word vectors.
Inspired by the Chinese automatic word segmentation method, the               Word2Vec included two algorithms: Skip-gram and CBOW.
forward maximum matching algorithm based on string matching                   Research showed that Skip-gram contained more semantic
was used in this paper. Slightly different from Chinese word                  information, while CBOW contains more grammatical
segmentation, it was necessary to extract the stem of English                 information[17]. The window size was also very important for
words before English word segmentation to avoid the influence of              training word vectors, and the commonly used window sizes were
word form on word segmentation. In addition, there would be                   5 and 101. Therefore, we would first explore the above two factors
some noise when using word segmentation lexicon to label the                  affecting word vector results in the classification experiment.
candidate entities, so the following entity classification task was           According to related studies[18], other parameters were shown in
actually multi-class.                                                         Table 1. In addition, to make word vectors more robust, we used
    3.2.2 Entity Classification. Entity classification was essentially        stemmed corpus to train Word2Vec.
word classification, which was a typical supervised task, and the                          Parameters             Values
training data was indispensable. At present, it is difficult to                            sg                     1/0
construct a large number of high-quality classification data in a                          window size(w)         5 / 10
given domain. However, a small number of high-quality data                                 min count              5
could usually be obtained at a low cost with the help of domain                            iteration number       20
knowledge bases or domain experts.                                                         embedding size         200
1) Construct training data. Training data consisted of positive
      samples and negative samples. Positive samples consisted of             Table 1 Parameters of Word2Vec. sg=1 meant algorithm was
      entities and corresponding types. And the pre-constructed               Skip-gram, sg=0 was CBOW. w was used to represent the
      glossary contained the types of terms, so we directly                   window size later.
      extracted some high-quality terms and their types as the
      positive samples. Negative samples, i.e. non-entities, were                 3.3.2 POS Feature. POS of words in sentences could be
      randomly extracted from keyword sets and texts.                         obtained through the Python third-party package nltk. There were
2) Construct text features. According to the task, word vector,               36 kinds of POS in nltk, which meant the length of the POS vector
      part of speech (POS) feature, and word case feature were                of a single word was 36. In the classification experiment, POS
      constructed. Word vectors could be obtained by training                 vectors of training data were obtained by concatenating the POS
      large-scale domain unlabeled corpus, and semantic                       vectors of component words. To avoid the inconsistent length of
      information was given to discrete words according to the                POS vectors, we counted the lengths of phrases (the number of
      context. POS feature was obtained by counting the corpus                words in phrases) in the segmentation lexicon to obtain the
      without word segmentation. The acquisition of the case                  maximum phrase length. When the lengths of training data were
      feature was basically consistent with the POS feature, but the          less than the maximum phrase length, POS vectors of training data
      corpus with word segmentation was used and the cases                    would be padded with 0. Finally, the length of POS vectors of
      needed to be self-defined.                                              training data was 36 * max(len(phrase{lexicon})) (Maximum length
3) Model selection, training, evaluation, and optimization.                   of phrases in lexicon). POS vectors were used by concatenating
      According to the task, the models we used included four                 word vectors in the experiment.
      classical machine learning models: Random Forest (RF), K-                   3.3.3 Case Feature. Three types of phrase cases were defined
      Nearest Neighbor (KNN), Support Vector Machine (SVM),                   in this paper: initial uppercase, all uppercase and all lowercase.
      Multilayer Perceptron (MLP), and TextCNN, which                         The lengths of case vectors of training data were 3. Similarly, case
      performed well in sentence classification[16]. The detailed             vectors were used by concatenating word vectors.
      steps of our experiment were as follows: ① feeding the
                                                                              3.4       Classification Models
      training data to models to obtain the basic results; ②
      optimizing the word vector according to the model effect and            Classification models used in this paper included RF, KNN, SVM,
                                                                              MLP, and TextCNN. The first four models were implemented by
      adding features in the training process; ③ evaluating the
                                                                              sklearn in Python. In the RF, the number of decision trees was set
      effect of models to decide whether to continue optimizing.
                                                                              to 100. All parameters of the KNN took the default values. In the
                                                                              SVM, the probability was set to True, that was, probability
3.3 Feature Processing
                                                                              estimation was enabled. In the MLP, the number of neurons in the
    3.3.1 Word Vector. At present, common models for training
word vectors included Word2Vec, GloVe, ELMO, GPT, BERT,                       1
                                                                                  https://www.bbsmax.com/A/A2dm2D7zen/
etc. The word vectors trained by the first two models were


                                                                         43
                       EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

hidden layer was (100, 50). TextCNN was originally used to                     The macro average of precision, recall, and F1-measure were used
classify sentences, and phrases could be regarded as shorter                   to evaluate the models.
sentences. The input of TextCNN 2 were vectors generated by                    Word vectors were the basic input of models, so we firstly
Word2Vec. The embedding size, sequence length, batch size, and                 explored the influences of word vectors trained by the two
training epoch in TextCNN were set as 200, 10, 32, and 20                      algorithms in different window sizes, and the results were shown
respectively. Parameters not mentioned above took the default                  in Table 2.
values.                                                                                                    sg=1                sg=0
                                                                                                      w=5      w=10      w=5       w=10
                                                                                            RF       0.672     0.681     0.666     0.655
4       Experiment                                                                         KNN        0.55     0.672     0.151     0.146
To verify the effectiveness of our methodology, we took the                                SVM       0.736     0.701     0.709     0.679
domain of AI as an example in the experiment.                                              MLP       0.672     0.685     0.428     0.478
                                                                                        TextCNN      0.695     0.685      0.67      0.67
4.1        Data Acquisition and Preprocessing
Firstly, we obtained the bibliography data of the AI domain. The               Table 2 Macro F1-measure of models using different word
data was from the category of AI in the core collection of WOS                 vectors on the test set.
(Web of Science). Documents were retrieved with WC = computer
science and WC = artificial intelligence, and the time range was               Firstly, we compared the results in Table 2 vertically. When sg=1,
set as from 1996 to 2020. Then, abstracts and keywords were                    five models in two window sizes achieved good results on the
extracted from the bibliography data, including 927675 abstracts               whole. Only when w=5, KNN performed poorly. However, when
and 161169 keywords.                                                           sg=0, the performances of all models decreased, especially KNN
Secondly, we constructed a glossary of AI domain. The data came                and MLP. The possible reason was that Skip-gram focused on
from a knowledge website3 in AI domain, from which we obtained                 semantics, which was more conducive to the NER task than
all problem and solution entities. The problem entities were from              CBOW. In addition, KNN and MLP had higher requirements for
the tasks of the Browse State-of-the-Art page in the website, and              data quality, but word vectors trained by CBOW could not meet
the solution entities were from the machine learning components                the requirement. Secondly, we compared the results in Table 2
                                                                               horizontally, and the best F1-measures of each model were about
of the Methods page. After removing duplications, 1887 problem
                                                                               0.7, which were highlighted in bold in Table 2. In the following
entities and 1209 solution entities were remained.
                                                                               experiments, the word vector that made each model achieved the
Finally, we processed the above data to get the final experimental
                                                                               best performance was used, and POS features and case features
data. The user-defined lexicon of English word segmentation was
                                                                               were added to models. The results were shown in Table 3.
constructed by merging keywords set and domain glossary. The
                                                                               It could be found in Table 3 that the addition of two features
training data of classifiers consisted of entities and non-entities, in
                                                                               effectively improved the F1-measures. When all features were
which 360 entities of each type were manually extracted from the
                                                                               fused, optimal results were obtained in all models. Among the five
glossary, and 360 non-entities were manually constructed. Non-
                                                                               models, SVM had the best performance, with an F1-measure of
entities included phrases and words, in which phrases were
                                                                               0.753. This might be because the underlying training mechanism
extracted from high-frequency keywords and words were
                                                                               of SVM made SVM more suitable for small sample classification.
constructed randomly. The ratio of phrases to words in non-
                                                                               When the word vector parameters were sg=1 and w=10, the voting
entities was about 2:1. This was because almost all entities were
                                                                               model had the best performance, and the F1-measure was 0.752.
phrases, so more phrase-level non-entities were needed to help
                                                                               SVM or voting model could be selectively used in practical
train models. Finally, 1080 pieces of classification data were
                                                                               application. The best F1-measure of TextCNN was 0.715, which
obtained. The training set and validation set were randomly
                                                                               was far from its performance in sentence classification. One
divided according to the ratio of 5:1. In addition, to evaluate the
                                                                               possible reason was that the length of phrases was much smaller
performance of our methodology, we set our baseline as the
                                                                               than sentences.
traditional BERT-BiLSTM-CRF NER model. A previous
                                                                               The baseline BERT-BiLSTM-CRF4 performed well on the domain
annotated corpus containing 3000 sentences was used for the
                                                                               NER task. Its F1-measure was 0.772, which was far less than its
baseline model, in which 2000 sentence were randomly selected as
                                                                               performance in the general NER task, but it had been a very good
the training data and the rest 1000 sentences were used as the
                                                                               result in the subdivided domain. And the result was 0.019 higher
common test set.
                                                                               than our optimal model. From the experimental result, there was
                                                                               still a gap in our methodology, but from the cost of experimental
4.2        Result Analysis
                                                                               data, the gap was acceptable. In the following work, we can
                                                                               further optimize the word vectors and add more features to
                                                                               improve the performance.
2
    https://github.com/cjymz886/text-cnn
3                                                                              4
    https://paperswithcode.com/                                                    https://github.com/macanv/BERT-BiLSTM-CRF-NER


                                                                          44
                     EEKE 2021 - Workshop on Extraction and Evaluation of Knowledge Entities from Scientific Documents

5      Conclusion                                                                          achieved in the extraction of problem and solution entities without
                                                                                           manual annotation using our approach.
Aiming at the problem that the current domain NER models
                                                                                           In general, our approach has good domain generalization because
heavily rely on manually annotation data and thus has poor
                                                                                           it does not need manual annotation, and can be applied to many
domain generalization ability, we propose a two-stage knowledge
                                                                                           subdivided domains at a low cost. However, the performance of
entity extraction methodology, which can get rid of the
                                                                                           our scheme still has some room for improvement. In the follow-up
dependence on manually annotation data. Experiments in WOS
                                                                                           work, we can try to use better word vectors and more features to
documents in the domain of AI showed that good results can be
                                                                                           improve the accuracy of entity extraction, and gradually extend
                                                                                           the model to the extraction of more knowledge types.
                            f1                                         f1+f2                               f1+f3                     f1+f2+f3
                    P       R                      F1           P        R              F1         P         R        F1        P        R         F1
        RF        0.588   0.812                   0.681       0.629    0.824           0.713     0.618     0.812     0.702    0.661    0.833    0.736
        KNN       0.593    0.78                   0.672       0.603    0.773           0.676     0.604     0.784     0.681    0.614    0.783    0.687
        SVM       0.677    0.81                   0.736       0.701    0.809           0.749      0.69     0.812     0.744    0.706    0.812    0.753
        MLP       0.593   0.813                   0.685       0.604    0.815           0.694     0.605     0.814     0.694    0.621    0.826    0.709
        TextCNN    0.63   0.785                   0.695       0.631    0.815           0.701      0.64     0.765     0.697     0.65     0.81    0.715
        Voting       -       -                      -           -        -               -          -        -         -      0.689    0.831    0.752
        BERT-BiLSTM-CRF(baseline)                               P: 0.756                              R: 0.789                      F1:0.772

Table 3 Macro P, R, F1-measure of models using different features on the test set. f1 was word vector, f2 was POS feature, f3 was
case feature.

                                                                                            [11]   Ni Jian, Dinu Georgiana and Florian Radu, 2017. Weakly Supervised Cross-
                                                                                                   Lingual Named Entity Recognition via Effective Annotation and
ACKNOWLEDGMENTS                                                                                    Representation Projection. In Proceedings of the 54th Annual Meeting on
This study is supported by the MOE (Ministry of Education in                                       Association for Computational Linguistics. 1470-1480.
                                                                                            [12]   Giorgi M John and Bader D Gary, 2018. Transfer learning for biomedical
China) Project of Humanities and Social Sciences.                                                  named entity recognition with neural networks. Bioinformatics 34 23 (2018),
                                                                                                   4087-4094.
                                                                                            [13]   Lison Pierre, Barnes Jeremy, Hubin Aliaksandr and Touileb Samia. 2020.
REFERENCES                                                                                         Named Entity Recognition without Labelled Data: A Weak Supervision
[1]    Zara Nasar, Syed Waqar Jaffry and Muhammad Kamran Malik, 2018.                              Approach. In Proceedings of the 58th Annual Meeting of the Association for
       Information extraction from scientific articles: a survey. Scientometrics 117               Computational Linguistics. 1518-1533.
       3(2018), 1931-1990. DOI: https://doi.org/10.1007/s11192-018-2921-5.                  [14]   Dai, Damai, et al. "Inductively Representing Out-of-Knowledge-Graph
[2]    Nooralahzadeh Farhad, Lønning Tore Jan and Øvrelid Lilja, 2019.                             Entities by Optimal Estimation Under Translational Assumptions." (2020).
       Reinforcement-based denoising of distantly supervised NER with partial               [15]   Ifeoluwa David Adelani, A. Michael Hedderich, Dawei Zhu, den Esther van
       annotation. In Proceedings of the 2nd Workshop on Deep Learning                             Berg and Dietrich Klakow, 2020. Distant Supervision and Noisy Label
       Approaches for Low-Resource NLP. 225-233.                                                   Learning for Low Resource Named Entity Recognition: A Study on Hausa
[3]    Peters E. Matthew, Ammar Waleed and Bhagavatula Chandra, et al., 2017.                      and Yor\`ub\'a. arXiv preprint arXiv: 2003.08370 (2020).
       Semi-supervised sequence tagging with bidirectional language models. In              [16]   Kim Y . Convolutional Neural Networks for Sentence Classification[J]. Eprint
       Proceedings of the 55th Annual Meeting of the Association for Computational                 Arxiv, 2014.
       Linguistics. 1756–1765.                                                              [17]   Mikolov Tomas, Chen Kai, Corrado Greg and Dean Jeffrey, 2013. Efficient
[4]    Gupta Sonal and Manning D, 2011. Analyzing the dynamics of research by                      Estimation of Word Representations in Vector Space. Computer Science.
       extracting key aspects of scientific papers. In Proceedings of the 5th                      arXiv preprint arXiv:1301.3781v3 (2013).
       International Joint Conference on Natural Language Processing. 1-9.                  [18]   Siwei Lai, Kang Liu, Liheng Xu and Jun Zhao, 2016. How to Generate a
[5]    Singh Mayank, Dan Soham, Agarwal Sanyam, Goyal Pawan and Mukherjee                          Good Word Embedding. IEEE Intelligent Systems 31 6 (2016), 5–14.
       Animesh, 2017. AppTechMiner: Mining Applications and Techniques from
       Scientific Articles. In Proceedings of the Joint Conference on Digital
       Libraries Joint Conference on Digital Libraries. 1-8.
[6]    Heffernan Kevin and Teufel Simone, 2018. Identifying Problems and
       Solutions in Scientific Text. Scientometrics 116 2 (2018), 1367–1382.
       DOI: https://doi.org/10.1007/s11192-018-2718-6.
[7]    Michael Collins and Yoram Singer, 1999. Unsupervised models for named
       entity classification. In Proceedings of the Joint SIGDAT Conference on
       Empirical Methods in Natural Language Processing and Very Large Corpora.
       100-110.
[8]    Zhou Guodong and Su Jian, 2002. Named entity recognition using an HMM-
       based chunk tagger. In Proceedings of the 40th Annual Meeting on
       Association for Computational Linguistics. Stroudsburg: Association for
       Computational Linguistics, 473-480.
[9]    McCallum Andrew and Li Wei, 2003. Early results for named entity
       recognition with conditional random fields, feature induction and web-
       enhanced lexicons. In Proceedings of the Seventh Conference on Natural
       Language Learning at HLT-NAACL. Stroudsburg: Association for
       Computational Linguistics, 188-191.
[10]   Cherry Colin and Guo Hongyu, 2015. The unreasonable effectiveness of word
       representations for Twitter named entity recognition. In The 2015 Annual
       Conference of the North American Chapter of the ACL. Stroudsburg:
       Association for Computational Linguistics, 735-745.


                                                                                       45