Named Entity Recognition in Albanian Based on CRFs Approach Gridi Kono Klesti Hoxha Department of Informatics Department of Informatics Faculty of Natural Sciences Faculty of Natural Sciences University of Tirana University of Tirana 1001 Tirana, Albania 1001 Tirana, Albania gridi.kono@gmail.com klesti.hoxha@fshn.edu.al Democratic Party since 15 December 2013.”, ”Mat- teo Renzi”, ”Italy” and ”Democratic Party” can be Abstract classified as person, location and organization entities, respectively. Named Entity Recognition (NER) refers to In this work we describe a machine learning ap- the process of extracting named entities (peo- proach for recognizing named entities in Albanian text ple, locations, organizations, sport teams, documents. The Albanian language lacks of publicly etc.) from text documents. In this work available annotated training corpora for NER. We have we describe our NER approach for documents created a custom annotated corpus consisting of news written in Albanian. We explore the use of articles written in Albanian published in various on- Conditional Random Fields (CRFs) for this line news media. The corpus has been created using a purpose. Adequate annotated training cor- custom built web application software that allowed for pora are not yet publicly available for Alba- n-gram based annotation sessions. Experiments were nian. We have created our own corpus an- conducted using Standford CRF based NER toolkit1 . notated manually by humans. The domain Results were promising despite the small size of the of this corpus is based on Albanian news created corpus. documents published in 2015 and 2016. We The rest of this paper is structured as follows. have tested our trained model with two test In Section 2 we will present previous works in NER sets. Overall precision, recall and F-score are and related approaches. In Section 3 the Conditional 83.2%, 60.1% and 69.7% respectively. Random Fields approach is described. In Section 4 we will describe our corpus and the methodology used 1 Introduction for creating it. In Section 5 we will present experi- Named Entity Recognition (NER) is an important tool ments and their results. Finally, Section 6 concludes in almost all Natural Language Processing (NLP) ap- the paper. plication areas. NLP systems that include some form of information extraction have gained much attention 2 Related Works from both the academic and business intelligence com- NER approaches have been reported since the early munity. 90s. One of the first works has been described by Rau Identifying and classifying words of text into differ- in [Rau91]. This paper describes the idea of a system ent classes is a process defined as named entity recog- that extracts and recognizes company names. It relied nition (NER) [ZPZ04]. In simple terms, a named en- on handcrafted rules and heuristics. tity is a group of consecutive words found in a sen- tence, and representing entities of the real world such Since NER is language dependent, many systems as people, locations, organizations, dates, etc. For have been presented for different languages. In instance in the following sentence: ”Matteo Renzi is [DBG+ 00] is described a NER system that recog- an Italian politician who has been the Prime Minister nizes named entities in texts written in Greek. This of Italy since 22 February 2014 and Secretary of the 1 http://nlp.stanford.edu/software/CRF-NER.html approach followed the MUC-7 NER task definition History documents. The corpus is a collection of three [CR97] with certain adaptions. Entity classes captured sub-corpora: People corpus, Locations corpus and Or- in this paper are people, organizations, location names, ganizations corpus. They performed experiments with date and time expressions, and percent and money ex- these corpora using Apache OpenNLP5 as a frame- pressions. This system is based on finite state ma- work for running their machine learning based NER chine techniques. The achieved precision and recall approach. The achieved results of this approach were were 0.86 and 0.81 respectively. as follows: An interesting study by Pathak et al. [PGJ+ 13] fo- The People corpus produced values of Precision, Re- cuses in clinical named entities. It recognizes three call and F-score as 0.85, 0.70 and 0.76 respectively. types of named entities like Problem, Test and Treat- The Locations corpus produced values of Precision, ment. In this study, authors proposed an approach Recall and F-score as 0.83, 0.66 and 0.73 respectively. which uses domain specific knowledge in the form of While Organizations corpus produced values of Preci- clinical features along with textual and linguistic fea- sion, Recall and F-score as 0.69, 0.60 and 0.64 respec- tures. The used textual features are stemming, pre- tively. fix, suffix and orthographical features. The used lin- In general, NER approaches reported for most lan- guistics features are part-of-speech(POS), chunks and guages belong to these categories: NP Head. While the used clinical features are section headers, customized stop words, dictionary search, ab- 1. Rule Based breviations and acronyms. They performed experi- ments with i2b2 shared task using CRF++2 . The 2. Machine Learning evaluation task was done using micro-averaged preci- sion, recall, and F-Score for exact and inexact matches. 3. Hybrid Models For exact matches they achieved 0.889 precision, 0.813 recall and 0.849 F-score respectively. For inexact The first one is based on handcrafted rules, linguis- matches they achieved 0.966 precision, 0.883 recall and tic approaches and Gazzeters. The second is based 0.923 F-Score. on statistical methods. The most used methods for An approach for German language is presented by statistical NER are Maximum Entropy Model [SB13], Faruqui et al. in [FPS10]. Their work consists of train- Conditional Random Fields [PGJ+ 13,FPS10,LMP01], ing an existing Stanford NER system on various Ger- Hidden Markov Models [ZS02] and Support Vec- man semantic generalization corpora. Semantic gener- tor Machines [BDR+ 08]. The third one combines alization refers to acquiring semantic similarities from Rule based and Machine learning methods [Rau91, large, unlabelled corpora that can support the gener- DBG+ 00, BDR+ 08]. Machine learning based methods alization of predictions to new, unseen words in the depend on preliminary training. The training methods test set while avoiding over-fitting. Corpora was eval- can be divided into three groups: Supervised learning, uated on both in-domain and out-of-domain data, as- Semi-supervised learning and Unsupervised learning sessing the impact of generalization corpus size and method. Supervised methods need annotated train- quality. The F-score of this system improves by 6% ing data to retrieve optimal results from the classifier. (in-domain) and 9% (out-of-domain) over supervised Semi-supervised learning methods require some data training approaches. which are used as a help for the training. Unsuper- Benajiba et al. in [BDR+ 08] have developed a NER vised learning methods do not depend on training data system for Arabic language. The features used are con- and are mostly clustering based. textual, lexical, morphological, geographical dictionar- ies (gazzeters), Part-of-speech tags and Base-phrase- 3 Conditional Random Fields chunking, nationality and the corresponding English capitalization. The system has been evaluated using In this work we used a linear chain CRF sequence clas- ACE Corpora3 and ANERcorp4 . The aggregate F- sifier. Conditional Random Fields is a probabilistic score for this system (when all the features are consid- framework used to segment and label sequence data. ered) is 82.71%. Conditional Random Fields are undirected graphical A valuable approach for Albanian Language is pre- models, used to calculate the conditional probability sented for the first time by Skënduli and Biba in of values on designated output nodes, given already [SB13]. Their work uses a human annotated corpus. assigned values to the input nodes. The conditional The domain of this corpus is focused in Politics and probability of a state sequence x = (x1 , . . . , xT ) given an observation sequence y = (y1 , . . . , yT ) calculated 2 https://taku910.github.io/crfpp/ as: 3 http://corpus.ied.edu.hk/ace/Corpus.html 4 http://users.dsic.upv.es/ ybenajiba/ 5 https://opennlp.apache.org/ Aggregated News Texts ( T K ) 1 XX pθ (y|x) = exp θk fk (yt−1 , yt , xt ) Zθ (x) t=1 k=1 (1) Tokenizer where, fk (yt−1 , yt , xt ) is a feature function whose weight θk , is to be learned via training. The values of N-Grams Generator feature functions may range between −∞ to +∞ , but usually they are binary. Usually, when applying CRFs to the named entity recognition problem, an observa- Annotated tion sequence is a sequence of tokens or a raw text N-Grams and the state sequence is its corresponding sequence of labels [LMP01]. By Hammersley-Clifford theorem, the conditional probability of a state sequence given Annotated an input sequence will be: Corpus Figure 1: Workflow diagram. ( T K ) X XX In order to add labels to the generated unigrams, Zθ (x) = exp θk fk (yt−1 , yt , xt ) . (2) we have built a web application using ASP.NET Web- y∈Y T t=1 k=1 Forms7 , C#8 , JQuery/Ajax9 and Javascript technolo- gies. Our application has two simple user interfaces. The where Zθ is a normalisation factor over the all state first user interface (Figure 2) lists titles of news articles sequences, which ensures that the probability distri- and allows selecting each of them for n-gram labeling. bution sums up to 1. 4 Corpus There are no publicly available NER annotated cor- pora for Albanian texts. Hence we decided to cre- ate a corpus of Albanian based on news articles pub- Figure 2: News User Interface. lished online from different local newspapers. We have The second user interface (Figure 3) consists of two used the news aggregator for Albanian news, built by parts. The first part displays raw content of a selected [HBN16] using Scrapy6 . News articles retrieved by this news article and the second part displays all unigrams news aggregator are stored in a MySQL database. We of it. For each unigram, annotators are able to set used Python NLTK toolkit to generate all n-grams(for a corresponding label from a list of predefined entity n=1,2,3,4) for each news article. All generated n- classes. Actually, our web application offers interfaces grams are stored in the same database with corre- for also labeling bigrams and trigrams, but because the sponding news articles. In this paper we have con- NER training model that we used for our experiments sidered only unigrams. Figure 1 shows the workflow depends on labeled unigrams we were limited to these. diagram of building our corpus. In order to visually aid the entity identification pro- cess, each word which starts with an uppercase charac- 7 https://www.asp.net/web-forms/ 8 https://msdn.microsoft.com/en-us/library/67ef8sbd.aspx 6 https://scrapy.org/ 9 https://jquery.com/ Recognizer. It comes with well-engineered feature ex- tractors for Named Entity Recognition, and many op- tions for defining feature extractors. Stanford NER is also known as CRFClassifier. The software pro- vides a general implementation of (arbitrary order) lin- ear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on la- beled data, you can actually use this code to build sequence models for NER or any other task [FGM05]. Figure 3: Unigrams User Interface. 5.1.1 Evaluation Metrics ter inside the news content is highlighted with yellow We have evaluated the results of our experiments color. with well-accepted standard measures in evaluation of This web application allows annotators to work on trained NER models. This can be performed by an- the same news item without overriding previous n- notating a corpus and then compare the human an- gram labels, but storing each annotation instead al- notations with a gold standard corpus. Thus, each lowing so for quality control processes. However, we annotation must be classified as being a: avoided this for the experiments reported in this work, leaving the experimentation with annotation quality 1. True Positive (TP): the system provides an anno- assurance techniques for future works. tation that exists in the gold standard corpus. Our corpus consists of 130 documents. The selected news documents were published in two different years 2. True Negative (TN): the non existence of an an- (2015 and 2016). They belong to eight categories: Pol- notation is correct according to the gold standard itics News, Economic News, Sport News, Health News, corpus. Technology News, Culture News, Chronics and Opin- ions. 3. False Positive (FP): the system provides an an- This corpus has been manually annotated by hu- notation that does not exist in the gold standard mans. We have organized three sessions with volun- corpus; teer annotators in order to annotate more n-grams. In the first and second sessions, volunteers annotated all 4. False Negative (FN): the system does not provide news articles designated for the training set. In the an annotation that is present in the gold standard third session we used different annotators that have corpus. not participated in previous sessions, in order to an- notate test sets. The annotation has been done ac- Concretely we used Precision, Recall and F-score as cording to the Inside Outside(IO) format10 with four used by other authors in [DBG+ 00] [BDR+ 08] [SB13]. tags as described in Table 1. Recall measures the ability of a NE trained model to present all relevant entities, and is formulated as: NE tag Meaning Example PER person name George PER TP Recall = Bush PER TP + FP LOC location Tirana LOC name Precision measures the ability of a NE trained ORG organization OSCE ORG model to present only relevant entities, and it is for- name mulated as: O Not an entity 76% O TP P recision = Table 1: Named Entity Tagset TP + FN 5 Experiments and Results These two measures of performance can be com- bined as one performance metrics, the F-score, which 5.1 Experimental Set-up is computed by the weighted harmonic mean of preci- We performed our experiments in Stanford NER. Stan- sion and recall. ford NER is a Java implementation of a Named Entity P recision × Recall 10 http://nlp.stanford.edu/software/crf-faq.html F − score = 2 × P recison + Recall 5.1.2 Experiments Precision Recall F-score Experiment I 0.8077 0.6402 0.7143 Our corpus is further divided into the training and the Experiment II 0.8555 0.5627 0.6789 test set, which contain 100 and 30 documents respec- Average 0.8316 0.60145 0.6966 tively. The training set contains news documents published Table 4: Final results of experiments. in 2015, in total around 50.000 words. The test set is divided into two subsets. The first subset contains news documents published in 2015, 7 Conclusions and Future Directions while the second subset contains news documents pub- In this paper we presented the results of a machine lished in 2016. Each subset contains 15 documents learning approach for identifying named entities in text respectively. documents written in Albanian. It is based in Condi- We have conducted two experiments, the first using tional Random Fields and was evaluated against two the first subset of test data and the second makes use different test sets on a corpus of Albanian news doc- of the second subset. uments. The corpus was created by annotating news articles through the use of a custom built web appli- 6 Results cation software. Volunteer annotators manually per- The evaluation task for each experiment as described formed this process by using a n-gram based news vi- above was done using three different metrics: Preci- sualization interface. The experiments were restricted sion, Recall and F-score. The following tables show in the recognition of three entity classes: people, loca- results for each test set that has been used. The tions, and organizations. used training model is the same for both experiments. Even though the size of the annotated corpus is These calculations were carried out automatically by modest, we got promising results, showing that the ex- Stanford NER. perimented model can be used for successfully extract- In the first experiment the NE class with highest ing named entities from Albanian text documents. F-score is Locations class, 81.1%. The NE class with The relatively low recall values for organization en- lowest value is Organizations class, 47.1%. Overall for tities may be improved by using a larger corpus and the first experiment we have got Precision of 80.8%, expand it beyond news text documents written in Al- Recall of 64.0% and F-score of 71.4% (see Table 2). banian. In the future we intend to increase the size of the Entity class Precision Recall F-score corpus in order to get more significant results. Fur- Locations 0.8219 0.8000 0.8108 thermore, we aim to improve the quality of the an- Organizations 0.6154 0.3810 0.4706 notated data by switching to a semiautomatic corpus People 0.8409 0.5441 0.6607 creation approach [ACS14]. It would need to use a Average 0.8077 0.6402 0.7143 publicly available knowledge base of people, locations, and organizations. This way we may aid human anno- Table 2: Results for the first experiment. tators in better recognizing possible named entities in the provided texts. Also we want to improve the user In the second experiment the NE class with high- interface involved in the annotation process and also est F-score is People class, 78.7%. The NE class with tweak it in order to avoid confusion and produce anno- lowest value is Organizations class, 35.3%. Overall for tation results better suited for the NLP toolkit that is the first experiment we have got Precision of 85.6%, being used. Another aspect that we want to improve in Recall of 56.3% and F-score of 67.9% (see Table 3). the future, is the inclusion of a quality control scheme in the annotation process. This way we will be able Entity class Precision Recall F-score to avoid false or ambiguous tagging of named entities Locations 0.8706 0.6379 0.7363 present in the text documents in question. Organizations 0.8333 0.2239 0.3529 Experimenting with other NER machine learning People 0.8429 0.7375 0.7867 techniques like Hidden Markov Model (HMM), Sup- Average 0.8555 0.5627 0.6789 port Vector Machine (SVM) and studying the be- haviour of these approaches for Albanian written doc- Table 3: Results for the second experiment. uments is also in our future plans. A NER tool for Albanian texts will also enable con- The overall average Precision, Recall and F-score crete applications like the creation of a knowledge base are 83.2%, 60.1% and 69.7% respectively (see Table that stores facts about named entities present in news 4). articles [HBN16]. References International Conference on Natural Lan- guage Processing, 2013. [ACS14] Giuseppe Attardi, Vittoria Cozza, and Daniele Sartiano. Adapting linguistic tools [Rau91] L. F. Rau. Extracting company names for the analysis of italian medical records. from text. In Proc. Seventh IEEE Conf Ar- In Proceedings of the First Italian Confer- tificial Intelligence Applications, volume i, ence on Computational Linguistics, 2014. pages 29–32, February 1991. [BDR+ 08] Yassine Benajiba, Mona Diab, Paolo [SB13] Marjana Prifti Skënduli and Marenglen Rosso, et al. Arabic named entity recogni- Biba. A named entity recognition ap- tion: An svm-based approach. In Proceed- proach for albanian. In Advances in Com- ings of 2008 Arab International Conference puting, Communications and Informatics on Information Technology (ACIT), pages (ICACCI), 2013 International Conference 16–18, 2008. on, pages 1532–1537. IEEE, 2013. [CR97] Nancy Chinchor and Patricia Robinson. [ZPZ04] Li Zhang, Yue Pan, and Tong Zhang. Fo- Muc-7 named entity task definition. In cused named entity recognition using ma- Proceedings of the 7th Conference on Mes- chine learning. In Proceedings of the 27th sage Understanding, page 29, 1997. Annual International ACM SIGIR Confer- ence on Research and Development in In- [DBG+ 00] Iason Demiros, Sotiris Boutsis, Voula formation Retrieval, SIGIR ’04, pages 281– Giouli, Maria Liakata, Harris Papageor- 288, New York, NY, USA, 2004. ACM. giou, and Stelios Piperidis. Named entity recognition in greek texts. In LREC, 2000. [ZS02] GuoDong Zhou and Jian Su. Named en- tity recognition using an hmm-based chunk [FGM05] Jenny Rose Finkel, Trond Grenager, and tagger. In proceedings of the 40th Annual Christopher Manning. Incorporating non- Meeting on Association for Computational local information into information extrac- Linguistics, pages 473–480. Association for tion systems by gibbs sampling. In Pro- Computational Linguistics, 2002. ceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computa- tional Linguistics, 2005. [FPS10] Manaal Faruqui, Sebastian Padó, and Maschinelle Sprachverarbeitung. Training and evaluating a german named entity rec- ognizer with semantic generalization. In KONVENS, pages 129–133, 2010. [HBN16] Klesti Hoxha, Artur Baxhaku, and Ilia Ninka. Bootstrapping an online news knowledge base. In International Confer- ence on Web Engineering, pages 501–506. Springer, 2016. [LMP01] John Lafferty, Andrew McCallum, and Fer- nando Pereira. Conditional random fields: Probabilistic models for segmenting and la- beling sequence data. In Proceedings of the eighteenth international conference on machine learning, ICML, volume 1, pages 282–289, 2001. [PGJ+ 13] Parth Pathak, Raxit Goswami, Gautam Joshi, Pinal Patel, and Amrish Patel. Crf- based clinical named entity recognition us- ing clinical nlp. In Proceedings of 10th