Labelling companies referred to in newspaper articles Amine Nahid Előd Egyed-Zsigmond Sylvie Calabretto amine.nahid@insa-lyon.fr elod.egyed-zsigmond@insa-lyon.fr sylvie.calabretto@insa-lyon.fr Université de Lyon; LIRIS UMR 5205 Lyon, France ABSTRACT This usual name list will then be combined to other methods to There are several domains where establishing links between news- propose a global score that predicts a distance between a text and a paper articles and companies is useful. In this paper, we will present company, in order to end up with a model that labels a newspaper the first elements of our solution to predict links between a news- article with its corresponding companies among our list. paper article written in French and a list of companies identified by their name and activity domain. We base our study on a semi- 2 STATE OF THE ART automatically annotated article corpus and the almost complete list Matching press articles with the companies they mention is part of official French company names. We combine statistical linguis- of the Named Entity Recognition (NER) domain. The term NER, tic methods with acronym generation and filtering techniques to appeared for the first time in the MUC-6 [5] conference. The task propose a global score that predicts a distance between a text and a of recognising company mentions in texts is hence a sub-problem company. The main objective of the study presented in this paper of NER, where we are interested only in entities representing com- is the creation of a usual name list for each company in order to panies. The issue can be addressed with different approaches. A improve the labelling of newspaper articles. baseline approach would be searching the official name of the com- pany in the text. Nonetheless, searching the official name of a CCS CONCEPTS company within a newspaper article might reveal itself inefficient, • Information systems → Document topic models; Relevance given that most companies have usual or common names that assessment. slightly differ from their legal ones. Working on a German corpus, [3] proposed the use of dictionaries of colloquial names from var- KEYWORDS ious sources, as well as an alias generator that generates an alias natural language processing, named entity recognition, information out of an official denomination (it goes through some classic NLP retrieval, text mining, text tagging data cleaning : removal of legal designations, special characters, geographic indications and token normalisation). There have been 1 INTRODUCTION other works elaborating rule based systems, based on heuristics and/or hand crafted rules on a morphological level [4, 6, 7]. Unfor- Businesses have always had interest in assessing their performances, tunately rule based methods are domain and language specific, and evaluating their financial and public relations situation. Hence, in- are not portable therefore. There are recently attempts to execute formation contained in press articles, clients feed-backs, etc. might generic NER tasks, using deep learning [2], but they usually need be of strategic importance. much more training examples than we have, annotated more pre- Our main problem is to link articles with companies for a very large cisely. We are also experimenting with CRF (Conditional Random number of companies registered in France, and identified by their Fields) based techniques, with promising results. These experiments unique national identifier (SIREN code) and legal name. However, will be related in a future paper. the companies are seldom referenced in the press using their legal In the following section we propose a statistics based protocol names, that are often long. Our project is to design a solution to to tackle the company recognition problem through common name link economic press articles written in French with a set of compa- dictionary generation. nies. We have a semi-automatically annotated article ground truth corpus and the list of the official denominations of around 30,000 companies registered in France. Our main contribution in this paper 3 PROPOSED APPROACH is a protocol to construct the common names of companies given In this section we present our company usual name creation method, their legal name and the set of annotated articles. first based on the official names and then on generated acronyms. We carry out our experiments and develop our tools on French language texts, but most of the methods used can be easily adapted 3.1 Hypothesis to other languages. Since companies are barely referred to by their legal names and are "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- rather known by one or more common names, we need to provide mons License Attribution 4.0 International (CC BY 4.0)." an accurate automatic protocol to generate these common names. Nahid et al. Through the observation of the legal names of a set of French companies, we made the following hypotheses: • The common name of a company might be only its legal name • The common name of a company might be a contiguous sequence of terms that form the legal name (a sub word 𝑛𝑔𝑟𝑎𝑚 of the legal name) • The common name might be an acronym of the legal name or some part of it. With these hypotheses, we aim to implement a common name generator that operates in two steps: as a first step it generates the sub-sequences and then the acronyms. The second step is the search of the best subset of 𝑛𝑔𝑟𝑎𝑚s and acronyms to compose the common name set. 3.2 Pre-processing For our study, we have two data sets. The first one catalogues around 30k French companies identified by their SIREN codes (unique French identifier for businesses and not-for-profit organisations) and legal names in capital characters with no accents. The second data set contains around 120 thousand annotated French newspa- Figure 1: Number of companies per number of referencing per article URLs, manually labelled with the SIREN code of the articles companies they are talking about. Its elements are listed in accor- dance with the following scheme: id, SIREN code, legal name of the company, URL address of the article. We developed a scrapper Since not all companies have the privilege of being talked about that collected the title and body of the articles when available. We very often in the press, our ground truth shall be about the same. included finally only the articles for which we managed to scrap For our dataset, the graph (cf. Figure 1) shows the number of com- their content: title and text of the article. That gave us a dataset panies in function of the number of articles labeled as talking about with around 58k articles. them. 2375 French companies have more than 6 articles labeled as We cleaned the official names from the first data set by removing talking about them. We shall call these companies well-documented the punctuation marks, especially the dots, commas and parenthe- companies and focus our study on them. We consider that for the ses. However, we chose to keep the hyphens as their use in French other 𝑙𝑒𝑠𝑠 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies, it is difficult to generate usual is very common for compound names, considered as single terms names based on annotated articles. in our model. Examples: 3.3 ngram generator We call ngrams all the contiguous sequences of terms contained in • For companies without the special characters aforemen- an expression. Our commitment at this is that for each company tioned, e.g. ELECTRICITE DE FRANCE, nothing is removed. we generate all possible ngrams for its legal denomination. • For CA INDOSUEZ WEALTH (FRANCE), the parentheses For instance for the company "COMPAGNIE DU RHONE", we should are irrelevant and would be problematic for the 𝑛𝑔𝑟𝑎𝑚 and generate the following n-grams: “COMPAGNIE", “DU", “RHONE", 𝑎𝑐𝑟𝑜𝑛𝑦𝑚 generation, the name has therefore got to be trans- “COMPAGNIE DU", “DU RHONE", “COMPAGNIE DU RHONE". formed to CA INDOSUEZ WEALTH FRANCE before any In order to filter potentially irrelevant 𝑛𝑔𝑟𝑎𝑚𝑠 we introduced 2 further process. rules: filter one character long 𝑛𝑔𝑟𝑎𝑚𝑠, filter 𝑛𝑔𝑟𝑎𝑚𝑠 based on their • There is a company registered in France with the official frequency in the official name list. name: CASINO, GUICHARD-PERRACHON. The comma is to be removed (hence CASINO GUICHARD-PERRACHON) as it 3.3.1 Occurrence frequency. For a given 𝑛𝑔𝑟𝑎𝑚 we compute an is useless for any future process. However, as written before, inverse occurrence frequency score 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) depending the hyphen is kept because GUICHARD-PERRACHON is on the number of times it occurs in the company legal names set. actually one name and should not be considered as two The higher 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) is, the more unique the n-gram is. separate terms. • Also the dots are removed so as to normalise the acronyms 𝑐𝑜𝑢𝑛𝑡 ({𝑓 ∈ 𝐶 |𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 (𝑙𝑛 𝑓 , 𝑛𝑔𝑟𝑎𝑚})) in use within the legal names. e.g.: SARL and S.A.R.L. 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) = 1 − 𝑐𝑜𝑢𝑛𝑡 (𝐶) For the second data set, we concatenate the titles and bodies (1) under a unique attribute we called 𝑐𝑜𝑟𝑝𝑢𝑠. We also normalise the Where: corpus by removing non-printable Unicode characters. The articles • 𝑛𝑔𝑟𝑎𝑚 is a subsequence of the legal name of the company 𝑐 are then put into an 𝐸𝑙𝑎𝑠𝑡𝑖𝑐𝑠𝑒𝑎𝑟𝑐ℎ index. containing 𝑛 words 0 < 𝑛 ≤ 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡 (𝑙𝑛𝑐 ) Labelling companies referred to in newspaper articles • 𝑙𝑛𝑐 , 𝑙𝑛 𝑓 are the legal names of the companies 𝑐, 𝑓 Similarly to the previous step, we end up with a key-value dictio- • 𝐶 is the set of all the companies we have nary, called 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 , where the keys are the SIREN codes of the • 𝑐𝑜𝑢𝑛𝑡 ({𝑓 ∈ 𝐶 |𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠 (𝑙𝑛 𝑓 , 𝑛𝑔𝑟𝑎𝑚𝑐 })) is the size of the sub- companies and the values are lists of retained acronyms for the set of companies from 𝐶 containing 𝑛𝑔𝑟𝑎𝑚𝑐 in their official given companies. name 3.5 F-measure 3.3.2 Threshold. Once we have the 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 for an 𝑛𝑔𝑟𝑎𝑚, we should define a threshold value 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 that determines the Having these two lists 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 and 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 containing potential 𝑛𝑔𝑟𝑎𝑚s to keep and those to discard according to their 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒. usual names, we have to create a method to keep those who are This is our second 𝑛𝑔𝑟𝑎𝑚 filter. actually useful in order to link press articles to the companies. The first thing to do is to merge 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 and 𝑑𝑖𝑐𝑡𝑎𝑐𝑟 into a unique 𝐼 𝑓 𝑜 𝑓 _𝑠𝑐𝑜𝑟𝑒 (𝑛𝑔𝑟𝑎𝑚) ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 𝑡ℎ𝑒𝑛 𝑘𝑒𝑒𝑝 𝑛𝑔𝑟𝑎𝑚 dictionary of "potential common names": 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 . In this dictionary, 𝑒𝑙𝑠𝑒 𝑑𝑖𝑠𝑐𝑎𝑟𝑑 𝑖𝑡 every company, referred to by its SIREN code, has a list of unique 𝑛𝑔𝑟𝑎𝑚𝑠 and acronyms that have been retained after applying the After implementation and study we empirically set the filters aforementioned. 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑𝑛𝑔𝑟𝑎𝑚 value to 0.999. The following step is to generate for every "potential common At the end of this step, we end up with a key-value dictionary, name" list all the sub-sets. The aim is to select the sub-set that called 𝑑𝑖𝑐𝑡𝑛𝑔𝑟𝑎𝑚 , where the keys are the SIREN codes of the com- contains the most relevant common names for every company. panies and the values are lists of the potentially relevant 𝑛𝑔𝑟𝑎𝑚𝑠 To do so, we compute the F-measure of every sub-list given our for the given company. indexed ground truth. The latter contains articles from the second data set mentioned in the pre-processing part. For each article we 3.4 Acronym generator know to which companies it refers to. We try to find one 𝑛𝑔𝑟𝑎𝑚 Many companies are referred to by an acronym in the press. We set that we call 𝑈 𝑠𝑢𝑎𝑙𝑁 𝑎𝑚𝑒𝑠𝑐 for each company that maximises complete therefore our list of 𝑛𝑔𝑟𝑎𝑚𝑠 derived from the legal names the F-measure when retrieving articles that contain at least one with acronyms. For each legal name, we want to generate every 𝑛𝑔𝑟𝑎𝑚 ∈ 𝑈 𝑠𝑢𝑎𝑙𝑁 𝑎𝑚𝑒𝑠𝑐 when looking for a given company 𝑐. potential acronym that might be in use. For instance: The F-measure, introduced at the MUC-4 conference [1], is the • the SOCIETE FRANCAISE DU RADIOTELEPHONE is often harmonic mean of 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 and 𝑟𝑒𝑐𝑎𝑙𝑙. referred to as SFR We define as 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 the articles that are annotated as • SOCIETE DE DISTRIBUTION DE PAPIER would be more talking about the company 𝑐 in our dataset. We define as 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 known as SODIPA the set of articles containing at least one of the common names in The acronym generation is based on the first letter or the two first the 𝑖 𝑡ℎ sub-set of the set of potential common names of a company 𝑐, letters of each term of the whole legal name or a part of it. as found in 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 (𝑐) The 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 is the ratio of 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 Let us consider for example the company registered as TONNEL- amongst 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 , whereas the 𝑟𝑒𝑐𝑎𝑙𝑙 is the ratio of rele- LERIE FRANCOIS FRERES. Our acronym generator would provide vant articles that were actually retrieved from 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 . the following possibilities: In fact, we query the press article Elasticsearch index, looking for • Based on the whole name: TFF, TOFRFR, TOFRF, TOFFR, TFFR, articles that contain at least one of the 𝑛𝑔𝑟𝑎𝑚s contained the input TFRF, TFRFR, TOFF 𝑛𝑔𝑟𝑎𝑚 sub-set (𝑠𝑢𝑏𝑠𝑒𝑡_𝑛𝑔𝑟𝑎𝑚𝑠𝑐 ). The documents that are returned • Based on a part of the name: TF, FF, FFR, FRF, FRF, TFR, TOF, upon the query are called 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠 (𝑠𝑢𝑏𝑠𝑒𝑡_𝑛𝑔𝑟𝑎𝑚𝑠𝑐 ). The TOFR ideal case would be that the query returns all articles about the company 𝑐 the input sub-set belongs to; these are what we define Given our acronym generation protocol, for every company we as 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 . should end up with a considerable amount of acronyms. More e.g.: For BANQUE PALATINE let us suppose that we have, among words a legal name has, the longer is the list of acronyms. We have others, the following potential common names sub-set: [PALATINE, therefore defined some rules to filter the potentially relevant ones. BP], we would retrieve any articles containing PALATINE or BP • Any non-alphabetic symbol is removed, as part of data clean- at least once. Meanwhile the relevant articles are obtained with ing for this process. querying all the articles tagged with the SIREN code of BANQUE • French stop words are removed from legal names before PALATINE in our ground truth dataset. generating any acronym. The formulae for the precision, the recall and the F-measure are: • No acronym generation for one-word long legal names. 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 ∗ 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 • For each generated acronym, we verify in our ground truth 𝐹 (𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 ) = 2. (2) 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 of articles tagged with the concerned company whether there is at least one occurrence in the corpora. So, if an if 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 + 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 ≠ 0 and 𝐹 (𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 ) = 0 if 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 + acronym occurs at least once then we keep it in our acronym 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 = 0 dictionary, else we discard it. Where : • For performance reasons, no acronym is generated if the • 𝑠𝑢𝑏𝑠𝑒𝑡𝑐,𝑖 being the 𝑖 𝑡ℎ sub-set of the set of potential common legal name is strictly more than 5 words long. 29 companies names of a company 𝑐, as found in 𝑑𝑖𝑐𝑡𝑝𝑐𝑛 (𝑐) out of the list of 2357 𝑤𝑒𝑙𝑙 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies. For • 0 < 𝑖 ≤ 𝐿𝑐 , 𝐿𝑐 being the number of all the sub-sets of 𝑛𝑔𝑟𝑎𝑚s those few, we set their usual names manually. for the company 𝑐 Nahid et al. Table 1: Mean values of the F-measure computation on the issue is of interesting debate, should an article be labelled “well documented" companies using their common names whenever it mentions a company, or only when the main (first row) and legal names only (second row) topic is about it? F-measure Precision Recall 5 CONCLUSION AND PERSPECTIVES best subset of common names 0.562 0.537 0.765 In this study we proposed a method to build common, usual names legal names only 0.446 0.476 0.560 for companies based on an official or legal name list including theirs difference 0.116 0.061 0.205 and a large set of annotated news articles. Our experiments show that the method improves in a significant manner the F-score of the retrieval of relevant articles for a company when looking for and our improved usual name set instead of only their official name. |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 } ∩ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 }| 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑐,𝑖 = The filtering of the best common name sets through a greedy F- |{𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 |} measure based approach is complete, but we can eventually try to (3) give individual scores of each potential usual name and filter them according to that score, instead of studying all possible potential |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 } ∩ {𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑖 }| 𝑟𝑒𝑐𝑎𝑙𝑙𝑐,𝑖 = (4) usual name set. |{𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡_𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠𝑐 |} As our final goal is to label articles with companies they are Finally, once the computation is finished, we keep the subset that talking about, we propose to work on other company-article dis- maximises the F-measure for every company as the definitive list tance measures. One is based on Part Of Speech tagging based NER of relevant common names of the concerned company. (Named Entity Recognition) analysis combined with machine learn- The computation of the F-measures is important as the most rele- ing methods (for instance Conditional Random Fields) to predict vant list of common names is not necessarily the largest one. In fact, words in a phrase that have a high probability of being company adding an 𝑛𝑔𝑟𝑎𝑚 or an 𝑎𝑐𝑟𝑜𝑛𝑦𝑚 does not improve the F-measure names. This can eliminate some false positives when company in all cases. Sometimes, adding a new keyword (𝑛𝑔𝑟𝑎𝑚 or acronym) names are also common words like 𝑂𝑟𝑎𝑛𝑔𝑒 or 𝐵𝑢𝑡. Another idea is to a query might add noise to the results, and if the recall improves to train a classifier to guess the activity sector an article is talking or stays constant, the precision might worsen and hence decrease about and check whether it is close to the activity sector of the pro- the F-measure. posed companies. This method would help increasing the precision of our retrieval. 4 EXPERIMENTATION RESULTS F-measures on 𝑑𝑖𝑐𝑡𝑠𝑢𝑏𝑝𝑐𝑛 and comparison with ACKNOWLEDGMENTS legal names To Infolégale, for providing the ground truth dataset. To G. Benturquia, Y. Latreche, G. Meddour, T. E. Mekhalfa and A. E. We ran the protocol described in the previous section on the 2357 Pereyra, for their most helpful work they have done during their 𝑤𝑒𝑙𝑙 − 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑒𝑑 companies, i.e. those having at least 6 articles fifth year research module. within the corpus labeled as talking about them. We calculated 𝑑𝑖𝑐𝑡𝑠𝑢𝑏𝑝𝑐𝑛 as described in the previous section. We REFERENCES defined 𝑑𝑖𝑐𝑡 𝑓 𝑖𝑟𝑠𝑡 which only keeps the subset that maximises the [1] Nancy Chinchor. 1992. MUC-4 evaluation metrics. In Proceedings of the 4th con- F-measure for each company, with its F-measure, precision and ference on Message understanding (MUC4 ’92). Association for Computational recall values. Linguistics, McLean, Virginia, 22–29. https://doi.org/10.3115/1072064.1072067 [2] Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2018. A Survey on Deep Learning We joined 𝑑𝑖𝑐𝑡 𝑓 𝑖𝑟𝑠𝑡 with 𝑑𝑖𝑐𝑡 𝑓 𝑙𝑒𝑔𝑎𝑙 to compare the F-measure of for Named Entity Recognition. CoRR abs/1812.09449 (2018). arXiv:1812.09449 either the legal name or the best common name subset. Table 1 http://arxiv.org/abs/1812.09449 summarises the mean values of results obtained for this experiment. [3] Michael Loster, Zhe Zuo, Felix Naumann, Oliver Maspfuhl, and Dirk Thomas. 2017. Improving Company Recognition from Unstructured Text by using Dictionaries. • We obtained an average of 0.56 for the F-measure on the (2017), 10. best common name subset for every company, whereas the [4] Andrei Mikheev, Claire Grover, and Marc Moens. 1998. Description of the LTG System Used for MUC-7. In Seventh Message Understanding Conference (MUC-7): average when tagging only through legal names for the same Proceedings of a Conference Held in Fairfax, Virginia, April 29 - May 1, 1998. Fairfax, sample did not go beyond 0.44. Our protocol realised an aver- Virginia. https://www.aclweb.org/anthology/M98-1021 age of 11.6% of F-measure improvement. It also improves the [5] David Nadeau and Satoshi Sekine. 2007. A survey of named entity recognition and classification. Lingvisticæ Investigationes 30, 1 (Jan. 2007), 3–26. https: recall by 20.5%, which means that we retrieve more relevant //doi.org/10.1075/li.30.1.03nad articles using generated potential common names. [6] L. F. Rau. 1991. Extracting company names from text. In [1991] Proceedings. The Seventh IEEE Conference on Artificial Intelligence Application (Miami Beach, FL, • The minimum difference of F-measure value is a negligible USA, 1991-02), Vol. i. IEEE, 29–32. https://doi.org/10.1109/CAIA.1991.120841 amount and can be considered as a nought value. We might [7] GuoDong Zhou and Jian Su. 2002. Named Entity Recognition using an HMM-based affirm that our protocol can either improve things for article Chunk Tagger. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics. Association for Computational Linguistics, Philadelphia, tagging or let them at a stable level. Pennsylvania, USA, 473–480. https://doi.org/10.3115/1073083.1073163 • The precision improvement is quite low compared to the recall’s improvement. This is mainly due to complex scrap- ping issues (such as URL redirections) that brought noise to the data, or omitted annotations on some articles. The latter