NLP4NLP: Applying NLP to scientific corpora about written and spoken language processing Gil Francopoulo1, Joseph Mariani2 and Patrick Paroubek3 1 gil.francopoulo@wanadoo.fr IMMI-CNRS + TAGMATICA rue John von Neumann 91405 Orsay Cedex, France 2 joseph.mariani@limsi.fr IMMI-CNRS + LIMSI-CNRS rue John von Neumann 91405 Orsay Cedex, France 3 pap@limsi.fr LIMSI-CNRS rue John von Neumann 91405 Orsay Cedex, France Abstract Analyzing the evolutions of the trends of a scientific domain in order to provide insights on its states and to establish reliable hypotheses about its future is the problem we address here. We have approached the problem by processing both the metadata and the text contents of the domain publications. Ideally, one would like to be able to automatically synthesize all the information present in the documents and their metadata. As members of the NLP community, we have applied the tools developed by our community to publications from our own domain, in what could be termed a “recursive” approach. In a first step, we have assembled a corpus of papers from NLP conferences and journals for both text and speech, covering documents produced from the 60’s up to 2015. Then , we have mined our scientific publication database to draw a picture of our field from quantitative and qualitative results according to a wide range of perspectives: ranging from sub-domains, specific communities, chronology, terminology, conceptual evolution, re-use and plagiarism, trend prediction, novelty detection and many more. We provide here an account of the corpus collection and of its processing with NLP technology, indicating for each aspect which technology was used. We conclude on the benefits brought by such corpus to the actors of the domain and on the conditions to generalize this approach to other scientific domains. Conference Topics Methods and techniques, Citation and co-citation analysis, Scientific fraud and dishonesty, Natural Language Processing 1 Introduction The NLP4NLP corpus, object of this paper, covers both the written and speech sub-domains of NLP and also encompasses a small sub-corpus in which Information Retrieval and NLP activities intersect. The corpus was made at LIMSI-CNRS (France) and contains to this day 57,235 documents from various conferences and journals with different access policies (from public to restricted). Our approach was to apply NLP tools on articles about NLP itself. We chose NLP as our first application domain because we wanted to take advantage of the fact that we are knowledgeable about the domain ourselves, and thus we would be better set to appreciate the amount of in-domain knowledge required to determine the pertinence of the results returned by automatic analysis, in particular for what concerns author names, institutions labels and acronyms, the domain terminology or the scientific concepts mentioned. 2 Existing Corpora Among all the NLP corpora available on Internet, the ACL Anthology 1 is one of the most known because of its wide coverage in terms of time span and number of papers (more than 1 http://aclweb.org/anthology 20,000 ACL related papers2) and also because it provides a full access to both the metadata and the contents of the papers. Most of the papers from the site are in English and come from ACL events or journals, with a few additions from other sources like the 4,550 papers from LREC conference series3 or the 976 articles in French or English from the TALN conference series4. Other sites exist like SAFFRON5 which display results obtained by processing the content of the ACL Anthology, the LREC or CLE conference sites, or the site from University of Michigan by the CLAIR group6 is more focused on ACL and provides search functionalities supported by apparently more elaborate numerical computations. If these sites are very valuable resources for the community, they offer publications mainly focused on the processing of written material. Since the conferences on speech processing (the other “side” of the NLP domain) are mostly managed by two large associations which are ISCA 7 (for the conferences Interspeech, ICSLP and Eurospeech) and the IEEE Signal Processing Society 8 for the ICASSP conferences. To our knowledge, the previous sites mentioned constitute the main repositories of scientific articles for the NLP domain. The respective share of papers in our corpus coming from the ACL Anthology is about only 35% of our corpus, while the remaining part is made of publications with their origin essentially from ISCA and IEEE Signal Processing. Note that although we could have limited this study to one of the two spheres, either text processing of speech, it was important for us to cover both since our lab has teams working in the two spheres and we are particularly interested in comparing their evolutions and studying the links between the two sub-domains of NLP. 3 Related works To the best of our knowledge this is the first time that such an extensive study covering both text and speech processing domain is undertaken. From the different works that were done in the past on scientific publications, the most notable one is probably the 2012 workshop organized by ACL in Jeju (South Korea): “Rediscovering 50 years of Discoveries in Natural Language Processing”. On a smaller scale and including articles in bot English and French, there is the work of Florian Boudin (Boudin 2013) on the TALN conference series. For what concerns only speech processing, there are the two recent studies presented at the occasion of the 25 years of ISCA during Interspeech 2013 (Mariani et al 2013) and more specially on resources and evaluation for text and speech processing ther is the study presented for the 15 years of LREC at LREC-2014 (Mariani et al 2014). To the best of our knowledge this is the first time that such an extensive study is undertaken. 4 Data collection In our study we distinguish the notion of sub-corpora specific to a journal or a conference series (for instance COLING), which can also be divided according to time, using a year as unit. The combination of both filtering criteria (sub-corpus and year) identifies what we call an “event”. For each document, we process to kind of information, the metadata and the textual content. Often different version of the metadata were available, which enabled to perform consistency checks. In our database, the metadata is made of the corpus name, the 2 The figures given here were valid on March 2015. 3 http://www.lrec-conf.org 4 http://www.atala.org/-Conference-TALN-RECITAL 5 http://saffron.insight-centre.org 6 http://clair.eecs.umich.edu/aan/index.php 7 http://www.isca-speech.org/iscaweb 8 http://www.signalprocessingsociety.org year, the authors (with the given name(s) well identified from the family name(s)), and the document title. Metadata have been cleaned by an automatic processing and manually checked by experts of the domain, limiting the checks to only the most frequent phenomena for the cases when the task was too daunting. The metadata can be considered “cleaner” as the ones generally available, in a sense that we fixed in general various typos and inconsistencies from the version publicly available. As an extra resource, we also have the ISCA member registry for speech processing papers. This registry is very useful for authors gender statistics as it contains explicit information whether the author is male or female, thus giving us the means to disambiguate epicene given names. Note that in the case of LREC, a manual identification of gender has been done for authors with an epicene given name. Originally, the textual content of the publications is in PDF format, of two kinds: first PDF holding only a scan of the original document, without any direct access to the content in raw test format, second PDF from which the text of the original document is retrievable directly. For the former we had to use OCR to recover the text content, see the preprocessing section below. 5 Data collection Up to now we have collected 32 sub-corpora in our NLP4NLP corpus. Their list is given in table 1. In the corpus, the vast majority (90%) of the documents comes from conferences and the remaining part from journals. As a convention, we call “document”, an article which has been published in a given conference or journal and we call “paper”, the physical object which holds a unique identifier. The difference is subtle, as we will see. In fact, it could be observed that the total of the cells of the table does not give exactly a grand total of 57,235 documents but slightly more (59,766) because a small number of conferences are joint conferences for some years, which means that a single paper matches with two different documents which respectively belong to two different corpora. Quantitatively, this is not an important phenomenon, because joint events happen relatively rarely, but these situations makes comparing two sub-corpora more complex. Initially, texts are in four languages: English, French, German and Russian. The number of texts in German and Russian is less than 0.5%. They are detected by the automatic language detector of the industrial pipeline TagParser9 and discarded. The texts in French are a little bit more numerous (3%, 1871 exactly). They are kept with the same status as the English ones. This is not a problem because our NLP pipeline we use is bilingual. 6 Preprocessing and normalization Textual contents and metadata are built independently in parallel. For PDF documents, we use PDFBox10 in order to extract the text content from the articles. When the PDF document holds only a scan of the original document, we apply OCR through the Tesseract 11 application. The texts resulting from both types of conversion are encoded in Unicode-UTF8. A filtering program is applied to process the most frequent OCR problems identified. An end-of-line processing is run with TagParser dictionary in order to distinguish caesura and composition hyphenation. Then, a set of “pattern matching” rules are applied to separate the abstract, the body and the reference section. For the metadata, the author name and the title are extracted from the conference program or the BibTeX material, depending on the source. Each author name is split into a given name and a family name with an automatic check against a large given name ISO-LMF (ISO-24613) dictionary comprising 74,000 entries. 9 www.tagmatica.com 10 https://pdfbox.apache.org 11 https://code.google.com/p/tesseract-ocr Table 1. Table List of subcorpora contained in the NLP4NLP corpus. short access to 12 # docs format long name Language Period # venues name content acl 4262 conference Association for Computational Linguistics conference English open access* 1979-2014 36 alta 262 conference Australasian Language Technology Association English open access* 2003-2014 12 anlp 329 conference Applied Natural Language Processing English open access* 1983-2000 6 cath 932 journal Computers and the Humanities English private access 1966-2004 39 cl 777 journal American Journal of Computational Linguistics English open access* 1980-2014 35 coling 3833 conference Conference on Computational Linguistics English open access* 1965-2014 21 conll 789 conference Computational Natural Language Learning English open access* 1997-2014 17 csal 718 journal Computer Speech and Language English private access 1986-2015 29 eacl 900 conference European Chapter of the ACL English open access* 1983-2014 14 emnlp 1708 conference Empirical methods in natural language processing English open access* 1996-2014 19 hlt 2080 conference Human Language Technology English open access* 1986-2013 18 IEEE International Conference on Acoustics, Speech and icassps 9023 conference English private access 1990-2014 25 Signal Processing - Speech Track ijcnlp 899 conference International Joint Conference on NLP English open access* 2005-2013 5 inlg 199 conference International Conference on Natural Language Generation English open access* 1996-2012 6 International Speech Communication Association isca 17592 conference English open access 1987-2014 27 conferences (Eurospeech, ICSLP, Interspeech) jep 507 conference Journées d'Etudes sur la Parole French open access* 2002-2014 5 lre 276 journal Language Resources and Evaluation English private access 2005-2014 10 lrec 4550 conference Language Resources and Evaluation Conference English open access* 1998-2014 9 ltc 299 conference Language and Technology Conference English private access 2009-2013 3 modula 232 journal Le Monde des Utilisateurs de L'Analyse des Données French open access 1988-2010 23 d muc 149 conference Message Understanding Conference English open access* 1991-1998 5 naacl 1000 conference North American Chapter of ACL English open access* 2001-2001 10 Pacific Asia Conference on Language, Information and paclic 1040 conference English open access* 1995-2014 19 Computation ranlp 363 conference Recent Advances in Natural Language Processing English open access* 2009-2013 3 Lexical and Computational Semantics / Semantic sem 752 conference English open access* 2001-2014 7 Evaluation speech 549 journal Speech Communication English private access 1982-2015 34 c tacl 92 journal Transactions of the Association of Computational Linguistics English open access* 2013-2015 3 tal 156 journal Revue Traitement Automatique du Langage French open access 2006-2013 8 taln 976 conference Traitement Automatique du Langage Naturel French open access* 1997-2014 18 IEEE Transactions on Audio, Speech and Language content not yet taslp 2659 journal English 1993-2015 23 Processing included tipster 105 conference Tipster DARPA text program English open access* 1993-1998 3 trec 1756 conference Text Retrieval Conference English open access 1992-2014 23 Total 59766 1965-2015 515 Then a matching process is applied between different metadata records in order to normalize author names textual realization (e.g. matching initial with first name or normalizing compound first name typography). The result is then manually checked by some members of the team who is familiar with the domain, limiting this manual check to the most frequent items if the number of items to validate becomes too large. Then, comes an important step: the calibration of the parsing pipeline. For each corpus, an automatic parsing is performed with TagParser for identifying the presence of unknown words in the documents. We make the hypothesis that the number of unknown words, according to the number of words of the texts is a good reverse indicator of the average quality of the initial data and of the processing the material has been submitted to so far. Discrepancies in the statistical profile is used to identify subcorpora which differ too widely form the average profile. We assume that the lower the percentage of unknown words is, the better the quality of the produced text is. The calibration permits also to make modifications in the preprocessing steps and to compare quantitatively the various processing steps to ensure homogeneity of the data produced. We tried different 12 This is the number of venues where data was obtainable. There may have been more venues. tools, like ParsCit13 or hand-written rules, and the calibration showed that computing names, titles and content globally and directly from the PDF is a bad choice with regards to the resulting quality. This is why we do not build anymore the metadata from the PDF file but from other sources. 7 Computing analysis indicators The various analysis indicator that we produce are the following. Basic counting: it is number of authors, one of the most basic indicators to follow the chronological evolution of each subcorpus. The number of different authors is 43,365 for 515 events. Co-authoring counting: the aim is to follow the number of co-authors along the time line. The results show that this number is constantly increasing regardless of the corpus. Over the whole archive, the average number of co-authors varies from 1.5 for the Computer and the Humanities journal, to 3.6 for LREC. Some additional counting are made concerning the signature order: is an author’s always or never mentioned as first author? Renewal rate: This indicator shows the author turnover. It asnwers the question whether the community associated to a subcorpus is stable or not. Gender counting: the author sexual gender is determined from the given name together with a member registry for ISCA and LREC for authors with an epicene given name. The goal is the study the proportion of men and women with respect to time and subcorpus. Geographical origin: for a certain number of corpora, we have access to affiliations and we are able to compute and compare the distribution of the organizations, countries and continents. Collaboration studies: a collaboration graph is built in order to determine the cliques and connected components in order to understand the set of authors is structured, i.e. who work with who (co-sign an article)? For each author, various scores are computed like harmonic centrality, betweenness centrality and degree centrality. We determine whether an author collaborates a lot or not, and whether an author sometimes signs alone or always signs with other authors. We compute a series of global graph scores like diameter, density, max degree, mean degree, average clustering coefficient and average path length in order to compare the structure of the communities around the different conferences and to understand whether and how the authors collaborate. Citations: the reference sections of the documents are automatically indexed and the citation links are studied within the perimeter of the 32 corpora. The H-Index are computed for each author and conference. The differences are important, starting at 5 for JEP and 11 for TALN (French conferences) to 71 for ACL, and this point highlights the citation problem with respect to the language of diffusion. As for the collaboration study, the citation graphs are built both for papers and authors. We are then able to determine which are the most cited documents compared to the most citing ones. It is easy to compute the publication rate with respect to the citation rate with for instance Kishore Papineni who did not published a lot but whom the document proposing the BLEU score (Bilingual Evaluation Understudy) is cited 1,225 times within our corpus. The most cited author is Hermann Ney with 3 927 citations, with a self-citation rate of 16%. Terminological extraction: the aim is to extract the domain terms from the abstracts and bodies of the texts. Our approach is called “contrastive strategy” and contrasts a specialized corpus with a non-specialized corpus in the same line as TermoStat (Drouin 2004). Two large non-specialized, one for English, one for French, were parsed with TagParser and the results were filtered with syntactic patterns (like N of N) and finally two statistical matrices were recorded. Our NLP4NLP texts were then parsed and contrasted with this matrix according to 13 https://github.com/knmnyn/ParsCit the same syntactic patterns. Afterwards, we proceeded in two steps: first, we extracted the terms and we studied the most frequent ones in order to manually merge a small amount of synonyms which were not in the parser’s dictionary. And then, we reran the system. The extracted terms are for most of them single terms (95% for LREC). In general, there are common nouns, as opposed to rare proper names or adjectives. Bibliographic searches: we transform the result of the parser (which natively produces a PASSAGE format14, based on ISO-MAF (ISO-24611) with additional annotations for named entities) into RDF in order to inject these triplets into the persistent storage Apache-Jena15 and thus to allow the evaluation of SPARQL queries. It should be noted that instead of processing an indexation and query evaluation on raw data, we index the content after preprocessing. The reason is threefold: 1) we avoid low level noise like caesura problems which are fixed by the preprocessing step, 2) the query may contain morphosyntactic filters like lemmatized forms or part-of-speech marks, 3) the query may contain semantic filters based on named entities semantic categories like company, city or system names. Of course, all these filters may be freely combined with metadata. Term evolution: with respect to the time line, the objective is to determine the terms which are popular. For LREC, it is “annotation”. We determine the terms which were not popular and which became popular like “synset”, “XML” and “Wikipedia”. Some terms were popular and are not popular anymore like “encoding” or “SGML”. We also study a group of manually selected terms and compute the usage of “trigram” compared to “ngram”. Let’s add that there are some fluctuating terms (depending on a specific time period) like “Neural Network”, “Tagset” or “FrameNet”. Weak signals: the aim is to study the terms which have a too small number of occurrences to be statistically taken into consideration but which are considered as “friends” of terms whose evolution is interpretable statistically. The notion of friend is defined by the joint presence of the term within the same abstract. Thus, we find that “synset” has friends like “disambiguation” or “Princeton”. Innovative feature: based on the most popular terms during the last years, the aim is to compute the author, the document and the conference mentioning this term for the first time. Thus, for instance, “SVM” appears in the LREC corpus for the first time in an Alex Weibel’s document published in 2000. It is then possible to detect the conferences producing the most innovative documents. Hybrid individual scoring: the aim is to compute an hybrid scoring combining: collaboration, innovation, production and impact. The collaboration score is the harmonic centrality. The innovation score is computed from a time-based formula applied to term creation combined by the success rate of the term over the years. The production is simply the number of signed documents. The impact is the number of citations. We then compute the arithmetic mean from these four scores. The objective is not to publish an individual hit parade but to form a short list of authors who seem to be important within a given conference. Classification: from the extracted terms, it is possible to compute the most salient terms of a document from TF-IDF15 and to compute a classification in order to gather similar documents within the same cluster. We use an UPGMA algorithm on a specific corpus. This tool is very helpful, because when we pick an interesting document, the program suggests a cluster of documents which are semantically similar, in the same vein as Amazon proposing a similar object or YouTube proposing the next video. Plagiarism and reuse studies: we define “plagiarism” as the recall of a text written by a group of authors X by an author Y who does not belong to group X. We define “reuse” of an 14 http://atoll.inria.fr/passage 15 We define the salient terms as the five terms with the higher TF-IDF. author as the recall of a text by himself in a posterior publication, regardless of the co-authors. In a first implementation, we compared raw character strings but the system was rather silent. Now we make a full linguistic parsing to compare lemmas and we filter secondary punctuation marks. The objective is to compare at a higher level than case marking, hyphen variation, plural, orthographic variation (“normalise” vs “normalize”), synonymy and abbreviation. A large set of windows of 7 sliding lemmas are compared and a similarity score is computed. We consider plagiarism and reuse when a given level is exceeded (3% for plagiarism and 10% for reuse). Concerning the plagiarism results, we did not notice any real plagiarism (one author reusing verbatim the material of another author without any citation). In contrast, we observe sometimes some groups of authors (with an empty intersection) who apparently copy-paste large fragments of texts while engaged in common collaborations. Reuse, in contrast is more frequent. 8 Conclusion and perspectives Up to now the NLP4NLP corpus has only been used by our team, but we plan to make the part of the corpus which has no copyright restrictions publicly available shortly. The early feedback from our legal department seems to indicate that there should be no problem for the majority of the texts (80%) because the texts and the metadata are already publicly available. In contrast, a certain number of metadata and textual contents belong to Springer or associations like IEEE, these of course we will not be able to distribute. We plan to use RDF as distribution format, so as to respect W3C recommendations concerning Open Linked Data and to be compatible with the current regional project called “Centre for Data Science” 16 (CDS). The preliminary results that the NLP4NLP corpus enabled us to extract from the indicators computed with core NLP technology (up to the level of full automatic parsing) provided quantitative assessments of facts that we knew from our knowledge of the field, e.g. the rise and fall of some terms though time. The most import lesson to draw from this first experience is the fact that the point of view and knowledge from experts of the community under study is essential to provide information that cannot be recovered automatically (e.g. sexual gender or given names of authors) and to ensure that the statistics produced do not contain too large discrepancies with respect to the actual state of the domain. 9 References Boudin, F 2013 TALN archives: une archive numérique francophone des articles de recherche en traitement automatique de la langue. TALN-RÉCITAL 2013, 17-21 June 2013, Les Sables d’Olonne, France. Drouin, P 2004 Detection of Domain Specific Terminology Using Corpora Comparison, in Proceedings of LREC 2004, 26-28 May 2004, Lisbon, Portugal Mariani, J, Paroubek, P, Francopoulo, G, and Delaborde, M 2013. Rediscovering 25 Years of Discoveries in Spoken Language Processing: a Preliminary ISCA Archive Analysis, in Proceedings of Interspeech 2013, 26-29 August 2013, Lyon, France. Mariani, J, Paroubek, P, Francopoulo, G and Hamon, O 2014 Rediscovering 15 years of Discoveries in Language Resources and Evaluation: The LREC Anthology Analysis, in Proceedings of LREC 2014, 26-31 May 2014, Reykjavik, Island. Ackknowledgements. This work was partically supported by the project REQUEST in “Programme d'Investissement d'Avenir, appel Cloud computing & Big Data”, convention 018062-25005. 16 www.campus-paris-saclay.fr/site/Idex-Paris-Saclay/Les-Lidex/Center-for-Data-Science-Paris-Saclay