=Paper=
{{Paper
|id=Vol-1399/paper6
|storemode=property
|title=Extraction of Career Profiles from Wikipedia
|pdfUrl=https://ceur-ws.org/Vol-1399/paper6.pdf
|volume=Vol-1399
|dblpUrl=https://dblp.org/rec/conf/bd/DibLN15
}}
==Extraction of Career Profiles from Wikipedia==
Extraction of Career Profiles from Wikipedia Firas Dib, Simon Lindberg, Pierre Nugues Lund University LTH, Department of Computer Science, S-221 00 Lund, Sweden ada10fdi@student.lu.se, ada10sli@student.lu.se, Pierre.Nugues@cs.lth.se Abstract In this paper, we describe a system that gathers the work experience of a person from her or his Wikipedia page. We first extract an ontology of profession names from the Wikidata graph. We then parse the Wikipedia pages using a dependency parser and we connect persons to professions through the analysis of parts of speech and dependency relations we extract from text. Setting aside the dates, we computed recall and precision scores on a very limited and preliminary test set for which we could reach a recall of 74% and a precision of 95%, showing our approach is promising. Keywords: Knowledge extraction, Wikidata ontology, Dependency parsing 1. Introduction language and another format. In addition, beyond Wiki- Biographies form a category of their own in literature as data, the techniques we have developed could be applied to they typically mix free-form text – a life narrative – with a expand any database. set of well-defined numerical and nominal properties, such as dates of birth and death, country of origin, titles and 2. Previous Work decorations, etc. that merely resort to databases. Tex- The analysis of career profiles from biographies is a specific tual biographies can therefore be associated to structured case of information extraction that produces tabular data databases that describe such properties in the form of tables from raw text. Information extraction has a long history in or graphs. Texts and databases are both useful and comple- natural language processing, starting from the message un- mentary for humanities research. While text often contains derstanding conferences (MUC) (Grishman and Sundheim, more details on people’s life, databases enable researchers 1996), and has been carried out with a variety of techniques to formulate questions like: along the time: rule-based, statistical, or hybrid, with a cur- Are there welders who became prime ministers? rent focus on machine learning (Mausam et al., 2012). See Hobbs et al. (1997) for the description of an early and oft- and immediately have answers. cited system and Roche and Schabes (1997) for a review. Although there are now scores of digital biographies, There are a few papers describing the extraction of time- Wikipedia has become the major reference of the internet. lines from Wikipedia. Timely YAGO (Wang et al., 2010) It is free and easy to download; it covers more people than is an example of them that is limited to the analysis of in- other online resources; it is open to popular culture; and foboxes, summaries of facts in the form of tabular data in- it is multilingual. This makes it unique even if Wikipedia side the articles, and lists in articles. Exner and Nugues often reuses text and data from older printed biographies (2011) is another example that uses semantic role label- and contains mistakes. In addition to its scope and size, ing and the LODE model (Shaw, 2010) to extract events. there are many open computer tools that have been de- Wu and Weld (2010) is a third example that combines signed for Wikipedia that make the development of new Wikipedia infoboxes and document text to collect data to programs dedicated to this resource faster. train relation classifiers. We created a system that takes a corpus of Wikipedia pages Contrary to these works, the system we describe is dedi- describing people as input and that outputs a career profile cated to the extraction of careers through the analysis of the for each respective individual. To carry this out, we used dependency graphs of the sentences. To collect the vocab- available tools to parse and extract information from text ulary associated with occupations, the system creates a ca- and then analyze the data. reer ontology that it automatically retrieves from the Wiki- A practical use of our system could be to expand Wikidata, data repository. In addition to being automatic, this process the data repository companion to Wikipedia. As of today, can easily be extended to create multilingual vocabularies. Wikidata often associates people with the most notable oc- cupation of their life. The system we describe makes it 3. Term Extraction possible to build more comprehensive semantic knowledge bases of career timelines as it extracts all the occupations, 3.1. Wikidata: A Semantic Repository possibly secondary, mentioned in the text. We used Wikidata as main source of structured knowledge In our experiments, we used the Swedish version of on human beings and their occupations. Wikidata is a free Wikipedia and we are strictly dependent on the Wikipedia data repository from the wikimedia foundation. Wikidata format. Nonetheless, we only used the text itself so this started as a means to identify named entities across all their source could easily be replaced with another one in another Wikipedia language versions with a unique number. 33 Göran Persson Jacques Delors Figure 2: The two first positions of Göran Persson out of five, where Minister of Finance has a start date, 7 October 1994, and an end date, 22 March 1996 Figure 1: Links from Wikidata to articles on Göran Persson and Jacques Delors in ten different languages cupation, Figure 3 shows an excerpt of such a hierarchy, where Wikidata gathers all types of jobs, professions, and Göran Persson, for instance, a former Prime Minister of careers. Sweden, has the identifier Q53747 in Wikidata that links this entity to the 44 different language versions of his biography in Wikipedia, while Jacques Delors, a former president of the European Commission, has the identifier Q153425 that provides links to the 35 language versions of Delors’ biography. Figure 1 shows the 10 first links for these two persons with their language codes, for instance en for English, de for German, or el for Greek and their name’s transcription in the corresponding script as in Greek: Ζακ Ντελόρ, for Jacques Delors. The entities reflected by Q-numbers are linked to concepts or other entities by a set of properties that describes the en- tity, Px , where x is a number. Property P31, corresponding Figure 3: An excerpt of the Wikidata ontology starting from to instance of, applies to Göran Persson with the value hu- the occupation node man; P569, date of birth, with the value 20 January 1949; P26, spouse, Anitra Steen; P106, occupation, politician, We processed the Wikidata graph and the concept hierar- etc. chies to create a baseline list of professions. We consid- P31(Q53747) = human ered the Instance of (P31) and Subclass of (P279) proper- P569(Q53747) = 20 January 1949 ties that we took as guiding relations to extract the people P26(Q53747) = Anitra Steen careers. We created a list of terms using all the descen- P106(Q53747) = politician dants of the Occupation node that we chose as the root node since Profession, Job, and Labour are all an In- The values human, Anitra Steen, and politician having stance of Occupation. We created this list in a prepro- themselves unique Q-numbers, respectively Q5, Q444325, cessing stage, independently and prior to the actual article and Q82955. parsing. The P39 property, position held, tracks the career of a per- son and consists of multiple values. Wikidata lists five po- 4. NLP Pipeline sitions held by Göran Persson: Leader of the Opposition, Although Wikidata covers lots of biographical details, it is Minister for Finance, Skolminister (Minister for Schools), far from being exhaustive and much of the information on Prime Minister of Sweden, and Member of the Riksdag, the career timelines still lays in the text. Stefan Löfven, possibly with time values or boundaries (Fig. 2). another Prime Minister of Sweden, provides an example of Wikidata stores all this information as a graph in the RDF this, where his Wikipedia page in English states that: format. It is similar to earlier projects such as DBpe- dia (Auer et al., 2007), Yago (Suchanek et al., 2007), or Löfvén began his career in 1979 as a welder at Freebase (Bollacker et al., 2008). A key difference be- Hägglunds in Örnsköldsvik. tween these earlier works and Wikidata is that Wikidata is language-agnostic and an integral part of the Wikipedia while Wikidata only lists him as a politician1 . We assem- structure. bled a pipeline of natural language processing components to analyze the text and extract such information. 3.2. Extracting Occupations The properties such as P106, occupation, are organized as 1 Both the Wikidata item and the Wikipedia page were re- hierarchies of more specific properties. In the case of oc- trieved on May 28, 2015. 34 We downloaded the Swedish version of Wikipedia and we 5.2. Finding Jobs first processed the articles to remove the wiki markup. This The second step finds the job names mentioned in the sen- markup code enriches the text of Wikipedia articles, for in- tences. We used the list of professions we collected from stance to create the links or to identify the section titles. Wikidata in Sect. 3.2. to check the presence of correspond- We then applied a part-of-speech tagger and a dependency ing words and extract them. However, this initial profession parser to the text. We split the Wikipedia archive in chunks list is far from being exhaustive and we applied additional allowing for a multithreaded execution in order to speed up rules to complete it. To decide if a given a word in a sen- the process: tence was a profession, we checked if it was: 1. The first step of the pipeline was to parse and remove 1. A job name in the list, without any modification; the Wikipedia markup. This markup is functionally similar to HTML or XML, but has a different format 2. The compounding of two stems, where the last one is a that requires a different parser. We used the Sweble profession in the list. We split the word in a prefix and tool (Dohrn and Riehle, 2013) to carry it out. a suffix and we applied a greedy search on the suffix, where both the prefix and suffix had to be in a dictio- 2. We then applied a tagger to the text in Swedish and an- nary of Swedish words. The prefix check was done notate the words with their parts of speech. We used to eliminate false positives such as kretsar “circuits” Stagger (Östling, 2013) that also includes a named en- that could be interpreted as tsar “Czar” preceded by a tity recognition (NER). We used these named entities meaningless prefix kre. further down in the pipeline to extract the persons from the sentence. 3. The compounding of two stems separated by a linking morpheme (fogemorpheme). In Swedish, and other 3. Finally, we ran the Maltparser dependency parser Germanic languages, it is common to either add an s (Nivre et al., 2006) on the POS tagged sentences to between the two stems or change the last vowel of the have a syntactic representation of them. first stem. We used two simple morphology rules to extract them: Table 1 shows the pipeline, its components, and for each component its input and output. (a) If the last letter of the prefix ends with an s, we re- move it and we check if this prefix is a valid word Input Tool Output in the dictionary as with utbildningsminister, (ut- Wikipedia article Sweble Plain text bildning + minister), “Minister for Schools”. Plain text Stagger POS tagged POS tagged Maltparser Dependency parsed (b) If the last letter of the prefix ends in a vowel, we Dep. parsed Career profiler Timeline replace it by another vowel and we see if this prefix makes up a word as with: förskolelärare Table 1: NLP processing pipeline (förskola + lärare) “preschool teacher”. 5.3. Finding Verbs As noted by Tesnière (1966), verbs in European languages 5. Career Parsing are central elements to describe processes between actors The career parsing module analyzes the text, sentence by and circumstances. We started from this observation and sentence, to find out the persons, what they are working at, we extracted the verbs hinting at a professional activity during what time frame, and tries to connect these elements from the sentences. As vocabulary, we used the follow- together through the dependency graph of the sentence. ing set of Swedish verbs: vara “be”, bli “become”, arbeta “work”, jobba “work”, and praktisera “practice”. 5.1. Finding Persons We then considered that these verbs were potential linking The first step of the career parser identifies the mentions of nodes to relate a person to a job in sentence. human beings in each sentence. We applied the following rules to decide if a word referred to a person: 5.4. Finding a Path The path finding step links people to jobs. From the previ- 1. The word matches a regular expression based on the ous steps, the career parser has gathered for each sentence Wikipedia page title: The person the page is about; respective lists of persons, jobs, and verbs. We create a path 2. The word is a singular pronoun in Swedish: han “he”, between these words by traversing the dependency graph of hon “she”, hans “his”, or hennes “her”; a sentence until we find a common ancestor. Figure 4 shows the dependency graph of the sentence: 3. The word is tagged as a person by Stagger’s named entity recognizer. Hon var tidigare kommun- och regionminister 2001-2005. We stored all the persons we found as well as the sentences “Previously, she served as minister for munici- they occurred in. palities and regions (2001-2005)”, 35 where we link a person mentioned by the feminine singular • Ämbete “officer”, pronoun hon, highlighted in green in the figure to a profes- sion, regionminister, in turquoise, through the verb var, in • Senator, purple, and where we extract the path: • Handledare “instructor, supervisor”, hon → var ← och ← regionsminister • Konsult “consultant”, • Sommaranställd “Summer employee”, and • Journalist, while Figure 8 shows the three ones for Filippa Reinfeldt: • Politiker, “politician”, • Talesperson, “spokesperson”, and • Sjöofficer, “naval officer”. Figure 4: Dependency tree of the sentence Hon var tidi- gare kommun- och regionminister 2001-2005 and the path The third profession of Filippa Reinfeldt is wrong and cor- between a person and a job responds to that of her father. We assessed the accuracy of the system using a small and Figure 5 shows another example with the sentence: preliminary test set of 10 random Wikipedia articles about people that were about one or two paragraphs long (Ta- Hans Göran Persson är en svensk politiker som ble 2). Since the articles were short, they were often to the var statsminister 1996-2006. point and did not contain any complicated language. This “Hans Göran Persson is a Swedish politician that made the recall easier than if we would have tested against was Prime Minister between 1996 and 2006.” larger and more complex articles. Although a more thorough testing would be necessary to where the career parser connects a person, Göran Persson, validate the system, it shows the promising nature of our to two occupations, politiker “politician” and statsminister approach. “Prime Minister”, through the verbs är “is” and var “was”. To deal with the case where multiple persons are referenced Recall Precision F-score in a sentence alongside a job, we introduced two additional 74.1% 95.2% 83.3% constraints: Table 2: Recall and precision 1. The path from job to person must include one of the professional activity verbs; 2. This path must be the shortest one. We search all the paths between all the persons and all the jobs and we 7. Further Work keep the shortest path for each respective profession. The techniques we described in this paper could be im- Figure 6 shows an example of this in a sentence with proved in many ways. Here is a list of possible further two persons and one activity. work: Negations. We did not consider negations, such as inte 5.5. Finding Dates “not” or aldrig “never”, in the sentences. This is an Once we have linked an occupation to a person, we extract aspect that could be improved; the dates from the sentence. We implemented a simple pro- Activities. We only extracted actual occupations and we cedure, where we looked at the words preceding and fol- did not associate work related activities or references lowing the word representing the job. to a workplace with a profession, meaning that nei- We first try to match the adjacent words to a date expres- ther phrase writes articles nor works at the New York sion. If these words correspond to dates, we use them to Times would relate the person to a journalist or writer annotate the occupation with time stamps; if the adjacent occupation. words are prepositions or conjunctions, we skip them and we repeat the matching attempt. Wikidata limitations. The search performed to find all the occupations collects anything remotely related to the 6. Results Occupation node. This results in an overgeneration. We processed the complete collection of Swedish A more robust analysis would filter out erroneous pro- Wikipedia articles referring to a person in Wikidata. We fessions, for example, by controlling that they do not extracted a total of 267,786 jobs from 170,300 articles. Fig- have a path to business or superheroes. ure 7 shows the seven professions we obtained for Barack Obama: Naive name matching. The procedure we used is naive and a named entity linker would certainly improve the • President, results. 36 Figure 5: Dependency tree and the path between a person and a job tional Linguistics (COLING), volume 1, page 466–471, Copenhagen. Jerry R. Hobbs, Douglas E. Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel, and Mabry Tyson. 1997. FASTUS: a cascaded finite-state transducer for ex- Figure 6: Two competing paths between Firas and pro- tracting information from natural-language text. In Em- grammerare and Simon and programmerare, where the se- manuel Roche and Yves Schabes, editors, Finite-State lected one is the shortest Language Processing, chapter 13, pages 383–406. MIT Press, Cambridge, Massachusetts. Mausam, Michael Schmitz, Robert Bart, Stephen Soder- Coreference. While looking for persons in the sentence, land, and Oren Etzioni. 2012. Open language learning we also check for pronouns. We then assume that the for information extraction. In Proceedings of the 2012 pronouns are referring to the person of interest. A Joint Conference on Empirical Methods in Natural Lan- coreference solver would make this step more accu- guage Processing and Computational Natural Language rate. Learning, EMNLP-CoNLL ’12, pages 523–534. Swedish only. Our system only supports the Swedish lan- Joakim Nivre, Johan Hall, and Jens Nilsson. 2006. guage. It would however be relatively simple to adapt Maltparser: A data-driven parser-generator for depen- it to English as well. dency parsing. In Proceedings of the fifth interna- tional conference on Language Resources and Evalua- 8. Acknowledgements tion (LREC2006). This research was supported by Vetenskapsrådet and the Robert Östling. 2013. Stagger: an open-source part of Det digitaliserade samhället program. speech tagger for Swedish. Northern European Journal of Language Technology, 3. 9. References Emmanuel Roche and Yves Schabes, editors. 1997. Finite- State Language Processing. MIT Press, Cambridge, Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Massachusetts. Lehmann, Richard Cyganiak, and Zachary Ives. 2007. DBpedia: A nucleus for a web of open data. In The Se- Ryan Benjamin Shaw. 2010. Events and Periods as Con- mantic Web, Proceedings of 6th International Semantic cepts for Organizing Historical Knowledge. Ph.D. the- Web Conference, 2nd Asian Semantic Web Conference, sis, University of California, Berkeley. ISWC 2007 + ASWC 2007, pages 722–735, Busan, Ko- Fabian M Suchanek, Gjergji Kasneci, and Gerhard rea, November 11-15. Springer. Weikum. 2007. Yago: a core of semantic knowledge. Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, In Proceedings of the 16th international conference on and Jamie Taylor. 2008. Freebase: A collaboratively World Wide Web, pages 697–706, Banff. ACM. created graph database for structuring human knowl- Lucien Tesnière. 1966. Éléments de syntaxe structurale. edge. In Proceedings of the 2008 ACM SIGMOD Inter- Klincksieck, Paris, 2nd edition. national Conference on Management of Data, SIGMOD Yafang Wang, Mingjie Zhu, Lizhen Qu, Marc Spaniol, ’08, pages 1247–1250. and Gerhard Weikum. 2010. Timely YAGO: Harvest- Hannes Dohrn and Dirk Riehle. 2013. Design and imple- ing, querying, and visualizing temporal knowledge from mentation of wiki content transformations and refactor- wikipedia. In Proceedings of the 13th International Con- ings. In Proceedings of the 9th International Symposium ference on Extending Database Technology, EDBT ’10, on Open Collaboration, WikiSym ’13, pages 2:1–2:10. pages 697–700. Peter Exner and Pierre Nugues. 2011. Using semantic role Fei Wu and Daniel S. Weld. 2010. Open information ex- labeling to extract events from Wikipedia. In Proceed- traction using wikipedia. In Proceedings of the 48th An- ings of the Workshop on Detection, Representation, and nual Meeting of the Association for Computational Lin- Exploitation of Events in the Semantic Web (DeRiVE guistics, ACL ’10, pages 118–127. 2011). Workshop in conjunction with the 10th Inter- national Semantic Web Conference 2011 (ISWC 2011), Bonn, October 23–24. Ralph Grishman and Beth Sundheim. 1996. Message un- derstanding conference – 6: A brief history. In Proceed- ings of the 16th International Conference on Computa- 37 Figure 7: Timeline extracted from the article on Barack Obama Figure 8: Timeline extracted from the article on Filippa Reinfeldt 38