=Paper= {{Paper |id=Vol-1399/paper6 |storemode=property |title=Extraction of Career Profiles from Wikipedia |pdfUrl=https://ceur-ws.org/Vol-1399/paper6.pdf |volume=Vol-1399 |dblpUrl=https://dblp.org/rec/conf/bd/DibLN15 }} ==Extraction of Career Profiles from Wikipedia== https://ceur-ws.org/Vol-1399/paper6.pdf
                             Extraction of Career Profiles from Wikipedia
                                      Firas Dib, Simon Lindberg, Pierre Nugues
                                            Lund University
                       LTH, Department of Computer Science, S-221 00 Lund, Sweden
          ada10fdi@student.lu.se, ada10sli@student.lu.se, Pierre.Nugues@cs.lth.se

                                                               Abstract
In this paper, we describe a system that gathers the work experience of a person from her or his Wikipedia page. We first extract an
ontology of profession names from the Wikidata graph. We then parse the Wikipedia pages using a dependency parser and we connect
persons to professions through the analysis of parts of speech and dependency relations we extract from text. Setting aside the dates, we
computed recall and precision scores on a very limited and preliminary test set for which we could reach a recall of 74% and a precision
of 95%, showing our approach is promising.

Keywords: Knowledge extraction, Wikidata ontology, Dependency parsing


                    1.    Introduction                                  language and another format. In addition, beyond Wiki-
Biographies form a category of their own in literature as               data, the techniques we have developed could be applied to
they typically mix free-form text – a life narrative – with a           expand any database.
set of well-defined numerical and nominal properties, such
as dates of birth and death, country of origin, titles and                                2.    Previous Work
decorations, etc. that merely resort to databases. Tex-                 The analysis of career profiles from biographies is a specific
tual biographies can therefore be associated to structured              case of information extraction that produces tabular data
databases that describe such properties in the form of tables           from raw text. Information extraction has a long history in
or graphs. Texts and databases are both useful and comple-              natural language processing, starting from the message un-
mentary for humanities research. While text often contains              derstanding conferences (MUC) (Grishman and Sundheim,
more details on people’s life, databases enable researchers             1996), and has been carried out with a variety of techniques
to formulate questions like:                                            along the time: rule-based, statistical, or hybrid, with a cur-
     Are there welders who became prime ministers?                      rent focus on machine learning (Mausam et al., 2012). See
                                                                        Hobbs et al. (1997) for the description of an early and oft-
and immediately have answers.                                           cited system and Roche and Schabes (1997) for a review.
Although there are now scores of digital biographies,                   There are a few papers describing the extraction of time-
Wikipedia has become the major reference of the internet.               lines from Wikipedia. Timely YAGO (Wang et al., 2010)
It is free and easy to download; it covers more people than             is an example of them that is limited to the analysis of in-
other online resources; it is open to popular culture; and              foboxes, summaries of facts in the form of tabular data in-
it is multilingual. This makes it unique even if Wikipedia              side the articles, and lists in articles. Exner and Nugues
often reuses text and data from older printed biographies               (2011) is another example that uses semantic role label-
and contains mistakes. In addition to its scope and size,               ing and the LODE model (Shaw, 2010) to extract events.
there are many open computer tools that have been de-                   Wu and Weld (2010) is a third example that combines
signed for Wikipedia that make the development of new                   Wikipedia infoboxes and document text to collect data to
programs dedicated to this resource faster.                             train relation classifiers.
We created a system that takes a corpus of Wikipedia pages              Contrary to these works, the system we describe is dedi-
describing people as input and that outputs a career profile            cated to the extraction of careers through the analysis of the
for each respective individual. To carry this out, we used              dependency graphs of the sentences. To collect the vocab-
available tools to parse and extract information from text              ulary associated with occupations, the system creates a ca-
and then analyze the data.                                              reer ontology that it automatically retrieves from the Wiki-
A practical use of our system could be to expand Wikidata,              data repository. In addition to being automatic, this process
the data repository companion to Wikipedia. As of today,                can easily be extended to create multilingual vocabularies.
Wikidata often associates people with the most notable oc-
cupation of their life. The system we describe makes it                                  3.    Term Extraction
possible to build more comprehensive semantic knowledge
bases of career timelines as it extracts all the occupations,           3.1.   Wikidata: A Semantic Repository
possibly secondary, mentioned in the text.                              We used Wikidata as main source of structured knowledge
In our experiments, we used the Swedish version of                      on human beings and their occupations. Wikidata is a free
Wikipedia and we are strictly dependent on the Wikipedia                data repository from the wikimedia foundation. Wikidata
format. Nonetheless, we only used the text itself so this               started as a means to identify named entities across all their
source could easily be replaced with another one in another             Wikipedia language versions with a unique number.

                                                                   33
            Göran Persson         Jacques Delors




                                                                       Figure 2: The two first positions of Göran Persson out of
                                                                       five, where Minister of Finance has a start date, 7 October
                                                                       1994, and an end date, 22 March 1996
Figure 1: Links from Wikidata to articles on Göran Persson
and Jacques Delors in ten different languages
                                                                       cupation, Figure 3 shows an excerpt of such a hierarchy,
                                                                       where Wikidata gathers all types of jobs, professions, and
Göran Persson, for instance, a former Prime Minister of               careers.
Sweden, has the identifier Q53747 in Wikidata that links
this entity to the 44 different language versions of his
biography in Wikipedia, while Jacques Delors, a former
president of the European Commission, has the identifier
Q153425 that provides links to the 35 language versions
of Delors’ biography. Figure 1 shows the 10 first links for
these two persons with their language codes, for instance en
for English, de for German, or el for Greek and their name’s
transcription in the corresponding script as in Greek: Ζακ
Ντελόρ, for Jacques Delors.
The entities reflected by Q-numbers are linked to concepts
or other entities by a set of properties that describes the en-
tity, Px , where x is a number. Property P31, corresponding            Figure 3: An excerpt of the Wikidata ontology starting from
to instance of, applies to Göran Persson with the value hu-           the occupation node
man; P569, date of birth, with the value 20 January 1949;
P26, spouse, Anitra Steen; P106, occupation, politician,
                                                                       We processed the Wikidata graph and the concept hierar-
etc.
                                                                       chies to create a baseline list of professions. We consid-
P31(Q53747) = human                                                    ered the Instance of (P31) and Subclass of (P279) proper-
P569(Q53747) = 20 January 1949                                         ties that we took as guiding relations to extract the people
P26(Q53747) = Anitra Steen                                             careers. We created a list of terms using all the descen-
P106(Q53747) = politician                                              dants of the Occupation node that we chose as the root
                                                                       node since Profession, Job, and Labour are all an In-
The values human, Anitra Steen, and politician having                  stance of Occupation. We created this list in a prepro-
themselves unique Q-numbers, respectively Q5, Q444325,                 cessing stage, independently and prior to the actual article
and Q82955.                                                            parsing.
The P39 property, position held, tracks the career of a per-
son and consists of multiple values. Wikidata lists five po-                               4.   NLP Pipeline
sitions held by Göran Persson: Leader of the Opposition,
                                                                       Although Wikidata covers lots of biographical details, it is
Minister for Finance, Skolminister (Minister for Schools),
                                                                       far from being exhaustive and much of the information on
Prime Minister of Sweden, and Member of the Riksdag,
                                                                       the career timelines still lays in the text. Stefan Löfven,
possibly with time values or boundaries (Fig. 2).
                                                                       another Prime Minister of Sweden, provides an example of
Wikidata stores all this information as a graph in the RDF
                                                                       this, where his Wikipedia page in English states that:
format. It is similar to earlier projects such as DBpe-
dia (Auer et al., 2007), Yago (Suchanek et al., 2007), or                     Löfvén began his career in 1979 as a welder at
Freebase (Bollacker et al., 2008). A key difference be-                       Hägglunds in Örnsköldsvik.
tween these earlier works and Wikidata is that Wikidata
is language-agnostic and an integral part of the Wikipedia             while Wikidata only lists him as a politician1 . We assem-
structure.                                                             bled a pipeline of natural language processing components
                                                                       to analyze the text and extract such information.
3.2. Extracting Occupations
The properties such as P106, occupation, are organized as                 1
                                                                            Both the Wikidata item and the Wikipedia page were re-
hierarchies of more specific properties. In the case of oc-            trieved on May 28, 2015.

                                                                  34
We downloaded the Swedish version of Wikipedia and we                 5.2.    Finding Jobs
first processed the articles to remove the wiki markup. This          The second step finds the job names mentioned in the sen-
markup code enriches the text of Wikipedia articles, for in-          tences. We used the list of professions we collected from
stance to create the links or to identify the section titles.         Wikidata in Sect. 3.2. to check the presence of correspond-
We then applied a part-of-speech tagger and a dependency              ing words and extract them. However, this initial profession
parser to the text. We split the Wikipedia archive in chunks          list is far from being exhaustive and we applied additional
allowing for a multithreaded execution in order to speed up           rules to complete it. To decide if a given a word in a sen-
the process:                                                          tence was a profession, we checked if it was:
 1. The first step of the pipeline was to parse and remove             1. A job name in the list, without any modification;
    the Wikipedia markup. This markup is functionally
    similar to HTML or XML, but has a different format                 2. The compounding of two stems, where the last one is a
    that requires a different parser. We used the Sweble                  profession in the list. We split the word in a prefix and
    tool (Dohrn and Riehle, 2013) to carry it out.                        a suffix and we applied a greedy search on the suffix,
                                                                          where both the prefix and suffix had to be in a dictio-
 2. We then applied a tagger to the text in Swedish and an-               nary of Swedish words. The prefix check was done
    notate the words with their parts of speech. We used                  to eliminate false positives such as kretsar “circuits”
    Stagger (Östling, 2013) that also includes a named en-               that could be interpreted as tsar “Czar” preceded by a
    tity recognition (NER). We used these named entities                  meaningless prefix kre.
    further down in the pipeline to extract the persons from
    the sentence.                                                      3. The compounding of two stems separated by a linking
                                                                          morpheme (fogemorpheme). In Swedish, and other
 3. Finally, we ran the Maltparser dependency parser
                                                                          Germanic languages, it is common to either add an s
    (Nivre et al., 2006) on the POS tagged sentences to
                                                                          between the two stems or change the last vowel of the
    have a syntactic representation of them.
                                                                          first stem. We used two simple morphology rules to
                                                                          extract them:
Table 1 shows the pipeline, its components, and for each
component its input and output.
                                                                             (a) If the last letter of the prefix ends with an s, we re-
                                                                                 move it and we check if this prefix is a valid word
 Input                Tool               Output
                                                                                 in the dictionary as with utbildningsminister, (ut-
 Wikipedia article    Sweble             Plain text
                                                                                 bildning + minister), “Minister for Schools”.
 Plain text           Stagger            POS tagged
 POS tagged           Maltparser         Dependency parsed                   (b) If the last letter of the prefix ends in a vowel, we
 Dep. parsed          Career profiler    Timeline                                replace it by another vowel and we see if this
                                                                                 prefix makes up a word as with: förskolelärare
             Table 1: NLP processing pipeline                                    (förskola + lärare) “preschool teacher”.

                                                                      5.3.    Finding Verbs
                                                                      As noted by Tesnière (1966), verbs in European languages
                 5.    Career Parsing                                 are central elements to describe processes between actors
The career parsing module analyzes the text, sentence by              and circumstances. We started from this observation and
sentence, to find out the persons, what they are working at,          we extracted the verbs hinting at a professional activity
during what time frame, and tries to connect these elements           from the sentences. As vocabulary, we used the follow-
together through the dependency graph of the sentence.                ing set of Swedish verbs: vara “be”, bli “become”, arbeta
                                                                      “work”, jobba “work”, and praktisera “practice”.
5.1.   Finding Persons                                                We then considered that these verbs were potential linking
The first step of the career parser identifies the mentions of        nodes to relate a person to a job in sentence.
human beings in each sentence. We applied the following
rules to decide if a word referred to a person:                       5.4.    Finding a Path
                                                                      The path finding step links people to jobs. From the previ-
 1. The word matches a regular expression based on the
                                                                      ous steps, the career parser has gathered for each sentence
    Wikipedia page title: The person the page is about;
                                                                      respective lists of persons, jobs, and verbs. We create a path
 2. The word is a singular pronoun in Swedish: han “he”,              between these words by traversing the dependency graph of
    hon “she”, hans “his”, or hennes “her”;                           a sentence until we find a common ancestor.
                                                                      Figure 4 shows the dependency graph of the sentence:
 3. The word is tagged as a person by Stagger’s named
    entity recognizer.                                                       Hon var tidigare kommun- och regionminister
                                                                             2001-2005.
We stored all the persons we found as well as the sentences                  “Previously, she served as minister for munici-
they occurred in.                                                            palities and regions (2001-2005)”,

                                                                 35
where we link a person mentioned by the feminine singular              • Ämbete “officer”,
pronoun hon, highlighted in green in the figure to a profes-
sion, regionminister, in turquoise, through the verb var, in           • Senator,
purple, and where we extract the path:                                 • Handledare “instructor, supervisor”,

           hon → var ← och ← regionsminister                           • Konsult “consultant”,
                                                                       • Sommaranställd “Summer employee”, and
                                                                       • Journalist,
                                                                     while Figure 8 shows the three ones for Filippa Reinfeldt:
                                                                       • Politiker, “politician”,
                                                                       • Talesperson, “spokesperson”, and
                                                                       • Sjöofficer, “naval officer”.
Figure 4: Dependency tree of the sentence Hon var tidi-
gare kommun- och regionminister 2001-2005 and the path               The third profession of Filippa Reinfeldt is wrong and cor-
between a person and a job                                           responds to that of her father.
                                                                     We assessed the accuracy of the system using a small and
Figure 5 shows another example with the sentence:                    preliminary test set of 10 random Wikipedia articles about
                                                                     people that were about one or two paragraphs long (Ta-
     Hans Göran Persson är en svensk politiker som                 ble 2). Since the articles were short, they were often to the
     var statsminister 1996-2006.                                    point and did not contain any complicated language. This
     “Hans Göran Persson is a Swedish politician that               made the recall easier than if we would have tested against
     was Prime Minister between 1996 and 2006.”                      larger and more complex articles.
                                                                     Although a more thorough testing would be necessary to
where the career parser connects a person, Göran Persson,
                                                                     validate the system, it shows the promising nature of our
to two occupations, politiker “politician” and statsminister
                                                                     approach.
“Prime Minister”, through the verbs är “is” and var “was”.
To deal with the case where multiple persons are referenced                         Recall    Precision   F-score
in a sentence alongside a job, we introduced two additional                         74.1%        95.2%     83.3%
constraints:
                                                                                    Table 2: Recall and precision
 1. The path from job to person must include one of the
    professional activity verbs;
 2. This path must be the shortest one. We search all the
    paths between all the persons and all the jobs and we                              7.     Further Work
    keep the shortest path for each respective profession.           The techniques we described in this paper could be im-
    Figure 6 shows an example of this in a sentence with             proved in many ways. Here is a list of possible further
    two persons and one activity.                                    work:
                                                                     Negations. We did not consider negations, such as inte
5.5. Finding Dates                                                       “not” or aldrig “never”, in the sentences. This is an
Once we have linked an occupation to a person, we extract                aspect that could be improved;
the dates from the sentence. We implemented a simple pro-
                                                                     Activities. We only extracted actual occupations and we
cedure, where we looked at the words preceding and fol-
                                                                         did not associate work related activities or references
lowing the word representing the job.
                                                                         to a workplace with a profession, meaning that nei-
We first try to match the adjacent words to a date expres-
                                                                         ther phrase writes articles nor works at the New York
sion. If these words correspond to dates, we use them to
                                                                         Times would relate the person to a journalist or writer
annotate the occupation with time stamps; if the adjacent
                                                                         occupation.
words are prepositions or conjunctions, we skip them and
we repeat the matching attempt.                                      Wikidata limitations. The search performed to find all the
                                                                         occupations collects anything remotely related to the
                      6.    Results                                      Occupation node. This results in an overgeneration.
We processed the complete collection of Swedish                          A more robust analysis would filter out erroneous pro-
Wikipedia articles referring to a person in Wikidata. We                 fessions, for example, by controlling that they do not
extracted a total of 267,786 jobs from 170,300 articles. Fig-            have a path to business or superheroes.
ure 7 shows the seven professions we obtained for Barack
Obama:                                                               Naive name matching. The procedure we used is naive
                                                                         and a named entity linker would certainly improve the
  • President,                                                           results.

                                                                36
                           Figure 5: Dependency tree and the path between a person and a job


                                                                     tional Linguistics (COLING), volume 1, page 466–471,
                                                                     Copenhagen.
                                                                   Jerry R. Hobbs, Douglas E. Appelt, John Bear, David Israel,
                                                                     Megumi Kameyama, Mark Stickel, and Mabry Tyson.
                                                                     1997. FASTUS: a cascaded finite-state transducer for ex-
Figure 6: Two competing paths between Firas and pro-                 tracting information from natural-language text. In Em-
grammerare and Simon and programmerare, where the se-                manuel Roche and Yves Schabes, editors, Finite-State
lected one is the shortest                                           Language Processing, chapter 13, pages 383–406. MIT
                                                                     Press, Cambridge, Massachusetts.
                                                                   Mausam, Michael Schmitz, Robert Bart, Stephen Soder-
Coreference. While looking for persons in the sentence,
                                                                     land, and Oren Etzioni. 2012. Open language learning
    we also check for pronouns. We then assume that the
                                                                     for information extraction. In Proceedings of the 2012
    pronouns are referring to the person of interest. A
                                                                     Joint Conference on Empirical Methods in Natural Lan-
    coreference solver would make this step more accu-
                                                                     guage Processing and Computational Natural Language
    rate.
                                                                     Learning, EMNLP-CoNLL ’12, pages 523–534.
Swedish only. Our system only supports the Swedish lan-            Joakim Nivre, Johan Hall, and Jens Nilsson. 2006.
   guage. It would however be relatively simple to adapt             Maltparser: A data-driven parser-generator for depen-
   it to English as well.                                            dency parsing. In Proceedings of the fifth interna-
                                                                     tional conference on Language Resources and Evalua-
              8.   Acknowledgements                                  tion (LREC2006).
This research was supported by Vetenskapsrådet and the            Robert Östling. 2013. Stagger: an open-source part of
Det digitaliserade samhället program.                               speech tagger for Swedish. Northern European Journal
                                                                     of Language Technology, 3.
                   9.    References                                Emmanuel Roche and Yves Schabes, editors. 1997. Finite-
                                                                     State Language Processing. MIT Press, Cambridge,
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens
                                                                     Massachusetts.
   Lehmann, Richard Cyganiak, and Zachary Ives. 2007.
   DBpedia: A nucleus for a web of open data. In The Se-           Ryan Benjamin Shaw. 2010. Events and Periods as Con-
   mantic Web, Proceedings of 6th International Semantic             cepts for Organizing Historical Knowledge. Ph.D. the-
   Web Conference, 2nd Asian Semantic Web Conference,                sis, University of California, Berkeley.
   ISWC 2007 + ASWC 2007, pages 722–735, Busan, Ko-                Fabian M Suchanek, Gjergji Kasneci, and Gerhard
   rea, November 11-15. Springer.                                    Weikum. 2007. Yago: a core of semantic knowledge.
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge,           In Proceedings of the 16th international conference on
   and Jamie Taylor. 2008. Freebase: A collaboratively               World Wide Web, pages 697–706, Banff. ACM.
   created graph database for structuring human knowl-             Lucien Tesnière. 1966. Éléments de syntaxe structurale.
   edge. In Proceedings of the 2008 ACM SIGMOD Inter-                Klincksieck, Paris, 2nd edition.
   national Conference on Management of Data, SIGMOD               Yafang Wang, Mingjie Zhu, Lizhen Qu, Marc Spaniol,
   ’08, pages 1247–1250.                                             and Gerhard Weikum. 2010. Timely YAGO: Harvest-
Hannes Dohrn and Dirk Riehle. 2013. Design and imple-                ing, querying, and visualizing temporal knowledge from
   mentation of wiki content transformations and refactor-           wikipedia. In Proceedings of the 13th International Con-
   ings. In Proceedings of the 9th International Symposium           ference on Extending Database Technology, EDBT ’10,
   on Open Collaboration, WikiSym ’13, pages 2:1–2:10.               pages 697–700.
Peter Exner and Pierre Nugues. 2011. Using semantic role           Fei Wu and Daniel S. Weld. 2010. Open information ex-
   labeling to extract events from Wikipedia. In Proceed-            traction using wikipedia. In Proceedings of the 48th An-
   ings of the Workshop on Detection, Representation, and            nual Meeting of the Association for Computational Lin-
   Exploitation of Events in the Semantic Web (DeRiVE                guistics, ACL ’10, pages 118–127.
   2011). Workshop in conjunction with the 10th Inter-
   national Semantic Web Conference 2011 (ISWC 2011),
   Bonn, October 23–24.
Ralph Grishman and Beth Sundheim. 1996. Message un-
   derstanding conference – 6: A brief history. In Proceed-
   ings of the 16th International Conference on Computa-

                                                              37
 Figure 7: Timeline extracted from the article on Barack Obama




Figure 8: Timeline extracted from the article on Filippa Reinfeldt




                                38