<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Job Recommendation based on Job Seeker Skills: An Empirical Study</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jorge Valverde-Rebaza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ricardo Puma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Bustios</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nathalia C. Silva</string-name>
          <email>ncsilvag@visibilia.net.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Scienti c Research</institution>
          ,
          <addr-line>Visibilia, CEP 13560-647, S~ao Carlos, SP</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: A. Jorge, R. Campos, A. Jatowt, S. Nunes (eds.): Proceedings of the Text2StoryIR'18 Workshop</institution>
          ,
          <addr-line>Grenoble, France, 26-March-2018, published at</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the last years, job recommender systems have become popular since they successfully reduce information overload by generating personalized job suggestions. Although in the literature exists a variety of techniques and strategies used as part of job recommender systems, most of them fail to recommending job vacancies that t properly to the job seekers pro les. Thus, the contributions of this work are threefold, we: i) made publicly available a new dataset formed by a set of job seekers pro les and a set of job vacancies collected from di erent job search engine sites; ii) put forward the proposal of a framework for job recommendation based on professional skills of job seekers; and iii) carried out an evaluation to quantify empirically the recommendation abilities of two state-of-the-art methods, considering di erent con gurations, within the proposed framework. We thus present a general panorama of job recommendation task aiming to facilitate research and real-world application design regarding this important issue.</p>
      </abstract>
      <kwd-group>
        <kwd>Job matching</kwd>
        <kwd>job seeking</kwd>
        <kwd>job search</kwd>
        <kwd>job recommender systems</kwd>
        <kwd>person-job t</kwd>
        <kwd>LinkedIn</kwd>
        <kwd>word embedding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Nowadays, job search is a task commonly done on the Internet using job search engine sites like LinkedIn1,
Indeed2, and others. Commonly, a job seeker has two ways to search a job using these sites: 1) doing a query
based on keywords related to the job vacancy that he/she is looking for, or 2) creating and/or updating a
professional pro le containing data related to his/her education, professional experience, professional skills and
other, and receive personalized job recommendations based on this data. Sites providing support to the former
case are more popular and have a simpler structure; however, their recommendations are less accurate than those
of the sites using pro le data.</p>
      <p>Personalized job recommendation sites implemented a variety of types of recommender systems, such as
content-based ltering, collaborative ltering, knowledge-based and hybrid approaches [AlO12]. Moreover, most
Copyright c 2018 for the individual papers by the paper's authors. Copying permitted for private and academic purposes.
This volume is published and copyrighted by its editors.
of these job recommender systems perform their suggestions based on the full pro le of job seekers as well as
by considering other data sources such as social networking activities, web search history, etc. Despite the fact
that many data sources can be useful to improve the job recommendation, previous studies showed that the
best person-job t is possible when the personal skills of a job seeker match with the requirements of a job
o er [Den15].</p>
      <p>Based on the person-job t premise, we propose a framework for job recommendation based on professional
skills of job seekers. We automatically extracted the skills from the job seeker pro les using a variety of text
processing techniques. Therefore, we perform the job recommendation using TF-IDF and four di erent con
gurations of Word2vec over a dataset of job seeker pro les and job vacancies collected by us. Our experimental
results show the performances of the evaluated methods and con gurations and can be used as a guide to choose
the most suitable method and con guration for job recommendation.</p>
      <p>The remainder of this paper is organized as follows. In Section 2, we brie y describe the natural language
processing methods we are used in our experimental setup. In Section 3 we present our proposal, including a
new dataset collected by us and the framework for job recommendation. In Section 4, we show our experimental
results. Finally, in Section 5, we o er conclusions and directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>In this section, we brie y describe two methods used in our experiments: Term Frequency-Inverse Document
Frequency (TF-IDF) and Word2vec. Moreover, for Word2Vec we also present two models commonly used over
it: Continuous Bag-of-Words (CBOW) and Skip-gram.
2.1</p>
      <sec id="sec-2-1">
        <title>Term Frequency-Inverse Document Frequency (TF-IDF)</title>
        <p>TF-IDF assigns weights to the words as a statistical measure used to evaluate the relevance of a word in document
of a corpus [Sal88]. This relevance is proportional to the number of times a word appears in the document and
inversely proportional to the frequency of the word in the corpus.</p>
        <p>This method has been successful in topic identi cation over large text datasets, but its performance decrease
when applied over small ones as those commonly found in job descriptions. However, TF-IDF has been applied
to deal with recommendation obtaining interesting results [Dia13] [Dia14].
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Word2vec</title>
        <p>Word2vec is a general predictive model for learning vector representations of words [Mik13b]. These vector
representations, also called word embeddings, capture distributional semantics and co-occurrence statistics [Mik13a].
There are two Word2vec models we can use to obtain word embeddings: CBOW and Skip-gram.
Continuous Bag-of-Words model (CBOW). This model predicts a target word based on the n words before
and n words after the target word [Mik13b]. For example, in the following sentence:</p>
        <p>Lorem ipsum dolor sit amet
CBOW will predict the word dolor taking as inputs n = 2 words before and after it, i.e. Lorem, ipsum, sit
and amet. These words are called the context of the target word and their quantity is a parameter of the
model.</p>
        <p>Skip-gram. Rather than predicting a word based on its context, Skip-gram aims to predict the context based
on one word [Mik13a]. For instance, based on our previous example, skip-gram will try to predict the words
Lorem, ipsum, sit and amet having only the word dolor as input.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Proposal</title>
      <p>In this section, we describe our framework for job recommendation. We narrow down the scope and focus on
recommendation of job vacancies for Information Technology (IT) professionals acting in the Brazilian market.
The proposed framework, depicted in Fig.1, is composed by three stages: data collection, data preparation and
recommendation.
We automatically collected a set of job vacancies/o ers from the Brazilian recruitment site called Catho3 and a
set of Brazilian professional pro les from the well-known LinkedIn. We make available these datasets in a public
repository4 with personal data anonymised. It is important to note that we collected more data from similar
sites but, due to the validation issues, we only managed to work with these two sources in our framework.</p>
      <p>To perform job o ers scraping, we created a list of keywords from the IT industry and used them as search
terms. For each keyword, we search all the related job o ers using Catho's search engine and save the retrieved
results in our database; thus, the content's quality is highly related to the quality of the Catho's search engine.
Additionally, the scraper is set up to avoid duplicate job o ers, thus all the job o ers are unique. On the other
hand, to perform professional pro les scrapping, we created a list of areas of professional practice from the IT
industry and, from that, we search among the professional contacts of rst, second and third degree of our research
group using Linkedin's search engine and save the retrieved results in our database; thus, all the professional
pro les also are unique.</p>
      <p>We use text mining approaches to process both pro les and job o ers data. Therefore, we selected the work
experience, education and competencies/skills from the pro les and, the description and title from the job o ers.
Finally, we concatenate these elds into a new one and discard the original elds, thus we end up with a
document-like representation for each job o er and professional pro le.
3.2</p>
      <sec id="sec-3-1">
        <title>Data preparation</title>
        <p>Although we retrieved data from job search sites using only IT keywords, there were still some job o ers that do
not correspond to this eld, then, the rst step in this phase is ltering out job o ers that do not belong to the
IT eld. To achieve this, we use a dictionary of weighted IT terms to match each job o er in its document-like
format. This way, we calculate the weighted sum of the appearances of each word of the job o er in the dictionary
and divided it by the appearances of the rest of words in the document (job o er). Finally, we get a score with
a value from 0 to 1, where a higher value indicates that the o er contains many relevant words on IT and it
is very likely that corresponds to this eld. Subsequently, we select only those job o ers with a value of this
score greater than 0.5. This setback only happens with the job o ers since pro les were collected only into a IT
professionals network.</p>
        <p>Once job o ers and pro les are ltered, the second step is text preprocessing. In this task, we perform stop
words removal, tokenization and lemmatization for the Portuguese language.</p>
        <p>The third step, feature representation, aims to represent these documents (job o ers and pro les) as vector
space models. For this purpose, we adopted two approaches: word embeddings and TF-IDF. The latter technique
does not require so much e ort to be implemented unlike the former. From the variety of word embedding
representations we selected Word2Vec, which has di erent variants. We explore the two model architectures
CBOW and Skip-Gram, and also the use of n-grams (bigrams and trigrams) in order to nd the variation that
best t our problem. This way, we tested 5 di erent representations, TF-IDF, Word2Vec using CBOW, Word2Vec
using Skip-Gram, Word2Vec using CBOW with n-grams and Word2Vec using Skip-Gram with n-grams. For the
Word2vec models, a vector space size of 200 was selected after some initial experimentation.</p>
        <p>For both word embedding and TF-IDF representation, we only used the corpus composed by the job o ers.
Although we lose some data, it was necessary since we realized that job seeker pro les added some noise because
of the existence of professionals with a very di erent background and skill set looking for a job on IT, which
could foster spurious relations among skills. Finally, we transform both job o ers and pro les into these 5 new
representations and then proceed to use them in the recommendation phase. In Table 1, we can see the description
of the corpora used for our word embeddings.
3 https://www.catho.com.br
4 http://visibilia.net.br/text2story-job-recomendation/
In this last phase, given a certain pro le with a proper representation, we select a group of the nearest job o ers
based on the distance to that pro le (job matching). In the case of TF-IDF representation, we use the cosine
distance while for word embeddings, we use the relatively new Word Mover's Distance (WMD) [Kus15]. Once
retrieved the top "k" job o ers for the pro le, we sort them in descending order based on the inverse of this
distance (ranking).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Results</title>
      <p>In this section, we present extensive empirical experiments focused on evaluating the quality of job
recommendations. For these experiments, we take the case of recommending a set of job o ers given a speci c professional
pro le.</p>
      <p>Our data set is composed by 50 professional pro les from LinkedIn and 3877 job o ers from Catho. Both pro les
and job o ers correspond to Brazilian professionals and companies from the IT eld. Due to the extensive of the
IT eld, professional pro les can also di er a little bit among them. Table 2 shows the distribution of sub elds
within our sample of 50 professional pro les which re ects the greater number of developers and BI consultants.</p>
      <p>First, we use our framework to generate 10 job o er recommendations for 50 di erent pro les. Thus, for
each evaluated technique, we obtained a total of 500 recommendations. Second, a group of 5 Resource Human
professionals evaluated manually these recommendations and allocate a score ranging from 1 to 10. The more
accurate or suitable the recommendation, the greater the score. In order to make the results more understandable,
we standardize these scores dividing them by the maximum score. Third, once these scores are obtained, we
averaged them and also calculated Precision and Minimum E ectiveness (ME).</p>
      <p>Precision for a single pro le by dividing the number of relevant documents (recommendations with a score
greater than 0.5) by all the retrieved documents (total of recommendations); then, we average this precision over
all the pro les. On the other hand, the Minimum E ectiveness (ME) allocates a score of 1 if at least one out
of the 10 recommendations for a pro le has a score greater or equal to 0.5, otherwise it allocates 0. Thus, we
average this value to have an estimator of the global e ectiveness of the system of 10 job recommendations per
pro le. In Table 3, we show the result of applying these metrics over our dataset for the 5 di erent evaluated
techniques.</p>
      <p>Here, we can observe that Word2Vec with Skip-Gram obtains a slightly better average score than TF-IDF,
which has the second best average. On the other hand, Word2Vec with Skip-Gram clearly gets a better average
precision over all the other techniques by a good margin and it is the best option according to the three metrics.
The Word2Vec variant using Skip-Gram with n-grams ranked second. Furthermore, we also observe that not all
the pro les were given a good recommendation as the maximum value of the average minimum e ectiveness is
0.96 (48 out of 50 pro les). This last metric is highly dependent on the quality of the ltering process and the
variety of job o ers since there can be a shortage of o ers for some speci c pro les. Finally, we can see that
the two versions of Word2Vec using n-grams perform better than the Word2Vec with CBOW according to all
metrics used. Plus, these n-gram versions have a slightly better average precision than TF-IDF, but a lower
average score.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we proposed a framework for job recommendation task. This framework facilitates the
understanding of job recommendation process as well as it allows the use of a variety of text processing and recommendation
methods according to the preferences of the job recommender system designer. Moreover, we also contribute
making publicly available a new dataset containing job seekers pro les and job vacancies.</p>
      <p>Future directions of our work will focus on performing a more exhaustive evaluation considering a greater
amount of methods and data as well as a comprehensive evaluation of the impact of each professional skill of a
job seeker on the received job recommendation.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the S~ao Paulo Research Foundation (FAPESP) grants: 2016/08183-5,
2017/14995-5, 2017/15070-5, 2017/15247-2 and 2017/17312-6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [AlO12]
          <string-name>
            <surname>Shaha T Al-Otaibi</surname>
            and
            <given-names>Mourad</given-names>
          </string-name>
          <string-name>
            <surname>Ykhlef</surname>
          </string-name>
          .
          <article-title>\A survey of job recommender systems"</article-title>
          .
          <source>In: International Journal of the Physical Sciences 7.29</source>
          (
          <year>2012</year>
          ), pp.
          <volume>5127</volume>
          {
          <fpage>5142</fpage>
          . issn:
          <volume>19921950</volume>
          . doi:
          <volume>10</volume>
          .5897/IJPS12. 482.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Den15]
          <string-name>
            <given-names>N</given-names>
            <surname>Deniz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Noyan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O G</given-names>
            <surname>Ertosun</surname>
          </string-name>
          .
          <article-title>\Linking Person-job Fit to Job Stress: The Mediating E ect of Perceived Person-organization Fit"</article-title>
          .
          <source>In: Procedia - Social and Behavioral Sciences</source>
          <volume>207</volume>
          (
          <year>2015</year>
          ), pp.
          <volume>369</volume>
          {
          <fpage>376</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>[Dia13] M Diaby</surname>
            ,
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Viennet</surname>
            , and
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Launay</surname>
          </string-name>
          .
          <article-title>\Toward the next generation of recruitment tools: An online social network-based job recommender system"</article-title>
          .
          <source>In: Proc. of the 2013 IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining</source>
          ,
          <string-name>
            <surname>ASONAM</surname>
          </string-name>
          <year>2013</year>
          (
          <year>2013</year>
          ), pp.
          <volume>821</volume>
          {
          <fpage>828</fpage>
          . doi:
          <volume>10</volume>
          . 1145/2492517.2500266.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Dia14]
          <string-name>
            <given-names>M</given-names>
            <surname>Diaby</surname>
          </string-name>
          and
          <string-name>
            <given-names>E</given-names>
            <surname>Viennet</surname>
          </string-name>
          . \
          <article-title>Taxonomy-based job recommender systems on Facebook and LinkedIn pro les"</article-title>
          .
          <source>In: Proc. of Int. Conf. on Research Challenges in Information Science</source>
          (
          <year>2014</year>
          ), pp.
          <volume>1</volume>
          {
          <issue>6</issue>
          . issn:
          <volume>21511357</volume>
          . doi:
          <volume>10</volume>
          .1109/RCIS.
          <year>2014</year>
          .
          <volume>6861048</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>[Kus15] M Kusner</surname>
          </string-name>
          et al. \
          <article-title>From word embeddings to document distances"</article-title>
          .
          <source>In: Proc. of the 32nd Int. Conf. on Machine Learning, ICML'15</source>
          .
          <year>2015</year>
          , pp.
          <volume>957</volume>
          {
          <fpage>966</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Mik13a]
          <string-name>
            <given-names>T</given-names>
            <surname>Mikolov</surname>
          </string-name>
          et al. \
          <article-title>Distributed Representations of Words and Phrases and Their Compositionality"</article-title>
          .
          <source>In: Proc. of the 26th Int. Conf. on Neural Information Processing Systems - Volume 2. NIPS'13</source>
          .
          <string-name>
            <surname>Lake</surname>
            <given-names>Tahoe</given-names>
          </string-name>
          , Nevada,
          <year>2013</year>
          , pp.
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          . url: http://dl.acm.org/citation.cfm?id=
          <volume>2999792</volume>
          .
          <fpage>2999959</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Mik13b]
          <string-name>
            <given-names>T</given-names>
            <surname>Mikolov</surname>
          </string-name>
          et al. \
          <article-title>E cient estimation of word representations in vector space"</article-title>
          .
          <source>In: arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Sal88]
          <string-name>
            <given-names>G</given-names>
            <surname>Salton</surname>
          </string-name>
          and
          <string-name>
            <given-names>C</given-names>
            <surname>Buckley</surname>
          </string-name>
          <article-title>. \Term-weighting approaches in automatic text retrieval"</article-title>
          .
          <source>In: Information Processing and Management 24.5</source>
          (
          <issue>1988</issue>
          ), pp.
          <volume>513</volume>
          {
          <fpage>523</fpage>
          . issn:
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
          . doi: https://doi.org/10. 1016/
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
          (
          <issue>88</issue>
          )
          <fpage>90021</fpage>
          -
          <lpage>0</lpage>
          . url: http://www.sciencedirect.com/science/article/pii/ 0306457388900210.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>