Studo Jobs: Enriching Data With Predicted Job Labels
               Markus Reiter-Haas                                      Valentin Slawicek                               Emanuel Lacic
                Moshbit GmbH                                             Moshbit GmbH                                    Know-Center
                  Graz, Austria                                           Graz, Austria                                  Graz, Austria
           markus.reiter-haas@studo.co                             valentin.slawicek@studo.co                       elacic@know-center.at
ABSTRACT
In this paper, we present the Studo Jobs platform in which we tackle
the problem of automatically assigning labels to new job advertise-
ments. For that purpose we perform an exhaustive comparison study
of state-of-the-art classifiers to be used for label prediction in the job
domain. Our findings suggest that in most cases an SVM based ap-
proach using stochastic gradient descent performs best on the textual
content of job advertisements in terms of Accuracy, F 1 -measure and
AUC. Consequently, we plan to use the best performing classifier
for each label which is relevant to the Studo Jobs platform in order
to automatically enrich the job advertisement data. We believe that
our work is of interest for both researchers and practitioners in the
area of automatic labeling and enriching text-based data.

KEYWORDS
job platform; data enrichment; label prediction; comparative study;

1     INTRODUCTION
The nature of the job market is a highly competitive one and finding a
new job is not an easy decision as it usually depends on many factors
like salary, job description or geographical location. This has led to
the recent rise of business-oriented social networks like LinkedIn1 or
                                                                                                     (a)                                 (b)
XING2 . Users of such networks organize and look after their profile
by describing their skills, interests and previous work experiences.
                                                                                     Figure 1: Screenshot of (a) the course overview in the Studo
But finding relevant jobs for users using such carefully structured
                                                                                     mobile application and, (b) the Studo Jobs platform.
content is actually a non-trivial task to perform [1]. Tackling the
same problem gets even more difficult for university students, as
they normally have only some or no relevant work experience at all.
This has become a real issue for students as they get more aware that                clearly need additional guidance throughout their studies. Thus, one
having a degree does not automatically guarantee them their desired                  of the main goals of Studo is to better prepare the students for the
job after graduation. For instance, the recent study of [10] reports that            job market they need to face after graduation.
one third of graduates in the U.S. were employed in positions that
                                                                                     Current work. In this paper, we present our work-in-progress of
do not require a university degree. Moreover, the authors report that
                                                                                     the newest extension to Studo - the Studo Jobs platform. As seen in
23.5% of employed graduates in 2013 are not only underemployed
                                                                                     Figure 1b, students can browse related job advertisements in order
but also work in positions with a below-than-average salary.
                                                                                     to gather relevant working experience even before they graduate. In
   In 2016 we launched the Studo3 mobile application with the initial
                                                                                     our case, these job advertisements typically describe the candidate’s
aim to provide constant guidance and support to Austrian students
                                                                                     job role, required skills, the expected educational background as
in their everyday life. As seen in Figure 1a, Studo integrates several
                                                                                     well as the company description, but only in an unstructured free-
university-relevant services (e.g. course management, mail, calendar)
                                                                                     text format. In the context of job recommendations, having such
but also enriches the student’s daily life by providing relevant news
                                                                                     unstructured data can be problematic as students already struggle
articles. With such a feature-set a student is not only better informed
                                                                                     with having little job experience [13]. To overcome this limitation,
but also encouraged to connect and collaborate with other peers
                                                                                     in this work we focus on enriching the job advertisement data by
from the same community. Moreover, the ever increasing popularity
                                                                                     automatically generating and assigning labels (i.e., categories) to
of the application at Austrian universities4 has shown that students
                                                                                     which an advertised job belongs. The benefits of such data enrich-
1 http://linkedin.com
                                                                                     ment are twofold. First, students can more easily navigate through
2 http://xing.com
3 https://studo.co/
                                                                                     the available job offers (e.g., by filtering out irrelevant categories).
4 As of June 2017 the 30,000 monthly active users have on average 100 applications   Second, by correctly enriching the job advertisement data we hope
starts per month                                                                     to increase the performance of future job recommendations (e.g., by
performing clustering like in [9]). As such, we perform an extensive             {
algorithmic comparison study on how to predict suitable labels for a             "id" : "123456",
particular job advertisement. We believe that our findings can sup-              "jobTitle" : "Junior Java Developer",
port both developers and researchers on how to enrich their data and             "text" : "Ihre Aufgaben: Erhebung der Anforderungen fuer die...",
potentially improve the recommendation performance.                              "labels" : ["Software"]
                                                                                 },
2    RELATED WORK                                                                {
Most work which is related to assigning labels to job advertisements             "id" : "456789",
come from the research on multi-label classification, an emerging                "jobTitle" : "Marketing, Sales & Event Manager",
machine learning paradigm which tries to assign a set of labels to               "text" : "Rocken−motivieren−begeistern! Schoen, dass Du...",
a document [17]. In an extensive literature review [20], previous                "labels" : ["Marketing", "Catering", "Graphics, Design"]
multi-label learning has been divided into two main groups, i.e.,                }
algorithm adaptation and problem transformation methods. Algo-                   Listing 1: Example of two crawled job advertisements (text was
rithm adaptation methods adapt popular learning techniques to deal               shortened for readability) in JSON format. In our experiments
with multi-label data directly, while problem transformation meth-               we only used the text of a given job in order to predict the best
ods transform the multi-label classification problem into either one             suited labels.
or more single-label classification problems. In our work we build
upon the later, i.e., explore on how to construct several single-label
classifiers in order to assign relevant labels to job advertisements. We         with the maximum soft margin, thus allowing for a minimal number
base our decision as other work have shown that binary relevance is              of errors [6] and, (2) a SVM-SGD approach, where the stochastic
a suitably method to tackle the problem of multi-label classification            gradient descent optimization method is applied on the SVM [21].
(e.g., [16]). Moreover, the author of [18] have shown that task of                  Finally, we experimented with three different neural network
predicting the job sector (i.e., category or label as in our case) can           approaches. We first trained a Multilayer Perceptron (MLP) which
be done more accurately than its title or education requirement.                 consists of an input layer that is provided with the input vector θc ,
                                                                                 followed by one hidden layer with a size of 1024 units and two
3    METHODOLOGY                                                                 smaller hidden layers with 128 units each. Each of the hidden layers
Similar to the work of [18], we train multiple binary classifier on the          is followed by a batch normalization layer. The next two models
basis of features contained in the text (i.e. terms). Thus, for each label       are based on the work of [11]. As such, we used a convolutional
l we define a parameter vector for class c: θc = {θc1 ,θc1 , ...,θcn },          neural network (CNN) with an embedding layer of 200 units. The
where n is the size of the vocabulary in the corresponding training              embedding layer is followed by a 1-dimensional convolutional layer
set. The values of vector θc are the calculated TF-IDF values as                 with 128 filters and a kernel size of 3. This was followed by a
denoted by Equation 1 and 2, where TF(t,j) is the term count within              global max pooling layer and a dense layer with 128 units. The
the job advertisement and DF(t) is the number of job advertisements              third model was a multichannel CNN (M-CNN) that connects the
in which that particular term occurs.                                            embedding layer with three convolutional layers in parallel, each
                                                                                 one having 128 filters and a kernel size of 3, 4 and 5 respectively.
                    TF-IDF(t,j) = TF(t,j) × IDF(t)                    (1)        Every convolutional layer is followed by a max pooling layer which
                                                                                 outputs are then afterwards merged together. In all three networks we
                                     1+n                                         used rectified linear units as the activation function (i.e., Equation 3).
                     IDF(t ) = loд            +1                    (2)          The output layer has always two units and uses a standard softmax
                                   1 + DF(t )
                                                                                 activation function (i.e., Equation 4). Each of the hidden layers uses
   For our comparison study we performed experiments on different
                                                                                 dropout with a rate of 0.2 for regularization. The models also use the
job labels using several classification algorithms. As a baseline,
                                                                                 Adam optimizer [12] with a learning rate of 0.001. It also needs to
we first explored three well-known algorithms from the literature.
                                                                                 be noted that the last two CNN approaches do not utilize a TD-IDF
Specifically, we looked into: (1) the Naive Bayes algorithm which
                                                                                 based input vector as the rest. The input is generated by transforming
assumes pairwise independence of the input features [19], (2) a
                                                                                 the textual content of a job advertisement into a sequence of word
Classification And Regression Tree (CART), where at each node
                                                                                 indices. Having a maximum sequence length of 1, 000, shorter texts
one input is tested and depending on the results the left or right sub-
                                                                                 were just padded with a default zero index.
branch is traversed [2] and, (3) a Random Forest ensemble approach,
where each tree votes for a particular class [4].
                                                                                                         f (x ) = x + = max(0,x )                      (3)
   Next we experimented with AdaBoost, a boosting algorithm
which does adaptive weight adjusting of incorrectly classified in-
                                                                                                                      ezj
stances [7]. Another approach was a linear model using Logistic                                            σ (z)j = PK                                 (4)
Regression which assumes that the posterior probability of a class is                                                  k =1
                                                                                                                              e zk
equal to a logistic sigmoid function acting on a linear function [3].
Support Vector Machines (SVM) is another algorithm that has been                 4    EXPERIMENTAL SETUP
shown to perform well with text classification. As such, we used                 In order to perform a comparative study we first constructed a train-
two SVM methods: (1) a Linear SVM which tries to fit a hyperplane                ing and test set by crawling job advertisement data from leading
                                                                             2
                                                                                   Algorithm           Accuracy F 1 -measure     AUC
                                                                                   Random Forest        0.8017       0.7932     0.8739
                                                                                   CART                 0.8046       0.8053     0.8054
                                                                                   MLP                  0.8224       0.8253     0.8923
                                                                                   AdaBoost             0.8331       0.8300     0.8952
                                                                                   Naive Bayes          0.8470       0.8555     0.9069
                                                                                   M-CNN                0.8571       0.8549     0.9177
                                                                                   CNN                  0.8597       0.8604     0.9236
                                                                                   Logistic Regression  0.8661       0.8653     0.9335
                                                                                   Linear SVM           0.8707       0.8709     0.9344
                                                                                   SVM-SGD              0.8748       0.8751     0.9329
                                                                              Table 1: Average performance across all labels. Each label is
                                                                              weighted with the number of samples in its subdataset.


Figure 2: Occurences of the four Studo Jobs labels in the
crawled dataset. Management and Marketing are merged to-                      due to the imbalanced nature of the original dataset), for each ap-
gether to the Business label.                                                 proach we report a weighted average in terms of Accuracy. F 1 -
                                                                              measure and AUC. For example, the F 1 -measure would be calcu-
                                                                              lated as: F 1 = αF 1 (Business) + βF 1 (T echnoloдy) + γ F 1 (Caterinд)
Austrian job platforms, i.e., stepstone.at, karriere.at and monster.at.       + δ F 1 (So f tware), where the weight values contain the percentage
We utilized an incremental crawler which was given a manually                 of the corresponding label in the dataset (i.e., α = 0.237, β = 0.197,
constructed list of URLs for each job type. If a particular job adver-        γ = 0.068 and δ = 0.497). In general, we found strong accuracy per-
tisement was not in the database a new entry was added, otherwise             formance in all models (e.g., the worst performing Random Forest
it was enriched with a new label. The crawler then iterated over a            did have an Accuracy of 0.8017 and F 1 of 0.7932). The best per-
finite number of pages. As seen in Listing 1, an extracted job entry          forming approaches were the SVM based ones, where the linear
consists of an id, a title and description in plain text and a list of        approach had the best AUC and the one using stochastic gradient
labels which denote the type of the job.                                      descent had Accuracy and F 1 . Interestingly enough, due to the recent
Dataset. We crawled 5, 602 job advertisements in total. On aver-              popularization of deep learning approaches our first assumption was
age, a posted job had 1.05 labels assigned to it. In our experiments          that the models based on CNN would perform much better than the
we focused on four different labels which are mainly used in the              SVM based ones. Although still competitive, we assume that a lower
Studo Job platform, namely: Software, Catering, Technology and                accuracy performance was reached because the hyperparameters
Business. As the Studo-specific label Business could not be directly          were not previously tuned enough and early stopping was not used
crawled, we derived it by combining job advertisements from the               in order to cope with overfitting. As such, we hypothesize that there
type Management and Marketing.                                                is still much to gain from such approaches by learning these parame-
                                                                              ters beforehand (e.g., using a nested cross-validation like in [8]) and
Evaluation. As the crawled dataset is clearly imbalanced (e.g., as            incorporating a validation set to stop the model training at the most
seen in Figure 2 the Software label dominates), we further con-               optimal time.
structed subdatasets for each label to experiment on. Thus, for each              The individual label results in terms of the F 1 -measure can be seen
label a subset was used containing all the jobs containing that partic-       in Figure 3. We show only the F 1 -measure due to space restrictions
ular label as well as a random sample of the same size containing             and the fact that the Accuracy values, when compared to, are almost
other labels. Therefore the resulting subdataset had 50% of job               identical. The best performance was achieved on the Software label
advertisments containing the evaluating label and 50% without it.             using the SVM-SGD approach with an Accuracy of 0.9193 and F 1 of
The evaluation was performed using Scikit-learn [14] for the Naive            0.9196. A contributing factor to such performance is possibly the size
Bayes, CART, Random Forest, AdaBoost, Logistic Regression, Lin-               of the training set which was by far the largest for the Software label.
ear SVM and the SVM-SGD approach. For the neural networks we                  An interesting finding is that all of the approaches that were utilized
utilized Keras [5]. These models were trained and evaluated using a           on the Catering label, which had the least training data, performed
10-fold stratified cross-validation on each subdataset respectively. In       much better than on the Business and Technology label. Looking at
order to finally quantify the prediction performance, we used a set           the data, we think that the reason for such a performance difference
of well-known information retrieval metrics. In particular, we report         lies in the broader definition of these label terms. Moreover, the
the prediction accuracy by means of Accuracy, the F 1 -measure and            reason that the Bussiness label had the worst performance could lie
the Area Under the ROC curve (AUC) [15].                                      in the fact that its data come from a combination of the crawled
                                                                              Management and Marketing labels. It should also be noted that the
5   RESULTS                                                                   MLP approach outperformed all others when applied to the Catering
The overall results of the algorithmic comparison can be seen in              label. This suggests that when the right hyperparameters are picked,
Table 1. As each subdataset had a different size of the test set (i.e.,       an increase in performance could still be gained.
                                                                          3
                                                                             the generated labels and assess their impact on perceived usefulness
                                                                             and navigability to users in a live setting (e.g., by letting users define
                                                                             and store filters to narrow down the search for relevant jobs). Finally,
                                                                             we plan to extend the Studo Job platform with personalized recom-
                                                                             mendations which leverage the automatically generated job labels.
                                                                             We not only want to investigate which approaches (e.g., content-
                                                                             based, collaborative filtering, etc.) benefit the most from such data,
                                                                             but also on how to incorporate recent label filters as additional time-
                                                                             dependent contextual cues in order to predict the current job interest.
                                                                             For this we also plan to investigate the recently popularized deep
                                                                             learning approaches (e.g., recurrent neural networks) to see if we
                                                                             can predict the future shift in interest of a job type.
                                                                             Acknowledgments. This work is supported by the Know-Center
                                                                             and ISDS Institute from Graz University of Technology. The authors
                                                                             would also like to thank the AVL company, especially Dr. Markus
                                                                             Tomaschitz, for the support at setting up this research project and
                                                                             giving insights about the job market.

                                                                             REFERENCES
                                                                              [1] F. Abel. We know where you should work next summer: job recommendations.
                                                                                  In Proceedings of the 9th ACM Conference on Recommender Systems, pages
                                                                                  230–230. ACM, 2015.
                                                                              [2] Berk. Classification and Regression Trees. Data Mining with Rattle and R: The
                                                                                  Art of Excavating Data for Knowledge Discovery, Use R, (November):36–350,
Figure 3: Performance results of the utilized binary classifica-                  2009.
tion algorithms in terms of the F 1 -measure for all labels relevant          [3] C. M. Bishop. Pattern Recognition and Machine Learning, volume 53. 2013.
to the Studo Jobs platform.                                                   [4] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
                                                                              [5] F. Chollet and Others. https://github.com/fchollet/keras.
                                                                              [6] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20(3):273–
                                                                                  297, 1995.
                                                                              [7] Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line
   Overall, the SVM-SGD performed best. However, the MLP ap-                      learning and an application to boosting. 139:23–37, 1995.
proach outperformed others for the Catering label and the much                [8] I. Guyon, A. R. S. A. Alamdari, G. Dror, and J. M. Buhmann. Performance
                                                                                  prediction challenge. In Neural Networks, 2006. IJCNN’06. International Joint
simpler Naive Bayes had almost the same performance as SVM-                       Conference on, pages 1649–1656. IEEE, 2006.
SGD for the Technology label. This suggests that a diversified model          [9] W. Hong, S. Zheng, H. Wang, and J. Shi. A job recommender system based on
combination could lead to even better performance.                                user clustering. Journal of Computers, 8(8):1960–1967, 2013.
                                                                             [10] J. Jones, J. Schmitt, et al. A college degree is no guarantee. Technical report,
                                                                                  Center for Economic and Policy Research (CEPR), 2014.
6   CONCLUSION                                                               [11] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint
                                                                                  arXiv:1408.5882, 2014.
In this work we presented the Studo Job platforms and showed how             [12] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. pages
we plan to tackle the problem of automatically assigning labels to                1–15, 2014.
                                                                             [13] R. Liu, Y. Ouyang, W. Rong, X. Song, C. Tang, and Z. Xiong. Computational
new job advertisements. For that purpose we performed an extensive                Science and Its Applications – ICCSA 2016. 9788:453–467, 2016.
comparative study between several state-of-the art text-classification       [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
                                                                                  M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
algorithms. Our findings suggest the by utilizing an SVM approach                 D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine
using stochastic gradient descent we can achieve the best perfor-                 Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
mance in terms of Accuracy, F 1 -measure and AUC. However, our               [15] D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informed-
                                                                                  ness, markedness and correlation. 2011.
results revealed that deep learning approaches can also improve the          [16] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label
prediction performance, especially with a right hyperparameter setup.             classification. Machine Learning, 85(3):333–359, 2011.
As such, for our Studo Jobs platform we will use a combination of            [17] G. Tsoumakas and I. Katakis. Multi-Label Classification : An Overview. Int J
                                                                                  Data Warehousing and Mining, 2007:1—-13, 2007.
those binary classifiers which showed the best performance results.          [18] J. Zavrel, P. Berck, and W. Lavrijssen. Information Extraction by Text Classi-
                                                                                  fication: Corpus Mining for Features. Proceedings of the Second International
Limitation and Future Work. As already mentioned, one limita-                     Conference on Language Resources and Evaluation LREC00, 2000.
tion of our work is that we did not extensively explore the impact of        [19] H. Zhang. The Optimality of Naive Bayes. Proceedings of the Seventeenth
                                                                                  International Florida Artificial Intelligence Research Society Conference FLAIRS
choosing the right hyperparameters for the deep learning approaches.              2004, 1(2):1 – 6, 2004.
Therefore, we plan to extend the study by finding the optimal hyper-         [20] M. L. Zhang and Z. H. Zhou. A review on multi-label learning algorithms. IEEE
parameters for each label that is relevant to the Studo Jobs platform             Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014.
                                                                             [21] T. Zhang. Solving large scale linear prediction problems using stochastic gradient
(e.g., by setting up a nested cross-validation). In addition, we also             descent algorithms. Proceedings of the twenty-first international conference on
plan to extend our comparison study by including other features                   Machine learning, 6:116, 2004.
besides the textual terms and incorporating methods that adapt algo-
rithms to directly perform multi-label classification. Building on the
data enrichment of job advertisements, we further plan to integrate
                                                                         4