Studo Jobs: Enriching Data With Predicted Job Labels Markus Reiter-Haas Valentin Slawicek Emanuel Lacic Moshbit GmbH Moshbit GmbH Know-Center Graz, Austria Graz, Austria Graz, Austria markus.reiter-haas@studo.co valentin.slawicek@studo.co elacic@know-center.at ABSTRACT In this paper, we present the Studo Jobs platform in which we tackle the problem of automatically assigning labels to new job advertise- ments. For that purpose we perform an exhaustive comparison study of state-of-the-art classifiers to be used for label prediction in the job domain. Our findings suggest that in most cases an SVM based ap- proach using stochastic gradient descent performs best on the textual content of job advertisements in terms of Accuracy, F 1 -measure and AUC. Consequently, we plan to use the best performing classifier for each label which is relevant to the Studo Jobs platform in order to automatically enrich the job advertisement data. We believe that our work is of interest for both researchers and practitioners in the area of automatic labeling and enriching text-based data. KEYWORDS job platform; data enrichment; label prediction; comparative study; 1 INTRODUCTION The nature of the job market is a highly competitive one and finding a new job is not an easy decision as it usually depends on many factors like salary, job description or geographical location. This has led to the recent rise of business-oriented social networks like LinkedIn1 or (a) (b) XING2 . Users of such networks organize and look after their profile by describing their skills, interests and previous work experiences. Figure 1: Screenshot of (a) the course overview in the Studo But finding relevant jobs for users using such carefully structured mobile application and, (b) the Studo Jobs platform. content is actually a non-trivial task to perform [1]. Tackling the same problem gets even more difficult for university students, as they normally have only some or no relevant work experience at all. This has become a real issue for students as they get more aware that clearly need additional guidance throughout their studies. Thus, one having a degree does not automatically guarantee them their desired of the main goals of Studo is to better prepare the students for the job after graduation. For instance, the recent study of [10] reports that job market they need to face after graduation. one third of graduates in the U.S. were employed in positions that Current work. In this paper, we present our work-in-progress of do not require a university degree. Moreover, the authors report that the newest extension to Studo - the Studo Jobs platform. As seen in 23.5% of employed graduates in 2013 are not only underemployed Figure 1b, students can browse related job advertisements in order but also work in positions with a below-than-average salary. to gather relevant working experience even before they graduate. In In 2016 we launched the Studo3 mobile application with the initial our case, these job advertisements typically describe the candidate’s aim to provide constant guidance and support to Austrian students job role, required skills, the expected educational background as in their everyday life. As seen in Figure 1a, Studo integrates several well as the company description, but only in an unstructured free- university-relevant services (e.g. course management, mail, calendar) text format. In the context of job recommendations, having such but also enriches the student’s daily life by providing relevant news unstructured data can be problematic as students already struggle articles. With such a feature-set a student is not only better informed with having little job experience [13]. To overcome this limitation, but also encouraged to connect and collaborate with other peers in this work we focus on enriching the job advertisement data by from the same community. Moreover, the ever increasing popularity automatically generating and assigning labels (i.e., categories) to of the application at Austrian universities4 has shown that students which an advertised job belongs. The benefits of such data enrich- 1 http://linkedin.com ment are twofold. First, students can more easily navigate through 2 http://xing.com 3 https://studo.co/ the available job offers (e.g., by filtering out irrelevant categories). 4 As of June 2017 the 30,000 monthly active users have on average 100 applications Second, by correctly enriching the job advertisement data we hope starts per month to increase the performance of future job recommendations (e.g., by performing clustering like in [9]). As such, we perform an extensive { algorithmic comparison study on how to predict suitable labels for a "id" : "123456", particular job advertisement. We believe that our findings can sup- "jobTitle" : "Junior Java Developer", port both developers and researchers on how to enrich their data and "text" : "Ihre Aufgaben: Erhebung der Anforderungen fuer die...", potentially improve the recommendation performance. "labels" : ["Software"] }, 2 RELATED WORK { Most work which is related to assigning labels to job advertisements "id" : "456789", come from the research on multi-label classification, an emerging "jobTitle" : "Marketing, Sales & Event Manager", machine learning paradigm which tries to assign a set of labels to "text" : "Rocken−motivieren−begeistern! Schoen, dass Du...", a document [17]. In an extensive literature review [20], previous "labels" : ["Marketing", "Catering", "Graphics, Design"] multi-label learning has been divided into two main groups, i.e., } algorithm adaptation and problem transformation methods. Algo- Listing 1: Example of two crawled job advertisements (text was rithm adaptation methods adapt popular learning techniques to deal shortened for readability) in JSON format. In our experiments with multi-label data directly, while problem transformation meth- we only used the text of a given job in order to predict the best ods transform the multi-label classification problem into either one suited labels. or more single-label classification problems. In our work we build upon the later, i.e., explore on how to construct several single-label classifiers in order to assign relevant labels to job advertisements. We with the maximum soft margin, thus allowing for a minimal number base our decision as other work have shown that binary relevance is of errors [6] and, (2) a SVM-SGD approach, where the stochastic a suitably method to tackle the problem of multi-label classification gradient descent optimization method is applied on the SVM [21]. (e.g., [16]). Moreover, the author of [18] have shown that task of Finally, we experimented with three different neural network predicting the job sector (i.e., category or label as in our case) can approaches. We first trained a Multilayer Perceptron (MLP) which be done more accurately than its title or education requirement. consists of an input layer that is provided with the input vector θc , followed by one hidden layer with a size of 1024 units and two 3 METHODOLOGY smaller hidden layers with 128 units each. Each of the hidden layers Similar to the work of [18], we train multiple binary classifier on the is followed by a batch normalization layer. The next two models basis of features contained in the text (i.e. terms). Thus, for each label are based on the work of [11]. As such, we used a convolutional l we define a parameter vector for class c: θc = {θc1 ,θc1 , ...,θcn }, neural network (CNN) with an embedding layer of 200 units. The where n is the size of the vocabulary in the corresponding training embedding layer is followed by a 1-dimensional convolutional layer set. The values of vector θc are the calculated TF-IDF values as with 128 filters and a kernel size of 3. This was followed by a denoted by Equation 1 and 2, where TF(t,j) is the term count within global max pooling layer and a dense layer with 128 units. The the job advertisement and DF(t) is the number of job advertisements third model was a multichannel CNN (M-CNN) that connects the in which that particular term occurs. embedding layer with three convolutional layers in parallel, each one having 128 filters and a kernel size of 3, 4 and 5 respectively. TF-IDF(t,j) = TF(t,j) × IDF(t) (1) Every convolutional layer is followed by a max pooling layer which outputs are then afterwards merged together. In all three networks we 1+n used rectified linear units as the activation function (i.e., Equation 3). IDF(t ) = loд +1 (2) The output layer has always two units and uses a standard softmax 1 + DF(t ) activation function (i.e., Equation 4). Each of the hidden layers uses For our comparison study we performed experiments on different dropout with a rate of 0.2 for regularization. The models also use the job labels using several classification algorithms. As a baseline, Adam optimizer [12] with a learning rate of 0.001. It also needs to we first explored three well-known algorithms from the literature. be noted that the last two CNN approaches do not utilize a TD-IDF Specifically, we looked into: (1) the Naive Bayes algorithm which based input vector as the rest. The input is generated by transforming assumes pairwise independence of the input features [19], (2) a the textual content of a job advertisement into a sequence of word Classification And Regression Tree (CART), where at each node indices. Having a maximum sequence length of 1, 000, shorter texts one input is tested and depending on the results the left or right sub- were just padded with a default zero index. branch is traversed [2] and, (3) a Random Forest ensemble approach, where each tree votes for a particular class [4]. f (x ) = x + = max(0,x ) (3) Next we experimented with AdaBoost, a boosting algorithm which does adaptive weight adjusting of incorrectly classified in- ezj stances [7]. Another approach was a linear model using Logistic σ (z)j = PK (4) Regression which assumes that the posterior probability of a class is k =1 e zk equal to a logistic sigmoid function acting on a linear function [3]. Support Vector Machines (SVM) is another algorithm that has been 4 EXPERIMENTAL SETUP shown to perform well with text classification. As such, we used In order to perform a comparative study we first constructed a train- two SVM methods: (1) a Linear SVM which tries to fit a hyperplane ing and test set by crawling job advertisement data from leading 2 Algorithm Accuracy F 1 -measure AUC Random Forest 0.8017 0.7932 0.8739 CART 0.8046 0.8053 0.8054 MLP 0.8224 0.8253 0.8923 AdaBoost 0.8331 0.8300 0.8952 Naive Bayes 0.8470 0.8555 0.9069 M-CNN 0.8571 0.8549 0.9177 CNN 0.8597 0.8604 0.9236 Logistic Regression 0.8661 0.8653 0.9335 Linear SVM 0.8707 0.8709 0.9344 SVM-SGD 0.8748 0.8751 0.9329 Table 1: Average performance across all labels. Each label is weighted with the number of samples in its subdataset. Figure 2: Occurences of the four Studo Jobs labels in the crawled dataset. Management and Marketing are merged to- due to the imbalanced nature of the original dataset), for each ap- gether to the Business label. proach we report a weighted average in terms of Accuracy. F 1 - measure and AUC. For example, the F 1 -measure would be calcu- lated as: F 1 = αF 1 (Business) + βF 1 (T echnoloдy) + γ F 1 (Caterinд) Austrian job platforms, i.e., stepstone.at, karriere.at and monster.at. + δ F 1 (So f tware), where the weight values contain the percentage We utilized an incremental crawler which was given a manually of the corresponding label in the dataset (i.e., α = 0.237, β = 0.197, constructed list of URLs for each job type. If a particular job adver- γ = 0.068 and δ = 0.497). In general, we found strong accuracy per- tisement was not in the database a new entry was added, otherwise formance in all models (e.g., the worst performing Random Forest it was enriched with a new label. The crawler then iterated over a did have an Accuracy of 0.8017 and F 1 of 0.7932). The best per- finite number of pages. As seen in Listing 1, an extracted job entry forming approaches were the SVM based ones, where the linear consists of an id, a title and description in plain text and a list of approach had the best AUC and the one using stochastic gradient labels which denote the type of the job. descent had Accuracy and F 1 . Interestingly enough, due to the recent Dataset. We crawled 5, 602 job advertisements in total. On aver- popularization of deep learning approaches our first assumption was age, a posted job had 1.05 labels assigned to it. In our experiments that the models based on CNN would perform much better than the we focused on four different labels which are mainly used in the SVM based ones. Although still competitive, we assume that a lower Studo Job platform, namely: Software, Catering, Technology and accuracy performance was reached because the hyperparameters Business. As the Studo-specific label Business could not be directly were not previously tuned enough and early stopping was not used crawled, we derived it by combining job advertisements from the in order to cope with overfitting. As such, we hypothesize that there type Management and Marketing. is still much to gain from such approaches by learning these parame- ters beforehand (e.g., using a nested cross-validation like in [8]) and Evaluation. As the crawled dataset is clearly imbalanced (e.g., as incorporating a validation set to stop the model training at the most seen in Figure 2 the Software label dominates), we further con- optimal time. structed subdatasets for each label to experiment on. Thus, for each The individual label results in terms of the F 1 -measure can be seen label a subset was used containing all the jobs containing that partic- in Figure 3. We show only the F 1 -measure due to space restrictions ular label as well as a random sample of the same size containing and the fact that the Accuracy values, when compared to, are almost other labels. Therefore the resulting subdataset had 50% of job identical. The best performance was achieved on the Software label advertisments containing the evaluating label and 50% without it. using the SVM-SGD approach with an Accuracy of 0.9193 and F 1 of The evaluation was performed using Scikit-learn [14] for the Naive 0.9196. A contributing factor to such performance is possibly the size Bayes, CART, Random Forest, AdaBoost, Logistic Regression, Lin- of the training set which was by far the largest for the Software label. ear SVM and the SVM-SGD approach. For the neural networks we An interesting finding is that all of the approaches that were utilized utilized Keras [5]. These models were trained and evaluated using a on the Catering label, which had the least training data, performed 10-fold stratified cross-validation on each subdataset respectively. In much better than on the Business and Technology label. Looking at order to finally quantify the prediction performance, we used a set the data, we think that the reason for such a performance difference of well-known information retrieval metrics. In particular, we report lies in the broader definition of these label terms. Moreover, the the prediction accuracy by means of Accuracy, the F 1 -measure and reason that the Bussiness label had the worst performance could lie the Area Under the ROC curve (AUC) [15]. in the fact that its data come from a combination of the crawled Management and Marketing labels. It should also be noted that the 5 RESULTS MLP approach outperformed all others when applied to the Catering The overall results of the algorithmic comparison can be seen in label. This suggests that when the right hyperparameters are picked, Table 1. As each subdataset had a different size of the test set (i.e., an increase in performance could still be gained. 3 the generated labels and assess their impact on perceived usefulness and navigability to users in a live setting (e.g., by letting users define and store filters to narrow down the search for relevant jobs). Finally, we plan to extend the Studo Job platform with personalized recom- mendations which leverage the automatically generated job labels. We not only want to investigate which approaches (e.g., content- based, collaborative filtering, etc.) benefit the most from such data, but also on how to incorporate recent label filters as additional time- dependent contextual cues in order to predict the current job interest. For this we also plan to investigate the recently popularized deep learning approaches (e.g., recurrent neural networks) to see if we can predict the future shift in interest of a job type. Acknowledgments. This work is supported by the Know-Center and ISDS Institute from Graz University of Technology. The authors would also like to thank the AVL company, especially Dr. Markus Tomaschitz, for the support at setting up this research project and giving insights about the job market. REFERENCES [1] F. Abel. We know where you should work next summer: job recommendations. In Proceedings of the 9th ACM Conference on Recommender Systems, pages 230–230. ACM, 2015. [2] Berk. Classification and Regression Trees. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery, Use R, (November):36–350, Figure 3: Performance results of the utilized binary classifica- 2009. tion algorithms in terms of the F 1 -measure for all labels relevant [3] C. M. Bishop. Pattern Recognition and Machine Learning, volume 53. 2013. to the Studo Jobs platform. [4] L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001. [5] F. Chollet and Others. https://github.com/fchollet/keras. [6] C. Cortes and V. Vapnik. Support Vector Networks. Machine Learning, 20(3):273– 297, 1995. [7] Y. Freund and R. E. Schapire. A desicion-theoretic generalization of on-line Overall, the SVM-SGD performed best. However, the MLP ap- learning and an application to boosting. 139:23–37, 1995. proach outperformed others for the Catering label and the much [8] I. Guyon, A. R. S. A. Alamdari, G. Dror, and J. M. Buhmann. Performance prediction challenge. In Neural Networks, 2006. IJCNN’06. International Joint simpler Naive Bayes had almost the same performance as SVM- Conference on, pages 1649–1656. IEEE, 2006. SGD for the Technology label. This suggests that a diversified model [9] W. Hong, S. Zheng, H. Wang, and J. Shi. A job recommender system based on combination could lead to even better performance. user clustering. Journal of Computers, 8(8):1960–1967, 2013. [10] J. Jones, J. Schmitt, et al. A college degree is no guarantee. Technical report, Center for Economic and Policy Research (CEPR), 2014. 6 CONCLUSION [11] Y. Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882, 2014. In this work we presented the Studo Job platforms and showed how [12] D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization. pages we plan to tackle the problem of automatically assigning labels to 1–15, 2014. [13] R. Liu, Y. Ouyang, W. Rong, X. Song, C. Tang, and Z. Xiong. Computational new job advertisements. For that purpose we performed an extensive Science and Its Applications – ICCSA 2016. 9788:453–467, 2016. comparative study between several state-of-the art text-classification [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, algorithms. Our findings suggest the by utilizing an SVM approach D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine using stochastic gradient descent we can achieve the best perfor- Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. mance in terms of Accuracy, F 1 -measure and AUC. However, our [15] D. M. Powers. Evaluation: from precision, recall and f-measure to roc, informed- ness, markedness and correlation. 2011. results revealed that deep learning approaches can also improve the [16] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label prediction performance, especially with a right hyperparameter setup. classification. Machine Learning, 85(3):333–359, 2011. As such, for our Studo Jobs platform we will use a combination of [17] G. Tsoumakas and I. Katakis. Multi-Label Classification : An Overview. Int J Data Warehousing and Mining, 2007:1—-13, 2007. those binary classifiers which showed the best performance results. [18] J. Zavrel, P. Berck, and W. Lavrijssen. Information Extraction by Text Classi- fication: Corpus Mining for Features. Proceedings of the Second International Limitation and Future Work. As already mentioned, one limita- Conference on Language Resources and Evaluation LREC00, 2000. tion of our work is that we did not extensively explore the impact of [19] H. Zhang. The Optimality of Naive Bayes. Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference FLAIRS choosing the right hyperparameters for the deep learning approaches. 2004, 1(2):1 – 6, 2004. Therefore, we plan to extend the study by finding the optimal hyper- [20] M. L. Zhang and Z. H. Zhou. A review on multi-label learning algorithms. IEEE parameters for each label that is relevant to the Studo Jobs platform Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014. [21] T. Zhang. Solving large scale linear prediction problems using stochastic gradient (e.g., by setting up a nested cross-validation). In addition, we also descent algorithms. Proceedings of the twenty-first international conference on plan to extend our comparison study by including other features Machine learning, 6:116, 2004. besides the textual terms and incorporating methods that adapt algo- rithms to directly perform multi-label classification. Building on the data enrichment of job advertisements, we further plan to integrate 4