<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text Zoning for Job Advertisements with Bidirectional LSTMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ann-Sophie Gnehm</string-name>
          <email>gnehm@soziologie.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Linguistics &amp; Swiss Job Market Monitor University of Zurich</institution>
        </aff>
      </contrib-group>
      <fpage>66</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>We present an approach to text zoning for job advertisements with neural networks. Text zoning refers to segmenting texts into eight classes differing from each other regarding content. It aims at capturing text parts dedicated to particular subjects, e.g. the publishing company or qualifications wanted, and hence facilitates subsequent information extraction. We use BiLSTMs, a class of neural networks particularly suited for sequence labeling. Our best approach, with task-specific word embeddings and ensemble technique, reaches token-level accuracy of 89.8% and outperforms previous approaches with CRFs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Job openings are a highly interesting text type in
several respects: While they show certain
standardization in structure, form and style, their
content is rather diverse and varied. They do not
only depict labor market demand in terms of
occupations and qualifications needed, but also
provide information on hiring strategies,
selfrepresentation of companies, descriptions of
actual job tasks and much more. In this way, job
openings are an excellent source for research on
labor and labor market
        <xref ref-type="bibr" rid="ref2">(Buchmann et al., 2016)</xref>
        .
      </p>
      <p>The Swiss Job Market Monitor (SJMM)1 is
devoted to a systematic monitoring and analysis of
the Swiss labor market. For this purpose, SJMM
has set up a monitor corpus of job openings,
reaching back to 1950. The corpus is based on
representative samples of job ads published in the three
most important media channels – corporate
websites, online job portals and press. Based on the
information in the job opening texts, a whole range
of annotations on characteristics of the job, the
person wanted, and the company offering the job,
is manually added. The SJMM corpus hence
provides a unique and rich database for labor market
research. The aim of this study is to partially
supersede manual annotation by automatic data
processing with supervised machine learning, in
order to lower costs of data collection and enlarge
research opportunities.</p>
      <p>Text zoning here is defined as segmenting the
job advertisement text into zones (or classes), that
differ from each other regarding their content. It
intends at capturing text parts dedicated to
particular entities or subjects such as the job task or the
publishing company (see Table 1 for definitions
and examples). In doing so, it simplifies
information extraction. Firstly, text parts of interest can be
localized more precisely. Secondly, it allows
disambiguating words with more than one sense, or
to decide if text passages actually refer to the
subject of interest. Imagine for instance, that we want
to extract required soft skills for a vacant position:
To know if the keyword “dynamic” refers to the
personality wanted indeed, or to a “dynamic CRM
system” as a working tool instead, simplifies the
information extraction task considerably. In
conclusion, text zoning provides structure to job ad
texts and therefore, represents a substantial
information gain.</p>
      <p>The goal of this work is to develop an
automatic solution to text zoning for job ads. Firstly,
we will test how well manual segmentation can be
reproduced with supervised machine learning for
the corpus as a whole. Since the main goal is to
develop a classifier for future application, the
automatic classification will then, secondly, be
optimized regarding job ads from most recent years
(2010-2014). The task will be tackled with
Recurrent Neural Networks, which recently have proven
to be very effective for a broad range of NLP tasks.
66</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        A previous approach to automatize text zoning of
the SJMM
        <xref ref-type="bibr" rid="ref2 ref4">(Gnehm, 2016)</xref>
        with Conditional
Random Fields (CRFs) and simple features, such as
token uni- and bigrams, part-of-speech (PoS)
bigrams, and the relative position tokens in text,
achieved an accuracy on token level of 87.7%.
      </p>
      <p>Though dealing with the same task in general,
results of Hermes and Schandock (2017) are not
directly comparable to results for text zoning on
the SJMM corpus. Firstly, they classified whole
paragraphs of job ads, not single tokens, and –
as a consequence, to reach adequate description
quality – they allowed multi-label classification.
Furthermore, their classification taxonomy is only
half of the size of the SJMM classification
taxonomy (4 classes vs. 8 classes). Finally, they
classified paragraphs independently from each other and
did not model job ads as sequences of paragraphs.
Their best approach with K-Nearest Neighbors
Algorithms reached an accuracy of 97%.</p>
      <p>The largest part of other research on text zoning
deals with scientific papers or abstracts (also
referred to as argumentative zoning), a smaller part
focuses on more general texts, such as newspaper
articles. Two conclusions for the task at hand may
be drawn from these approaches: Firstly,
modeling the task as sequence tagging, that is to
consider context for the current classification
decision, can improve performance considerably, as
Merity et al. (2009) as well as Hirohata et al.
(2008) showed. Secondly, if we make use of
distributional semantics, it is important to find an
appropriate modeling: Sun et al. (2008) and Brants
et al. (2002) both obtained rather different results
depending on the similarity measure chosen. In
contrast to most argumentative zoning approaches,
sentences are considered too coarse-grained for
a reliable segmentation for our application, as in
many cases a single sentence refers to several
different text zones.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Data &amp; Methods</title>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>The SJMM corpus comprises over 38,000
manually segmented job ads in German, covering the
time period from 1950 to 2014. Ads are collected
from corporate websites, online job portals and the
press.</p>
        <p>In text zoning, job ads are segmented in eight
different classes, distinguishable from each other
with regard to their contents (see Table 1). The
text is split up on token level, where each token is
assigned exactly one class. Zone 3, which includes
primarily information about the application
procedure, serves also as residual class in case no other
zone seems appropriate. Actually, not every job
ad contains information on every single text zone.
67
For example, most often the reason of the vacancy
remains unknown. On the other hand, zones can
show up several times in one ad, e.g. you will
typically find information on the job description in
several places (see Figure 1). In conclusion, text
zoning for SJMM job ads is modeled as a
onelabel multiclass classification on the level of token
sequences.</p>
        <p>The text type of job ads has undergone
noticeable change over the last decades. For instance,
as Figure 2 illustrates, job ad texts have grown
substantially, from around 30 tokens on average
in 1950 up to over 150 tokens in 2014. In
particular, the job description and the description of
the company increase. Likewise, the amounts of
text referring to required hard and soft skills rise
strongly. Proportionally, all other zones become
less important.</p>
        <p>Figure 2 also reveals a strongly skewed class
distribution: The two largest zones, Z6 and Z3,
comprise around 30% of the tokens each, whereas
Z2, Z4 and Z5 on the other hand have very little
evidence with frequencies lower than 5%. A
baseline classifier that assigns every token the majority
class Z6 reaches an accuracy of 30.5%.</p>
        <p>Furthermore, tokens show a high class
ambiguity: More than 50% of the tokens appear in all
eight classes. For the most part, these tokens are
function words which by their nature usually adopt
class affiliation from neighbored tokens. However,
another 40% of the tokens have been seen in more
than one class as well. This implies that the
classifier cannot rely on token information only.
Instead, it is necessary to consider context
information to assign zones to tokens.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Methods</title>
        <p>We model text zoning as a sequence labeling task
(see Graves (2012) for further discussion): we tag
sequences of input data (tokens) with sequences of
labels (zone tags), where neither single tokens nor
single tags represent individual data points, but
instead the current token and the current zone tag
strongly depend on preceding and subsequent
tokens or tags, respectively.</p>
        <p>
          As we have a large amount of labeled data at
hand, we train text zoning models with supervised
machine learning. We use BiLSTMs, a class of
recurrent neural networks with two improvements
that make them especially suited for sequence
labeling: Firstly, so-called Long Short-Term
Memories (LSTMs) are better in integrating information
over long sequences, by using so-called gates
(input, output, and forget gates) and “memory cell”
units. Secondly, bidirectional LSTMs consider
context on both sides for the classification of the
actual token
          <xref ref-type="bibr" rid="ref6">(Graves, 2012)</xref>
          .
        </p>
        <p>
          Feature engineering is rather simple, since the
only input to the model are word embeddings.
To use pretrained word embeddings can improve
model performance, as initial values of
parameters (i.e. values of pretrained vectors vs. values
of randomly initialized vectors) influence model
regularization. On the other hand, to train
taskspecific embeddings, where dimensionality
reduction of input vectors is done within the model
optimization process itself, seems to be promising in
light of the relatively large amount of labeled data
available
          <xref ref-type="bibr" rid="ref13 ref5">(Goldberg, 2017)</xref>
          . In our experiments we
compare the performance of these two approaches.
        </p>
        <p>
          As pretrained word embeddings, we feed in
Conceptnet Numberbatch embeddings
          <xref ref-type="bibr" rid="ref15">(Speer and
Chin, 2016)</xref>
          , which achieve measurably higher
performance in word-similarity evaluations than
any previous known system. Conceptnet
comprises almost 130,000 German word embeddings,
each represented as a vector with 300 dimensions.
        </p>
        <p>
          A reasonable coverage of the vocabulary of the
SJMM corpus by the pretrained embeddings is
crucial for using them as input. We conducted
several processing steps to reach this objective,
most important a semantic reduction of compound
words: If there is no embedding for the
(lemmatized) compound word, we search the
embedding for the head, respectively the embedding for
the largest part if there is more than one modifier
(“vermittler” instead of “personalvermittler”,
“dienstleistungsunternehmen” instead of
“mediendienstleistungsunternehmen”). Decompounding has
been done with the GERTWOL tool for
morphological analysis
          <xref ref-type="bibr" rid="ref11">(Koskeniemmi and Haapalainen,
1996)</xref>
          . After all, for three out of four tokens
(respectively for 92% of tokens without numeral,
punctuation and special characters) a pretrained
embedding is available. This coverage seems to
be sufficient for using the embeddings as input.
        </p>
        <p>
          For rare words, that is, words that occur no
more than vfie times in the training data or words
not covered by a pretrained embedding, we use
character-level representations, as in the approach
by Neubig et al. (2017). In doing so, as there is
much more data on character level, the model can
generalize better. Furthermore, this approach also
solves the unknown word problem: We are able to
provide an embedding for every word in the test
set, regardless of whether the word was part of the
training data or not
          <xref ref-type="bibr" rid="ref13 ref5">(Goldberg, 2017)</xref>
          .
        </p>
        <p>
          We use the network architecture of the BiLSTM
PoS tagger by Huang et al. (2015) as starting
point for our experiments2: Bidirectional
tokenlevel LSTMs encode the input, and on top of the
states of the token-level BiLSTMs comes a
Multilayer Perceptron (MLP) with one hidden layer. For
rare tokens, the training process optimizes an
embedding over the characters: BiLSTMs encode the
input, and the concatenated output vectors serve
as input for the token-level BiLSTMs. Network
parameters are randomly initialized and optimized
with Adam algorithm
          <xref ref-type="bibr" rid="ref10">(Kingma and Ba, 2014)</xref>
          .
        </p>
        <p>2We based our experiments on a Python implementation
for this PoS tagger, which can be found here: https://
github.com/clab/dynet_tutorial_examples/
blob/master/tutorial_bilstm_tagger.py
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments &amp; Results</title>
      <sec id="sec-4-1">
        <title>Experiments</title>
        <p>For all experiments, we split annotated data into a
training set (80% of ads), a development set (10%)
and a final test set (10%). We train models in 50
(or 30) iterations and keep the model that reaches
highest accuracy on the development set.
Accuracy, as well as precision and recall, are assessed
on token level. To validate our most important
results, we run specific model settings vfie times
and report mean accuracy and standard deviation
of model performance over the vfie runs.</p>
        <p>A first series of experiments analyzes whether
word embeddings that are learned with the
training material perform better than pretrained word
embeddings. This relates to the question of how
domain-specific the vocabulary of the SJMM
corpus is. Furthermore, a task specific tuning of
embeddings might be promising considering the large
amount of training data.</p>
        <p>A second series of experiments deals with the
question, whether changing the hyperparameters
can increase model performance, as performance
usually rises with model capacity (see e.g.
Collobert et al. (2011)).</p>
        <p>Thirdly, the main objective is to build a model
that is optimized to segment present-day or future
job ads. As mentioned in Section 3.2, job ads as
a text type have undergone major changes in the
past decades. Thus, it is probable that older
training data lowers model performance. On the other
hand, excluding older job ads results in a smaller
training set. The trade-off between these two
factors, size and up-to-datedness of the training data,
will be tested on a subset of current job ads.</p>
        <p>
          At the very end, automatic text zoning is
optimized by an ensemble technique: Five models
with the same setting, but different (random)
initialization will be trained. As they start from
different points in the hypothesis space, the
classifiers will differ slightly from each other
          <xref ref-type="bibr" rid="ref14 ref3">(Collobert
et al., 2011; Rokach, 2010)</xref>
          . The test set will be
tagged by the vfie classifiers, keeping for every
token the majority vote of the vfie models as final
classification. Especially if agreement between
the models is not very high, we can expect an
increase in model performance.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>
          The initial model with the same neural
network architecture previously used by Huang et al.
69
(2015), as described in Section 3.2, reaches an
accuracy of 89.0% on token level. The setting of
hyperparameters in this first experiment is as
following: We use one hidden layer for word-level
and character-level BiLSTMs each, where tokens
are represented with an input vector of 128
dimensions and output vector of 100 dimension
(output vector of forward and backwards LSTMs with
50 dimensions each get concatenated).
Characterlevel BiLSTMs are applied when a token appears
less than six times in the training material.
Characters are represented with 20-dimensional input
vectors. Output vectors of forwards and
backwards LSTMs comprise 64 dimensions each, the
concatenation of 128 dimensions then is handed
over as input to token-level BiLSTMs. MLPs on
top of the BiLSTMs reduce 100-dimensional
output vectors of BiLSTMs to 32 dimensions and
subsequently to 8 final dimensions representing
the classification tags. With this first model, we
achieve already higher accuracy than in a previous
approach with CRFs
          <xref ref-type="bibr" rid="ref2 ref4">(Gnehm, 2016)</xref>
          .
        </p>
        <p>embeddings
task-specific
pretrained, not fine-tuned
pretrained , fine-tuned
accuracy
89.0%
86.5%
88.9%</p>
        <p>s.d.
0.03%
0.05%
0.04%</p>
        <p>Task-specific embeddings help slightly more
than pretrained embeddings, even though the latter
are built from much more data (see Table 2).
Accuracy is clearly lower when using pretrained
embeddings (86.5%) than when training task-specific
embeddings (89.0%). Allowing the pretrained
embeddings to be adapted during training brings a
substantial gain in performance, but still accuracy
is still slightly lower than with task-specific
embeddings (88.9%). These results indicate that a
task-specific training or at least adaption of
embeddings is fruitful for job ad segmentation.</p>
        <p>Character-level embeddings are helpful, as for
all the three models performance is lower, if we
use one and the same embedding for all unknown
or rare words instead. With task-specific
embeddings, we reach a slightly lower accuracy of 88.7%
(vs. 89.0%), if we go without character-level
representations. With pretrained (fine-tuned during
training) embeddings, we reach at most an
accuracy of 84.8% without the character-based
BiLSTMs (vs. 88.9%). The effect is much stronger
for the model with pretrained embeddings, as the
character-level BiLSTM in this case is applied for
the quarter of tokens without a pretrained
embeddings, whereas in the model with task-specific
embeddings, only around 6% of the tokens appear no
more than vfie times in the training material and
hence are represented at character-level. In
conclusion, to use representations on character-level
for rare or unknown words is beneficial. 3
Experiments with the size of the training set
3The most positive effect of character-based embeddings
(accuracy of 86.5% vs. 79.6%) is obviously observed in the
model with pretrained word embeddings that are not adapted
during training process: here, character-based embeddings
are the only way to fine-tune the representation of input data
to the task.
70
show that the amount of labeled data is large
enough to learn a robust model. As can be
seen from Figure 3, the learning curve levels off
strongly as the training set size increases, for
all three models. The effect of the training set
size is somewhat smaller for pretrained
embeddings than for task-specific representations.
Precisely because vectors of Conceptnet embeddings
are pretrained – instead of randomly initialized –
they perform better when using comparatively
little amounts of training data. But subsequently,
task-specific embeddings are able to profit more
from additional training material. For pretrained
embeddings without adaption, the learning curve
shows a similar shape, but on a remarkably lower
level. Since the representations themselves are
fixed, training material can only be used to adjust
the weighting of the representations in the model
which results in poorer performance. In
conclusion, these results suggest that learning
taskspecific representations here is successful given
the large amount of labeled training data. Based
on these results, continuing experiments without
pretrained word embeddings (and time-consuming
preprocessing) and instead training word
embeddings as part of the model parametrization is
recommended.</p>
        <p>Experiments with hyperparameter settings
show that adding a second hidden layer to
the token-level BiLSTMs raises accuracy up to
89.2%. However, adding more layers does not
improve classification, while increasing the training
time considerably. Also, other configurations with
increased hyperparameters – bringing more
capacity to the model (additional layers on
characterlevel BiLSTMs, or higher dimensional input
vectors) – do not boost performance. Therefore, we
continued experiments with only a second layer
added on the token-level BiLSTMs.</p>
        <p>Are older job ads useful to train a model for
present-day and future application? To answer this
question, several models using training data from
different time periods are evaluated on a
development set consisting of 10% of the data from 2010
to 2014.4 As can be seen from Table 3, adding
more and older training data slightly improves
performance, but this holds only for jobs ads
pub4Previous experiments have shown that highest accuracy
is typically reached around iteration 20, and model
performance does barely change after iteration 30. Therefore, for
all experiments described below, the number of training
iterations is set to 30.
lished from 1970 onward. Going back further in
time results in decreasing model performance,
indicating that the effect of more data is partly
weakened by some sort of out-of-domain effect. This
out-of-domain effect can be observed although the
amount of training data from 1950 to 1970 is
considerably smaller than the amount of training data
from 1970 onward.</p>
        <p>time period
accuracy
s.d.</p>
        <p>train set size
ads tokens
5K 1,1M
10K 1,7M
15K 2,1M
20K 2,5M
25K 2,8M
30K 3,1M
32K 3,2M
as of 2010
as of 2000
as of 1990
as of 1980
as of 1970
as of 1960
as of 1950</p>
        <p>However, using older training data does not
lower model performance substantially. Thus,
long term change of job ads as a text type over
time is not a major issue when it comes to text
zoning of job ad texts. Moreover, change seems to be
slow, suggesting that for a future application, we
might not need to manually annotate vast amounts
of training material often. This is clearly an
encouraging observation. The model using training
data from 1970 onward seems to find the optimal
trade-off between a training set that is up to date
and large enough. Hence, this model is selected
for present-day and future application.</p>
        <p>The findings presented above lead to several
important insights for building an optimal model:
First, task-specific embeddings perform better
than the pretrained Conceptnet word embeddings,
given the amount of training data available.
Second, hyperparameters are best set to two layers
for word-level LSTMs, keeping all other
hyperparameters as in the initial setting. Third, for a
future real-world application on current job ads,
we don’t need training data older than 1970. And
fourth, additional detailed analyzes showed that it
is advisable to use one single model for all job
ads, regardless of text length or publication
channel of job ads. Considering all this, a single model
reaches 89.3% accuracy on the development set
containing job ads from 2000 to 2014.
71</p>
        <p>By using an ensemble of classifiers , model
performance further increases. Given the best model
specification according to the experiments
presented above, we trained vfie classifiers with
different (random) initial parameters. As can be seen
from Table 4, the classifiers vary only slightly in
accuracy. Also, the tiny differences in results on
development and test set indicate that models do
not suffer from a development set overfit.
Furthermore, Inter-Annotator Agreement between the vfie
models is very high, as Fleiss’ kappa values over
0.92 indicate. Nonetheless, to combine the
classifiers improves model performance: Using the
majority vote of the vfie models as the final
classification increases accuracy by nearly 0.5 percentage
points compared to the best single model, for the
development set and for the test set.</p>
        <p>accuracy model 1
accuracy model 2
accuracy model 3
accuracy model 4
accuracy model 5
mean acc. model 1-5
s.d.
accuracy with ensembling
Fleiss’ kappa</p>
        <p>Besides, ensembling brings an additional
benefit for real-world applications, as disagreement
between the models can support human quality
control: We can identify tokens without a
majority vote (this applies to 0.5% of the tokens in
the test set) and correct them manually. For the
present case, this post-processing would raise
accuracy above 90%. Manual inspection further
reveals that not all deviations from gold standard are
problematic: Table 5 shows an example for a text
passage where model predictions differ from gold
standard tags only regarding the question of how
detailed to split text into residual text and other
zones. The predicted tags make sense as well
and will not negatively affect subsequent
information extraction. Results of manual inspection
suggest that a considerable part of predictions
differing from gold standard tags are “soft errors” (this
holds for the majority of the 20 cases revised).</p>
      </sec>
      <sec id="sec-4-3">
        <title>Classification results for single text zones</title>
        <p>token gold standard
einer soft skills
exakten soft skills
Person soft skills
bietet job descr.
sich job descr.
hier job descr.
eine job descr.
spannende job descr.</p>
        <p>Aufgabe job descr.
model prediction
soft skills
soft skills
soft skills
admin./resid. text
admin./resid. text
admin./resid. text
job descr.
job descr.
job descr.
show that regarding the most interesting text zones
for labor market research – description of the
company and the job, as well as required hard skills
and soft skills – evaluation measures vary roughly
around 90% (see Table 6). Just for required soft
skills, recall is somewhat lower (84%). The
reason for this latter finding is a question for further
investigation.</p>
        <p>Clearly, the skewed class distribution is a
limiting factor to model performance. Evaluation
measures, notably recall, are relatively low for text
zones that are not so frequent, i.e. reason of the
vacancy, description of job agencies and
incentives. In terms of content, it can be justified to
merge reason of the vacancy (Z2) and incentives
(Z5) after labeling to the description of the job
(Z6). By merging Z2 and Z6, recall for the
resulting new class would rise up to 89.7%, precision
up to 92.1%, and accuracy over all classes would
increase up to 90.1%. If we merge additionally
Z5, we even reach recall of 90.6%, precision of
92.4% and a global accuracy of 90.5%. This
postlabeling merging would imply a rather minor
information loss, also acceptable considering most
important labor market research interests.
zone
1 company description
2 reason of vacancy
3 admin. and residual text
4 job agency description
5 incentives
6 job description
7 required hard skills
8 required soft skills
precision
89.1%
87.0%
93.5%
80.1%
85.4%
89.5%
90.7%
90.2%
recall
89.4%
72.2%
91.3%
65.4%
75.1%
92.1%
91.8%
83.9%</p>
        <p>As an additional experiment, we tried out</p>
      </sec>
      <sec id="sec-4-4">
        <title>Viterbi algorithm for global decoding in order to</title>
        <p>find the optimal sequence of classification labels.
Global decoding is supposed to smooth out label
sequences and adjust tags of ambiguous tokens to
tags of their neighbors. However, Viterbi
decoding does not bring any improvement (accuracy =
89.1%, s.d. = 0.12%). This implies that BiLSTMs
by themselves succeed in using context
information in order to find appropriate label sequences.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <sec id="sec-5-1">
        <title>5.1 Summary</title>
        <p>
          Overall, the results of supervised machine
learning with BiLSTMs for text zoning of job ads are
convincing. For the SJMM corpus as a whole,
accuracy reaches 89.0% with the original
hyperparameter setting and 89.2% with a model with an
additional hidden layer. Optimizing the model in
several steps regarding segmenting job ads from
more recent years (2010-2014), we raise accuracy
up to 89.8% on the final test set. Compared to
results of an earlier approach with CRFs
          <xref ref-type="bibr" rid="ref2 ref4">(Gnehm,
2016)</xref>
          this is an improvement of more than 2
percentage points, or an error rate reduction on the
token level by 16%. These results suggest that we
achieve satisfying results given the task at hand.
        </p>
        <p>The large amount of labeled training material
proves to be highly beneficial. Experiments with
the size and the time period of the training set
show that the total amount of labeled data is
sufficient. When applying the model to most recently
published job ads, adding job ads published before
1970 results in decreasing performance, though
only slightly. Here, the positive effect of more data
might be partially surpassed by some kind of
outof-domain effect.</p>
        <p>Particularly in combination with this large
amount of training data, task-specific embeddings
prove to be at least as effective as pretrained word
embeddings. For a future application, one main
advantage of task-specific embeddings is that we
can do without the (time consuming)
preprocessing needed when using pretrained embeddings. In
conclusion, due to the large amount of training
data, we are able to train a model with simple
feature engineering that generalizes well.</p>
        <p>Results of more in-depth error analysis are
encouraging as well. For the most interesting text
zones regarding labor market research –
description of the company and the job, and required
hard and soft skills – recall and precision are both
around 90%. One limiting factor for model
performance seems to be the skewed class distribution:
The classes with the lowest frequencies (reason of
the vacancy, description of a job agency, and
incentives) show also poorest classification results.
By merging classes after tagging, i.e. adding
reason of the vacancy and incentives to the
description of the job, we can further increase
performance and reach an overall accuracy over 90%,
while incurring only a minor information loss.</p>
        <p>Applying an ensemble technique proves to be
beneficial. Firstly, it brings the largest
performance improvement of all optimization steps per
se, raising accuracy by 0.5 percentage points.
Secondly, ensembling allows us to identify text zones
with low Inter-Model Agreement and to pass on
these text zones to manual revision. Applying this
kind of post-processing to the 0.5% in the test set
without a majority vote is a second option for
raising accuracy over 90%.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Future Work</title>
        <p>While results so far do not suggest that more
capacity for the current network architecture will
boost performance, the fact that we have not
observed overfitting might indicate that we have not
yet reached the upper limit of model capacity.
Future experiments could test the effect of altering
several model parameters at the same time.</p>
        <p>Options for model improvements include to
pretrain domain-specific word embeddings on a very
large unlabeled dataset of job ads, or to add local
features by using gazetteers on job titles, locations,
company and job agency names.</p>
        <p>Instead of merging certain text zones after
classifying, we could train a model with fewer classes.
This would reduce the skewed class distribution
problem considerably and imply only a minor
information loss for downstream applications.</p>
        <p>Finally, we could also compare BiLSTMs with
other sequence tagging algorithms and use more
sophisticated ensemble methods to improve our
text zoning solution.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>I would like to thank my supervisor Dr. Simon
Clematide for his guidance, helpful advice and for
his support and encouragement. I also thank the
anonymous reviewers for their valuable comments
and suggestions.
73</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Thorsten</given-names>
            <surname>Brants</surname>
          </string-name>
          , Francince Chen, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Tsochantaridis</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Topic-based document segmentation with probabilistic latent semantic analysis</article-title>
          .
          <source>In Proceedings of the 11th international conference on information and knowledge management. ACM</source>
          , pages
          <fpage>211</fpage>
          -
          <lpage>218</lpage>
          . https://doi.org/10.1145/584792.584829.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Marlis</given-names>
            <surname>Buchmann</surname>
          </string-name>
          , Helen Buchs,
          <string-name>
            <surname>Ann-Sophie</surname>
            <given-names>Gnehm</given-names>
          </string-name>
          , Debra Hevenstone, Urs Klarer, Marianne Mueller, Stefan Sacchi, and
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Salvisberg</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Swiss Job Market Monitor: Scientific Use File Documentation (Release</article-title>
          <year>2016</year>
          ). University of Zurich: Swiss Job Market Monitor. Distributed
          <string-name>
            <surname>by</surname>
            <given-names>FORS</given-names>
          </string-name>
          , Lausanne (http://forsbase.unil.ch).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          :
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          . https://dl.acm.org/citation.cfm?id=
          <fpage>2078186</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ann-Sophie Gnehm</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Erschliessung des Korpus Stellenmarkt-Monitor Schweiz und Automatisierte Textsegmentation mit Supervised Machine Learning. (Unveroeffentlichte Dokumentation Programmierprojekt)</article-title>
          .
          <source>Universitaet Zuerich: Institut fuer Computerlinguistik.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Yoav</given-names>
            <surname>Goldberg</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Neural Network Methods in Natural Language Processing</article-title>
          . Morgan &amp; Claypool Publishers, San Rafael.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Graves</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Supervised sequence labelling with recurrent neural networks</article-title>
          . Springer, Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Juergen</given-names>
            <surname>Hermes</surname>
          </string-name>
          and
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Schandock</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Stellenanzeigenanalyse in der Qualifikationsentwicklungsforschung: Die Nutzung maschineller Lernverfahren zur Klassifikation von Textabschnitten</article-title>
          .
          <source>Bundesinstitut fuer Berufsbildung</source>
          , Bonn.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Kenjii</given-names>
            <surname>Hirohata</surname>
          </string-name>
          , Naoaki Okazaki, Sophia Ananiadou, and
          <string-name>
            <given-names>Mitsuru</given-names>
            <surname>Ishizuka</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Identifying sections in scientific abstracts using conditional random fields</article-title>
          .
          <source>In Proceedings of the Third International Joint Conference on Natural Language Processing. Asian Federation of Natural Language Processing (AFNLP)</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>381</fpage>
          -
          <lpage>388</lpage>
          . http://www.aclweb.org/anthology/I08-1050.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Zhiheng</given-names>
            <surname>Huang</surname>
          </string-name>
          , Wei Xu,
          <string-name>
            <given-names>and Kai</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Bidirectional LSTM-CRF models for sequence tagging</article-title>
          . arXiv. https://arxiv.org/abs/1508.
          <year>01991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Diederik P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Ba</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980. http://arxiv.org/abs/1412.6980.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Kimmo</given-names>
            <surname>Koskeniemmi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mariikka</given-names>
            <surname>Haapalainen</surname>
          </string-name>
          .
          <year>1996</year>
          . GERTWOL - Lingsoft
          <string-name>
            <surname>Oy</surname>
          </string-name>
          . In Roland Hausser, editor,
          <source>Linguistische Verifikation: Dokumentation zur Ersten Morpholympics</source>
          <year>1994</year>
          , Niemeyer, Tbingen,
          <source>number 34 in Sprache und Information</source>
          , pages
          <fpage>121</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Merity</surname>
          </string-name>
          , Tara Murphy, and
          <string-name>
            <surname>James</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Curran</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Accurate argumentative zoning with maximum entropy models</article-title>
          .
          <source>In Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries. Association for Computational Linguistics</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>26</lpage>
          . http://aclweb.org/anthology/W09-3603.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Graham</given-names>
            <surname>Neubig</surname>
          </string-name>
          , Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Callesteros, David Chiang,
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Clothiaux</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Cohn</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DyNet: The Dynamic Neural Network Toolkit</article-title>
          . arXiv. https://arxiv.org/abs/1701.03980.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Lior</given-names>
            <surname>Rokach</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Ensemble-based classifiers</article-title>
          .
          <source>Artificial Intelligence Review</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          . https://doi.org/10.1007/s10462-009-9124-7.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Robert</given-names>
            <surname>Speer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Joshua</given-names>
            <surname>Chin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An ensemble method to produce high-quality word embeddings</article-title>
          . arXiv. https://arxiv.org/abs/1604.01692.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Qi</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Runxin</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Dingsheng</given-names>
            <surname>Luo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xihong</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Text segmentation with lda-based fisher kernel</article-title>
          .
          <source>In Proceedings of the 46th Annual</source>
          <article-title>Meeting of the Association for Computational Linguistics on Human Language Technologies (Short Papers)</article-title>
          .
          <source>Association for Computational Linguistics</source>
          , pages
          <fpage>269</fpage>
          -
          <lpage>272</lpage>
          . https://doi.org/10.3115/1557690.1557768.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>