<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Information Extraction for Semi-structured Email Corpora</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Science Media Center</string-name>
          <email>firstname.lastname@sciencemediacenter.de</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cologne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Germany</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TH Koln - University of Applied Sciences</institution>
          ,
          <addr-line>Cologne</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information extraction is a requirement for enhanced IR techniques. To surpass rigid extraction rules in wrappers based on XPath or CSS selectors, we present a new extraction method extending the FleXPath method that was used for structured XML retrieval in INEX. We expand this method to work with semi-structured HTML and present a case study and a short evaluation based on a corpus of emails from scienti c publishers.</p>
      </abstract>
      <kwd-group>
        <kwd>Information extraction</kwd>
        <kwd>Semi-structured documents</kwd>
        <kwd>Emails</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Information extraction is a requirement and necessity for enhanced
information retrieval techniques like entity-based search, semantic search, or other
approaches that make use of properly structured and annotated information [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Searching for or within structured information like XML documents (INEX)
allows to make use of semantic annotations directly, but usually, source
documents are not semantically structured at all and only allow full-text search. In
this work, we use a newsletter corpus that is not formally structured but
semistructured with some reoccurring parts of the mails (like title, dates, or authors).
We want to extract this information to allow rich IR techniques like ltering,
browsing, or ranking based on these semantic annotations.
      </p>
      <p>
        Usually information extraction [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] relies on structural layouts and syntactical
patterns [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] within the source documents. Wrappers are used to make use of these
structures. Wrappers use pattern matching procedures and heavily rely on
prede ned or learned extraction rules [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A more general approach for wrapper
construction is XPath [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or derivation like OXPath [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Both systems are used
to address and locate parts and nodes in XML / HTML documents and to extract
their content. Especially for HTML documents CSS selectors are an alternative
to XPath [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        A common issue in real-life web information extraction is the fact that XPath
or CSS-based extraction rules are not exible and are prone to even slight changes
in the source documents' structures. Due to this, the process of adjusting an
existing wrapper to new requirements is costly. Amer-Yahia et al. introduced a
technique called FleXPath [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to surpass this issue. They developed a mixture
of structured XPath and full-text XQuery-based search techniques to extract
information from structured XML documents. Thus FlexPath allows using both:
database style querying and full-text search. The results from both query
approaches are scored and ranked on structural and full-text search features. This
approach was successfully evaluated in INEX [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>We argue that automated web information extraction could bene t from this
idea and that the FleXPath approach could be expanded for semi-structured
HTML documents. Our main goal is to yield better coverage for structured
information extraction. Due to the nature of XPath or CSS-based selectors,
any structural changes to the source of extraction will a ect the quality of the
returned data. This paper aims to create and to test an information extraction
concept for semi-structured documents.</p>
      <p>We will demonstrate the feasibility of our approach by implementing our
algorithm in a demonstrator called FleXipy and test it against a sample document
collection containing di erent types of semi-structured HTML data. We will test
whether our approaches will result in a more robust and more complete outcome
in contrast to \normal" XPath/CSS-based extraction frameworks.
2</p>
    </sec>
    <sec id="sec-2">
      <title>FleXipy - Information Extraction for HTML</title>
      <p>In this section we will present FleXipy, an extraction mechanism that
incorporates both path-based and text-based features. FleXipy always starts with a
strict path-based extraction rule that was con gured by the wrapper generator.
Only if this strict extraction rule does not verify and returns an empty node,
the additional extraction features are activated. First, we generate a list of
possible node candidates by looking in the neighborhood of the original con gured
path. In a second step, we use di erent text-based features to rank the di erent
node candidates to nd the best-matching node and path to correct the wrong
extraction rule.
2.1</p>
      <sec id="sec-2-1">
        <title>Main Components of FleXipy</title>
        <p>The core of FleXipy is divided into three major components: (1) Modules for
path-based tree interactions that use rigid patterns like XPath, modules for
textual, (2) content-based interactions that use full-text search patterns and
nally (3) a module to rank the node candidates.</p>
        <p>Path-based Modules There are two di erent methods for nding node
candidates with FleXipy. One requires a full XPath to nodes where the desired text
should be found. The other one requires an XPath to select a structural
description to look for anywhere in the DOM tree. When giving a full XPath, we
will use it as a template. This method is useful in cases where the desired text
should be in this node but moved to a nearby sibling due to minimal structural
changes. Using a limited breadth- rst search on the DOM tree and a following
depth-limited search, we will check any siblings for existing text and their full
XPath will be added to the list of possible candidates. In FleXipy, this is called
template search.</p>
        <p>The second method requires a structural description of a subtree to search for
in the DOM tree, e.g //tbody/tr/td/h1. Any found expression containing text
will be considered a candidate for further analysis. In cases where no candidate
can be found, it is recommended and supported to reduce the size of the subtree
to enlarge the search space. This is called subtree search.</p>
        <p>The two path-based searches return a list of candidates. We calculate the
distances between the candidates in the DOM tree and the original con gured
XPath expression. This distance will be normalized for the template search. For
subtree search, no distance can be calculated. Therefore the distance is set to d =
0. All other distances contribute to the path score like follows: p = 1 d=dmax.
A smaller distance will lead to a higher path score of 0 p 1.
Text-based Modules Text- and content-based modules are components that
test the content of all candidates found by the path-based modules. These
modules represent a veri cation process. Depending on the con gured rules, the
interaction can be expressed by using a Radcli /Obershelp fuzzy string
matching algorithm or by using a combination of rules to verify the structural features
of the extracted text. These rules can match text features like length and
formatting, check for pre xes or use text similarity scores for text which may only
change slightly for multiple sources. Every con gured rule symbolizes a veri
cation task for the extracted text and should \penalize" candidates for not passing
a pattern rule. The text score is t = 1 f =n, where f is the number of failed
veri cations and n is the number of con gured rules. The result of this process
is a score of 0 t 1 for all text-based features.</p>
        <p>Candidate Ranking The two module types return two scores: p for for all
path-based features and t for all textual features for all node candidates. Using
a simple linear combination of both normalized scores p and t will then lead to
a nal ranking R (Eq. 1):</p>
        <p>R = p
(1
) + t
(1)
where 0 1 is usually set to 0.7. The value for is a purely heuristically
determined best-practice value that works for the test data sets in our case study.
(see section 3).</p>
      </sec>
      <sec id="sec-2-2">
        <title>Example Extraction with FleXipy</title>
        <p>To further clarify how the FleXipy algorithm works, we will showcase a step
by step extraction in comparison to a conventional XPath based extraction
approach.
in this case, are formatted using CSS, and we also assume that headlines have
a minimum length of 30 characters. Therefore, we can make use of functions
provided by the FleXipy framework addressing these characteristics. These
directives will be added to the FleXipy con guration le.</p>
        <p>
          During a simulated extraction process, we got an email where the title can be
found not in //div/p[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]/span[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] but in //div/p[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]/span[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Using a normal
XPath-based wrapper, the extraction will fail since the path has changed. The
FleXipy framework will, rst of all, check the given XPath expression against
the con gured directives. If they match, FleXipy is done, if not FleXipy will use
its path-based modules to look in the surroundings of the con gured XPath for
nodes that contain any form of text assuming that the wanted text only moved
slightly. In this case, the mentioned template search will look for siblings for
p and span nodes, ultimately also cover the wanted text. When reaching the
con gured limits, all found nodes containing text will then be checked against
the text-based directives. Combining the distance and probability that a found
text is the wanted text, the framework will then calculate a ranking of candidates
and provide it for further processing.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A Case Study on FleXipy for Emails</title>
      <p>To test the feasibility of the FleXipy approach, we will present the results of a
case study with semi-structured email texts. The source of our test collection is a
set of extracted email bodies in HTML from the Google Digital News
Initiativefunded PRepublicatIOn Radar (PRIOR) project3. The background of PRIOR
is to enable science journalists to keep up with the latest scienti c research
in relevant domains of knowledge. Scienti c publishers o er science journalists
exclusive access to prepublications of upcoming research articles under a very
strict embargo. These prepublications are distributed exclusively via email and
are very heterogeneous by nature as each publisher has its own format. The
PRIOR project uses a fully automated extraction pipeline written in Python
that utilizes the web crawling framework scrapy and XPath-based wrappers for
information extraction. The daily work with these emails shows that these
wrappers are highly vulnerable against deviations in the source data, which is the case
for most of the emails. Most of the time, the emails are highly unstructured or do
not share a common set of formatting rules, making them nearly impossible to
process by standard wrappers. These issues make the unstructured and
heterogeneous email bodies from the PRIOR project a perfect test bed for evaluating
the FleXipy approach.
3.1</p>
      <sec id="sec-3-1">
        <title>Evaluation Setup</title>
        <p>We built a small test collection out of 75 PRIOR emails from the four
scienti c journals Science, Jama, Lancet, and Cell. Each email may contain many
3 https://ir.web.th-koeln.de/projects/prior/
embargo announcements that consist of di erent metadata elds. We manually
identi ed and annotated 202 of these metadata elds (also called slots) within
the source emails. The elds were of the four following types: Embargo dates,
contact details, titles of the embargoed articles, and short abstracts. For each
eld, we manually de ned the correct XPath to extract the information to test
the di erent system con gurations.</p>
        <p>Two di erent con gurations were part of our evaluation: A simple
XPathbased extraction without any modi cations (called baseline) and a full features
FleXipy con guration (called Full-FleXipy). The baseline represents the
standard extraction process done by frameworks like Scrapy. From each journal, we
randomly took one email as training data to manually extract the XPath
expression and used it to create a simple wrapper for the information extraction. The
same XPath expression was later con gured in FleXipy. The training emails were
removed from the evaluation corpus. These extracted XPath expressions were
then used as a rigid set of extraction rules for the remaining emails. Since emails
are heterogeneous by nature, the results of this simulated extraction process are
expected to be insu cient. After generating our baseline, we created additional
test-based rule sets (like pre x patterns) for the training data and activated the
path-based modules.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation Metrics</title>
        <p>
          We compared the results of FleXipy against manually annotated data for all
slots. The comparison relies on two di erent measurements: F1 and the slot
error rate [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (SER). When evaluating, we have four di erent kinds of results for
a slot: (1) We correctly extracted a slot (hit ), (2) we have slots which contain
incorrect data (substitution), (3) we have slots without any data (deletion) and
(4) slots which were con gured and extracted but cannot be found in the source
data and should not have been extracted (insertion).
        </p>
        <p>Returned slots are either a hit, a substitution or an insertion, but only slots
which count as a hit are relevant in terms of precision and recall and therefore
F1. We decided to use the slot error rate as a additional measurement as it
introduces a performance measurement for information extraction which gives
adequate weight to the di erent error types. It is de ned as follows (Eq. 2):
SER =
(substitutions + deletions + insertions)
(hits + substitutions + deletions)
(2)
where a lower SER value indicates a better extraction performance.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Findings</title>
        <p>We evaluated four di erent entry types (embargo dates, contact information,
titles, and abstracts) for four di erent scienti c journals (Science, Jama, Lancet,
and Cell). We report SER and F1-scores (see Table 1). The statistical signi
cance of di erences in the average performance is determined using a two-sided
Baseline Science</p>
        <p>Jama
Lancet</p>
        <p>Cell
avg</p>
        <p>Embargo</p>
        <p>Contact
SER
SER</p>
        <p>Student's t-test. Signi cant improvements over the baselines (p &lt; 0:05) are
indicated with an asterisk.</p>
        <p>We see clear improvements in the average extraction performance in all four
extracted entry types on both F1 and SER. Especially the contact information
shows a signi cant improvement from an average SER of 1.06 to 0.17 and an
improvement of F1 from 0.1 to 0.67. The other improvements are still evident
but not statistically signi cant. A small increase can also be seen for the most
heterogeneous slot de ned increasing the F1 average for abstracts from 0.37 to
0.43. Only one single slot entry type for one single journal had a loss compared
to the baseline: the embargo dates from the Science journal. Here the SER value
raised from 0.38 to 0.44 while the F1 still increased from 0.69 to 0.72.</p>
        <p>We can also see that the information saved in the embargo slot was extracted
perfectly for three of the four providers. For Cell, FleXipy was capable of nding
all missing titles and contacts and x them accordingly. For Jama, there is no
change in the result at all, since the baseline result was already near perfect from
the beginning.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and Outlook</title>
      <p>We presented and evaluated an information extraction method to complement
simple XPath or CSS-based wrappers. We implemented our approach that
incorporates path-based and text-based extraction rules and a candidate ranking
module in a demonstrator called FleXipy. The evaluation of email newsletters
from di erent scienti c publishers and journals showed clear improvements
compared to a simple XPath baseline. The improvements are possible as FleXipy can
search for alternative candidates when the original XPath-selector would only
return an empty node (deletions). Additionally, FleXipy could be used to correct
the two other extraction errors types substitutions and insertions as the
textbased modules would allow checking on additional features (like text patterns)
rather than only on path features.</p>
      <p>Although the evaluation showed a clear improvement and returned nearly
perfect results for some entry types, we have to emphasize the preliminary
character of this evaluation. It is only a case study with 202 manually annotated
slots resulting in a rather small test collection. To get a more reliable result,
the size of the collection should be increased. Also, the con gured rules used to
verify the data can be improved by increasing the size of the training data.</p>
      <p>An interesting result is the opposing performance of SER and F1 for the
embargo entries in Science. This is the case when the extraction results shift from
cases of substitutions to cases of deletions. This leads to a decreasing number
of returned documents and therefore in uences the precision and recall. The
increase in F1 does not clearly describe an overall increase in the extraction
result since there could still be the case of less correctly extracted slots. These
contradicting results for the embargo dates in Science are a compelling case that
demonstrates the justi cation for using SER to complement the
precision/recallxed perspective of F1. While we increase the number of relevant slots in the
result, we introduce more and di erent error types.</p>
      <p>The overall results of this case study can be promising for further
development of this approach to improve web information extraction. FleXipy is not a
complete extraction framework (like e.g., Scrapy) but can be part of such
frameworks to complement XPath or CSS-based wrappers. For productive use, this
framework should be extended with more possibilities to write extraction rules
for the source data.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work was part of the PRIOR project that was funded by the Google Digital
News Initiative4.
4
https://newsinitiative.withgoogle.com/dnifund/dni-projects/priorprepublication-radar-round-4/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amer-Yahia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakshmanan</surname>
            ,
            <given-names>L.V.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pandit</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Flexpath:
          <article-title>Flexible structure and full-text querying for xml</article-title>
          .
          <source>In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data</source>
          . pp.
          <volume>83</volume>
          {
          <fpage>94</fpage>
          . SIGMOD '04,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          ). https://doi.org/10.1145/1007568.1007581, http: //doi.acm.
          <source>org/10</source>
          .1145/1007568.1007581
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amer-Yahia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lalmas</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>XML search: languages, INEX and scoring</article-title>
          .
          <source>ACM SIGMOD Record</source>
          <volume>35</volume>
          (
          <issue>4</issue>
          ),
          <volume>16</volume>
          {23 (Dec
          <year>2006</year>
          ). https://doi.org/10.1145/1228268.1228271, http://portal.acm.org/citation.cfm?doid=
          <volume>1228268</volume>
          .
          <fpage>1228271</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Balog</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Meet the Data</article-title>
          . In: Balog,
          <string-name>
            <given-names>K</given-names>
            . (ed.)
            <surname>Entity-Oriented Search</surname>
          </string-name>
          , pp.
          <volume>25</volume>
          {
          <fpage>53</fpage>
          . The Information Retrieval Series, Springer International Publishing,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          (
          <year>2018</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -93935-3 2, https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -93935-
          <issue>3</issue>
          _
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ram</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grolemund</surname>
          </string-name>
          , G.:
          <article-title>Extracting data from the web apis and beyond</article-title>
          .
          <source>In: The R User Conference</source>
          <year>2016</year>
          . Stanford University, Stanford, California (
          <year>2016</year>
          ), https://github.com/ropensci-training/user2016-tutorial/blob/ master/slides.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <issue>5</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kayed</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girgis</surname>
            ,
            <given-names>M.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaalan</surname>
            ,
            <given-names>K.F.</given-names>
          </string-name>
          :
          <article-title>A survey of web information extraction systems</article-title>
          .
          <source>IEEE transactions on knowledge and data engineering</source>
          <volume>18</volume>
          (
          <issue>10</issue>
          ),
          <volume>1411</volume>
          {
          <fpage>1428</fpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martin</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          :
          <article-title>Speech and language processing</article-title>
          , vol.
          <volume>3</volume>
          . Pearson London (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Makhoul</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kubala</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwartz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weischedel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , et al.:
          <article-title>Performance measures for information extraction</article-title>
          .
          <source>In: Proceedings of DARPA broadcast news workshop</source>
          . pp.
          <volume>249</volume>
          {
          <fpage>252</fpage>
          .
          <string-name>
            <surname>Herndon</surname>
            ,
            <given-names>VA</given-names>
          </string-name>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Melton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buxton</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Querying</surname>
            <given-names>XML</given-names>
          </string-name>
          :
          <article-title>XQuery, XPath, and SQL/XML in context</article-title>
          . The Morgan Kaufmann Series in Data Management Systems, Elsevier Science (
          <year>2011</year>
          ), https://books.google.de/books?id=EuYRXgDqVp0C
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Michels</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fayzrakhmanov</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ley</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sallinger</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schenkel</surname>
          </string-name>
          , R.:
          <article-title>Oxpathbased data acquisition for dblp</article-title>
          .
          <source>In: 2017 ACM/IEEE Joint Conference on Digital Libraries, JCDL</source>
          <year>2017</year>
          , Toronto, ON, Canada, June 19-23,
          <year>2017</year>
          . pp.
          <volume>319</volume>
          {
          <fpage>320</fpage>
          . IEEE Computer Society (
          <year>2017</year>
          ). https://doi.org/10.1109/JCDL.
          <year>2017</year>
          .
          <volume>7991609</volume>
          , https:// doi.org/10.1109/JCDL.
          <year>2017</year>
          .7991609
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schaer</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Web-Scraping for Non-Programmers: Introducing OXPath for Digital Library Metadata Harvesting</article-title>
          .
          <source>Code4Lib Journal</source>
          <year>2017</year>
          (
          <volume>38</volume>
          ) (
          <year>Oct 2017</year>
          ), https://journal.code4lib.org/articles/13007
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>