<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visualizing the addition of missing words to regular expressions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Rebele</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katerina Tzompanaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Suchanek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ETIS / ENSEA, University of Cergy-Pontoise</institution>
          ,
          <addr-line>CNRS, 95000, Cergy-Pontoise</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Télécom ParisTech</institution>
          ,
          <addr-line>75013 Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Regular expressions are used in many information extraction systems like YAGO, DBpedia, Gate and SystemT. However, they sometimes do not match what their creator wanted to find. We investigate how missing words can be added automatically to a regular expression by creating disjunctions at the appropriate positions. Our demo visualizes the steps that our algorithm employs to repair the regular expression.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Regular expressions (regexes) are a popular tool in information extraction. While
it is possible to learn regexes automatically from positive and negative examples,
many projects craft their regexes manually. This applies, e.g., to Semantic Web
projects such as YAGO and DBpedia, but also to systems such as Gate and
SystemT. One of the reasons may be that the regexes contain human expertise
that goes beyond the information contained in training data.</p>
      <p>
        One of the challenges in such contexts is to adapt a regex so that it matches
a word that was previously missed. Consider, e.g., the simple regex [A-Za-z]+
\d\d, \d\d|\d\d/\d\d/\d\d\d\d. It covers dates that spell out the month, and
dates that consist of digits with a dash. However, it will not match “1/8/1935”.
One way to solve this is obviously to just add this word as a disjunction – but that
would not generalize to other, similar words. To repair the regex properly, we
would first have to identify where the word finds its largest approximate match in
the regex (on the right hand side of the disjunction). Then, we have to perform a
minimal surgical operation to make two digits optional. Finally, we could factor
out the 4 digits of the year. In this paper, we present our work on an algorithm
that does all of this automatically. In the example, we obtain ([A-Za-z]+ \d\d,
|(\d{1,2}/){2})\d{4}. Our demo allows the user to visualize these changes.
Related work. Several approaches [
        <xref ref-type="bibr" rid="ref2 ref3 ref8">3, 2, 8</xref>
        ] learn a regex from scratch. Our
approach, in contrast, modifies a given regex. Other approaches remove false
positives [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], while we aim to add false negatives. Again others [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] rely on user
feedback, while we aim at a fully automated solution. Other work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] generalizes
individual characters, while we aim to change the structure of the regex.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Algorithm</title>
      <p>Each regex r is associated to a syntax tree, whose leaf nodes are characters or
character classes. We say that a leaf node is to the left of another leaf node, if
the latter can be reached from the former in the Glushkov automaton of r. We
write L(r) for the language of r. Given a regex r and a word w, our goal is to
construct a regex r0 such that w 2 L(r0) ^ L(r) L(r0).</p>
      <p>
        We first find a maximal matching of w to r, i.e., a partial function from
the positions of w to the leaf nodes of r. For this purpose, we first unfold the
syntactic sugar of the regex, so that it contains only Kleene stars, disjunctions,
and concatenations. Then, we use Meyer’s approximate matching algorithm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
Like the well-known Wagner-Fischer algorithm for the edit distance, Meyer’s
algorithm uses a matrix. The rows correspond to the characters of the word, and
the columns correspond to leaf nodes of the regex (left to right). This gives us a
mapping from positions in the string to leaf nodes of the regex.
The next step is to repair the regex. Algorithm 1 takes as input a regex r, a word
w, and a matching m. It runs left to right through the word (Line 1). For each
mapped character position, it finds the next mapped character position (Line 2),
and the lowest common ancestor (lca) of both leaf nodes in the syntax tree of r
(Line 3). In a Kleene star, it may happen that the next character is mapped to a
leaf node that lies to the left of the leaf node of the current character. Consider,
e.g., the regex (abc)* and the word “abfabc”. Here, the second “a” is mapped to a
leaf node that lies to the left of the node of “b”: the links cross. In such a case, we
find the next character that is mapped to the right of the current one (Line 5),
and we add the obstructing substring as an optional concatenation (Line 7). In
the example, we obtain (ab(fab)?c)*. If no such fix can be found, we remove
the mapping (Line 10) and proceed. If the next character is mapped to a regex
leaf node that succeeds the current leaf node, we trace the path from the left
leaf node to the lca, and make all children to the right of that path optional
(Lines 14-16). We proceed analogously with the right leaf node (Lines 17-19).
Finally, we insert the unmatched substring below the lca (Lines 20-21).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Demo</title>
      <p>
        Our demo allows the user to visualize the regex repairs that we propose. In the
beginning, the user can enter a regex and a missing word. For example, the user
can choose the regex \d{2}/\d{2}/\d{4}, and the missing word “1/8/1935”. The
demo then proceeds to show the user the maximal matching between the regex
and the missing word, as computed by Myers’ algorithm (Figure 1 (left)). In
the following, the user can walk step by step through the modifications that our
algorithm proposes. In the example, the algorithm will propose to make first the
first digit optional, and then the third (Figure 1 right). Finally, the algorithm
will simplify the regex to (\d?\d/){2}\d{4} (not shown in the figure).
Alternatively, the user can choose one of our pre-defined sets of words. We offer
the ReLie dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a dataset of Wikipedia infoboxes (where 100 000 values of
infobox attributes have been annotated by YAGO [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] with dates and numbers),
and a sample of the Enron email dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (where we have annotated phone
numbers manually). The user can then enter a regex that they think captures
the words. We show the user which words are not captured, and the user can
choose to add one of them to the regex. The user can then run the modified
regex again to see how many more words are captured, and which ones are still
missing. This way, the user can see how the algorithm performs on real data.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>Since information extraction for the Semantic Web is our intended use case,
we ran our experiments on the afore-mentioned Wikipedia infobox dataset. We
asked 15 colleagues to manually produce regexes. We improve these regexes with
our algorithm by randomly adding 10 words that the regexes did not match. We
allowed the algorithm to reject the repair (and add the word in a disjunction
instead) if the precision sinks on a training dataset. Table 1 compares the
F1value to the dis-baseline (which adds a word simply as a disjunction) and the
star-baseline (.*). As expected, the dis-baseline raises recall only marginally.
The star-baseline only covers those attributes that only consist of one match.
Our approach, in contrast, can increase the recall of the regex. While these
experiments cannot prove the viability of the approach under durance, they do
show that automated regex repair is possible.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We have presented an algorithm for automatically adding missing words to
regular expressions. Our demo allows users to visualize and understand the changes
we propose. For future work we plan to investigate other feedback functions,
prevent overgeneralizing repairings and work on minimizing the length of the
repaired regex. Our demo is available at https://thomasrebele.org/projects/
regex-repair.</p>
      <p>Acknowledgments. This research was supported by the grants
ANR-11-LABEX0045-DIGICOSME and ANR-16-CE23-0007-01 (“DICOS”).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rohit</given-names>
            <surname>Babbar</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nidhi</given-names>
            <surname>Singh</surname>
          </string-name>
          . “
          <article-title>Clustering based approach to learning regular expressions over large alphabet for noisy unstructured text”</article-title>
          .
          <source>In: Workshop on Analytics for Noisy Unstructured Text Data</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Alberto</given-names>
            <surname>Bartoli</surname>
          </string-name>
          et al. “
          <article-title>Automatic Synthesis of Regular Expressions from Examples”</article-title>
          .
          <source>In: IEEE Computer 47.12</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Falk</given-names>
            <surname>Brauer</surname>
          </string-name>
          et al. “
          <article-title>Enabling information extraction by inference of regular expressions from sample entities”</article-title>
          .
          <source>In: CIKM</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yunyao</given-names>
            <surname>Li</surname>
          </string-name>
          et al. “
          <article-title>Regex learning for information extraction”</article-title>
          .
          <source>In: EMNLP</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Einat</given-names>
            <surname>Minkov</surname>
          </string-name>
          , Richard C Wang, and William W Cohen.
          <article-title>“Extracting personal names from email”</article-title>
          .
          <source>In: EMNLP</source>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Karin</given-names>
            <surname>Murthy</surname>
          </string-name>
          , Deepak Padmanabhan, and Prasad Deshpande. “
          <article-title>Improving Recall of Regular Expressions for Information Extraction”</article-title>
          .
          <source>In: WISE</source>
          .
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Eugene</surname>
            <given-names>W</given-names>
          </string-name>
          <string-name>
            <surname>Myers and Webb Miller</surname>
          </string-name>
          . “
          <article-title>Approximate matching of regular expressions”</article-title>
          .
          <source>In: Bulletin of mathematical biology 51.1</source>
          (
          <year>1989</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Prasse</surname>
          </string-name>
          et al. “
          <article-title>Learning to Identify Concise Regular Expressions That Describe Email Campaigns”</article-title>
          .
          <source>In: J. Mach. Learn. Res</source>
          .
          <volume>16</volume>
          .1 (
          <issue>Jan</issue>
          .
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
          </string-name>
          , Gjergji Kasneci, and Gerhard Weikum. “
          <article-title>Yago: a core of semantic knowledge”</article-title>
          .
          <source>In: WWW</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>