<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on OCL and Textual Modeling, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Adding regular expression operators to OCL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>K. Lano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Informatics, King's College London</institution>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>25</volume>
      <issue>2021</issue>
      <abstract>
        <p>In this short paper we provide a systematic proposal to add regular expression operators to the OCL standard library for the String type, and we discuss other String operators which could be included in the library. The String type is a central data type in most forms of software, however the String type facilities in the OCL 2.4 standard are quite restricted (only 17 specific operators are defined) [ 9]. One category of omitted operations are those concerning substitution within strings and string matching against regular expressions. Furthermore, since strings can be regarded as sequences of characters (strings of size 1), it would seem natural to adopt sequence operators such as first , last, front, tail and insertAt for strings. We consider the addition of regular expression operators in Section 2, and the addition of other operators in Section 3. Implementation of the operators is discussed in Section 4, and performance evaluation in Section 6.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Object Constraint Language</kwd>
        <kwd>OCL</kwd>
        <kwd>Regular expressions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Regular expressions</title>
      <p>
        Regular expressions are an important facility in many software systems, especially for validation
of textual data, data filtering and transformation. We found that these facilities were of key
importance in the development of software engineering tools for MDE, in particular our
codegenerator DSL,  ℒ [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] depends heavily on string pattern matching and replacement. It
would be desirable to be able to specify such tools using OCL.
      </p>
      <p>
        Most mainstream programming languages provide regular expression facilities, but there are
considerable diferences in the level of support and operations provided (eg., Java str.matches(patt)
requires that the entire str is a single match for patt, whilst Python patt.match(str) only requires
matching of patt to a substring of str, and C# patt.Matches(str) returns all matches of patt within
str). Specification languages such as EOL and ATL’s OCL version also provide some regular
expression facilities [
        <xref ref-type="bibr" rid="ref2 ref5">5, 2</xref>
        ], but again these languages do not use a consistent terminology.
      </p>
      <p>We propose a cohesive set of String operators to enable the use of regular expressions in
OCLbased specifications. These operators clearly distinguish between diferent forms of matching
and replacement:
• isMatch(patt : String) : Boolean – returns true only if the entire text of self is a single
match for the regular expression patt.
• hasMatch(patt : String) : Boolean – returns true if some non-empty substring of self
satisfies the regular expression patt.
• allMatches(patt : String) : Sequence(String) – returns the sequence of non-overlapping
and non-empty substrings of self which match patt, in the order of their occurrence in
self .
• replaceAllMatches(patt : String, rep : String) : String – A copy of self with each substring
in allMatches(patt) replaced by rep.
• split(patt : String) : Sequence(String) – the sequence of substrings of self formed by
removing the substrings in allMatches(patt).</p>
      <p>
        For additional clarity, split could alternatively be named splitOnMatches. Useful auxiliary
operations are
• firstMatch (patt : String) : String
• replaceFirstMatch(patt : String, rep : String) : String
It is also useful to support replacement substituteFirst and substituteAll of substrings based on
literal string equality, as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Because of the general utility of such operations, we consider that they should be included
into the OCL standard library, rather than in a specialised library.
2.1. Regular expression semantics
We assume a core regular expression language, such as POSIX ERE, consisting of the standard
metacharacters ., * , +, ?, [, ], |, literal characters, character ranges c1− c2, and escaped special
characters ∖∖s. This language is supported by all mainstream programming languages. It has a
semantics given in terms of sets of sequences of characters [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: [[r]] is the set of sequences of
characters which conform to regular expression r. This semantics can be defined by recursion
on the structure of r. Eg., [[r1 | r2]] = [[r1]] ∪ [[r2]] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>We assume that some matching relation str |= r exists between literal strings and regular
expressions, this means that str.characters() ∈ [[r]]. For example, “Augustus” |= “[A− Z ][a− z]+”
and not(“+ + * ” |= “[A− Z ][a− z]+”). str |= patt means that the entire string str satisfies
patt. Thus
but not(“C + +” |= “[A− Z ]+”).</p>
      <p>
        isMatch can then be defined directly in terms of |=. str.isMatch(patt) is invalid if str or patt
is invalid, otherwise it is null if either str or patt is null. For valid arguments we define:
isMatch(patt : String) : Boolean
pre: true
post:
if size() = 0 then result = false
else if patt.size() = 0 then result = false
else if str |= patt then result = true
else result = false
endif endif endif
This definition provides a formal definition of isMatch, suitable for inclusion in Section A.2.1.3
of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The other regular expression operators can then be defined in terms of isMatch. hasMatch is:
hasMatch(patt : String) : Boolean
pre: true
post:
result =</p>
      <p>Sequence{1..size()}-&gt;exists( x, y |</p>
      <p>x &lt;= y and str.substring(x,y).isMatch(patt) )
We assume that str.substring(i, j) is “” if j &lt; i1.</p>
      <p>ifrstMatch (patt : String) is defined by:
firstMatch(patt : String) : String
pre: true
post:
if not(hasMatch(patt)) then result = null
else if</p>
      <p>Sequence{1..size()}-&gt;exists( y | substring(1,y).isMatch(patt) )
then
result = Sequence{1..size()}-&gt;collect( y | substring(1,y) )
-&gt;select( str | str.isMatch(patt) )-&gt;max()
else</p>
      <p>result = substring(2,size()).firstMatch(patt)
endif endif
The use of max here means that we take the largest match of patt (the longest matching substring
of self ) starting from a given index. This is the most common matching strategy for regular
expressions (‘greedy’ matching), but alternative strategies can be specified analogously.</p>
      <p>
        str.allMatches(patt) is also defined recursively:
allMatches(patt : String) : Sequence(String)
pre: true
post:
if size() = 0 then result = Sequence{}
1This case is not defined in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
else if not(self.hasMatch(patt)) then result = Sequence{}
else
let m : String = self.firstMatch(patt) in
result = Sequence{ m }-&gt;union(
self.substring(
      </p>
      <p>self.indexOf(m)+m.size(),self.size()).allMatches(patt))
endif endif
Matches cannot be the empty string, by definition of isMatch, therefore this recursion is
welldefined. An example is “abracadabra”.allMatches(“ab”) = Sequence{“ab”, “ab”}.</p>
      <p>str.replaceAllMatches(patt, rep) is defined by a similar recursion. It is invalid if any of str,
patt or rep are invalid. If patt or rep or str are null, the result is str.
replaceAllMatches(patt : String, rep : String) : String
pre: true
post:
if patt.size() = 0 or self.size() = 0
then result = self
else if not(str.hasMatch(patt)) then result = self
else
let m : String = self.firstMatch(patt)
in
result =
substring(1,indexOf(m)-1) + rep +
substring(indexOf(m)+m.size(),size()).</p>
      <p>replaceAllMatches(patt,rep)
endif endif
An example is “abracadabra”.replaceAllMatches(“ab”, “”) = “racadra”.</p>
      <p>replaceFirstMatch is:
replaceFirstMatch(patt : String, rep : String) : String
pre: true
post:
if patt.size() = 0 or self.size() = 0
then result = self
else if not(str.hasMatch(patt)) then result = self
else
let m : String = self.firstMatch(patt)
in
result =
substring(1,indexOf(m)-1) + rep +</p>
      <p>substring(indexOf(m)+m.size(),size())
endif endif
str.split(patt) is defined similarly. It is invalid if either str or patt are invalid, and Sequence{str}
if either are null.
split(patt : String) : Sequence(String)
pre: true
post:
if self.size() = 0 or patt.size() = 0
then result = Sequence{self}
else if not(hasMatch(patt)) then result = Sequence{self}
else
let m : String = self.firstMatch(patt)
in let ind : Integer = indexOf(m) in
if ind &gt; 1 then
result = Sequence{substring(1,ind-1)}-&gt;union(</p>
      <p>substring(ind+m.size(),size()).split(patt))
else</p>
      <p>result = substring(ind+m.size(),size()).split(patt)
endif endif endif
An example is “abracadabra”.split(“ab”) = Sequence{“racad”, “ra”}.</p>
      <p>An alternative to using isMatch as the core regex operator would be to use firstMatch : both
isMatch and hasMatch can be derived from firstMatch .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Other String operators</title>
      <p>
        Since strings can also be regarded as sequences (of characters), it is often convenient to apply
some sequence operators to strings. The size() and at(i) operators are already common to both
types in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], as is indexOf (str), but with a more general meaning for String than for Sequence
(the corresponding Sequence operator would be indexOfSubSequence(sq)). In addition, the
subrange operators substring(i, j) and subSequence(i, j) correspond. In several programming
languages strings are treated as pseudo-collections, eg., iterators can be defined over strings in
C++ and in Swift, and Python uses the same range selection operators for strings and sequences.
      </p>
      <p>
        We suggest that first (), last(), front() and tail() are natural convenience operators to adopt
for String. These can be directly defined in terms of substring, so do not require extension of
Annex A of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. reverse() and insertAt(i, str) would also be relevant operators to adopt, where
insertAt allows a general string to be inserted into another:
s.insertAt(i, str) =̂︀
      </p>
      <p>s.substring(1, i − 1) + str + s.substring(i, s.size())</p>
      <p>Other widely-used string operators are trim() to remove leading and trailing whitespace,
lastIndexOf (str), hasPrefix (str)/startsWith(str), hasSufix (str)/endsWith(str) and after(str) and
before(str). Thus we also recommend these for inclusion in the OCL String library.</p>
      <p>
        However there are many specialised string operators which are more appropriately defined
in external extension libraries. For example, computation of the edit distance of two strings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
or the generation of random strings.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Operator implementations</title>
      <p>
        The operators isMatch, hasMatch, firstMatch , allMatches, replaceFirstMatch, replaceAllMatches
and split have been incorporated into the AgileUML toolset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] OCL-based specification language
and included in the specification analysis facilities of the toolset. Thus they can be used in any
specification of business, mobile or web-service systems defined using the tools.
      </p>
      <p>
        Code-generation for specifications using the regular expression operators is supported by
extended OCL libraries for Java 4, 6, 7 and 8, C#, C++, C, Python and Swift, available from
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Eg., Ocl.swift supports Swift versions 4 and 5. In each library we have defined specific
implementations of the operators. For Java we use the java.util.regex library facilities. For C#
we use the System.Text.RegularExpressions library. For C++ we use the C++ 2011 &lt;regex.h&gt;
library. For ANSI C we use the lcc regex.h library. For Swift we use the NSRegularExpression
facilities from Objective-C. For Python we use the re library facilities.
      </p>
      <p>In some cases, such as Java, the programming languages directly provide the required
operators. In other cases (eg., C, Swift), the operators need to be implemented in terms of more basic
matching facilities. We have primarily used iterative algorithms for eficiency.</p>
      <p>We provide implementations of first (), last(), front(), tail(), reverse(), insertAt(i, str), trim(),
lastIndexOf (str), after(str) and before(str) String operators in the AgileUML OCL
implementations for Java, C#, C++, C, Python and Swift.</p>
      <p>Extension operators such as edit distance are instead defined as static operations of a StringLib
component. In specifications these are invoked as StringLib.editDist(s1, s2). Implementations
of StringLib need to be provided in each target programming language. More advanced regular
expression facilities, such as first-class patterns and match groups, could also be provided in an
extension component.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Related work</title>
      <p>
        The OCL 2.4 library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides 17 operations on String, but omits regular expressions and
convenience operators such as hasPrefix , hasSufix , lastIndexOf , reverse, trim.
      </p>
      <p>
        The Eclipse OCL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] library for String additionally provides a matches operation (for isMatch),
replaceAll, replaceFirst, but not firstMatch , allMatches, hasMatch or split.
      </p>
      <p>
        EOL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] provides 20 String operators based on Java, including the regular expression operators
matches (representing hasMatch) and split, but omits allMatches, replaceAllMatches and isMatch.
      </p>
      <p>
        ATL OCL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provides the OCL 2.4 String operators, additional String operators trim, lastIndexOf ,
startsWith, endsWith, a limited form replaceAll of literal string replacement, and regular
expression operators split, regexReplaceAll, but not isMatch, hasMatch or allMatches.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Evaluation</title>
      <p>
        In order to evaluate the performance of the regular expression implementations in diferent
languages we defined examples for (i) allMatches and (ii) replaceAllMatches with the regular
expression [a− zA− Z ]+ applied to a string of length 1000 containing 100 words in case (i) and
expression ∖∖([⌢∖∖(∖∖)] * ∖∖ ) applied to a string of length 500 in (ii). The latter expression
matches against ()-bracketed texts which contain no ( or ) brackets within them. The examples
are in oclregexExamples.zip at [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Table 1 shows the performance for Java, Python and C#
implementations of these examples, iterated for N=100, 1000, 10000 and 100000 times. The
results are the averages of three independent executions, with all times in ms. All tests were
carried out on a Windows 10 i5-6300 dual core laptop with 2.4GHz clock frequency, 8GB RAM
and 3MB cache.
      </p>
      <p>N</p>
      <p>This means that for Java each match takes about 10− 4ms and each replacement about 4 *
10− 4ms in the largest case. Performance figures are similar for other languages.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Summary</title>
      <p>In this paper we have provided a rationale for adding a core set of regular expression facilities
to OCL, we also described their semantics and how the operators can be translated into diferent
programming language implementations. We also discussed the possible inclusion of other
widely-used string operators.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] AgileUML repository, https://github.com/eclipse/agileuml/,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] Eclipse, ATL user guide, eclipse</article-title>
          .org,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Eclipse</surname>
            <given-names>AgileUML project</given-names>
          </string-name>
          , https://projects.eclipse.org/projects/modeling.agileuml,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Eclipse</surname>
            <given-names>OCL</given-names>
          </string-name>
          <source>Version 6.4</source>
          .0, https://projects.eclipse.org/projects/modeling.mdt.ocl,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>The</given-names>
            <surname>Epsilon Object Language</surname>
          </string-name>
          , https://www.eclipse.org/epsilon/doc/eol,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gries</surname>
          </string-name>
          ,
          <article-title>Compiler construction for digital computers</article-title>
          , Wiley,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kolahdouz-Rahimi</surname>
          </string-name>
          ,
          <article-title>Agile specification of code generators for model-driven engineering</article-title>
          ,
          <source>ICSEA</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>I. Levenshtein</surname>
          </string-name>
          ,
          <article-title>Binary codes capable of correcting deletions, insertions and reversals, Cybernetics and control theory</article-title>
          , vol.
          <volume>10</volume>
          (
          <issue>8</issue>
          ),
          <year>1966</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <article-title>[9] OMG, Object Constraint Language 2</article-title>
          .4 Specification , OMG document formal/2014-02-03,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>