Branching Alignment based Synthesis of Regular Expressions Andreas Scherbakov andreas@softwareengineer.pro The University of Melbourne Abstract. We propose a novel Multiple Sequence Alignment algorithm which is able to build an optimized branching graph given a set of positive matching sample strings. The algorithm is principally based on Minimum Edit Distance approach being applied incrementally. However, we essen- tially extended the set of edit operations. The newly added operations allow implementing an acyclic graph drawing feature. The feature yields a better visual and appropriate representation of partial alignment gaps between sample strings. For instance, it may produce an /abc|de[fg]/ regular expression from a set of "abc", "def", "deg" strings. For fur- ther optimization, we adjust edit penalties based on character class hi- erarchy. The algorithm may be used in Bioinformatics as well as for extracting Regular Expressions (RE, patterns) from sets of observed or prospective documents (that may be useful in Web data mining, Spam filtering, RE design advisory tools and so on). This paper also considers integration of the proposed algorithm into a full stack Regular Expression extraction flow, particularly, in conjunction to Repetition detection. Keywords: Regular expression, Sequence, Alignment, Multiple Sequence Alignment, Alignment Gap, Regular expression synthesis, Regular ex- pression extraction, Pattern extraction, Hammock graph, Branching Reg- ular Expression, Edit Distance, Character class 1 Introduction Regular expressions (RE) are used widely in various applications for the text data processing. While in many cases the pattern a developer or analyst intends to establish is clear and formally well defined, there are many areas where reg- ular expressions used to recognize and mine a piece of text that is similar to some sample [6]. Examples are Web data mining, Spam filtering, Bioinformat- ics, State machine analysis and plenty of others. In these cases building and testing a proper RE becomes technically challenging. Algorithms that generate a RE that matches large number of samples are of high demand in these areas. Also, such algorithms may provide priceless help to those application developers who want to obtain necessary RE patterns just by supplying several matching samples rather than by deepening into RE building details [7]. There exist a lot of software tools aiding the regular expression building of blocks and testing but few applications that generate a RE from samples. Examples of the latter 2 presented in [11],[2],[4],[3] employ genetic algorithms or Machine Learning ap- proaches directly to the block composition. In order to take an advantage of such an universal approach, (1) one needs a big bunch of samples, as every possible usage of building block adds a feature while the number of test examples should be sufficient to learn efficiently against that number of features [1]; (2) a set of negative examples should be provided as well as a set of positive ones which of- ten does mean extra human work in designing negative samples. Also, character abstraction rules are mostly beyond user’s control. There are approaches that attempt to build an optimal automaton to sat- isfy a given input sequence [5]. They do produce regular expressions solely of positive samples. However, they are focused on repeated subsequence extraction while don’t mind possible alignments and splitting branches for more neat rep- resentation. In this paper we propose an algorithm that performs alignment and partial join of input samples. Please note that the repetition extraction itself is generally beyond our discussion scope here, although being a part of the pro- posed technique. In a short word, the new algorithm performs fuzzy alignment of samples and produces branches while leaving the repetition extraction to an external method. Our major motivation was producing regular expressions for advising pro- grammers and users and for assisting human’s work in areas where a content schema isn’t much strict and clear. Following this, we pay attention to make a RE concise while adjustable both in terms of look and feel and of degrees of freedom (perplexity). 2 Representation of Regular Expressions 2.1 Hammock Graph /a(b([ij]\d)?|k0)/ transition [ij] + bypass b 4 5 \d 1 a 2 6 bypass k 0 zero fork 3 join transition Non-exhaustive examples of possible/impossible new branches which might be added next time Fig. 1. An example of regular expression graph 3 For the purposes of our algorithm, a regular expression is represented as a directed acyclic graph (Fig. 1). More precisely, as a ”hammock graph” [14], allowing exactly one entrance and one exit to each area. A node of such a graph is referred to as a RE position, while an edge as a RE transition (there may be normal transitions that correspond to a character place as well as zero transitions shunting constructs like /(abc)?/). For the simplicity, in the current paper a graph is assumed to be sorted and having all positions enumerated with a continuous [0..N ] integer range. We refer to a sub-graph between some entrance position and some exit posi- tion as a hammock . Thus, for example, all edges in a plain chain are hammocks. Hammocks of course may be nested. The way from a hammock’s entrance posi- tion to it’s exit is referred as a bypass. An entrance or exit position shared by multiple hammocks is referred as a fork or join position, respectively. 2.2 Character Selection characters characters characters a ab ab2 character classes character classes character classes Single Word Any Single Word Any Single Word Any . \w . . \w . . \w . var = 1 var = ∞ var = ∞ var = 1 var = ∞ var = ∞ var = 1 var = ∞ var = ∞ cost = 0 cost = 2 cost = 4 cost = 0 cost = 2 cost = 4 cost = 0 cost = 2 cost = 4 cost = 0 cost = 2 cost = 4 Fig. 2. A character selection cost evolution while new characters are merging. We represent every character place in a RE by a character selection. A character selection is both considered as a concrete character set and as a set of character classes that fit it. A character set in our scope means all characters really falling to this place in some sample string(s). A character class is en entity that corresponds to some predefined group of characters while also (op- tionally) restricts the maximum variety of participating characters. For example, a character class with [a − z] character group and maximum allowed variety of 2 matches possible character selection of [ax] but doesn’t match the [axz] due to the variety exceeding. The choice of character classes and their parameters es- sentially affects the resulting regular expression look and actually represents the user preferences that cannot be deducted from a limited number of samples (like whether should we display [123] or \d for a concrete selection of digits at some RE element). Thus, character classes are considered to be user-defined parame- ters in our algorithm. The only mandatory requirement is that every character should belong to at least one class. Every character class is attributed with a class cost. The class cost may be treated as an inverse metric to the prettiness of 4 a class occurrence in a resulting expression record. The cost of a character selec- tion is defined as the minimum cost of a character class containing all characters selected. Θsel (chars) = min Θclass (cls) (1) cls∈charClasses ∀char∈chars→char∈cls.chars cls.maxV arety≥||chars|| When we consider an option to add (merge) a new character to a selection (for example, /ab/ + /ac/ → /a[bc]/ – adding the second character ‘c’ into a position previously occupied by just one character ‘b’), we must calculate the selection cost both before and after the planned merge. That (non-negative) difference will be the merge cost used in a dynamic programming procedure described below. Ψmerge (chars, c) = Θsel (chars ∪ {c}) − Θsel (chars) (2) As a variant, a character class match may be additionally rewarded by substi- tuting a zero resulting value with some negative one: If Ψmerge = 0 then Ψmerge := Ψmatch < 0 (3) 3 Branching Alignment algorithm In order to produce a regular expression from a set of samples, we first merge the samples into a hammock graph. The core part of our alignment algorithm generally looks like an extension of well known Minimum Edit Distance method [13]. The procedure incrementally transforms the current RE r, attempting to align it to the next sample s with the minimum penalty score. We use a dynamic programming algorithm that optimizes the total cost of regular expression mod- ification choosing the best alignment between r and s positions. In order to get a final RE, we start with some input sample as an initial RE and then merge all other samples one by one into it. Note. The resulting expression look may depend on the order of sample string merging. Although the effect of ordering yet needs exploration, our experiments showed that following a longer first, shorter next order ensures an optimal align- ment or a good approximation to it in vast majority of cases. Our algorithm adds a new operation to the original operation set consisting of deletion, insertion and substitution. This new operation enables drawing a new graph branch instead of mere aligning characters between two given positions whenever it seems to yield better score. Table 1 displays the list of possible atomic editing events available at a given (i, j) combination where i is the target position in r while j is the number of processed characters of s. We associate two cumulative costs, JL (i, j) and JS (i, j), to every possible (i, j) pair. Initially, JL (0, 0) = 0 while all other JL (i, j) and JL (i, j) are set to +∞. (Subscripts L ans S near J stand for the ‘light’ and ‘shadow’ words, respectively). The difference between JL and JS lays in the fact that the former immediately correspond to some position in r while the latter points to a bypass 5 Table 1. Editing events Operation Description Sample RE posi- Penalty string tion ad- position vance advance Merge Merging a character from the sam- yes yes Ψmerge ple string into a RE character selec- tion Insert Adding a new single character ele- yes no Ψinsert ment to the RE. As the RE should keep matching all previous samples, the element will be optional Delete Letting a RE element to be skipped. no yes 0, if the element It means the element becomes op- is already marked tional optional; Ψinsert , otherwise Insert into Adding a character to a newly cre- yes no Ψdisp a branch ated branch Bypass Drawing a newly created branch no yes Ψbranch , once per over a RE hammock bypass. branch connection position inside a potential new branch coming ”over” ith position in current RE. For example, if r was /ab/, s is ”cd” and r will be /(ab)|(cd)/ then JS (1, 1) may refer to a connection point between a bypass ”over” ‘a’ and a bypass ”over” ‘b’. We fill matrices of JL and JS for all possible positions [0..M ]×[0..N ] iterating positions in any convenient non-decreasing order. We use the following equations in order to take all possible edit events at (i, j) point into account.    JL (t.src, j − 1) + Ψinsert , if j > 1 ∀t ∈ trans(i) JL (t.src, j) + Ψdelete     ∀t ∈ trans(i) JL (t.src, j − 1) + Ψmerge (t.sel, s[j]), if j > 1  JL (i, j) = min   ∀b ∈ bypass(r, i) : ¬b.f ork JS (b.src, j) ∀b ∈ bypass(r, i) JL (b.src, j) + Ψbranch     ∀t ∈ zerotrans(i) JL (t.src, j)   (4)   ∀b ∈ bypass(r, i) : ¬b.join ∧ ¬b.f ork, JS (b.src, j) ∀b ∈ bypass(r, i) : ¬b.join JL (b.src, j) + Ψbranch  JS (i, j) = min (5)   JS (i, j − 1) + Ψdisp , if j > 1 JL (i, j − 1) + Ψdisp + ψbranch if j > 1  As seen from the equation, JS (i, j) cannot be used or assigned ”through” a bypass that shares a fork or join, respectively. This restriction prevents project- ing a hammock breaking branch, as discussed in the previous section. Once we consider a new branch creation, we apply a Ψbranch penalty. This penalty (acting 6 like a simplified Graph Edit Distance [8]) enforces the willingness to compose a laconic and uniform regular expression rather than to proliferate branches. Having the matrix fully constructed, we traverse the lowest total penalty path from JL (N, length(s)) back to JL (0, 0) in order to extract the least costly sequence of editing events. To do this, we reuse the same equations (4), (5) but in place of calculating minimums (that are already known at this stage) we notice the choice that yields minimum value to the left hand side. Then we apply these editing events over elements of r producing a modified RE version r0 (See Fig. 3 for an example). The new branch insertion is a special case. To add each new projected branch, if any, into the RE, we do the following: (1) outline a branch-related subsequence of editing events (consisting of ”Insert into branch” and ”Bypass” events); (2) compose a new branch and (3) insert the new branch into r at proper start and end positions. The idea of algorithm described above may remind the Affine Gap Penalty concept known in Bioinformatics [9] ; the major difference is the fact that we consider the gap not merely as a modifier to an element-by-element edit distance metric, but also as a reason to draw a new graph branch. See Algorithm 1 illustrating the approach discussed. JL(0,5) JS(5,5) 7 7 7 7 7 6 ax(be|cf) Edit inline JS(0,5) JL(5,5) 7 7 7 7 7 7 e New branch 6 6 6 6 6 6 ax(be|cf) 6 6 6 6 6 6 2 5 5 5 4 4 5 ax((2|b)e|cf) 5 5 5 5 5 5 1 4 4 2 5 5 5 ax((12|b)e|cf) 4 4 4 7 7 5 x 3 2 4 6 6 5 ax((12|b)e|cf) 3 3 4 7 7 5 r 0 3 4 7 7 5 JS(0,0) JS(5,0) [ar]x((12|b)e|cf) ∞ 3 4 9 9 5 JL(0,0) JL(5,0) a x (b e |c f) Fig. 3. An example: merging ”rx12e” string to /ax(be|cf)/ expression 7 A note on the complexity. The algorithm shows the same complexity as Minimum Edit Distance (per one sample), of O(M ∗N ), which is meanly propor- tional to N 2 , where M is the regular expression length, N is the sample string length. Still there are many known methods aimed to bound the complexity of Edit Distance minimization, and most of them are directly applicable to the pro- posed algorithm. Particularly, it might be bounding of maximum edit distance to some constant number [12] 4 Repetition Capture A regular expression produced with the algorithm described above yet needs a repetition capture in order to get its final form. We may employ any known method capable of finding and removing repetitions of adjacent substring in a plain symbol string . Let M (s) → s0 denote such a method, where s and s0 are input and output strings, respectively. Now, we may process the RE graph recursively applying M to all edge sequences of the graph, recursively down up, representing each nested hammock by a single canonical symbol. In more detail, we use a T : (α, Rmin , Rmax ) tuple in order to represent a hammock, where α is a symbol of some alphabet, Rmin and Rmax are the minimum and maximum number of α’s repetition, respectively. So, we are to slightly modify the M procedure in order to fit the following form. MT (q) → q 0 , q, q 0 : S, (6) where S denotes the sequence of T tuples. We expect that MT compares two tuples in a sequence solely relating to the α value (ignoring Rmin , Rmax ). Once has found a repetition, the procedure merges it into single tuple with proper sums for the total Rmin and Rmax : (α, Rmin1 , Rmax1 ), (α, Rmin2 , Rmax2 ) → (α, Rmin1 + Rmin2 , Rmax1 + Rmax2 ) (7) Then, we need a hammock canonical representation function C(si ) → sj . The aim of such a function is to represent a hammock using a minimal variety of α. It’s expected to satisfy, for instance, the following properties. – C({x, []}) = (C(x).α, 0, C(x).Rmax ) – C({[x]}) = C(x) C function may expect that its arguments already represented in a canonical form. We initially annotate each character selection cj found in our graph with a tuple   0, if cj marked as optional (cHash(cj ), , 1) : T, (8) 1, otherwise where cHash(cj ) is a canonical symbol for the best character class that fits cj . Now we may use a depth-first search procedure like the following one to capture repetitions in our RE graph. 8 procedure captureRepeat(q : S) for e ∈ q do if e ∈ / single char selections then for b ∈ e.branches do captureRepeat(b); b ← C(b); q ← MT (q) Now we have all elements annotated with their repetition bounds [Rmin → Rmax ] and thus we may deliver a RE. However, an user may want to see un- bounded repetitions in it (like in /a+/ or /ab*/ ). We cannot rigorously extract an unbounded repetition from a finite number of samples, so we have to use em- piric generalization rules. The simplest of them (while working well in practice) may be assigning Rmax to +∞ if Rmax − Rmin >= Rbound , where Rbound is some constant threshold (usually equal to 3). 5 Experiments We applied the proposed algorithm to the Spam classification task [10]. We used 195 series of Spam messages (i.e. 195 mimes). We extracted six sample strings from each message (four headers, a plain text and an HTML). Every series was split into the train and test sets (60% and 40% of messages, respectively). We randomly chose the sets three times per mime. As we don’t know in advance which combination of sample strings (like ”Subject+body” or ”From+Html”) better classifies a Spam mime, we have tried all 26 − 1 = 32 combinations and selected one per mime that yields the best F1 metric value once applied for classifying the union of the test set for this mime (Spam) and a standalone collection of non-Spam messages. We have repeated these experiments at various values of the branch cost ratio (that tells one how many avoided highest-cost character merges ”worth” a new branch creation). Fig. 4 displays F1 metric values distribution over mimes. It seems that branch cost should be at least about 3 for the best result. Further increase does not affects the quality significantly. Fig. 5 shows how the precision and the perplexity correlate to the recall. The precision in most cases equals one and behaves quite irregularly in the remaining cases that demonstrates a quite low probability of matching against some random document. A surprisingly inverse overall correlation in the (perplexity, recall) pair is due to the fact that an increase in perplexity of a generated RE is a reaction to a high content variety of training samples. A congestion near (50, 0.7) coordinate seems to be caused by the generator’s attempts to satisfy the sets of altering string constants frequently occurring in Spam messages with the aim to challenge filters. The perplexity of generated regular expression still seems rather low to cover the expected variety of training set samples. Of course being a natural consequence of a synthetic approach to RE building, this fact 9 F1 distribution over Spam mimes 1.0 0.8 branch cost ratio = 1, no \d and \w classes 0.6 branch cost ratio = 1 F1 branch_cost ratio = 1.75 branch_cost ratio = 2.5 0.4 branch cost ratio = 8 0.2 0 20 40 60 80 100 Samples fraction, % Fig. 4. A distribution of F1 metrics observed in a Spam classification test based on RE extraction indicates that further adjustments toward the perplexity boosting at ”volatile” sample text regions might improve the recall values. Spam messages provide an extremely challenging test bench because they are specially designed to foolish automated filters as possible. Thus, we didn’t expect the recall to demonstrate as high values as it may be seen in ”normal” content processing. A bunch of expression generated were analyzed and it was found that it’s usually easy to do a manual correction (mostly meaning the removal of some parts) which results in a recall of 95..100%. Thus, the algorithm at it’s today’s stage may be considered as an effective advisory solution for draft RE extraction. 6 Manual Assessment We have found that each piece of a generated RE looks reasonable at a proper selections of major cost and limit parameters. However, the best choice of these parameters may significantly vary through the entire expression. At the same time, the overall alignment of samples works fine at a wide range of parameters. Thus we conclude that for the best advisory value the technique should allow re- compilation of any selected part of generated RE at a different choice of options. Meanly, an optimal cost of branch containing 3 characters is expected to be of 10 Precision against Recall in Spam tests 1.0 ● ● ● ● ● ● ●● ●● ● ●● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● branch cost ratio = 1, no \d and \w classes 0.8 Precision branch cost ratio = 1 branch_cost ratio = 1.75 0.7 ● ● branch_cost ratio = 2.5 0.6 branch cost ratio = 8 0.5 0.2 0.4 0.6 0.8 1.0 Recall RE Perplexity against Recall in Spam tests 70 ● 60 50 ● Perplexity 40 ● 30 ● ● 20 ● ● ● 10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● 0.2 0.4 0.6 0.8 1.0 Recall Fig. 5. Precision and Perplexity against Recall for Spam tests at various parameters. about 1.5..2.5 times the highest cost of single character merge (also referred to as the Wild card cost). Table 2 displays examples of regular expressions generated with a set of E-Mail addresses taken from a Spam distribution ”From:” field. The table also includes manual assessment of expression quality in terms of readability, recall, and precision. Expressions were taken under various settings of three major parameters such as (1) Wild card character class cost (max Ψmerge ), (2) Branch cost (Ψbranch ) and (3) Branch Character cost (Ψdisp ). Also, Match bonus (−Ψmatch ) was varied in some cases. 7 Discussion The algorithm has demonstrated its potential as an advisory solution for an an- alyst or an engineer willing to compose regular expressions in a semi-automated manner. Tuning of parameters may be really needed to get the desired look of 11 Table 2. Examples of RE extracted at different parameter settings from a set of header field values found in a Spam distribution cost Branch Wild Match Regular Expression produced Assessment Start Char 4 1..6 1 0 (?:Ponto Frio|(?:Sepho|Ext)ra| Ok "?InfoJobs"?)<κ@mail2m\d{1,2}\.info> 4 7 1 0 (?:Ponto Frio|InfoJobs|"InfoJobs" | Ok but less (?:Sepho|Ext)ra) <κ@mail2m\d{1,2}\.info> readable 4 2 2 0 .?\w{3}o?(?:ra|Jobs"?| Frio) too general, <κ@mail2m\d?\d\.info> less readable 2 2 2 0 .?(?:[Io]n)?(?:.{2})?.{5} too general, <κ@mail2m\d?\d\.info> less readable 4 1 3 -2..0 ["P]?\wn?\wo?(?:.\w{3})?. irrelevant <κ@mail2m\d?\d\.info> 7 1 4 -2..0 (?:P|")?\wn?(?:\w{2})?(?: |\w)\w{3} too general (?:o|")? <κ@mail2m\d?\d\.info> 2 10 1 0 (?:.{2})?(?:n[ft]o|f)?.{5} low rele- <κ@mail2m\d?\d\.info> vancy 2 1 4 0 ["P]?.n?.o?(?:.{2})?.{3} too general <κ@mail2m\d?\d\.info> 3 1 2 -2..0 .?\w{3}o?(?:ra|Jobs"?| Frio) acceptable <κ@mail2m\d?\d\.info> still unread- able 4 1 1 -2 (?:Sep|E|(?:Po|"?I)n)\w\w(?:ra|Jobs"?| acceptable Frio) <κ@mail2m\d{1,2}\.info> still unread- able Samples "InfoJobs" <κ@mail2m37.info> Sephora <κ@mail2m38.info> Ponto Frio <κ@mail2m10.info> Sephora <κ@mail2m16.info> "InfoJobs" <κ@mail2m37.info> Sephora <κ@mail2m39.info> Extra <κ@mail2m18.info> "InfoJobs" <κ@mail2m37.info> "InfoJobs" <κ@mail2m32.info> Ponto Frio <κ@mail2m10.info> Extra <κ@mail2m18.info> InfoJobs <κ@mail2m30.info> Sephora <κ@mail2m35.info> Sephora <κ@mail2m1.info> κ stands for a constant name. an expression produced. (And some extra logic/arithmetic in order to have an easy-to-use parameter control may be desirable in implementations). Due to a quadratic complexity of the proposed algorithm, a preliminary align- ment step should be done first (using less consuming algorithms, for example, 12 suffix tree based ones) in case of long samples. This fact limits the maximum size of a branched RE region rather than the overall RE size and in such a way it doesn’t principally prevent the usage of the proposed technique with text samples of arbitrary length. Indeed, the proposed algorithm is expected to fill the gap between concrete substring alignment and the regularity (repetition) extraction. We mean a cas- caded pipeline as follows. 1. A plain string based alignment of input samples splits a RE extraction task into a set of tasks with shorter text samples (complexity = O(K ×N ×log N ) or lower, where K is the sample count, N is a representative sample length). 2. The proposed extraction algorithm aligns characters in a fuzzy (class based) way and creates branches (complexity = O((N/N1 ) × K × N12 ) = O(K × N × N1 )), where N1 is a representative length of sample substring produced by the previous phase. 3. A repetition extraction step over an aligned RE (complexity = O(N22 ) or lower, where N2 is a representative length of hammock chain in a graph resulting from phase 2, N2 <= N1 ). Phase 2 complexity is critical in such a schema. The overall complexity de- pends on N1 behavior w.r.t. N . Fortunately, in real documents N1 usually seems to be bounded as thy contain some non-volatile structure that may be captured at phase 1. At such an assumption, overall complexity of the pipeline may be kept ’almost’ equal to the complexity of phase 1. An internet service demo based on the proposed algorithm is available at http://regexus.com. All interested people are welcome to test and use it. 8 Next steps We expect the following major directions of further related research. – to make parameters adjustable with Machine Learning techniques; – to explore an automated ordering of sample string as well as non-sequential merging methods; – to apply a preliminary alignment and partitioning of samples in order to reduce the overall complexity; – to explore an option to extract more ’popular’ constructs of RE (say, refer- ences to previously matched substrings). 13 Appendix Algorithm 1 Merging a string into a RE r ← empty regex for s ∈ samples do N ← max position number(r) JL (∗, ∗) ← +∞ JS (∗, ∗) ← +∞ JL (0, 0) ← 0 for j = 0 → length(s) do for i = 0 → N do JL (i, j) ← use Equation (4) JS (i, j) ← use Equation (5) current ←pointer toJL (N, length(s)) newBranch ← empty list while current 6= 0 or j 6= 0 do bestChoice ← f indBestChoice(current) . finds a case in Equation(4) or (5) yielded the minimum value to the matrix element pointed by current if not empty(newBranch) and current ∈ JL (∗, ∗) then addBranch(r, current, branchEnd) newBranch ← empty branch if isBypass(bestEvent) or E ∈ LS (∗, ∗) then . dealing with a new RE branch if empty(newBranch) then then newBranchStart ← currentElement if bestChoice.ancestor.j < current.j then newBranch ←concat(single char selection(s[j]), newBranch else . dealing with an existing RE branch if bestChoice.ancestor.i < current.i then if bestChoice.ancestor.j < current.j then character merge (s[current.j], charSelection(bestChoice.transition))) else newSingle ← single char selection (s[current.j]) markAsOptional(newSingle) insert(r, current.i, newSingle) else if bestChoice.ancestor.j < current.j then markAsOptional(charSelection(bestChoice.transition))) current ← bestChoice.ancestor if not empty(newBranch) then addBranch(r, current, newBranch) 14 References 1. Hussein Almuallim and Thomas G Dietterich. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1-2):279–305, 1994. 2. Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Marco Mauri, Eric Medvet, and Enrico Sorio. Automatic generation of regular expressions from examples with genetic programming. In Proceedings of the 14th annual conference companion on Genetic and evolutionary computation, pages 1477–1478. ACM, 2012. 3. Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Eric Medvet, and Enrico Sorio. Automatic synthesis of regular expressions from examples. 2013. 4. Alberto Bartoli, Andrea De Lorenzo, Eric Medvet, and Fabiano Tarlao. Playing regex golf with genetic programming. In Proceedings of the 2014 conference on Genetic and evolutionary computation, pages 1063–1070. ACM, 2014. 5. Geert Jan Bex, Wouter Gelade, Frank Neven, and Stijn Vansummeren. Learning deterministic regular expressions for the inference of schemas from xml data. ACM Transactions on the Web (TWEB), 4(4):14, 2010. 6. Falk Brauer, Robert Rieger, Adrian Mocan, and Wojciech M Barczynski. Enabling information extraction by inference of regular expressions from sample entities. In Proceedings of the 20th ACM international conference on Information and knowl- edge management, pages 1285–1294. ACM, 2011. 7. Jeffrey Friedl. Mastering regular expressions. ” O’Reilly Media, Inc.”, 2006. 8. Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit distance. Pattern Analysis and applications, 13(1):113–129, 2010. 9. Osamu Gotoh. An improved algorithm for matching biological sequences. Journal of molecular biology, 162(3):705–708, 1982. 10. Jean-Michel Jolion. Some experiments on clustering a set of strings. In Graph Based Representations in Pattern Recognition, pages 214–224. Springer, 2003. 11. Efim Kinber. Learning regular expressions from representative examples and mem- bership queries. In Grammatical Inference: Theoretical Results and Applications, pages 94–108. Springer, 2010. 12. Esko Ukkonen. Finding approximate patterns in strings. Journal of algorithms, 6(1):132–137, 1985. 13. Robert A Wagner and Michael J Fischer. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168–173, 1974. 14. Fubo Zhang and Erik H. DHollander. Using hammock graphs to structure pro- grams. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 30(4):231– 245, 2004.