Case-Based System of User Interface Reverse Engineering

                                                           Pavel Myznikov
                                                      Novosibrisk State University
                                                      1 Pirogova str., Novosibirsk
                                                            630090, Russia
                                                       miznikov72@gmail.com


                           Abstract                                    lems with adaptation of similar problems from the past to a
                                                                       current situation. This approach has been chosen due to the
  The paper is devoted to the implementation of case-based
  reasoning in reverse engineering of graphical user interfaces.
                                                                       hypothesis that formalization of images (which is, in fact,
  In particular, web interfaces are considered. We suggest au-         creation of html markup) using transductive reasoning, or
  tomating HTML/CSS markup building with aggregation of                reasoning by analogy, produces a result, the most similar to
  code samples from previous cases. The proposal is based              what a human does. Also, such a scheme can solve the prob-
  on the hypothesis that analogy is employed to conceptualize          lems with the automation described above.
  HTML markup construction. The article considers the origi-              The state of the art solutions offer to generate code with-
  nal theory of building an image structure, the modification of       out human involvement. They work as black boxes, where
  the case-based reasoning approach and the results of practical       the final result totally depends on the collected dataset. The
  experiments.                                                         CBR approach is to fill this gap with the acquisition of ex-
                                                                       perts’ knowledge in case storage. The ultimate role of CBR,
                       Introduction                                    therefore, is to help specialists to formalize their knowledge
Web-technology is one of the most developed areas of mod-              in a relatively small dataset and then automatically combine
ern computer science. It is used not only in website develop-          complex interfaces in such a way as if the specialists did it
ment but also as an important node of any IT-infrastructure.           manually.
Consequently, the whole information technology industry
uses web-development in one form or another.                                                 Related work
   Nevertheless, automation of web-development itself re-
                                                                       To fill a gap in knowledge for readers who are not familiar
mains incomplete. One of the key tasks – creation of
                                                                       with the field, we overview two types of related work. The
html/css markups – does not have an extensive solution
                                                                       first one is the papers about interfaces reverse engineering:
yet. Some properties of this process prevents using clas-
                                                                       they are the benchmarks which we compare our approach
sical methods of automation. These properties are: non-
                                                                       with. The second one is an outline of CBR. Since this is a
formalized requirements for a layout, variability of indus-
                                                                       core of the given approach, this information is crucial for
trial standards for code writing, and, finally, cross-browser
                                                                       understanding the paper.
and cross-platform compatibility.
   The impact of automation of html/css markups building
is not limited to speeding up a process of web-applications            Interfaces reverse engineering
development. It also makes the development more flexible               We consider two types of interfaces that are typical objects
allowing to scale results, verify hypotheses and help testing          of the reverse engineering task: mobile interfaces and web
applications.                                                          interfaces.
   We can expand the results of the work to a more general
domain: any kind of technologies where markup languages                Mobile interfaces Mobile development is supposed to be
are used for building GUI.                                             a more common area for interfaces reengineering than web.
   A methodological basis of the work is the Case-Based                It can be explained by the fact that mobile interfaces are usu-
Reasoning (CBR) approach (Kolodner 1992). Briefly de-                  ally relatively simplier and more unified, while web pages
scribing its essence, one can say it is a way of solving prob-         have a great variance of different layouts. Here we point at a
                                                                       few mobile interface reverse engineering researches that we
Copyright c 2020 held by the author(s). In A. Martin, K. Hinkel-       consider as robust baselines in terms of comparing OCR and
mann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen
(Eds.), Proceedings of the AAAI 2020 Spring Symposium on Com-
                                                                       code generation parts.
bining Machine Learning and Knowledge Engineering in Practice             The framework of (Nguyen and Csallner 2015), RE-
(AAAI-MAKE 2020). Stanford University, Palo Alto, California,          MAUI, is conceptually similar to our approach, especially
USA, March 23-25, 2020. Use permitted under Creative Commons           in the OCR part. REMAUI is also faced with the problem of
License Attribution 4.0 International (CC BY 4.0).                     false-positive candidates and contains a merging step while
parsing an input. In both questions, they use a set of heuris-     ally generate less accurate results in terms of pixel-to-pixel
tics specific to a mobile domain. Surely, there is a difference.   similarity, but an important thing is that specialists can con-
Firstly, our OCR algorithm is mainly focused on image de-          tribute their knowledge in a case storage, thereby managing
tection rather than text detection. Secondly, we suggest an-       a style or methodology that they expect to see in the final
other model of hierarchy building. However, we can state           result.
that the idea of filtering candidates in the OCR stage was
added to our framework after familiarization with their pa-        Case-Based Reasoning
per.                                                               As mentioned above, case-based reasoning is a core of the
   Based on REMAUI, (Natarajan and Csallner 2018) de-              approach. This section outlines basics of CBR.
veloped a tool – P2A – for generating a code for animated             The origin of CBR is the work of Roger Schank (Schank
mobile application. This is the next step of the problem: not      1982). His idea is representing knowledge as memory or-
only to recognize a layout but also to find a relation between     ganization packets (MOPs) that hold results of a person’s
two similar interfaces screenshot. What is important is that       experience. When a person solves a problem, he or she uses
this tool implies an interaction with a user while performing      MOPs to reapply a previously successful solution scheme in
a task as well as our approach. The difference is that we sug-     a new similar context. This approach contradicts the rule-
gest a long-term setting of the system in the beginning but        based approach, where a person uses predefined scripts for
the absence of it later, while P2A offers a user’s involvement     all cases. You can find more information in (Watson and
in each session of image processing.                               Marir 1994).
   The work of (Chen and Burrell 2001) is devoted to solv-            A lot of works devoted to this approach were written
ing a very specific subtask of mobile interface reverse en-        in the 1990s (Kolodner 1992; Watson and Marir 1994;
gineering: automatic conversion of an Android application          Aamodt and Plaza 1994; Ram and Santamarı́a 1997). In
into an iOS application and vice versa based on interface          the 2000s case-based reasoning was implemented in differ-
screenshots. Having such restricted input and output of the        ent domains: medicine (Gómez-Vallejo et al. 2016; Delir
task, the authors were able to implement a more powerful           Haghighi et al. 2013), business (Chou 2009; Chang, Liu,
model. Namely, they reduced the task to component detec-           and Lai 2008; Carrascosa et al. 2008), finance (Sartori, Maz-
tion and component classification, where for the second part       zucchelli, and Gregorio 2016; Sun et al. 2014), government
convolutional neural networks were used.                           (Lary et al. 2016), education (Zeyen, Müller, and Bergmann
                                                                   2017), information technologies (De Renzis et al. 2016;
Web interfaces Further, we would like to focus on web              Navarro-Cáceres et al. 2018; He 2013), systems design
reverse engineering, because in the case of solving this task,     (Shen et al. 2017), geosciences (Lary et al. 2016).
a possible outcome can be more considerable, as web tech-             In short, CBR is a method of solving a problem by adap-
nologies are applied in an extremely wide range of domains.        tation of solutions to similar problems in the past to a current
Researches in this field are not so numerous, however.             situation.
    Asiroglu in (Asiroglu et al. 2019) presented a quite un-          The core concept in CBR is a case. A case is a threesome
usual reverse engineering case. Their tool accepts a hand-         of elements:
drawn sketch and generates an HTML document using a
deep convolutional neural network. Such an application             • Problem is a state of the world when the case occurs.
seems very helpful in the industry. Unfortunately, it cannot       • Solution is a suitable answer for the problem in the given
be used in solving the given research problem, because in            context.
our case, we should detect exact features of the interface,        • Outcome is a state of the world after the problem is
e.g. color, font, size, etc., while that tool perceives a com-       solved.
mon idea of an interface.
    The most recent successful case of solving the problem in         The process of finding the solution is called a CBR-cycle
the industry is the result of UIzard Technologies company          that consists of 4 stages (Kolodner 1992):
(Beltramelli 2017) with the title “pix2code”. The authors          1. Retrieve. The best matching case is extracted from the
note the similarity of the task of generating a code based            case base.
on an interface screenshot with the task of generating a text
                                                                   2. Reuse. The solution to the case retrieved is adapted to
in a natural language based on an image. Consequently, they
                                                                      solve the current problem.
use a similar method: the basis of the solution is a cascade
of long-short term memory recurrent neural networks. At the        3. Revise. The adapted solution is tested: if it is not satis-
moment, this project is only on a proof-of-concept stage, so          factory, then either additional cases are retrieved, or the
it is too early to make valid conclusions about the viability         retrieved solution must be adapted again.
of the idea.                                                       4. Retain. The case with the give solution is added to the
    As you can see above, there are many points of view on            case base.
the interface reverse engineering problem. Each of them is
suitable for specific conditions and goals. The contribution
of the given paper is building a reverse engineering frame-                             Task definition
work more as an expert system rather than an automation            When defining the task, we aim to reproduce the way that a
tool. It means that compared with alternatives, it can eventu-     human performs HTML markup building. All criteria, con-
Figure 1: CBR cycle according to Aamodt and Plaza
(Aamodt and Plaza 1994)
                                                                  Figure 2: Scheme of generation HTML code based on a
strains and conditions given below are based on how the in-       bitmap. 1. Extraction elements. 2. Building a hierarchy of
dustry operates. For more information, see (Ali 2017).            elements. 3a and 3b. Transforming a hierarchy to html doc-
   The following life-cycle of interaction with the system is     ument using a case storage. 4. Assembling final files.
suggested. In the very beginning, an expert (or a group of ex-
perts) populates a case storage with code templates to setup
the system. Or, they can use a ready case storage populated         identically processed in all modern browsers and operat-
with someone else. This step is required to be done only            ing systems.
once. Then, the system is used to get a design of interface         In addition, we set input constraints on the current stage.
as an input and generate code automatically. At this point,       An image must not include
we don’t expect an absolutely accurate result. A specialist,      • gradient background
in the end, makes final improvements in the code, or uses it
as a basis for more high-level tasks.                             • animation elements
   Consequently, the main requirements to an image pro-           • popup elements
cessing part are that the algorithm must recognize an image         In the future, we intend to make an algorithm working
structure, find as many features (color, type, margins, etc.)     beyond these conditions.
of objects as possible, and generate the code implementing
the structure recognized. Wherein, it is not necessary to                                Architecture
• recognize all features                                           The basis of the architecture is an attempt to recreate the
• recreate the picture pixel-to-pixel                              process of writing code by a human. The hypothesis is, a
                                                                   human uses transductive reasoning when formalizing visual
   The introduction and related work sections contain con-         information: one extracts structures of different levels from
ditions that should be considered in solving the task. It is       an image and describes this structure by analogy with one’s
necessary to make a set of requirements for the system such        own or others’ previous experience.
that it would be possible to get positive results in these con-        In particular, such a situation is observed in writing
ditions.                                                           HTML-code. Meeting an unknown layout pattern, a special-
• Condition 1. Non-formalized requirements. An image               ist finds in the literature, how to implement it on HTML.
  itself does not contain complete information about what          Further, referrals to directories are becoming less and less,
  should be a final HTML-markup. Therefore, it must be             but the principle remains the same: facing another pattern, a
  available to include additional knowledge about the re-          human extracts from memory a suitable example and adapts
  quirements for the system.                                       it to a current situation.
                                                                       The process is implemented in the following way (Fig. 2):
• Condition 2. Variability of industrial standards to
  writing the code. There are a few methodologies                 1. The elements are extracted from the image (text, pictures,
  of HTML-markups (adaptive, BEM (block-element-                       blocks, etc.).
  modificator), bootstrap and so on). Moreover, as a rule,        2. The extracted images are combined in a tree-like structure
  each developer’s team has its own standards. Thus, the               according to a special algorithm.
  system must generate code in several styles and standards.      3. A prefix tree traversal is performed; on each iteration,
• Condition 3. Cross-browser and cross-platform com-                   there is a request to the case base to find a suitable HTML
  patibility. HTML-code must be not only valid, but also               description of an architectural pattern in the tree node.
4. Artifacts received on the previous step are saved in files.   Algorithm 2 Search for a suitable node (findNode)
    The next sections are devoted to the description of these    Require: nodei – a node for that a suitable node is be-
 steps.                                                              ing found, nodex1 , nodey1 – a left top coordinate,
                                                                     nodex2 , nodey2 – a right bottom coordinate
              Image structure building                           Require: orientation – a direction of joining nodes
The goal of this step is to receive a so-called structure of     Ensure: nodej is a node demanded or NULL if there is no
a bitmap, i.e. a mathematical object describing a mutual             such a node
arrangement of image parts. In (Myznikov 2018), we de-            1: if orientation is HOR then
scribed an approach to solving this problem. We detect bor-       2:     nodej ← the next node to the right of the nodei
ders of objects (embedded images, texts, background) and          3: else if orientation is VERT then
then use a greedy algorithm to build a hierarchy of these         4:     nodej ← the next node to the bottom of the nodei
                                                                                       1
                                                                                               x1
objects (see algorithms 1 and 2). The hierarchy then is pro-      5: x ← min(nodex
                                                                       1
                                                                                      i , nodej )
cessed with the CBR cycle (see the next section).                                       y1        y1
                                                                  6: y 1 ← min(nodei , nodej )
                                                                                          2         2
Algorithm 1 Algorithm of hierarchy construction                   7: x2 ← max(nodex       x
                                                                                  i , nodej )
                                                                                         y2       y2
Require: nodes – set of elements ordered by horizontal and        8: y 2 ← max(nodei , nodej ))
    vertical axes of the top left coordinate                      9: R ← rectangle (x1 , y 1 ) (x2 , y 2 )
Ensure: |nodes| == 1 and nodes1 is a root node of the            10: for all nodek ∈ nodes do
    tree containing all elements from the origin set             11:     if nodek ∩ R 6= ∅ then
 1: while |nodes| > 1 do                                         12:         nodej ← NULL
 2:     for orient ← [HOR,VERT] do
 3:         i←0
 4:         while i < |nodes| do
 5:             suitN ode ← f indN ode(nodesi , orient)          Retrieval phase A new case has a problem but no solu-
 6:             if suitN ode is not NULL then                    tion and outcome. The task of this stage is to retrieve cases,
 7:                 if nodei is composite then                   which are the most similar to the problem of a new case. In
 8:                     nodesi .addChild(suitN ode)              general, it is a task of n-class classification, where n is the
 9:                 else                                         number of cases in storage. On the proof-of-concept stage,
10:                     newN ode ← new N ode                     we choose the method of k-nearest neighbors with Euclid
11:                     newN ode.addChild(nodesi )               distance. For this purpose, we vectorize and normalize prob-
12:                     newN ode.addChild(suitN ode)             lems. This means that first, all categorial features are trans-
13:                     newN ode.orientation ← orient            formed to a numerical type, second, absolute values are re-
14:                     nodesi ← newN ode                        placed with relative ones by the formula: xi = xmaxx−x   i
                                                                                                                            min
                                                                                                                                .
15:                 nodes ← nodes \ suitN ode                    Then, calculate a distance between
                                                                                                 pPn  the  given vector and vec-
                                                                                                                     2
16:                 i←i+1                                        tors in a case base d(x, y) =         i=1 (xi − yi ) and select
                                                                 the case, where distance is minimal. In terms of CBR, this
                                                                 case is called retrieved.
          Development of the CBR-system                          Adaptation phase The adaptation stage is performed with
Remember that generation of code is performed with the           an algorithm, receiving a problem of a new case and a solu-
prefix tree traversal, where a tree is an image structure. On    tion of a retrieved case and returning an outcome. A case
each iteration, the CBR-cycle runs. We need to define the        containing a problem of a new case, solution of a retrieved
problem, solution, and outcome in the context of the task. In    case and the generated outcome is a solved case, and the
other words, to describe a case format.                          solution is called suggested.
• Problem is a features vector of a node and its children.          In the current work, a derivative type of adaptation is sug-
                                                                 gested, that means an old solution is to be regenerated in
• Solution is templates of HTML and CSS code, that im-
                                                                 new conditions. Since a solution is a code template, a re-
   plement markup of the structure described.
                                                                 sult is built with using templating language. The difficulty
• Outcome is HTML/CSS code generated from the tem-               is that the results of different cases may conflict with each
   plate applied to a specific context.                          other during the processing. It is especially true for conflicts
   Thus, when each node of a tree is processed, a “problem”      in CSS code: situations when different results describe the
is formed: description of the current element and elements       same class differently. That is why conflict resolution is an
on a lower level in an image structure. In storage, there are    important part of the adaptation stage.
cases in which problems are typical cases of elements lay-          The base of conflict resolution is a resolving of three
out, and solutions are typical ways of markup. The task of a     cases: 1) absence of conflict; 2) a complete conflict, that is
CBR-cycle is to get HTML/CSS code to build a markup of           classes describe incompatible styles; 3) one class is a special
a current structure using the solution stored. Let us describe   case of another. The following strategies are applied accord-
the cycle in detail. As a notation, denote the case processed    ingly: 1) the code is saved as is; 2) different classes with
as new case. Further terms will be included as they appear.      own styles are created, and appropriate replacements in an
                                                                   3. There is no “clogging” of a global storage with specific
      Table 1: Differences of local and global storages               cases, which, first, positively influences the size of global
     Local storage               Global storage                       storage, and second, it reduces a level of overfitting of the
     Stored in RAM               Stored in external mem-              system.
                                 ory                                  In other words, a global storage serves as a source of cases
     Destroyed after a docu-     Exists regardless of              that have solutions of code generation of a typical layout.
     ment is processed           documents processing              Then, local storage is used for accurate and rapid adaptation
     Filled during a docu-       Filled with a special             of these solutions in the context of a specific document.
     ment processing             procedure before docu-
                                 ments processing                  Case storage population The case storage is populated by
     Cases have a higher pri-    Cases have a lower pri-           experts with a share of automation. Namely, a set of cases
     ority when extracting       ority when extracting             with their problem parts is built automatically, and corre-
                                                                   sponding solution parts are filled by specialists. The follow-
                                                                   ing are the steps for generating cases:
HTML-document are made; 3) a new class is added to a               1. A list of real websites is collected.
styles set, which is a special case, and appropriate inserts       2. DOM-trees of random pages are saved.
in an HTML-document are made.
                                                                   3. Trees are split by sub-trees with prefix traversal of a whole
Evaluation phase In the classic CBR, an evaluation step               tree and selecting each node with its children.
serves as validation of the suggested result. A case that
passed the test is called tested, and the solution is approved.    4. A given set of sub-trees is clustered with DBSCAN algo-
Regarding this work, the task of evaluation of the quality of         rithm, when tree-edit distance is used to define the simi-
HTML/CSS code deserves a separate study. In the current               larity between objects. (See tree-edit distance overview in
state of the research, this step is skipped, and validation of        the section below.)
the result is resolved for the whole document and not on each      5. From each cluster, a random object is selected. The se-
iteration of a CBR-cycle.                                             lected objects form the case storage, where each object is
Updating phase The goal of the last step of the cycle is              a problem part of a single case.
to save an approved solution in case storage so that it can           While managing parameters eps and min samples of DB-
be reused in the future. In the current work, this step differs    SCAN, one can control the size of a case storage. The bigger
from the classic approach. Unlike using a centralized case         the case storage, the more accurate the system, but there is
storage, the CBR-system developed maintains a “global”             more work for experts. One must find a right balance be-
and few “local” case storages (see table 1). The global case       tween these two criteria to set up the system adequately to
storage is used for all images. The local ones are used only       the existing conditions. Then, when the set of cases with
in processing a specific document. We can say that a “local”       their problems is built, experts can populate it with solutions
storage is a context of a document: it contains the results of     - HTML/CSS codes.
solved problems during its processing.
   Specifically, a tested case is always saved only into a local
storage. Wherein, on a retrieval step, a search is performed
                                                                                  Results and further work
in both global and local repositories, where local ones have       Before the experiment evaluation, let us illustrate how the
priority. This approach has several advantages:                    system works in practice.
                                                                      As an example, let us consider a page of Novosibirsk State
1. A risk of different code generation for the same blocks of      University site.
   an image is decreased. The system may process identical
                                                                      As an output of the first stage of image processing, ele-
   blocks differently, because noise on an image can affect
                                                                   ments were extracted and grouped into nodes (Fig. 3). Then,
   retrieval of a suitable case. At the same time, the size lim-
                                                                   the algorithm (described earlier in the paper) created a tree-
   itation of a local storage prevents generating redundant
                                                                   like structure (Fig. 4). We estimate that the result is of good
   solutions.
                                                                   quality, because despite of some mistakes (wrong font was
2. System performance is increased:                                selected, some pictures were missing, tiny errors in size ex-
 (a) a global storage is located in external memory, while a       isted), the system managed to solve the main task: to recreate
     local one is in RAM; as a result, the number of requests      a structure of elements and generate human-like code (Fig.
     to a hard disk reduces                                        5).
 (b) graphical user interfaces have a property to contain a
     lot of similar blocks; when processing the next block, a      Evaluation method
     search in a relatively small local storage, which already     The question of evaluation is open. What should we compare
     contains a suitable case, is performed much faster than       to estimate the result? One option is to compare images: an
     a repeating search in a global storage; also, an adapta-      initial screenshot and a screenshot of the interface generated
     tion stage requires less resources than initial adaptation    by the system. It is probably the most obvious way but as we
     of an “unprepared” case.                                      stated in the task definition part, we do not aim to recreate an
                                                               Figure 6: Top-left parts of the input image (left) and the gen-
                                                               erated result (right). Some features like margins, fonts and
                                                               others have been incorrectly reproduced. But in general, the
                                                               layout is generated correctly.


                                                               interface by pixel-to-pixel: it is much more important to rec-
                                                               ognize the idea of the layout. Another option is to compare
                                                               source codes. We can collect a base of real applications with
                                                               their source codes, make screenshots, generate a code, and
                                                               find the difference between the texts. This way is too sensi-
                                                               tive to implementation specificity. There are numerous ways
Figure 3: The first step of processing: elements detection.    to write a code for the same interface and there is no argu-
The processed image (left) and selected elements and nodes     ment to consider the result incorrect if the generated code
(right). The nodes are bounded with colored boxes.             differs from the real one.
                                                                  We believe the best way is to compare tree models of the
                                                               source codes, because a tree model represent the way how a
                                                               code is organized. We collect the data with crawling web
                                                               pages so that among the screenshots we get HTML code
                                                               and extract the DOM model from them. As a rule (except
                                                               some tricky cases), the DOM model adequately describes
                                                               elements structure. The idea is to map structures received by
                                                               our algorithm with DOM trees and roughly estimate their
                                                               similarity. Such a procedure allows us to evaluate our ap-
Figure 4: The second step of processing: structure build-      proach, although we understand that it is limited only on
ing. The tree represents hierarchical relations between ele-   a specific subset of images (only web interfaces) and also
ments.                                                         some technical tricks of web technologies can affect the cor-
                                                               rectness of DOM trees and thereby decrease the estimates.
                                                               Nevertheless, it is the best way of evaluating the method so
                                                               far, which can prove the viability of the solution and outline
                                                               the next steps.

                                                               Selection of a measure
                                                               The crucial thing is the selection of a quality measure. There
                                                               are few approaches to comparing trees: some of them count
                                                               operations needed for the transformation of one tree to an-
                                                               other; others compare the longest common paths from a root
                                                               to a tree node; also there are variable-length doesn’t care
                                                               (VLDC) based methods.
                                                                  We select two measures: edit distance and bottom-up dis-
                                                               tance.
                                                                  Tree edit distance. The review of edit distance methods
                                                               can be found in the survey of Philip Bille. This is how he
                                                               defines the problem:
                                                                  Given T is a labelled ordered tree (ordered means that
                                                               order among siblings matters). Define tree operations as fol-
                                                               lows:
Figure 5: The final result. A web-page image generated by      • Relabel. Change the label of a node.
the system.
                                                               • Delete. Delete a non-root node v in T with parent v’ keep-
                                                                 ing children of v order.
• Insert. Insert a node v as a child of v’ in T making v the         relatively high and bottom-up distance would be relatively
  parent of a consecutive subsequence of the children of v’.         low.
   Assume there is an edit cost is c : V × V → R. c(v, w)               To estimate the approach from different points of view we
denotes relabel, c(v, ⊥) denotes delete and c(⊥, w) denotes          use both measures as well as the F -score that generalizes a
insert operations. Given an edit script S between T1 and T2          common penalty. The problem of aggregating the scores is
                                                                     that they have different scales: [0, +∞) for an edit distance
                              Pn s1 , s2 , . . . , sn , si ∈ V × V
is a sequence of edit operations
                                                                     and [0, 1] for a bottom-up one. We solve this problem by
and cost of S is d(S) =          i=1 c(si ). Denote an optimal
edit script between T1 and T2 is Sopt (T1 , T2 ) : d(Sopt ) =        transforming an edit distance measure. In the beginning, we
minSt d(St ) and d(Sopt ) is a tree edit distance. (Bille 2005)      normalize its value by the size of a sample tree:
   Based on the survey (Bille 2005), we considered that the
                                                                                                           d(Sopt (T1 , T2 ))
best solution of the tree edit distance problem for our case is                      dnorm (T1 , T2 ) =                             (3)
the Klein’s algorithm (Klein 1998), which requires a worst                                                      |T2 |
                             2
case time bound of O(|T1 | |T2 | log |T2 |) and a space bound           It allows us to compare edit distance for trees of different
of O(|T1 ||T2 |).                                                    sizes but the measure is still not limited from above. There-
   In our case, we suggest the following edit cost function.         fore, we apply the logarithm to the measure:
First, we denote a fixed finite alphabet Σ containing values
for labels:                                                                                                        1
                                                                                dlog (T1 , T2 ) = − log(                        )   (4)
                                                                                                           1 + dnorm (T1 , T2 )
                         Σ = {t, p, i}
                                                                        As d is always positive, the domain of dlog is [0, 1). More-
where t stands for an element with text, p stands for an ele-        over, the logarithmic form of the measure perfectly suites
ment with an embedded picture, and i stands for an internal          our idea about estimating the method: we assume that the
node.                                                                more trees differ, the less important the exact value of the
  We denote edit cost as follows:                                    difference.
                                                                        Finally, we can denote the F -measure. In order to move
           
             1,       if v 6= ⊥ ∧ w = ⊥                              from cost scores to measure quality, we subtract distances
                                                                     from one.
           
           
             0.8,     if v = ⊥ ∧ w 6= ⊥
           
 c(v, w) =                                               (1)
            0.1,     if v 6= ⊥ ∧ w 6= ⊥ ∧ v, w ∈ {p, t}
                                                                                  (1 − b) · (1 − dlog )    (1 − b) · (1 − dlog )
           
           
             0,       otherwise                                        F =2                             =2                          (5)
                                                                                 (1 − b) + (1 − dlog )        2 − b − dlog
   Also, denote that a tree received by our algorithm would
be the first parameter, and a tree from a data set would be          Data set collection
the second one.
   This edit cost function penalizes the method the most if          We used Alexa service 1 to get a list of 1000 the most
a resulted tree misses any node, a little less if it contains an     popular websites. Then we made screenshots of the main
extra node and very little if it mixes up a text with a picture.     and two random pages of each website. Also, we crawled
You can mention that it does not penalize for mixing up an           HTML/CSS codes of each page and transformed them into
internal node with no internal, because in this case the al-         trees using DOM parser in Python. Due to technical restric-
gorithm either misses or adds an extra node, which already           tions in some cases, we were only able to collect a dataset of
implies a big penalty.                                               2640 examples.
   Bottom-up distance. This method is presented in the                  In addition, to make the experiment more useful and get
work of Valentie (Valiente ). The complexity of the algo-            more insights we scored each web page with a measure
rithm is O(|T1 ||T2 | log(T1 + T2 )). For two trees T1 and T2        “Text-to-Image ratio”:
the bottom-up distance equals 1−f / max(|T1 |, |T1 |), where
                                                                                                      T
f is the size of the largest common forest in T1 and T2 . We                                    tti =                          (6)
slightly change this formula and make it asymmetric:                                                T +I
                                                                        where T is the number of nodes with text content and I is
                 basym (T1 , T2 ) = 1 − f /|T2 |              (2)    the number of nodes with embedded pictures.
   Edit and bottom-up distances as defined above can be                 The reason why we use this score is to estimate how a
roughly interpreted as “precision” and “recall” respectively         share of text and images affects the result and to define the
in terms of machine learning evaluation. Technically, they           next steps. It is important to understand which part of the
are not the same, but it gives us a good idea of how to read         method is the most problematic, and where efforts should be
an experiment results. Indeed, for a tree that correctly rep-        focused to increase the quality as much as possible.
resents a part of another tree but completely does not con-
tain another part, edit distance would be relatively low, while      Experimental results
bottom-up distance would be relatively high. Otherwise, if a         In general, results are as follows (see Table 2):
tree contains all nodes of another tree, but the structure is
                                                                        1
different and some extra nodes exist, edit distance would be                https://www.alexa.com/topsites
              Table 2: The experiment results
                                 Mean     Variance
             Edit distance       0.28      0.028
          Bottom-up distance     0.15      0.043
               F score           0.78      0.033


                                                                Figure 8: F-score mean and variance depending on “Text-to-
                                                                Image ratio”


                                                                Next steps

                                                                In addition to making improvements from the previous sec-
                                                                tion, the next steps include:
Figure 7: Mean and variance of edit and bottom-up distances
depending on “Text-to-Image ratio”                              1. Development of a procedure of automatic filling of case
                                                                   storage.
                                                                   The current paper considers the development of a sys-
   The mean of F-score is 0.78, and the variance is 0.033,         tem where case storage is populated manually by experts.
which appears to be a good result. Note, that the bottom-          As explained above, this approach has strong advantages.
up distance is much better than the edit distance. It means        However, in general, we would like to collect cases auto-
that the recognition part, on average, works better than the       matically, because it would be less time-consuming and
structure building part.                                           less prone to human errors. Our suggestion is to analyze
   Let us analyze the dependence of scores on “Text-to-            existing web-sites elements with one of the methods of
Image ratio” (figures 7 and 8).                                    clustering and use centers of clusters as reference cases.
   We see that the biggest issues are with the cases when an
image has both text and embedded pictures, approximately        2. Development of a better similarity measure.
in a proportion of seven to three. Comparing extreme cases,        One of the crucial elements of a case-based reasoning so-
when an image consists of mainly embedded pictures or              lution is the selection of an appropriate similarity mea-
mainly text, in the first case the approach works far better       sure. The current version uses a simple KNN principle.
than in the second one. Herewith, edit distance is decreasing      Consequently, there is room for optimizing the measure
when moving from mixed cases to more text-based, while             construction, because the nearest neighbors algorithm is
for the bottom-up distance this effect is not so strong.           insensitive to categorical features. As categorical features
   Also note the heteroskedasticity of the data: the variance      are the majority of elements properties processed, we are
is bigger in “mixed” cases and smaller in extreme cases.           planning to use classifier models based on decision trees.
That is, the method is more unpredictable when a picture
has diverse content.                                            3. Development of a revise stage in the CBR-cycle.
   Based on these outputs, we make the following conclu-           When building a CBR cycle, the revising stage has been
sions:                                                             skipped, because the evaluation of HTML document cor-
                                                                   rectness is an unsolved problem. This question should be
• The method demonstrated satisfactory results.                    investigated to make the cycle complete.

• The results can be advanced by applying forces in the fol-      In conclusion, we developed a system that generates
  lowing areas:                                                 markup language source code for a given interface screen-
                                                                shot. The feature of the approach is using experts’ knowl-
  – enhancement of the text detection part                      edge that is kept in a specific case storage. The experiments
  – improvement of the processing of the cases when an          demonstrated the satisfactory quality of the current solution
    image has diverse content                                   and provided grounds for the further development.
                       References                                Myznikov, P. 2018. Development of the Case-Based Ap-
Aamodt, A., and Plaza, E. 1994. CBR: foundational issues,        proach of Web Interfaces Reverse Reengineering. Vestnik
methodological variations and system approaches. AI Com-         NSU. Series: Information Technologies 16(4):115–126.
munications 7(1):39–59.                                          Natarajan, S., and Csallner, C. 2018. P2A: A Tool for Con-
Ali, K. 2017. A study of software development life cy-           verting Pixels to Animated Mobile Application User Inter-
cle process models. International Journal of Advanced Re-        faces. In 2018 IEEE/ACM 5th International Conference on
search in Computer Science 8(1).                                 Mobile Software Engineering and Systems (MOBILESoft),
                                                                 224–235. Gothenburg, Sweden: IEEE.
Asiroglu, B.; Mete, B. R.; Yildiz, E.; Nalcakan, Y.; Sezen,
A.; Dagtekin, M.; and Ensari, T. 2019. Automatic                 Navarro-Cáceres, M.; Rodrı́guez, S.; Bajo, J.; and Corchado,
HTML Code Generation from Mock-Up Images Using Ma-               J. M. 2018. Applying case-based reasoning in social com-
chine Learning Techniques. In 2019 Scientific Meeting on         puting to transform colors into music. Engineering Applica-
Electrical-Electronics & Biomedical Engineering and Com-         tions of Artificial Intelligence 72:1–9.
puter Science (EBBT), 1–4. IEEE.                                 Nguyen, T. A., and Csallner, C. 2015. Reverse engineer-
Beltramelli, T. 2017. pix2code: Generating Code from a           ing mobile application user interfaces with remaui (t). In
Graphical User Interface Screenshot. 1–9.                        2015 30th IEEE/ACM International Conference on Auto-
                                                                 mated Software Engineering (ASE), 248–259.
Bille, P. 2005. A survey on tree edit distance and related
problems. Theoretical Computer Science 337(1-3):217–239.         Ram, A., and Santamarı́a, J. 1997. Continuous case-based
                                                                 reasoning. Artificial Intelligence 90(1-2):25–77.
Carrascosa, C.; Bajo, J.; Julian, V.; Corchado, J.; and Botti,
V. 2008. Hybrid multi-agent architecture as a real-time          Sartori, F.; Mazzucchelli, A.; and Gregorio, A. D. 2016.
problem-solving model. Expert Systems with Applications          Bankruptcy forecasting using case-based reasoning: The
34(1):2–17.                                                      CRePERIE approach. Expert Systems with Applications
                                                                 64:400–411.
Chang, P.-C.; Liu, C.-H.; and Lai, R. K. 2008. A fuzzy
case-based reasoning model for sales forecasting in print        Schank, R. 1982. Dynamic memory: A theory of reminding
circuit board industries. Expert Systems with Applications       and learning in computers and people. Cambridge: Cam-
34(3):2049–2058.                                                 bridge University Press.
Chen, D., and Burrell, P. 2001. Case-Based Reasoning Sys-        Shen, L.; Yan, H.; Fan, H.; Wu, Y.; and Zhang, Y. 2017. An
tem and Artificial Neural Networks : A Review. Neural            integrated system of text mining technique and case-based
Comput & Applic 10:264–276.                                      reasoning (TM-CBR) for supporting green building design.
                                                                 Building and Environment 124:388–401.
Chou, J.-S. 2009. Web-based CBR system applied to early
cost budgeting for pavement maintenance project. Expert          Sun, J.; Li, H.; Huang, Q. H.; and He, K. Y. 2014. Predict-
Systems with Applications 36(2):2947–2960.                       ing financial distress and corporate failure: A review from
                                                                 the state-of-the-art definitions, modeling, sampling, and fea-
De Renzis, A.; Garriga, M.; Flores, A.; Cechich, A.; and         turing approaches. Knowledge-Based Systems 57:41–56.
Zunino, A. 2016. Case-based Reasoning for Web Service
Discovery and Selection. Electronic Notes in Theoretical         Valiente, G. An efficient bottom-up distance between trees.
Computer Science 321:89–112.                                     In Proceedings Eighth Symposium on String Processing and
                                                                 Information Retrieval, 212–219. IEEE.
Delir Haghighi, P.; Burstein, F.; Zaslavsky, A.; and Arbon,
P. 2013. Development and evaluation of ontology for intel-       Watson, I. A. N., and Marir, F. 1994. Case-based reasoning
ligent decision support in medical emergency management          : A review. The Knowledge Engineering Review 9(4):327–
for mass gatherings. Decision Support Systems 54(2):1192–        354.
1204.                                                            Zeyen, C.; Müller, G.; and Bergmann, R. 2017. Conver-
Gómez-Vallejo, H.; Uriel-Latorre, B.; Sande-Meijide, M.;        sational Process-Oriented Case-Based Reasoning. Springer,
Villamarı́n-Bello, B.; Pavón, R.; Fdez-Riverola, F.; and        Cham. 403–419.
Glez-Peña, D. 2016. A case-based reasoning system for
aiding detection and classification of nosocomial infections.
Decision Support Systems 84:104–116.
He, W. 2013. Improving user experience with case-based
reasoning systems using text mining and Web 2.0. Expert
Systems with Applications 40(2):500–507.
Klein, P. N. 1998. Computing the edit-distance between
unrooted ordered trees. In ESA.
Kolodner, J. L. 1992. An introduction to case-based reason-
ing. Artificial Intelligence Review 6(1):3–34.
Lary, D. J.; Alavi, A. H.; Gandomi, A. H.; and Walker, A. L.
2016. Machine learning in geosciences and remote sensing.
Geoscience Frontiers 7(1):3–10.