Visual Inter faces for the Development of Event-based
             Web Agents in the IRobot System

                                       Liangyou Chen

                                         ACM Member

                                  chen_liangyou@yahoo.com


       Abstract. Timely integration and analysis of information from the World-Wide
       Web is important for businesses and startups, especially to cope with the
       increasing global competition need. Manual Web data analysis is time
       consuming and error prone, while developing automated algorithms is
       nontrivial even for expert Web programmers. We argue that an event-based
       architecture that couples Web-agent technology with visual-programming
       methodology is ideal for Web-data analysis. We demonstrate this with the
       IRobot system, which uses visual interfaces to facilitate the design of event-
       based intelligent Web agents for data integration. The system is effective in
       both its ability to capture human-oriented Web procedures and its flexibility in
       modeling complex agents that meet human goals.

       Keywor ds: Web computing, rule-based systems, Web agents, data integration,
       Web automation


1     Introduction

The Word-Wide Web has grown as a central platform for publishing and distributing
information. Small businesses and startups are more and more dependent on
information from the Web, e.g., for price comparison, market analysis, real-time trend
discovery, literature research, and many other activities. A real problem for them is
the timely integration and analysis of information from multiple Web resources.
Because of time and financial constraints, they are unable to hire special Web
programmers to develop complex systems. Rather, they depend much on ad hoc and
manual approaches to pull information around the Web, which is both labor intensive
and time consuming.
   There are efforts from both the research and the industrial communities trying to
address this issue. The research community proposes the use of intelligent agent
systems to support autonomous data integration from the Web, e.g., Michalowski et
al. [1] and Neiling et al. [2]. Industrial solutions typically use special libraries in a
hosting programming language for Web automation. Examples include cURL [3] in
PHP, and Scrapy [4] in Python. Utilizing these technologies still requires substantial
knowledge and effort in programming and system building. Also, many Web
resources publish standard Web services or application programming interfaces for
third-party application integration, which should allow automated integration and
processing of information from their sites through simple scripts. However, many
other websites publish only traditional Web pages mainly for human exploration, and
integration and analysis of information from such websites is especially difficult for
non-experts.
    Our goal here is to offer an affordable, easy to use software platform for small
businesses to integrate and process available Web resources without requiring much
programming skill. We developed a solution through the use of event-based Web
agents, or robots, with the IRobot (standing for Internet Robots) system, which is able
to simulate user’s navigation on the Web while performing data retrieval and
manipulation. The system can be learned and mastered by non-experts because of
the use of visual-programming interfaces. The system has attracted thousands of
registered users worldwide, and received favorable feedback from a public Web
forum.


2     A Motivating Example

To recommend the best doctor for patients, Mary, a medical consultant, was
considering the use of Google Scholar citation index [5][6] to evaluate the expertise
of medical doctors. Her idea was that doctors who published more quality papers
should be more knowledgeable about the medical condition than their peers, and
should be able to provide better treatments. The citation index provided by Google
Scholar can serve as a reference for the quality and impact of the doctor’s
publications. Mary was thinking of the following procedure:

Pr ocedur e 1: Google Citation Index
   • Get the disease name of a patient from a database;
   • Find from Google Scholar [5] publications related to the disease;
   • Get the author names of each publication;
   • Go to Google Citation Gadget [6] for citation index;
   • Find the citation index score for each author;
   • Rank authors based on their citation-index scores.

   Mary found IRobot software from the Internet. After spending a few hours to get
familiar with the system, she quickly created a robot to complete the above procedure.
Unfortunately, she found out that Google Scholar citation index only uses the last
name and first-name initial of the author to search, and popular names such as “J.
Smith” has a much higher score than rare names, because there are many J. Smiths in
the world. She decided to use another Web service named “Scholarometer” [7],
which is a crowd-source scholar-rating system provided by Indiana University, to rate
scholars, and use Pubmed [8] to find disease-related publications. She changed the
procedure as follows:
Pr ocedur e 2: Scholar ometer Impact Scor e
   • Get the disease name of a patient from a database;
   • Find from Pubmed [8] publications related to the disease;
   • Get the author names of each publication;
   • Use Scholarometer Web service [7] at Indiana University to search for the
       author’s impact score;
   • Find the impact score for each author;
   • Rank authors based on their impact scores.

   Now, Mary can easily make a robot for Procedure 2. She found that the new
impact scores are more reliable than the citation index score. Based on the ranking,
she had more confidence when recommending doctors to her patients.
   In this example, Mary can test her idea directly on the Web using IRobot software.
She can continuously change and refine her idea without much cost. The resulting
robot improved the quality of her work. (Note that although the above two
procedures look similar, it requires significant programming effort to retrieve and
locate Web data at different websites using regular programming techniques.)


3     The IRobot System

IRobot is a system for the design and deployment of Web agents for Web data
processing. The system includes convenient visual-programming interfaces for the
composition and combination of high-level Web and database operations into
actionable software agents. The creation and operation of agents can be easily
followed and mastered by casual users. Also, the system provides a full range of
lower-level data-operation functions, such as text transformation, date and time
operations, and logic computation, for skilled users. An event-driven architecture is
employed for the integration of features and functions at different levels.
    Internally, IRobot uses a specially-designed XML-based rule language to represent
Web and database operations, which are termed “actions” in the IRobot system. For
example, each step in Procedure 1 and Procedure 2 can be seen as an action in the
IRobot system, and can be represented as an “Action” tag in XML. The benefit of
an XML-based rule language is that rules can be easily composed and manipulated by
software. This, for example, allows the IRobot software to automatically learn action
rules from user-Web interactions, and encode them in XML.
    In IRobot, a sequence of actions comprises a “task,” and a robot may include
multiple tasks. Each robot is stored and maintained in a single robot file, and
represents a logically complete job. In our example, Procedure 1 and 2 can be
designed as two tasks named “Google Citation Index” and “Scholarometer Impact
Score,” and they can be included in a single robot, named “scholar_index.” In
addition, we allow users to flexibly divide a task into multiple smaller tasks and
combine them by “task calls.” The concepts of robot, task, and action also allow
users to visualize Web interactions as objects of different granularities, and provide a
means to solve complex problems by visually composing and combining simpler
operations. A more complete discussion of features of the IRobot system can be
found from our online manual at http://irobotsoft.com/help/.
   Fig. 1 shows the main interface of IRobot. The interface shows user-designed
robots on the right in an embedded Internet Explorer (IE) Web browser, and the list of
robot actions on the left panel. Each action can be customized as an object. A user
can open and run robots from the main interface by simple clicking. When a robot is
running, it will show the real-time Web interactions in the embedded Web browser.


Fig. 1. The main interface of the IRobot system. On the right it uses an embedded IE Web
browser to list user-designed robots. On the left, it shows the actions of a selected robot.


3.1    Visual-Pr ogr amming Inter face

IRobot allows users to visually compose and combine actions representing Web and
database operations. Actions can be created in IRobot by simple recording.
Specifically, IRobot provides a recorder-like interface, which automatically generates
a sequence of robot actions when the user navigates in the embedded Web browser.
These actions can be used to repeat what the user has done in the browser, such as
link following, input feeding, form submission, or data extraction.             More
importantly, these actions are resistant to Web page changes and can work
continuously on dynamically generated Web content. Internally, we use robust
wrapper techniques reported in [9] and [10] to locate Web data.
    Once the actions are generated, the user can move them around in an object
oriented fashion, or customize their properties via a Web-browser based interface.
Fig. 2 shows the customization of the properties of a “Get a table of data” action,
which is given a name “AuthorList.” Customizable properties include the location
of the data, the sequential order for retrieving each tuple of data, the text description
of the action, and so on. Notice that these properties were set initially by the
recorder, and typically users simply need to choose another option to change the
action’s behavior.


Fig. 2. Property of a robot action that has been automatically generated by the recorder.


3.2    Event Customization

Fine-grain control and customization of robots is done through event-condition-action
(ECA) rules. For this, we have defined special events corresponding to different
stages of retrieving a Web page or retrieving a tuple of results from the Web page.
For example, a “before each page” event is associated with the time before a page is
retrieved by the robot, and an “after each tuple” event is associated with the time after
a tuple is extracted from the page. Users can then use these events to fire action
rules, for example, to compute new variables, or to call robot tasks.
    Fig. 3 shows the use of events in the “scholar_index” robot to transform the author
names. Here, the user uses some low-level functions like “htql” and “loadData”
(specifications available in our manual), and associates them with an “after each
tuple” event to divide author names from each tuple of results (which was defined in
variable “AuthorList”). The ECA rules serve to separate the relatively simple
computations from the more complex Web related operations (i.e., data extraction and
Web navigation operations).
Fig. 3. Visual interfaces to customize events in the IRobot system.


3.3    Database Oper ations

    IRobot is also a data-integration engine. It supports the integration of general
data sources including most commercial databases, text files, HTML files, and XML
files. Each database or file is defined as a named data source in IRobot with simple
wizard-like interfaces. Once defined, they can be used to locate or save data.
    Most of the database operations can be defined with visual interfaces. For
example, Fig. 4 shows the interface to save and sort data in a text database in CSV
(standing for Comma-Separated Values) format. The sorting fields are simply listed
in the “Sorting by fields” box, and duplicated values will be removed by selecting an
option from a drop-down list, e.g., in this case “Unique & Keep Old Data & Append
File,” which ensures that new data with the same unique keys will not be added to the
text database. Finally, this database operation is associated with an “after each
tuple” event, so data are automatically fed and saved to the database when each tuple
of result is extracted. The combination of visual interfaces and ECA rules provides a
mechanism for users to define database operations without writing complex structured
query language, or SQL, statements.


4     How IRobot Works

As demonstrated in Fig. 1, Procedure 1 in our example is realized in IRobot as a task
named “Google Citation Index,” including the following sequence of actions:
  a. Go to URL: http://scholar.google.com/...;
  b. Submit form with 'group2';
  c. Get a table of data including 'Related articles';
Fig. 4. Visual interfaces in the IRobot system to save and sort data to databases.

  d.    Repeat;
  e.    Go to URL: http://code.google.com/p/citations-gadget/;
  f.    Submit form with 'group3';
  g.    Click an button like 'Submit';
  h.    Extract data like 'Citations';
  i.    Extract data like 'Cited Publications';
  j.    Extract data like 'H-index';
  k.    Save.
    This sequence of actions is very similar to the steps in Procedure 1, and a user can
easily understand the workflow of this task by simply looking over the action list.
However, such visual simplicity disguises much complexity in the actual performance
of this task. The run-time complexity of the robot comes from two sources. First,
the sequence of actions is not exactly carried out in sequence – they are performed in
a recursive manner, i.e., each later action is carried out repeatedly after each tuple of
data is processed by its preceding action. This recursive behavior mainly affects
actions that produce multiple tuples. For example, action c in the above list would
extract multiple “Related articles,” and because of recursion, each article will be
further processed by actions from d to k.
    The second source of run-time complexity comes from various programming
constructs including action repeating, conditional branching, and task calling. These
programming constructs provide a means to fine-control the execution logic of a
robot, and they can be defined visually. For example, action d above repeats on each
author extracted from action c, and also due to recursion, actions e-k will be
recursively applied to each author. Conditional branching and task calling are
mainly done with ECA rules, where, based on certain condition, another task may be
called for execution just like function calls in regular programming languages. For
example, in our software demonstration for Procedure 2, the Scholarometer Web
service is designed as a separate task, and is called from the main task “Scholarometer
Impact Score” after each author is found from Pubmed.


5     More about the IRobot System

IRobot is free software available at http://irobotsoft.com/. Video demos and detailed
manuals can be found online at: http://irobotsoft.com/help/. An active discussion
forum is at: http://irobotsoft.org/bb/. Our members love the software. For
example, our forum member “herbycanopy” said:
    “I must say this program really is great, you all seem to have thought of
    everything.”
Another recent comment from member “linkme” said:
    “I love this software … But … When are you going to go xPlatform with it? It
    doesn’t play as well as it could in wine :p”
Through the use of ECA rules and visual-programming interfaces, IRobot offers
great simplicity for the design of Web-data integration agents. IRobot decreased
the cost of small businesses for Web-data collection and analysis. For example,
one of our customers stated in email:
    “Desperate to work with someone who is reasonably priced and that we can
    trust and you have never let us down.”


Refer ences

1. Michalowski, M., Ambite, J. L., Thakkar, S., Tuchinda, R., Knoblock, C. A., Minton, S.:
   Retrieving and Semantically Integrating Heterogeneous Datafrom the Web. IEEE Intelligent
   Systems, 19, 72--79 (2004)
2. Neiling, M., Schaal, M., Schumann, M.: WrapIt: Automated Integration of Web Databases
   with Extensional Overlaps. Web, Web-Services, and Database Systems, 2593/2009, 184--
   198 (2009)
3. cURL, http://curl.haxx.se/
4. Scrapy, http://scrapy.org/
5. Google Scholar, http://scholar.google.com/
6. Google Citation Gadget, http://code.google.com/p/citations-gadget/
7. Scholarmeter, http://scholarometer.indiana.edu/
8. Pubmed, http://www.ncbi.nlm.nih.gov/pubmed/
9. Chen, L., Jamil, H. M., Wang, N.: Automatic Composite Wrapper Generation for Semi-
   structured Biological Data Based on Table Structure Identification. SIGMOD Record, 33,
   58--64 (2004)
10.Chen, L.: Ad Hoc Integration and Querying of Heterogeneous Online Distributed Databases,
   Ph.D. Dissertation. Dept. of Comp. Sci. & Eng., Miss. State Univ., MS (2004)