<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Search Engine Optimization Recommender System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christian D. Hoyos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Juan C. Duque</string-name>
          <email>juan.duqueg@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andre´s F. Barco</string-name>
          <email>bar@usc.edu.co</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>This means it analyzes each web page individually</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Search Engine Optimization reefers to the process of improving the position of a given website in a web search engine results. This is typically done by adding a set of parameters and metadata to the hypertext files of the website. As nowadays the majority of the web-content creators are non-experts, automation of the search engine optimization process becomes a necessity. On this regard, this paper presents a recommender system to improve search engine optimization based on the site's content and creator's preferences. It exploits text analysis for labels and tags, artificial intelligence for deducing content intention and topics, and case-based reasoning for generating recommendations of parameters and metadata. Recommendations are given in natural language using a predefined set of sentences.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Normally, web content creators require their websites to be easily
found by content consumers through search engines [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. They do so
by setting parameters and adding metadata to the hypertext source
files of the websites. These parameters and metadata allow the
algorithms of the search engines to index and retrieve data of millions of
websites in an efficient way [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For instance, parameters about the
intention of the website allow to classify content and metadata stating
the location is useful to customize content or restrict access. Further,
this information makes possible for the search engine to rank the
results of a query by priority. As reported by Chitika [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], configuring
websites for correct indexing is a key element of their success. This
configuration of values is called Search Engine Optimization (SEO).
      </p>
      <p>Now, although every website is implemented following a
standard, namely HTML, there is no standard for web page ranking as
each search engine (Google, Yahoo, Bing, etc) implements its own
ranking system. This implies that improving the indexing position of
a website requires an expert on both the content as well as on the
search engine ranking system.</p>
      <p>On this regard, this paper proposes an expert recommendation
system in charge of performing SEO for a given web page4 targeting the
Google search engine. It uses artificial intelligence to deduce the
intention and content topic of the web page, it uses text analysis over
labels and tags in order to classify and comparison, and it uses
casebased reasoning to provide recommendations for improving SEO on
the web page.</p>
      <p>The documents is structured as follows. The overall behavior of
the system, and its architecture, are presented in Section 2. Each of
the modules of the system are described in Section 3. An
experimental test and its results are shown in Section 4. Conclusions are
presented in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Overview</title>
      <p>To provide recommendations for indexation of a web page, aspects
such as content topic, keywords, intention of the (authors’) web page,
metadata, related web pages and the specific raking system of the
search engine, should be taken into account. These aspects allow the
expert system to understand the website communication goals and
to create recommendations that respect the search engine
implementation. The expert system proposed here tries to unveil the previous
aspects using three modules in charge of analysis and one module in
charge of recommendation generation (see Figure 1).</p>
      <p>The systems receives three inputs, two of which are optional. The
first input of the system is either an HTML source file or an
hyperlink (URL to an HTML). If the HTML contains scripts or CSS
definitions, they are ignore are they not provide useful information
for the indexation. Hyperlinks should be accessible from the web.</p>
      <p>The second input is the topic of the web page, which is an optional
value. The last input value is the intention of the web page and it is
as well optional. It is worth noticing that having explicitly defined
topic and intention will help the system’s accuracy and performance
(no topic and intention processing). Having the inputs, the system
executes the following steps and throws as output a web page score
and its recommendations.</p>
      <p>
        First, the web page is analyzed using text analysis over the HTML
source code. The analysis throws an score depending on the presence
or absence of 22 of the more important factors for indexation
according to Google [
        <xref ref-type="bibr" rid="ref2 ref5">2, 5</xref>
        ]. These factors add positive values to the score
when present and negative values when not. This is the first source
of knowledge to build a recommendation of a web page.
      </p>
      <p>
        Once the text analysis is done, a topic and intention analysis is
performed using the IBM Watson system (a state-of-the-art artificial
intelligent API) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The topic and intention are useful in two ways.
      </p>
      <p>At the one hand, they allows to classify the content of the web page.</p>
      <p>And, on the other hand, they are basis a case-based reasoning
recommendation executed in the last step.</p>
      <p>
        Next, using the obtained topic and intention as keywords, the
system performs a search query in the Google search engine and
retrieves the first 10 pages from the result. It then proceeds by
analyzing each web page in the aim of extracting key values, such as
keywords and metadata, that made those pages the 10 first ranked pages
of Google. This is an implementation of case-based reasoning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and are the second source of knowledge to build a recommendation
of a web page.
      </p>
      <p>
        Finally, the system builds a recommendation using HTML code and
natural language [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] using predefined sentences. They are based on
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the identified negative evaluated factors (e.g., missing tags) and the
extracted data from the first 10 pages (e.g., new keywords).
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System’s Core</title>
      <p>The recommendation system is divided in four modules.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Module 1: HTML analysis</title>
      <p>This module focuses in labels and metadata of the web page HTML
source files. In particular, it looks for specific information that is
related with the Google ranking system and 22 key aspects in specific
labels as &lt;meta name=...&gt;. These aspects include keywords
definition, char-set codification, description of web page, copyright,
content duplication and broken links, among others. Each factor has
associated a positive value if included in the source file and a
negative value if not. The Table 1 present some of the key aspects and its
respective values.</p>
      <p>Label
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12
F13
F14
F15
F16
F17
F18
F19
F20
F21
F22</p>
      <p>Description
User of keywords in tag title.</p>
      <p>Connection among keywords
(interrelated)
Low density on keywords (not too
many)
Description in tag meta with a
maximum of 200 words.</p>
      <p>Excessive use of meta and alt
tags.</p>
      <p>Definition codification in tag
char-set.</p>
      <p>Avoid the use of tag refresh
Use of tag alt in &lt;img&gt; and
&lt;input&gt;
No broken URLs in source file
Use of tag H (h1, h2, h3)
Exceeding maximum number of
characters in tag title
Use of tag keyword with
maximum of 200 characters.</p>
      <p>Percentage (between 5 and 20) of
keywords in text
Hyperlinks to pages of the same
website
Content strongly connected to the
web page topic and keywords
Duplicated content.</p>
      <p>Use of strong, bold and
italic for fonts.</p>
      <p>Use of cache-control tag.</p>
      <p>Keywords in URL.</p>
      <p>Use of keywords in numbered lists.</p>
      <p>Use of tag author.</p>
      <p>Definition of tag copyright
Benefit
13,5
13,5</p>
      <p>
        Penalty
-16,8
-16,8
10,5
10,5
10,5
13,5
The intention and topic is deduced from the content, meaning that
only the text within the labels &lt;body&gt; ... &lt;/body&gt; are
analyzed. Both intention and topic are deduced using the IBM Watson
system through its public API only if no user input is given.
Watson is, en essence, an on-line system that exploits several techniques
from artificial intelligence to provide services as speech to text,
natural language understanding and query answer system, emotion and
sentiment analysis, translator and visual recognition [
        <xref ref-type="bibr" rid="ref3 ref9">3, 9</xref>
        ].
      </p>
      <p>The topic and intention are deduced by Watson using Natural
Language Understanding/Classification for the analysis of text. In the
case of the topic, classification is done through a set of categories,
concepts and keywords. In case of the intention the system classifies
according to how positive or negative is the web page. Then the
analysis assigns one of the following labels to the page: Very Positive,
Positive, Neutral, Negative and Very Negative. Each of these labels
are connected to numerical values thrown by Watson, as presented in
Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>Test</title>
      <p>Two type of tests have been made; tests using public web pages and
tests using an authors’ web page.
4.1</p>
    </sec>
    <sec id="sec-6">
      <title>Public websites tests</title>
      <p>In these tests, five topics have been chosen and the following five
queries have been designed.</p>
      <p>Label
Very Positive
Positive
Neutral
Negative
Very Negative
The set of categories, concepts, keywords and the intention are used
for constructing a search query in the aim of obtaining similar web
pages. The main idea is to extract the parameters used by top ranked
web pages, the first 10 pages in Google’s search engine, that address
the same topic and has the same intention. Potentially, those 10 pages
include data in their HTML files that made them the first ranked by
the search engine. Arguably, using the same or similar parameters
(as new keywords or tags), will help to improve indexation of other
pages. For instance, adding keywords that were not previously
included in the web page but that are a common for most of the 10
retrieved pages.</p>
      <p>
        Note: The system only retrieves the first 10 pages for two reasons.
At the one hand, according to the literature, the probability of user
access to a web page ranked after 10th position is around 1% [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Thus, the system obtains only those web pages that are likely to have
high user access rate. And, at the other hand, the analysis of more
pages may reduce its efficiency. Consider that each of the 10 pages
is analyzed using the same techniques. Ergo, the process, plus
comparison, must be executed 11 times, which is time consuming.
3.4
      </p>
    </sec>
    <sec id="sec-7">
      <title>Module 4: Natural language recommendation</title>
      <p>Recommendation are build with structured predefined sentences of
the form: target factor + recommendation over factor + explanation
of recommendation + example in HTML. Each recommendation is
classified into four categories in according with its importance:
Black: Critical recommendation to be applied for basic indexation
in Google search engine.</p>
      <p>Red: Not following the recommendation may significantly affect
the position of the web page in the results.</p>
      <p>Yellow: Not following the recommendation may moderately affect
the position of the web page in the results.</p>
      <p>Blue: Not following the recommendation may minimally affect
the position of the web page in the results.
1. Football soccer critic.
2. Mediterranean food.
3. Vaccines for cats.
4. Contamination of the Oceans.
5. Renewable energies.</p>
      <p>The first three results of each query have been feed to the system
with automatic execution. Table 3 shows the number of
recommendations of each found page.</p>
      <p>Query
Football critic
Mediterranean food
Vaccinations for cats
Contamination of the Oceans
Renewable energies
For these tests, a web page created by the authors is feed to the
system in three different round. Recommendations (from rounds one and
two) are implemented before the next round (rounds two and three).
The designed page is a basic HTML file, without styles or scripts,
used to show the improvement of a given web page through the
system’s recommendations. The title of the web page is the “The fall
of JQuery”, and addresses the descend of developers using JQuery.
Figure 2 show the recommendations of the first round (in Spanish5)
with different colors for their importance. As an example of the
result, first line of recommendation states “You should use labels h1,
h2, h3...h6 more often, as they help defining the importance of conent
within the page”. Table 4 presents the results of the three rounds of
execution.</p>
      <p>The values thrown by the system in the three rounds show an
evolution of the web page through the recommendations. As expected,
for a web page with no external links referencing at it, the system
assigns low score and several recommendations for the first run,
augmenting score and decreasing recommendations. Bear in mind that
the number of recommendations is lower than the tests in the
previous sections given that the content of the designed web page is not as
big and does not have as many links as the other pages.
5</p>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>Although the internals of web search engines are very similar, each of
them implements different ranking system for indexing web pages.
In consequence, the identification of factors that are included in the
ranking systems, and its tuning by means of hypertext (metadata),
is critical for the success of a given web page. In this context, tags,
topic and intention are relevant for recommending changes in the aim
of improving results position.</p>
      <p>This paper proposed a recommender system for improving the
search optimization of a web page in the Google’s search engine.
The system evaluates 22 main factors used by Google search engine
to classify the web pages (ranking them). The system represents a
positive contribution because:</p>
      <p>Basic and fundamental factors are handled so that the search
engine can identify the content and structure of the web page.
Each recommendation explains with details and examples, and in
natural language, how the improvement of a factor in the website
can be made.</p>
      <p>An user without much experience in SEO can make use of the
recommendation system as it is intuitive.</p>
      <p>Recommendations are different for each factor and each web page
(customized recommendations).</p>
      <p>The analysis and recommendations are made based on the top 10
bests indexed sites in Google, that deal with the same topic and
intention (instance of case-based reasoning).
5 The system interface is in Spanish as it is being used in a multimedia
engineering program in Colombia.</p>
      <p>Figure 2. Recommendations for first round to “The fall of JQuery” web page..</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Monica</given-names>
            <surname>Bianchini</surname>
          </string-name>
          , Marco Gori, and Franco Scarselli, '
          <article-title>Inside pagerank'</article-title>
          ,
          <source>ACM Trans. Internet Technol.</source>
          ,
          <volume>5</volume>
          (
          <issue>1</issue>
          ),
          <fpage>92</fpage>
          -
          <lpage>128</lpage>
          , (
          <year>February 2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Pablo</given-names>
            <surname>Ferna</surname>
          </string-name>
          <article-title>´ndez, 'Google's pagerank and beyond: The science of search engine rankings'</article-title>
          ,
          <source>The Mathematical Intelligencer</source>
          ,
          <volume>30</volume>
          (
          <issue>1</issue>
          ),
          <fpage>68</fpage>
          -
          <lpage>69</lpage>
          , (
          <year>Mar 2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          , 'Introduction to “This is Watson”',
          <source>IBM Journal of Research and Development</source>
          ,
          <volume>56</volume>
          (
          <issue>3</issue>
          .4), 1:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          :
          <fpage>15</fpage>
          , (May
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Chowdhury</surname>
            <given-names>G.</given-names>
          </string-name>
          ,
          <source>'Natural language processing'</source>
          ,
          <source>Annual Review of Information Science and Technology</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ),
          <fpage>51</fpage>
          -
          <lpage>89</lpage>
          , (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Chitika</given-names>
            <surname>Insights</surname>
          </string-name>
          .
          <article-title>The value of google result positioning</article-title>
          . http: //info.chitika.com/uploads/4/9/2/1/49215843/ chitikainsights-valueofgoogleresultspositioning. pdf, cited
          <year>June 2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Killoran</surname>
          </string-name>
          , '
          <article-title>How to use search engine optimization techniques to increase website visibility'</article-title>
          ,
          <source>IEEE Transactions on Professional Communication</source>
          ,
          <volume>56</volume>
          (
          <issue>1</issue>
          ),
          <fpage>50</fpage>
          -
          <lpage>66</lpage>
          , (
          <year>March 2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Atanas</given-names>
            <surname>Kiryakov</surname>
          </string-name>
          , Borislav Popov, Damyan Ognyanoff, Dimitar Manov, Angel Kirilov, and Miroslav Goranov, '
          <article-title>Semantic annotation, indexing, and retrieval'</article-title>
          ,
          <source>in The Semantic Web - ISWC</source>
          <year>2003</year>
          , eds.,
          <string-name>
            <surname>Dieter</surname>
            <given-names>Fensel</given-names>
          </string-name>
          , Katia Sycara, and John Mylopoulos, pp.
          <fpage>484</fpage>
          -
          <lpage>499</lpage>
          , Berlin, Heidelberg, (
          <year>2003</year>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kolodner</surname>
          </string-name>
          ,
          <source>Case-Based Reasoning</source>
          , Elsevier Science,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Punkaj</given-names>
            <surname>Vohra</surname>
          </string-name>
          , '
          <article-title>The new era of watson computing this article introduces cognitive computing using ibm watson and how to leverage cognitive computing with ecm centric solutions'</article-title>
          ,
          <source>IBM Developer Works, (02</source>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>