<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A New Image Analysis Framework for Latin and Italian Language Discrimination</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Darko Brodic</string-name>
          <email>dbrodic@tf.bor.ac.rs</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessia Amelio</string-name>
          <email>aamelio@dimes.unical.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zoran N. Milivojevic</string-name>
          <email>zoran.milivojevic@vtsnis.edu.rs</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Applied Technical Sciences</institution>
          ,
          <addr-line>Aleksandra Medvedeva 20, 18000 Nis</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>DIMES University of Calabria</institution>
          ,
          <addr-line>Via P. Bucci Cube 44, 87036 Rende (CS)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Belgrade, Technical Faculty in Bor</institution>
          ,
          <addr-line>V.J. 12, 19210 Bor</addr-line>
          ,
          <country country="RS">Serbia</country>
        </aff>
      </contrib-group>
      <fpage>46</fpage>
      <lpage>55</lpage>
      <abstract>
        <p>The paper presents a new framework for discrimination of Latin and Italian languages. The rst phase maps the text in the given language into a uniformly coded text. It is based on the position of each letter of the script in the text line and its height, derived from its energy pro le. The second phase extracts run-length texture measures from the coded text given as 1-D image, by producing a feature vector of 11 values. The obtained feature vectors are adopted for language discrimination by using a clustering algorithm. As a result, the distinction between the two languages is perfectly realized with an accuracy of 100% on a complex database of documents in Latin and Italian languages.</p>
      </abstract>
      <kwd-group>
        <kwd>Clustering</kwd>
        <kwd>Document analysis</kwd>
        <kwd>Image processing</kwd>
        <kwd>Information retrieval</kwd>
        <kwd>Italian language</kwd>
        <kwd>Statistical analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Information retrieval represents one of the areas of natural language processing.
It nds the objects, which usually represent documents of an unstructured
nature (usually text) that satisfy an information need from within large collections
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Typically, the vector space model is used for similarity distinction between
the documents. However, the cross-language information retrieval is still a
challenge. It is especially expressed between very similar languages or languages that
evolved one from another.
      </p>
      <p>
        The Latin language was originally spoken in the region around Rome called
Latium. As a consequence of Roman conquests, Latin was quickly spread over
a larger part of Italy and wider. Accordingly, it has begun the formal language
of the Roman Empire. After its collapse, Latin language evolved into the
various Romance languages. However, it was still used for writing. Furthermore,
the Latin language was a lingua franca, which was used for scienti c and
political a airs, for more than a thousand years. Up to now, ecclesiastical Latin
language has remained the formal language of the Roman Catholic Church. As
a consequence, it is the o cial language of the Vatican. Although Latin
language is not a live language, it is not a dead language. It is still partly in use.
Today, the Latin language is usually taught in order to translate Latin texts
into modern languages. Because of this long tradition and of the in uence on
the modern languages, the study of Latin is extremely important for linguistic
research. Italian language is one of the languages from the Romance language
group, which is the closest to the Latin language. It comprises many dialects
from the North to the South of Italy. However, the standard Italian language is
virtually the only written language. Today, the standard Italian language is
virtually the only dialect of culture in modern Italy, which is used as the language
of intercommunication between di erent parts of Italy. To the very best of the
author's knowledge, some aspects of evolving Latin into modern Italian language
have been researched. Still, these aspects were completely linguistics in nature
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In contrast, we conducted the research in the direction of safe automatic
di erentiation of these languages in unsupervised manner.
      </p>
      <p>In this paper, we propose a novel framework for the distinction between
languages that evolved one from another. As an example, we use Latin and modern
Italian languages. The framework includes the following stages: script coding,
run-length texture analysis and clustering. The main novelty of the framework
is the extension of a state-of-the-art clustering method and its application on
document features for discrimination of languages evolved one into another.
Because we deal with discrimination problem, unsupervised method is appropriate.
The distinction between the two related languages is perfectly realized with an
accuracy of 100%, which outperforms competitor methods.</p>
      <p>The paper is organized in the following manner. Section 2 describes the
proposed framework. Section 3 explains the experiment. Section 4 gives the results
of the experiment and discusses them. Section 5 makes a conclusion.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Proposed Framework</title>
      <p>
        Our framework for Latin and modern Italian language discrimination is
composed of the following three steps: (i) script coding, (ii) texture analysis, (iii)
clustering. Script coding adopts the approach previously introduced by Brodic
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In fact, it demonstrated to be successful for solving a critical task of
closely related language discrimination [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular, given the text
document as input, it maps each letter of the document to only four codes based
on the corresponding position in the text line, representing the gray-level pixels
of a 1-D image. Then, texture analysis is performed on the produced image in
order to extract run-length texture features. In order to select the feature
representation, three well-known types of texture features, run-length, co-occurrence
and ALBP, have been evaluated on benchmark datasets of the same languages.
Results demonstrated that run-length features obtain the best performances in
language discrimination in this context. These features are discriminated by a
new clustering method in order to detect classes representing documents written
in two di erent languages.
Text documents can be divided into text lines. Furthermore, each text line can be
segmented by considering the energy of the script signs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] into the four virtual
lines [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]: top-line, upper-line, base-line and bottom-line. These lines track the
following vertical zones in the text line area [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]: upper zone, middle zone and
lower zone. The letters can be categorized based on their position in vertical
zones of the line, that represents their energy pro le. The short letters (S) are
located into the middle zone only. The ascender letters (A) occupy the middle
and upper zones. The descendent letters (D) are spread into the middle and
lower zones. The full letters (F) enlarge over all vertical zones. Consequently,
all letters can be classi ed as belonging to four di erent script types [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Fig. 1
depicts the script characteristics according to their position in the baseline.
      </p>
      <p>Each script type can be mapped into a di erent number code. Because there
are only four script types, mapping is performed to four number codes f0, 1, 2,
3g. Then, these codes are associated with four di erent gray levels to create an
image. Fig. 2 illustrates the correspondence between script type number codes
and gray levels.</p>
      <p>
        Consequently, each text document is translated into a set of number codes
f0, 1, 2, 3g corresponding to pixels of only four gray levels. It obtains a textured
1-D image I, which can be analyzed by adopting the texture analysis.
Texture quanti es the intensity variation in the image area [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Hence, it is a
powerful tool for the extraction of important properties like image smoothness,
coarseness and regularity. Accordingly, the texture is useful to compute image
statistical measures. Run-length statistical analysis is adopted to retrieve texture
features and to evaluate texture coarseness [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A run is a set of consecutive pixels
with the same gray-level value in the speci c texture direction. The ne textures
are characterized by long runs, while coarse textures include short runs.
      </p>
      <p>Let I be an image of X rows, Y columns and L gray levels. The rst step
consists in building the run-length matrix P. It is created by xing a direction
and then counting how many runs are encountered for each gray level and length
in that direction. Accordingly, a set of consecutive pixels with identical intensity
values identi es a gray-level run. The row number of P is equal to L, i.e. the
number of gray levels, while the column number of P is equal to the maximum
run length R. In our case, a single element of the run-length matrix P (i; j) at
position (i; j) represents the number of times a run of gray-level i and of length
j occurs inside the image I (in our case, 1-D image).</p>
      <p>
        Di erent texture features can be extracted from the P matrix [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: (i) Short
run emphasis (SRE), (ii) Long run emphasis (LRE), (iii) Gray-level non-uniformity
(GLN), (iv) Run length non-uniformity (RLN), and (v) Run percentage (RP).
The extraction of texture features from P includes also the following two
measures [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]: (i) Low gray-level run emphasis (LGRE) and (ii) High gray-level run
emphasis (HGRE). In Dasarathy et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], other four texture features are
proposed, based on the joint statistical measure of gray level and run length. They
are: (i) Short run low gray-level emphasis (SRLGE), (ii) Short run high
graylevel emphasis (SRHGE), (iii) Long run Low gray-level emphasis (LRLGE), and
(iv) Long run high gray-level emphasis (LRHGE).
      </p>
      <p>In this way, run-length statistical analysis extracts a total of 11 feature
measures, de ning a 11-dimensional feature vector for language representation.
2.3</p>
      <p>
        Clustering
The aforementioned run-length feature vectors, each representing a document in
Latin or modern Italian languages, are subjected to unsupervised classi cation
by a clustering technique. It is adopted for discriminating between documents
written in Latin language and documents written in modern Italian language.
In order to nd the classes in the data, we adopt the Genetic Algorithms Image
Clustering for Document Analysis algorithm (GA-ICDA), previously introduced
by Brodic et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], modi ed to be suitable for languages evolved one into
another. We call the modi ed version of this algorithm Genetic Algorithms Image
Clustering for Document Analysis-Plus (GA-ICDA+). Next, we recall the main
concepts underlying GA-ICDA and propose the modi cations for GA-ICDA+.
      </p>
      <p>
        GA-ICDA is a bottom-up clustering method representing the set of
documents written in di erent languages or scripts as a weighted graph G = (V; E; W ).
Each node vi 2 V is a document and each link eij 2 E connects two nodes vi
and vj to each other. A weight wij 2 W associated to the link eij represents the
similarity among the nodes vi and vj . For each node vi, only a set of the other
nodes V n vi in G is considered. This set is called h-nearest neighborhood of vi
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It represents the set of nodes whose corresponding documents are the most
similar to the document associated to vi. Similarity between two nodes vi and
vj is calculated as:
where a is a scale parameter and d(i; j) is the distance between the document
feature vectors of vi and vj . The L1 norm is adopted as distance, while h is a
(2)
parameter in uencing the size of the neighborhood [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The h-nearest neighbor
nodes of vi are denoted as nnvhi = fnnvhi (1); :::; nnvhi (k)g, where k is the number
of h-nearest neighbors. Then, a mapping f is de ned between each node in V
and an integer label, f : V ! f1; 2; ::; ng n = jV j, realizing a node ordering.
Finally, the di erence is calculated between the label corresponding to the node
f (vi) and the labels corresponding to the nodes in nnvhi , jf (vi) f (nnvhi (j))j
j = 1:::k. Each node vi in G is connected only to the nodes in nnvhi whose label
di erence is less than a given threshold value T . It implies that only similar and
"spatially" close nodes are connected to each other in G. The obtained node
connections, weighted by the similarity values, are represented in terms of the
adjacency matrix M of G. Then, G is subjected to a genetic method for nding
the connected components representing the clusters of documents. After that,
for correcting the local optima, a merging procedure is applied on the found
clusters. In particular, pairs of clusters having minimum mutual distance are
selected and repeatedly merged, until a xed cluster number is reached. The
distance is computed as the L1 norm between the two farthest document feature
vectors, one for each cluster.
      </p>
      <p>The rst introduced modi cation in GA-ICDA+ is the similarity
computation among the graph nodes. The inner complex and variegate structure of the
evolved language, like modern Italian, determines naturally higher distance
values computed between the document feature vectors. Such a phenomenon may
cause an anomaly in the similarity computation in Eq. (1). Consider vi as a node
in G with associated document feature vector di. If the distance d(i; j) between
the vectors di and dj of the nodes vi and vj is particularly high, because of the
power by 2, the numerator of the exponent d(ia;2j)2 is very high, determining a
similarity value which is zero. If it occurs much often for di erent pairs of
document feature vectors, the adjacency matrix M corresponding to the similarity
matrix will be unjusti ably very sparse. In order to overcome this problem, the
exponent of d(i; j) in Eq. (1) which is currently 2, is substituted by a parameter
for obtaining a more exible and smoothed characterization of the similarity.
Consequently, wij in Eq. (1) begins:
The second introduced modi cation is the graph construction. Speci cally,
consider the second step of the procedure where, for each node vi, only the h-nearest
neighbors are maintained, which are "spatially" close to vi, given a node
ordering f . It is clear that it determines a reduction in the number of neighbors, and
consequently in the number of outgoing links, for each node vi. It obtains in
most cases a better characterization of the graph connected components. When
the document graph is particularly complex, like in this task of capturing di
erences between languages evolved one into another, a low value of the threshold
T is necessary for determining good components. However, it causes the
presence of isolated nodes, for which all the nearest neighbors are removed by the
threshold T . In GA-ICDA this situation is not considered, because we obtain
good components even if the T value is higher. Here we relax this constraint,
by managing the presence of isolated nodes. They are "singleton" nodes for the
genetic procedure, which is not able to add them inside any connected
component, because of the absence of node neighbors. At the end of the procedure,
they will be considered as "singleton" clusters and automatically managed by
the nal bottom-up strategy.</p>
      <p>
        Fig. 3 shows an example of GA-ICDA+ execution. From left to right, for
each node in the distance matrix (6 nodes), the algorithm nds the 2-nearest
neighbors (in grey). Then, for each node, the algorithm nds the neighbors with
label di erence smaller than T = 3 with respect to the label of that node (in
dotted grey), making the node 2 isolated. The adjacency matrix is obtained by
computing the similarity values from the distance values by adopting Eq. (2)
( = 1:5). c1, c2 and c3 are the clusters detected from the genetic algorithm. c01
and c02 are the nal clusters detected from the bottom-up merging procedure,
with xed cluster number nc = 2. They are obtained by computing the
distances of cluster pairs and merging the singleton cluster c2 with c3 exhibiting
the minimum distance value of 0:8.
As example of framework usage, an experiment is performed on a complex
custom oriented database, publicly available at [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], composed of a set of 90
documents in Latin and modern Italian languages. Speci cally, 50 out of 90
documents are given in Latin language and 40 out of 90 documents are given in
modern Italian language. Documents count from 400 to 6000 characters each.
40 out of 50 Latin documents are extracted from Cicero's works (106 BC - 43
BC), in particular from De Inventione, De Oratore, De Optimum Genere
Oratorum, De Natura Deorum and De O ciis. 10 out of 50 Latin documents are
extracted from Virgil's Aenead (70 BC - 19 BC). The documents from the two
di erent authors belong to a di erent historical period and the writing style of
the two authors is also di erent. Consequently, recognition of common language
is di cult. Modern italian documents are extracted from two well-known Italian
newspapers, Il Sole 24Ore and La Repubblica, and from websites. In particular,
20 out of 40 modern Italian documents are excerpts from newspapers and 20 out
of 40 modern Italian documents are excerpts from the web. The writing style of
the newspapers excerpts is di erent, because more "technical", than the writing
style of the excerpts from the web, which is more "linear".
      </p>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>
        Next, we demonstrate the e cacy of our framework as a combination of feature
representation and clustering method, in correctly discriminating between Latin
and modern Italian documents. Speci cally, we show in Table 1 the clustering
results obtained from our framework (named as GA-ICDA+) on the custom
oriented document database and compare them with the clustering results obtained
from other ve algorithms on the same database. They are three clustering
methods, Hierarchical Clustering, K-Medians and Self-Organizing-Map (SOM), which
are di erent well-known strategies for text document categorization [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
In particular, we chose to adopt K-Medians instead of K-Means because the rst
one uses the same L1 norm as our method GA-ICDA+ and because it is more
robust to outliers than K-Means. The other two algorithms are the GA-IC
framework for image database clustering [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and the GA-ICDA framework [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which
is the extension of GA-IC for document database clustering, without the modi
cations introduced for GA-ICDA+. All the algorithms, K-Medians, hierarchical
clustering, SOM, GA-IC and GA-ICDA adopt the same run-length feature vector
representation used from GA-ICDA+.
      </p>
      <p>
        Clustering results are showed in terms of ve methods for performance
evaluation: precision, recall and f-measure indexes [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], purity, entropy,
Normalized Mutual Information (NMI) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ],[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] and Adjusted Rand Index (ARI)
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Precision, recall and f-measure are reported separately for each language
class (Latin and modern Italian) in correspondence to each algorithm. For the
other performance measures, purity, entropy, NMI and ARI, a single overall value
is reported for each algorithm. Purity, entropy, NMI and ARI are well-known
performance measures for clustering evaluation. On the contrary, the
computation of precision, recall and f-measure requires that the correspondence between
each cluster detected from the algorithm and the true language class is known.
Consequently, we associate each cluster with the true language class whose
corresponding number of documents is in majority in that cluster. The number of
clusters nc found from the algorithms is also reported.
      </p>
      <p>A trial and error procedure has been adopted on benchmark documents,
di erent from the documents in the considered database, for tuning the
algorithms parameters. The parameter values providing the best possible results on
the benchmark documents have been adopted for clustering the custom oriented
document database. Consequently, in K-Medians algorithm, the number of
clusters is xed to 2. In SOM algorithm, the dimension of a neuron layer is 1 2. The
number of training steps for initial covering of the input space is 100 and the
initial neighborhood size is 3. The distance between two neurons is computed
as the number of steps separating each other. Hierarchical clustering adopts a
bottom-up agglomerative strategy using L1 norm for distance computation.
Average linkage is used for cluster distance evaluation. The obtained dendrogram
is "horizontally" cut to obtain a number of clusters which is equal to 2. The h
value of the neighborhood is xed to 33 for GA-IC and GA-ICDA and to 43
for GA-ICDA+ and the T threshold value to 9 for GA-ICDA and to 7 for
GAICDA+. The parameter for the similarity computation in GA-ICDA+ is xed
to 1:5.</p>
      <p>The algorithms have been implemented in MATLAB R2012a. Experiments
have been run on a Desktop computer quad core 2.3GHz 4GB RAM and
Windows 7. Each algorithm has been executed 100 times and the average values
of each performance measure together with the standard deviation values (in
parenthesis) have been reported. Our framework takes 55 s for each execution
on the database of 90 documents.</p>
      <p>We observe that our framework, which is the combination of run-length
features and GA-ICDA+ clustering method, performs successfully, overcoming all
the other clustering methods (see Table 1). In fact, GA-ICDA+ obtains the
perfect distinction between Latin and modern Italian documents, with a number
of clusters equal to 2, precision, recall and f-measure values of 1:00 for both
Latin and modern Italian language classes, purity, NMI and ARI values of 1:00
and an entropy value of 0:00. Furthermore, standard deviation values are always
zero, demonstrating the stability of the result. It is interesting to observe as
GA-IC algorithm is not able to well discriminate the languages. Although the
number of found clusters is exactly 2, the f-measure values are 0:83 for Latin
and 0:78 for modern Italian, the purity value is 0:81, the NMI value is quite low
and equal to 0:30, together with the value of ARI which is 0:38 and the high
value of entropy which is 0:62. This means that the found clusters contain mixed
Latin and modern Italian documents. The GA-ICDA procedure performs
considerably better than GA-IC for this task. In fact, it exhibits f-measure values
of 0:95 for Latin and 0:94 for modern Italian, a purity value of 0:94, a entropy
value of 0:22 and NMI and ARI values of respectively 0:74 and 0:79. It indicates
that GA-ICDA is more apt to deal with document data than GA-IC. However,
the best result is given from GA-ICDA+, demonstrating the e cacy of the
performed modi cations. About the other algorithms, we can observe that a pure
bottom-up strategy like hierarchical clustering is not able to outperform the
GA-IC, GA-ICDA and GA-ICDA+ evolutionary strategies. In fact, it reaches
f-measure values of 0:72 and 0:60 for respectively Latin and modern Italian, a
purity value of 0:57, a entropy value of 0:44 and very low NMI and ARI values
of respectively 0:02 and 0:006. It is also worth to note that the results of
GAICDA, adopting together an evolutionary method and a bottom-up re nement
procedure, are better than both the pure evolutionary procedure of GA-IC and
the pure bottom-up strategy of hierarchical clustering. It demonstrates the
efcacy of the combination of both the evolutionary and bottom-up methods in
document clustering. The SOM results are very similar to the results obtained
from GA-IC. In fact, the f-measure values are equal to 0:83 for Latin and 0:78 for
modern Italian, the purity and entropy values are respectively 0:81 and 0:62, the
NMI and ARI values are quite low and respectively 0:30 and 0:38. K-Medians
also obtains results which are similar to the results of SOM and GA-IC, with a
f-measure value of 0:83 for Latin and 0:78 for modern Italian, purity, NMI and
ARI values of respectively 0:81, 0:30 and 0:38 and a very high entropy value of
0:62. It indicates that GA-IC, SOM and K-Medians are trapped into a recurrent
solution consisting of mixed clusters of documents in Latin and modern Italian
languages.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>The paper introduced a new framework for the discrimination between
documents written in Latin and modern Italian languages. It is characterized by the
position of each script letter in the baseline, derived by its energy pro le, for
mapping into uniformly coded text. The statistical analysis of the coded text,
represented as an image, is performed by the run-length matrix calculation for
texture feature extraction. The obtained feature vectors revealed satisfactory
dissimilarity of the documents in di erent languages. Such a dissimilarity is the
basis for successfully document clustering by the extension of a state-of-the-art
classi cation tool GA-ICDA+. Experimental results demonstrated the
superiority of the new framework with respect to the other clustering methods. Future
work will extend the experiment to larger databases and multiple types of
language feature representations.</p>
      <p>Acknowledgments. This work was partially supported by the Grant of the
Ministry of Education, Science and Technological Development of the
Republic Serbia, as a part of the project TR33037. Authors are fully grateful to Ms.
Zagorka Brodic, professor of French and Serbo-Croatian languages, for the
helpful discussions about Latin and Italian languages.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amelio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Pizzuti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A new evolutionary-based clustering framework for image databases</article-title>
          .
          <source>In: Image and Sign. Proc., June 30-July</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Cherbourg</surname>
          </string-name>
          , Normandy, France,
          <volume>8509</volume>
          :
          <fpage>322</fpage>
          -
          <string-name>
            <surname>331</surname>
            <given-names>LNCS</given-names>
          </string-name>
          , Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Andrews</surname>
            ,
            <given-names>N. O.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>E. A.</given-names>
          </string-name>
          :
          <article-title>Recent Developments in Document Clustering</article-title>
          .
          <source>Technical report</source>
          , Computer Science, Virginia Tech.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brodic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amelio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milivojevic</surname>
            ,
            <given-names>Z. N.</given-names>
          </string-name>
          :
          <article-title>Characterization and Distinction Between Closely Related South Slavic Languages on the Example of Serbian and Croatian</article-title>
          .
          <source>In: Comp. Anal. of Images and Patterns</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>4</lpage>
          September, Valletta, Malta,
          <volume>9256</volume>
          :
          <fpage>654</fpage>
          -
          <string-name>
            <surname>666</surname>
            <given-names>LNCS</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          , Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Brodic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milivojevic</surname>
            ,
            <given-names>Z.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maluckov</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Recognition of the script in serbian documents using frequency occurrence and co-occurrence analysis</article-title>
          .
          <source>The Scienti c World Journal</source>
          ,
          <volume>896328</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Calabrese</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>On the Evolution of the short high vowel of Latin into Romance, in</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Perez-Leroux</surname>
          </string-name>
          &amp; Y Roberge (eds.)
          <article-title>Romance Linguistics</article-title>
          .
          <source>Theory and Acquisition</source>
          . Amsterdam, John Benjamins,
          <fpage>63</fpage>
          -
          <lpage>94</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sehgal</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greenleaf</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          :
          <article-title>Use of gray value distribution of run lengths for texture analysis</article-title>
          .
          <source>Pattern Recognition Letters</source>
          <volume>11</volume>
          (
          <issue>6</issue>
          ):
          <fpage>415</fpage>
          -
          <lpage>419</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Dasarathy</surname>
            ,
            <given-names>B.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holder</surname>
            ,
            <given-names>E.B.</given-names>
          </string-name>
          :
          <article-title>Image characterizations based on joint gray-level run-length distributions</article-title>
          .
          <source>Pattern Recognition Letters</source>
          ,
          <volume>12</volume>
          (
          <issue>8</issue>
          ):
          <fpage>497</fpage>
          -
          <lpage>502</lpage>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Galloway</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          :
          <article-title>Texture analysis using gray level run lengths</article-title>
          .
          <source>Computer, Graphics and Image Processing</source>
          <volume>4</volume>
          (
          <issue>2</issue>
          ):
          <fpage>172</fpage>
          -
          <lpage>179</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>G.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivaswamy</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A generalised framework for script identi cation</article-title>
          .
          <source>IJDAR</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>55</fpage>
          -
          <lpage>68</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>10. https://sites.google.com/site/documentanalysis2015/latin-italian-database.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            <given-names>P.</given-names>
          </string-name>
          , Schutze, H.: Introduction to Information Retrieval. Cambridge University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Powers</surname>
            ,
            <given-names>D. M. W.</given-names>
          </string-name>
          :
          <article-title>Evaluation: From Precision, Recall and</article-title>
          <string-name>
            <surname>F-Measure to</surname>
            <given-names>ROC</given-names>
          </string-name>
          , Informedness,
          <source>Markedness &amp; Correlation. Journal of Machine Learning Technologies</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>37</fpage>
          -
          <lpage>63</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Saarikoski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laurikkala</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Jarvelin,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Juhola</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Self-Organising Maps in Document Classi cation: A Comparison with Six Machine Learning Methods</article-title>
          .
          <source>In: 10th Int. Conf., ICANNGA</source>
          ,
          <fpage>14</fpage>
          -
          <lpage>16</lpage>
          April, Ljubljana, Slovenia,
          <string-name>
            <surname>Part I</surname>
          </string-name>
          6593:
          <fpage>260</fpage>
          -
          <string-name>
            <surname>269</surname>
            <given-names>LNCS</given-names>
          </string-name>
          , Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Embrechts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>On the Use of the Adjusted Rand Index as a Metric for Evaluating Supervised Classi cation</article-title>
          .
          <source>In: 19th International Conference on Arti cial Neural Networks: Part II</source>
          ,
          <fpage>14</fpage>
          -
          <lpage>17</lpage>
          September, Limassol, Cyprus, SpringerVerlag, Berlin, Heidelberg,
          <fpage>175</fpage>
          -
          <lpage>184</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Steinbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          , V.:
          <article-title>A comparison of document clustering techniques</article-title>
          .
          <source>In: KDD Workshop on Text Mining</source>
          ,
          <fpage>20</fpage>
          -
          <lpage>23</lpage>
          August, Boston, MA, USA,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Texture information in run-length matrices</article-title>
          .
          <source>IEEE Trans. Image Proc</source>
          .
          <volume>7</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1602</fpage>
          -
          <lpage>1609</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>De Vries</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Trotman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Document clustering evaluation: Divergence from a random baseline</article-title>
          .
          <source>CoRR, abs/1208.5654</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Yoo</surname>
          </string-name>
          , I.:
          <article-title>A comprehensive comparison study of document clustering for a biomedical digital library medline</article-title>
          .
          <source>In: 6th ACM/IEEE-CS Joint Conference on</source>
          ,
          <fpage>11</fpage>
          -
          <lpage>15</lpage>
          June, Chapel Hill,
          <string-name>
            <surname>NC</surname>
          </string-name>
          , USA,
          <fpage>220</fpage>
          -
          <lpage>229</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Hierarchical Clustering Algorithms for Document Datasets</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>10</volume>
          (
          <issue>2</issue>
          ):
          <fpage>141</fpage>
          -
          <lpage>168</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Zramdini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ingold</surname>
          </string-name>
          , R.:
          <article-title>Optical font recognition using typographical features</article-title>
          .
          <source>IEEE Trans. Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>20</volume>
          (
          <issue>8</issue>
          ):
          <fpage>877</fpage>
          -
          <lpage>882</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>