<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Representing a Computer Science Research Organization on the ACM Computing Classification System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Boris Mirkin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Susana Nascimento</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Moniz Pereira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department and Centre for Artificial Intelligence (CENTRIA) FCT, Universidade Nova de Lisboa Caparica</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science Birkbeck University of London London</institution>
          ,
          <addr-line>UK WC1E 7HX</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a method, Cluster-Lift, for parsimoniously mapping clusters of ontology classes of lower levels onto a subset of high level classes in such a way that the latter can be considered as a generalized description of the former. Specifically, we consider the problem of visualization of activities of a Computer Science Research organization on the ACM Computing Subjects Classification (ACMC), which is a three level taxonomy. It is possible to specify the set of ACMC subjects that are investigated by the organization's teams and individual members and map them to the ACMC hierarchy. This visualization, however, usually appears overly detailed, confusing, and difficult to interpret. This is why we propose a two-stage Cluster-Lift procedure. On the first stage, the subjects are clustered according to their similarity defined in such a way that the greater the number of researchers working on a pair of subjects, the greater the similarity between the pair. On the second stage, each subject cluster is mapped onto ACMC and lifted within the taxonomy. The lifting involves a formalization of the concept of “head subject”, as well as its “gaps” and “offshoots” and is to be done in a parsimonious way by minimizing a weighted sum of the numbers of head subjects, gaps and offshoots. The Cluster-Lift results are easy to see and interpret. A real-world example of the working of our approach is provided.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        ACM Computing Classification System Fits for Representing CS
Research Activities
ACM Computing Classification System (ACMC) is a conceptual three level
classification of the Computer Science subject area built to reflect the vast and changing world
of computer oriented writing. This classification was first published in 1982 and then
thoroughly revised in 1998 and it is being revised since [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The ACMC is used, mainly,
as a device for annotation and search for publications in collections such as that on the
ACM portal [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], that is, for the library and bibliographic applications. Here we propose
its use for representing research organizations in such a way that the organization’s
research topics are generalized by parsimoniously lifting them, after clustering, along the
ACMC topology.
      </p>
      <p>Potentially, this kind of ACMC representation can be used for the following
purposes:
i Overview of scientific subjects being developed in an organization.
ii Positioning the organization over ACMC.
iii Overview of scientific disciplines being developed in organizations over a country
or other territorial unit, with a quantitative assessment of controversial subjects, for
example, those in which the level of activity is not sufficient or the level of activities
by far excesses the level of results.
iv Assessing the scientific issues in which the character of activities in organizations
does not fit well onto the classification; these can be potentially the growth points
or other breakthrough developments.
v Planning research restructuring and investment.
2</p>
      <p>Cluster - Lift Method
We represent a research organization by clusters of ACMC topics emerging according
to members or teams simultaneously working on them. Each of the clusters is mapped
to the ACMC tree and then lifted in the tree to express its general tendencies. The
clusters are found by analyzing similarities between topics as derived from either automatic
analysis of documents posted on web by the teams or by explicitly surveying the
members of the department. The latter option is especially convenient at situations in which
the web contents do not properly reflect the developments. Then we need a survey tool.</p>
      <p>Accordingly, this work involves developing the following tools. 1) A e-screen based
ACMC topic surveying device. 2) A method for deriving similarity between ACMC
topics. 3) A robust method for finding possibly overlapping subject clusters. 4) A method
for parsimoniously lifting topic clusters on ACMC. In the following subsections, we
describe them in turn.
2.1</p>
      <p>
        E-screen survey tool
An interactive survey tool has been developed to provide two types of functionalities
about the research activities in an organization: i) data collection about the research
results of individual members, described according to the ACMC topics; ii) statistical
analysis and visualization of the data and results of the survey. The period of research
activities comprises the survey year and the previous four years. This is supplied with
simultaneous “focus + context” navigation functionalities as well as quick interaction
with the taxonomy [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The respondent is asked to select up to six topics in the third
layer of the ACMC tree and assign each with a percentage expressing the proportion of
the topic in the total of the respondent’s research activity. Figure 1 shows a screenshot
of the interface for a respondent who chose six ACMC topics during his/her survey
session. Another, “research results” form allows to make a more detailed assessment in
terms of research results of the respondent in categories such as refereed publications,
funded projects, and theses supervised.
      </p>
      <p>The (third-layer) nodes of the ACMC tree are populated thus by respondents’ weights,
which can be interpreted as membership degrees of the respondent’s activity to the
ACMC topics.
We derive similarity between ACMC topics i and j as the weighted sum of individual
similarities. The individual similarity is just the product of weights f i and fj assigned
by the respondent to the topics. Clearly, topics that are left outside of the individual’s
list, have zero similarities with other topics.</p>
      <p>The individual’s weight is inversely proportional to the number of subjects they
selected in the survey. This smoothes out the differences between topic weights imposed
by the selection sizes.</p>
      <p>It is not difficult to see that the resulting topic-to-topic similarity matrix A = (a ij )
is positive semidefinite.
2.3</p>
      <p>Finding overlapping clusters
The issue of determining of the subject clusters can be explicated as the well-known
problem of finding clusters, potentially overlapping, over similarity matrix A = (a ij ).</p>
      <p>
        We employ for this the data recovery approach described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for the case of crisp
clustering and in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for the case of fuzzy clustering. Here we consider only the crisp
clustering case.
      </p>
      <p>
        Let us denote s = (si) a binary membership vector defining a subset of ACMC
topics S = {i : si = 1}. The one-cluster criterion to be optimized by the cluster S to
be found is expressed as:
g(S) = sT As/sT s = a(S)|S|.
(1)
where a(S) is the average similarity aij within S and |S| the number of entities in S.
This criterion has a simple intuitive meaning as a compromise between too
contradicting criteria: (a) maximizing the within-cluster similarity and (b) maximizing the cluster
size. When squared, the criterion expresses the proportion of the data scatter, which is
taken into account by cluster S according to the data recovery model described in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        It should be pointed out that this criterion not only emerges in the data recovery
framework but it also fits into some other frameworks such as (i) maximum density
subgraphs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and (ii) spectral clustering [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>ADDI-S algorithm starts from S = {i} where i is any topic i ∈ I, and, in this
way, produces a number of potentially overlapping or even coinciding locally optimal
clusters Si – these are considered then for selection according to their contribution
weights g(Si)2 and the extent of their overlap with the other clusters. The intuition
behind this heuristic is that each of the locally optimal clusters is well separated from
the rest; therefore, a small number of them covering a major part of the data set is a
good representation of the similarities.</p>
      <p>The algorithm iteratively finds an entity j ∈ S by maximizing g(S ± j) where S ± j
stands for S + j if j ∈ S or S − j if j ∈ S. It appears, for doing this one just needs to
compare the average similarity between j and S with the threshold π = a(S)/2.
Obviously, the produced S is rather tight because each i ∈ S has a high degree of similarity
with S, greater than half of the average similarity within S, and simultaneously is well
separated from the rest, because for each entity j ∈ S, its average similarity with S is
less than that.
2.4</p>
      <p>Parsimonious lifting method
To generalise the main contents of a subject cluster, we translate it to higher layers of
the taxonomy by lifting it according to the principle: if all or almost all children of a
node belong to the cluster, then the node represents the cluster on a higher level of the
ACMC taxonomy. Such a lift can be done differently leading to different portrayals of
that on the ACMC tree. A cluster can fit quite well into the classification or not (see
Figure 2), depending on how much its topics are dispersed among the tree nodes.</p>
      <p>The best possible fit would be when all topics in the subject cluster fall within a
parental node in such a way that all the siblings are covered and no gap occurs. The
parental tree node, in this case, can be considered as the head subject of the cluster.
A few gaps, that is, head subject’s children topics that are not included in the cluster,
although diminish the fit, still leave the head subject unchanged. A larger misfit occurs
when a cluster is dispersed among two or more head subjects. One more type of misfit
may emerge when almost all cluster topics fall within the same head subject node but
one or two of the topics offshoot to other parts of the classification tree (see Figure 3).</p>
      <p>Good subject cluster</p>
      <p>Bad subject cluster</p>
      <p>Such offshoots, when persist at subject clusters in different organizations, may show
some tendencies in the development of the science, that the classification tree has not
taken into account yet. The total count of head subjects, gaps and offshoots, each type
weighted accordingly, can be used for scoring the extent of effort needed for lifting
a research grouping over classification tree as illustrated on Figure 4. The smaller the
score, the better the fit. When the topics under consideration relate to deeper levels of
classification, such as the third layer of ACMC, the scoring may allow some tradeoff
between different possibilities for lifting clusters to the head subjects. As illustrated
on Figure 4, the subject cluster of third-layer topics presented by checked boxes, can
be lifted to two head subjects as on (A) or, just one, the upper category on (B), with
the “cost” of three more gap nodes added, and one offshoot subtracted. Depending on
the relative weighting of gaps, offshoots and multiple head subjects, either lifting can
minimize the total misfit. In fact, the gaps and offshoots are determined by the head
subjects specified in a lift.</p>
      <p>Altogether, the set of subject clusters, their head subjects, offshoots and gaps
constitutes what can be referred to as a profile of the organization in consideration. Such a
representation can be easily accessed and expressed as an aggregate. It can be further
elaborated by highlighting representation subjects in which the organization members
have been especially successful (i.e., publication in best journals, award or other
recog</p>
    </sec>
    <sec id="sec-2">
      <title>Head subject 2 Gap 3 Offshoot 1 Total 2H+3G+1O</title>
    </sec>
    <sec id="sec-3">
      <title>Head subject 1 Gap 6 Offshoot 0 Total 1H+6G</title>
      <p>nition) or distinguished by another feature (i.e., industrial product or inclusion to a
teaching program).</p>
      <p>Building a parsimonious lifting of a subject cluster can be achieved by recursively
building a parsimonious scenario for each node of the ACMC tree based on
parsimonious scenarios for its children. At each node of the tree, sets of head gain, gap and
offshoot events are to be determined and iteratively raised to the parents under each of
two different assumptions that specify the situation “above the parent” starting, in fact,
from the root.</p>
      <p>One assumption is that the head subject is not at the parental node to the parent, but
is somewhere higher, and the second assumption is that it has been gained in the node
only. It is necessary to distinguish these two cases since, clearly, it is only meaningful
to consider the loss of a head subject at a node if it was inherited at that node; similarly,
it is only meaningful to consider the gain of a head if it was not inherited from above.
Consider the parent-children system as shown in Figure 5, with each node assigned
with sets of offshoot, gap and head gain events under the above two inheritance of head
subject assumptions.</p>
      <p>
        Let us denote the total number of events under the inheritance and non-inheritance
assumptions by ei and en, respectively. A lifting result at a given node is defined by
a triplet of sets (H, G, O), representing the tree nodes at which events of head gains
and gaps, respectively, have occurred in the subtree rooted at the node. We use (Hn, Gn,
On) and (Hh, Gh, Oh) to denote lifting results under the inheritance and non-inheritance
assumptions, respectively. The algorithm computes parsimonious scenarios for parental
nodes according to the topology of the tree, proceeding from the leaves to the root in
the manner, which is, to some extent similar to that described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>Parent Head Gap Off
Not HS Hn Gn On</p>
      <p>Head S Hh Gh Oh
Child1 Head Gap Off
Not HS Hn1 Gn1 On1
Head S Hh1 Gh1 Oh1</p>
      <p>Child2 Head Gap Off
Not HS Hn2 Gn2 On2
Head S Hh2 Gh2 Oh2</p>
      <p>Child3
Not HS
Head S</p>
      <p>Head Gap Off
Hn3 Gn3 On3
Hh3 Gh3 Oh3
Let us describe how this approach can be implemented by using the data from a survey
conducted at the Department of Computer Science, Faculty of Science &amp; Technology,
New University of Lisboa (DI-FCT-UNL). The survey involved 49 members of the
academic staff of the department.</p>
      <p>For simplicity, we use only data of the second level of ACMC, each having a code
V.v where V=A,B,...,K, and v =1,..,mK, with mK being the number of second level
topics. Each member of the department supplied three ACMC topics most relevant to their
current research. These comprise altogether 26 of the 59 topics at the second level in
ACMC (we omit two subjects of the second level, General and Miscellaneous, occurred
in every first-level division as they do not contribute to the representation). The
similarity between two ACMC subjects, V.v and W.w, was defined as the number of members
of the department that work on both of them.</p>
      <p>With the algorithm ADDI-S applied to the 26x26 similarity matrix, we get the
following 6 clusters (each of them contributes more than 4% to the data scatter): Cl1
(contribution 27.08%, intensity 2.17), 4 items: D3, F1, F3, F4; Cl2 (contribution 17.34%,
intensity 0.52), 12 items: C2, D1, D2, D3, D4, F3, F4, H2, H3, H5, I2, I6; Cl3
(contribution 5.13%, intensity 1.33), 3 items: C1, C2, C3; Cl4 (contribution 4.42%, intensity
0.36) , 9 items: F4, G1, H2, I2, I3, I4, I5, I6, I7; Cl5 (contribution 4.03%, intensity
0.65), 5 items: E1, F2, H2, H3, H4; Cl6 (contribution 4.00%, intensity 0.64), 5 items:
C4, D1, D2, D4, K6. These clusters lifted in the ACMC are presented on Figure 6, in
which only those first-level categories that overlap them are shown.</p>
      <p>One can see the following:
– The department covers, with a few gaps and offshoots, six head subjects shown on
the Figure using pentagons filled in by different patterns;
– The most contributing cluster, with the head subject F. Theory of computation,
comprises a very tight group of a few second level topics;
– The next contributing cluster has not one but two head subjects, D and H, and
offshoots to every other head subject in the department, which shows that this cluster
currently is the structure underlying the unity of the department;
– Moreover, the two head subjects of this cluster come on top of two other subject
clusters, each pertaining to just one of the head subjects, D. Software or H.
Information Systems. This means that the two-headed cluster signifies a new direction in
Computer Sciences, combining D and H into a single new direction, which seems
a feature of the current developments indeed; this should eventually get reflected in
an update of the ACM classification (probably by raising D.2 Software Engineering
to the level 1?);
– There are only three offshoots outside the department’s head subjects: E1. Data
structures from H. Information Systems, G1. Numerical Analysis from I.
Computing Methodologies, and K6. Management of Computing and Information Systems
from D. Software. All three seem natural and should be reflected in the list of
collateral links between different parts of the classification tree.</p>
      <p>D
aaaaaaaaa
aaaaaaaaa
aaaaaaaaa
aaaaaaaaa
aaaaaaaaa
C aaaaaaaaaaaaaaaaaa</p>
      <p>aaaaaaaaa
aaaCaaaaaa1aaaaaaaaa aaaaaaCaaaaaaaaa2aaa aaaCaaaaaaaaa3aaaaaa C4 C5</p>
      <p>CS
K6
aaaaa
aaaaa
F aaaaaaaaaa</p>
      <p>H
H1</p>
      <p>H2 H3</p>
      <p>H4 H5</p>
      <p>E1
G1
aa
aa
D1 D2 D3</p>
      <p>D4
aaaaaa
F1</p>
      <p>F2
aaaaaa
F3
aaaaaa
F4</p>
      <p>I1</p>
      <p>I2</p>
      <p>I3 I4</p>
      <p>I5</p>
      <p>I6</p>
      <p>I7
We have shown that ACMC can be used as an ontology structure for representing CS
research activities. In principle, the approach can be extended to other areas of science
or engineering, provided that these areas have been systematized into comprehensive
ontologies or taxonomies. Potentially, this approach could lead to a useful instrument of
visually feasible comprehensive representation of developments in any field of human
activities prearranged as a hierarchy of relevant topics.</p>
      <p>Acknowledgments The authors acknowledge all DI-FCT-UNL members that agreed to
participate in the survey. Igor Guerreiro is acknowledge for the development of the e-survey tool
illustrated in Figure 1. This work is supported by the grant PTDC/EIA/69988/2006 from the
Portuguese Foundation for Science &amp; Technology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>The ACM Computing Classification</surname>
            <given-names>System</given-names>
          </string-name>
          , http://www.acm.org/class/1998/ccs98.html,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Spence</surname>
          </string-name>
          , Information Visualization,
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          (ACM Press),
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Mirkin</surname>
          </string-name>
          ,
          <article-title>Clustering for Data Mining: A Data Recovery Approach</article-title>
          , Chapman &amp; Hall /CRC Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Nascimento</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mirkin</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Moura-Pires</surname>
          </string-name>
          ,
          <article-title>Modeling Proportional Membership in Fuzzy Clustering</article-title>
          ,
          <source>IEEE Transactions on Fuzzy Systems</source>
          ,
          <volume>11</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>173</fpage>
          -
          <lpage>186</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Mirkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galperin</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Koonin</surname>
          </string-name>
          ,
          <article-title>Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes</article-title>
          ,
          <source>BMC Evolutionary Biology</source>
          <volume>3</volume>
          :
          <fpage>2</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>G.</given-names>
            <surname>Gallo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.D.</given-names>
            <surname>Grigoriadis</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.E.</given-names>
            <surname>Tarjan</surname>
          </string-name>
          ,
          <article-title>A fast parametric maximum flow algorithm and applications</article-title>
          ,
          <source>SIAM Journal on Computing</source>
          ,
          <volume>18</volume>
          , pp.
          <fpage>30</fpage>
          -
          <lpage>55</lpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Shi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Malik</surname>
          </string-name>
          ,
          <article-title>Normalized cuts and image segmentation</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>22</volume>
          (
          <issue>8</issue>
          ), pp.
          <fpage>888</fpage>
          -
          <lpage>905</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>