BibSLEIGH: Bibliography of
                              Software (Language) Engineering
                                        in Generated Hypertext


                                                         Vadim Zaytsev

                                                  vadim@grammarware.net


                                       Universiteit van Amsterdam, The Netherlands
                                                        Raincode, Belgium


                                                                       which is a specialisation, and just as any specialisation,
                                                                       can lead to signicant optimisation. We have collected
                          Abstract                                     some features in  1.3 that are missing from the cur-
                                                                       rent widespread remedies (we refuse to call them so-
     The body of research contributions is vast and                    lutions).   Each of the features is missing for a good
     full of papers. Existing projects help us nav-                    reason: each requires research, development and do-
     igate through it and relate authors to papers                     main focus. This makes them both attractive to invest
     and papers to venues.        In this paper we list                eort in and dangerous because most are non-trivial.
     features missing from those projects and pro-                     Finally, in  1.4 the most obvious point will be raised
     pose a solution in the form of BibSLEIGH                         about information that is interesting in bibliographical
     a work in progress on facilitated browsing of                     context, being distributed over various unconnected
     scientic knowledge objects. Through leverag-                     sources of not that structured data.
     ing domain focus, by actively employing auto-
     mated data collection and scraping tools, and
                                                                       1.1   BibTEX non-uniformity across sources
     with automated annotating of the corpus, we
     are able to gain and provide insights into sci-                   If we attempt to download      .bib les for the same
     entic communities and topics, as well as sur-                    publication from various sources, they will all look
     face potential interdisciplinary opportunities.                   dierently, sometimes drastically so.     Many publish-
                                                                       ers do not curate their data, rely on automatic text
1    Motivation                                                        recognition and only occasionally and serendipitously
                                                                       x misspellings.
                                                                                      BibTEX providers are often volatile
BibSLEIGH has started in 2014 as a project to scratch
                                                                       when it comes to conference naming.    IEEE and
some personal itches and solve problems that were eat-
                                                                       ACM are obviously inclined to include their alia-
ing away from the authors' time as well as anyone
                                                                       tion (the IEEE/ACM international conference on...),
else's.   These issues can be broadly categorised into
                                                                       sometimes in favour of more useful information like
four categories. In  1.1, we will discuss in some detail
                                                                       the number of the conference in the series.        DBLP
problems with the bibTEX format and the unnecessary
                                                                       has changed their policy on abbreviating venue names
diversity of conventions for equivalent items, which has
                                                                       during the period of writing this paper (between SAT-
a chance of making academic publications look unpro-
                                                                       ToSE in July 2015 and post-proceedings in November).
fessional and can also lead to confusion and mistakes.
However, consistency enforcing is very time consum-                      When information is available, bibTEX providers

ing. In  1.2, the focus will be on domain specicity,                 usually decide to include it  yet what was the last
                                                                       time someone cared about whether ESOP 1986 took
Copyright   c   2015 by the paper's authors.   Copying permitted       place in Saarbrücken or in Passau? This information
for private and academic purposes.    This volume is published
                                                                       can be leveraged for other purposes, like tracking coun-
and copyrighted by its editors.
                                                                       try and continent preferences and their shifting over
In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE
                                                                       the years, or investigating the impact of location on
2015 Seminar on Advanced Techniques and Tools for Software
Evolution, University of Mons, Belgium, 6-8 July 2015,                 the number, quality and aliation of papers. However,
published at http://ceur-ws.org                                        it is not used for any of those purposes, yet included in


                                                                   1
                                                                                 +
the bibliographical entry. Nevertheless, many details          al. [VSM13, VSM 14] that harvested PC members of
about in which hotel near which city on which exact            several top conferences and cross-checked them with
days the conference has taken place, nd their way into        authors publishing there to measure academic inbreed-
bibTEX, even though they were important only for the           ing. However, the focus of such a website is limited to
briefest of times, and only to immediate attendees of          one event, or in some lucky cases to a series of events,
the event.                                                     and such websites are very prone to disappearing for-
  So, on one hand, there is too much information               ever once their organisers retire or change employers.
in the bibTEX entries supplied by publishers and
accumulators like DBLP and Google Scholar: ad-
dresses, dates, timestamps, keywords, sometimes en-              As the other extreme we have services that make an
tire abstracts.    On the other hand, however, some            endeavour to collect information over a broad choice of
of more useful information is routinely missed.    Fre-        conferences on all kinds of topics, and put them in one
quent omissions concern editor names and hyper-                place for display and consumption. The most famous
links that can be used to access the actual content            ones are DBLP with its 6500+ venues, Google Scholar
of the publication.     Editor names play exactly the          which is based on web crawling and Microsoft Aca-
same role in events and journal special issues as au-          demic Search that contains ranking tables sorting con-
thor names play in individual publications: they help          ferences of one eld by the number of citations their ar-
to identify the item but also establish community              ticles enjoyed over the years. Such services try to be as
links across dierently named and formally unrelated           general and comprehensive as possible, and this is ex-
events.     Hyperlinks are not always entirely missing,        actly where they fail short. Broad generalisations are
but oftentimes hidden behind non-standard elds like           impossible without compromises on metadata models,
ee or acmid; not curated in a way that a doi eld              on information representation, on clone detection. A
sometimes starts with http://; and even outdated               website of one particular conference typically shows
 most if not all links like http://www.computer.              very clearly which volume of which journal contains its
org/proceedings/csmr/0546/05460161abs.htm be-                  post-proceedings special issue  while DBLP habitu-
ing provided by DBLP have been dead (HTTP Status               ally gives you all issues of the conference and all issues
404) for several years since the redesign of the IEEE          of all journals and leaves the search for a match in your
Computer Society website made them obsolete.                   own hands. University libraries fall into the same cat-
  Time lost in reformatting is only a part of this             egory: while limiting their databases to material avail-
side of the problem.     Inconsistencies lead to unpro-        able physically or through subscriptions, they do not
fessional look of those papers whose authors have de-          dierentiate among domains, so searching for muta-
cided against wasting time on bibliography beautica-          tion will likely result in many items unrelated to mu-
tion; and worse yet  to duplicate entries appearing           tation testing; and searching for graph, while more
within the same paper with slight variations in spelling       productive, will still yield results from graph transfor-
and data details provided, which made searching for            mation research as well as from general graph theory.
the right entry harder and clone detection impossible
within a typical textual editor.
                                                                 The quest for broad coverage makes the project vul-
                                                               nerable. For instance, DBLP covers millions of authors
1.2   Lack of domain focus
                                                               and thus has to be extremely careful about not con-
Academic researchers tend to specialise but never limit        fusing authors with similar names  however, many
themselves overly to one particular series of events.          researchers, especially in the pre-google era, did not
Yet, when we look at sources of information we have            write their names always in the same fashion.        This
at our disposal, they come in two sizes only. On one           would have been known to domain experts who are
extreme we have websites devoted to individual con-            familiar with key authors in their eld, but domain
ferences.    They usually contain a lot of information         knowledge does not scale up. Similarly, Google Scholar
that is not immediately required for a decent bibTEX           relies on its web crawler, and so it is not uncommon for
entry, but can be quite useful in the long run for com-        it to point you to papers that are no longer available or
munity recognition: after all, one is much more likely         are in fact no papers at all, no matter what their au-
to submit to a conference chaired by someone whose             thors claim. Microsoft Academic Search is based on ci-
name they recognise and whose work they can relate             tation information  and as a result of dierent people
to that of their own. Organisation committee details           citing the same venue in dierent ways (e.g., with In-
and programme committee members provide refresh-               ternational Conference or without it), the same venue
ingly large foundation for automation of this process,         appears several times in the ranking, both positioned
as demonstrated by the recent work of Vasilescu et             much lower than they deserve.


                                                           2
1.3   Missing features                                         since even fairly focused researchers will nd them-
                                                               selves contemplating submission to a dozen or two rea-
When we like a paper, we often begin investigating
                                                               sonable venues.    There is quite some space for auto-
its authors to see if they have contributed to simi-
                                                               mated clustering.
lar lines of research before or after.   DBLP lookup
                                                                 Topic-driven grouping is not the only kind of classi-
has become a part of a routine check in many cases
                                                               cation that would be sensible for a bibliographic por-
from research exploration to job candidate evaluation.
                                                               tal: some venues are linked by a subcommunity of
However, a graph transformation researcher that occa-
                                                               people who strongly contribute to both. For instance,
sionally published a model transformation paper, or a
                                                               there are many people who publish regularly both at
grammarware engineer masquerading as a metamodel
                                                               MoDELS and ICSME/SCAM, even though they can-
evolution contributor, will have dierent styles across
                                                               not attend both within the same year (they happen
other of their papers, and might not be as fruitful to
                                                               simultaneously).    Having linked data about people's
investigate if your interest is particular and your time
                                                               contributions, we can surface such relations  and
budget is limited.    What could have helped here is
                                                               some RDF frontends to DBLP let you do that with a
visualisation beyond textual: instead of browsing
                                                               couple of medium-size SPARQL queries.
through a multi-page wall of text prole on DBLP,
                                                                 All that being said in  1.1 about the state of
some of us would have wanted to take a quick look
                                                               bibTEX entries obtainable from available sources, we
at a diagram depicting community contribution in a
                                                               still want to have some freedom in formatting: ev-
concise and illustrative manner.
                                                               eryone in computer science research knows what LNCS
  Natural language processing techniques have a                is; in a paper submitted to SLE one does not need
powerful arsenal: even the simplest analyses like stem-
                                                               to explain this abbreviation; editor names are nice to
ming and lemmatisation can provide great aid in surf-
                                                               have but sacriceable under pressing space constraints,
ing through the ocean of papers to pick the right ones
                                                               etc. We want exible bibTEX formatting: DBLP pro-
to read and cite.    It is common knowledge that the
                                                               vides you with some very limited options (crossref or
names of conferences do not always completely repre-
                                                               no crossref ); IEEE Xplore and Elsevier as well (ab-
sent their intentions: having languages in the name
                                                               stract or no abstract); but BibSLEIGH even in its very
can mean one or two of a dozen of entirely dierent
                                                               beginning stage provides its users with more freedom.
research directions; venues with engineering in their
                                                                 Desktop software for managing bibliographies like
name can get quite science-y and theoretical, just as
                                                               Mendeley has tagging functionality that can help its
a name starting with trends does not mean all pa-
                                                               users to annotate the papers they read into dierent
pers are surveys, overviews and vision statements. To
                                                               categories or add brief descriptions to them. However,
the best of our knowledge, no currently existing biblio-
                                                               there is a huge gap between doing that and providing a
graphic website currently provides a lot of NLP-based
                                                               comprehensive annotated bibliography on the subject:
features, although ACM Digital Library has recently
                                                               in fact, such contributions are rare and properly trea-
started collaborating with IBM Watson to pursue that.
                                                               sured, for it takes a lot of expertise and work to craft
  Scraping older sources from document scans to                them. Unfortunately, there are much many topics and
websites that fell apart decades ago and have their            subtopics than there will even be annotated bibliogra-
ruins exposed though the Wayback Machine, is usu-              phies. We need some semi-automatic way of providing
ally beyond the goals and capabilities of bibliographic        us with at least bundles of related papers if we
websites. Armed with domain knowledge and the in-              indicate the selection criteria.
terest seriously linked to that domain, we can gather
enough eort to complete such endeavours and ask se-
                                                               1.4      Distributed information
nior and emeritus colleagues directly about that
one long-forgotten obscure workshop that a reputable           It was already pointed out above that participating in
conference has grown from.                                     event organisation and serving in programme commit-
  Grouping and clustering of conferences is usu-               tees can be seen as community binding and is therefore
ally either manual work, or done though event co-              metadata of interest.   Yet, to the best of our knowl-
location,    or not done at all.   The rst option is          edge, there is no project currently dedicated to collect-
labour-intensive, error-prone, vulnerable to biases and        ing this kind of information, and it remains scattered
prejudice.    The second option delivers complications         half over the internet and half in the Way Back Ma-
for roaming venues like BX (deliberately co-locating           chine.
                                                                                                        +
each year with a dierent community: ETAPS, STAF,                Mathematics Genealogy Project [C ] is a totally
VLDB, etc) and for diverging venues that stopped co-           disconnected project dedicated to documenting top-
locating deliberately to emphasize pursuing a diver-           ics of doctoral dissertations (and occasionally habili-
gent path.    The third option is not an option at all,        tations) and supervisorship information.     It certainly


                                                           3
has a merit of its own, but we believe it can also be            we call it LRJ, short for Lexically Reliable JSON, be-
coupled with other kinds of metadata in a sensible way.          cause we store all key-value pairs one per line sorted
    Aliation information very occasionally nd its              by keys. This was chosen over a more classic database
way into DBLP as well as into Google Scholar where               setup in order to allow individual traceable edits of
academics can log in and update it (unfortunately,               each piece of data and at the same time to guarantee
some choose to log in and prohibit Google from ever              user responsiveness. Data is imported to this central
showing information about them), but there is no easy            place through any of the existing importers, which are
way of tracking and leveraging it. However, it is not            usually implemented as iterative parsers (to process
outrageous to think of research dedicated to tracking            the DBLP dump which is around 2 GB) or webscrap-
research centres of activities on particular topics over         ers (at this moment we have those for individual DBLP
the years.                                                       pages, CEUR and EasyChair).             JSON les can also
    Finally, citation information  it is available on           obviously be added manually. There is also an ad-hoc
publishers' websites in limited form (because they are           importer that creates appropriate JSON entities from
not big fans of sharing it among themselves) and on              a list it reads from a textual le  this helps to prop-
Google Scholar (where it is heavily guarded against              erly add ancient entries.
any form of automated scraping). While acknowledg-                  Once the data is in the repository, it can be fur-
ing some interest in it, we choose to avoid this aspect          ther curated, normalised, improved, enhanced and
for now, because it is not static by nature: citation in-        crosschecked with other sources. Typical maintenance
formation available today can be totally out of date by          activities include adding a fresh issue of an already
tomorrow. However, there is a lot of potential research          known conference or a journal issue known to be re-
here that goes way beyond traditional bibliometrics:             lated to one of the known conferences (automated:
for instance, we can identify canonical sources (which           one just needs to run an incremental updater), im-
often will be books, like the Dragon Book [ASU85])               proving the name of the proceedings booktitle (semi-
that are used throughout a large fraction of papers in a         automated:      changed manually at the top and au-
specic conference, and nd other venues in a dierent           tomatically propagated downwards), removing non-
language that have the tendency to cite translations of          academic clutter such as forewords and panel sum-
this book.                                                       maries (manually or heuristic-based). As an example
    Additionally, academic articles also contain links to        of crosschecking we can talk about adding PC mem-
web resources such as additional documentation, wikis            bers and organisers: this information is never found on
and tool repositories, and such links have a half life           DBLP, but can be harvested elsewhere and integrated
of 4 years on average [Spi03]. The Software Heritage             into the same system.
Project was recently proposed by Roberto Di Cosmo as                Once normalisation reaches a point of being a valid
a project to organise, preserve and share all academi-           input for analysis, we enrich the data by stemming all
cally produced software to provide much desired avail-           titles and tagging them by predened tags  following
ability, traceability and uniformity. Unfortunately the          the spirit of the rest of the project, each tag has its own
project seems to be in early stages, its call to action is       denition stored in a separated JSON le which can
available on SlideShare [Cos15] but the project itself           be accessed, inspected and changed right on GitHub.
is yet unknown to public search engines.       It will be        Stemming provides fully automated foundation to nat-
interesting to see if the corpus of BibSLEIGH can be             urally link papers to their conceptual neighbours, tags
automatically mined for references to tools and clus-            play the same role for previously known manually de-
tered by technological space.                                    ned concepts (so that λ-lifting falls under the same
                                                                 tag as λ-calculus, but µ-kernel is kept away from µ-
2     BibSLEIGH to the rescue!                                   calculus, even though the characters look similar ).
                                                                                                                              1
                                                                 Each tag denition can contain links to Wikipedia,
BibSLEIGH is a work in progress.        Keeping that in
                                                                 Wikidata and other places that are displayed on the
mind, we would like to sketch preliminary require-
                                                                 tag's webpage. Stems can only rely on automatically
ments and architecture decisions in  2.1, point out
                                                                 derivable information, so their webpages display neigh-
some related work in  2.2 and describe the state of
                                                                 bours  stems that are commonly used together with
the project as it is by the time of submission in  2.3.
                                                                 them.
Next,  3 will draft some possible future directions we
might decide to explore.                                            1 As a side remark, in Unicode these are dierent symbols: µ-
                                                                 kernel is read as microkernel and therefore uses the micro sign
2.1    Proposed solution                                         character (U+00B5), while µ-calculus is read as mu-calculus
                                                                 and is thus represented by the Greek small letter mu (U+03BC).
In the centre of BibSLEIGH there is one centralised              BibSLEIGH is the only website that gets it right in all places,
repository containing all its data in JSON format               the readers are welcome to check.


                                                             4
      Whenever the central dataset of BibSLEIGH is                     2.0 [ABFM09], SL(E)BOK, etc. They usually combine
needed for inspection, it is formatted as a collection of              requirements elicitation with experience reports with
almost-static XHTML pages: the only dynamic part of                    calls to arms. One of those very similar to ours is Meta-
them is the pretty-printing of bibTEX itself. The out-                 Science [CCCB14]  unlike BibSLEIGH that mainly
look of BibSLEIGH is less austere than that of DBLP,                   aims at cross-referencing various information sources
it makes full use of a palette of colours and a collection             and using domain knowledge, MetaScience is focused
of icons for each covered brand of conferences.                        exclusively on automatically deriving metadata such
                                                                       as coauthor graphs and pages published per year, and
                                                                       contains impressive interactive visualisations of it.
2.2      Related work
                                                                           Linked data is an initiative that started in the se-
In the eld of High-Energy Physics there has been a                    mantic web community and has gained a lot of at-
movement concerning long time preservation of pub-                     tention over the decade of its existence.        The idea
lications, datasets, repositories and relations between                revolves around uniform identication of entities by
                +                      +
them [GMH 09, GMB10, AAA 12, Sou13], and there                         URIs and uniform encoding of a graph of their rela-
is a prospering project called INSPIRE-HEP at http:                    tions as a collection of subject-predicate-object triples.
//inspirehep.net. It covers a dierent domain than                     They have standard formats for specifying the triples
software (language) engineering, but otherwise partly                  (mostly RDF or Turtle), languages for querying them
addresses the same problems we have pointed out. It                    (nowadays mostly SPARQL) and over half a thousand
does oer additional functionality such as job listings                open datasets containing up to several billion of such
and does not intend to cover some of our goals such as                 triples [CJ14].   There is research evidence backed up
visualisations.                                                        by operational prototypes, that points to usefulness
      ACM Digital Library in recent collaboration with                 of linked data for many related tasks from connect-
IBM Watson has started to provide feature called                                                         +
                                                                       ing community heritage [WNB 15] to mining software
Concept Insights.         For each paper, two things can                                +
                                                                       repositories [KFH 12].
be explored:         concepts in this article       that links
glossary terms mined from the full text of the pa-
                                                                       2.3     Terminology and current state of Bib-
per, to their denitions on Wikipedia and recent
                                                                               SLEIGH
authors with related interests that visualises people
who recently published something that share these                      By    domain we mean a top group of conferences: the
concepts.       This functionality is certainly welcome,               front page of BibSLEIGH displays logos of its domains.
even though it remains to be seen how such auto-                       Right now they are dened ad-hoc with the help of
mated concept matching can compete with and com-                       some domain knowledge; in the future we will use au-
plement manual research eorts in taxonomies that                      tomated clustering techniques to form such domains.
try to identify key publications and tie them with                     A   brand is a series of events with continuing numbering
key concepts and relations between them:                  exam-        and, more often than not, the same name. One event
ples exist for taxonomies of domain specic aspect                     can belong in several brands: a brand of MoDELS cov-
languages [FDNT15], reverse engineering [CC90], re-                    ers the UML series because they kept the numbering,
                              +
verse architecting [PDP 07], (un)parsing [ZB14], algo-                 but events of the brand LDTA and ATEM belong only
rithm animated visualisation [KKM06], security top-                    to the domain of SLE, but not to the brand SLE. Each
ics [KLS09].        Information retrieval research has also            proceedings entity is called an   issue : usually it is reg-
demonstrated promising results in helping to select                    ular conference proceedings issue, but it can also be a
features for automated induction [YC09, LWT08] and                     journal special issue. Multi-volume proceedings have
renement [HZL06, Nov07] of taxonomies, which we                       one issue per volume because bibTEX entries for such
have not yet explored.                                                 volumes are dierent. A    tag is a predened term such
      One   step    farther   from   bibliographical   reposito-       as context-free grammar or visual notation speci-
ries there are model repositories such as FMI (Free                    ed as a set of matching rules covering spelling variants
Model       Initiative)   [SHK14],     ReMoDD     (Repository          and synonyms (so a paper with graphical notation in
                                                  +
for    Model    Driven    Development)      [FBM 12],     CDO          the title will be tagged with visual notation). There
(Connected Data Objects) [Ecl09],             Atlantic Meta-           are several style-dening tags like question (the ti-
model Zoo [Atl05], Grammar Zoo [Zay15], GenMy-                         tle ends in a question, like Can Programming Be
Model [Gen14], that are on a quest of collecting mod-                  Liberated from the Von Neumann Style?), towards
els for various purposes.            There are quite a num-            (like Towards Incremental Execution of ATL Trans-
ber of initiatives related specically to          community           formations), considered harmful, past, present and
management and facilitation: DBLP [Ley02], Reengi-                     future, etc.   Interestingly, one of the most popular
neering wiki [vDV02], Researchr [VVvC09], Research                     tags (covering around 7.2% of all papers) is named,


                                                                   5
 Domain                        Brands
 Applied computing             SAC
 Components / architecture     WICSA, ECSA, CBSE, QoSA
 Design / automation           ASE, CASE, DAC, DATE
 Documentation / databases     DocEng, DRR, HT, ICDAR, PODS, SIGMoD, TPDL, JCDL, VLDB
 Education                     CSEET, ITiCSE, TFPiE, LAK, SIGITE
 Federated computing           PEPM, PLDI, SAS, STOC
 Formal language theory        AFL, CIAA, DLT, ICALP, LATA
 Formal methods                FM, iFM, SEFM, SFM, VDM
 Functional                    AFP, CEFP, FPCA, ICFP, IFL, ILC, LFP
 Graphs                        ICGT, AGTIVE, GaM, GCM, GG, GRAPHITE, GT-VMT
 High level / logics           ALP, FLOPS, GPCE, LOPSTR, PLILP, PPDP, QAPL
 Human factors                 CHI, CSCW, DHM, DUXU, HCD, HCI, HIMI, IDGD, LCT, OCSC, SCSM, SOFTVIS, VISSOFT
 Information systems           CAiSE, EDOC, ICEIS
 Knowledge engineering         CIKM, ECIR, ICML, ICPR, KDD, KDIR, KEOD, KMIS, KR, LSO, MLDM, RecSys,
                               SEKE, SIGIR, SKY
 Language engineering          SLE, ATEM, LDTA, ASF+SDF, WAGA
 Modelware                     MoDELS, UML, ECMFA, ICMT, AMT, BX
 Object orientation            ECOOP, Onward!, OOPSLA, PLATEAU, SPLASH, TOOLS
 Product lines                 SPLC, PLEASE
 Programming languages         POPL, PADL
 Reliability                   AdaEurope, HILT, SIGAda, TRIAda
 Requirements                  ICRE, RE, REFSQ
 Software engineering          ESEC, FSE, ICSE, GTTSE
 Software evolution            SANER, SCAM, CSMR, WCRE, ICPC, ICSME, PASTE, MSR
 System software               ASPLOS, CC, COCV, CGO, HPCA, HPDC, ISMM, LCTES, OSDI, PLOS, PPoPP, SOSP
 Testing                       CADE, CAV, CSL, FATES, FLoC, ICLP, ICST, ICTSS, IJCAR, ISSTA, LICS, MBT, RTA,
                               SAT, SMT, TAP, TLCA, VMCAI
 Theory of software            ESOP, FASE, FoSSaCS, TACAS, WRLA


                        Table 1: Snapshot of the brands and domains currently in BibSLEIGH.


which corresponds to the pattern of starting the title                 The oldest entry so far is the First International
with a word followed by a colon or an em-dash  like              LISP Conference held in 1963 in México, with atten-
Lilith: A Personal Computer for the Software Engi-               dees like John McCarthy and Marvin Minsky. It has
neer, or Miranda: A Non-Strict Functional language              mostly historical value, but a nice part was that it was
with Polymorphic Types, or GHC: Operational Se-                 possible to surface most of the papers and reconstruct
mantics, Problems, and Relationships with CP (↓, |).             metadata by googling and scraping. This issue is not
Currently tags are created based on titles only, because          present on DBLP.
that information is indisputably in the public domain
                                                                       Many mistakes in DBLP data (and sometimes in
and can be used fairly; there is an ongoing discussion
                                                                  publishers' data) were corrected because they were
about fair use of abstracts and keywords, but techni-
                                                                  becoming quite apparent once automated processing
cally they can be harvested as well, so we plan to do so
                                                                  began:       the longest stems were words erroneously
(perhaps not committing the results of such harvest to
                                                                  glued    together;    matching   heuristics   work   reason-
public repositories to avoid copyright claims). A    word         ably well to equate dierent spellings of diacritical
is what we call a stem obtained from a classic Snowball
                                                                  names, etc.        An example of DBLP mismatch could
                                                                              comparing http://dblp.uni-trier.
stemmer for English. We use our own lexer that tries
                                                                  be    seen    by
to split camelcased words properly: not just Camel-
                                                                  de/db/conf/edoc/edoc2007.html     to   http:
Case to Camel and Case, but also APIExplorer
                                                                  //bibtex.github.io/EDOC-2007.html: except for
to API and Explorer and XSDtoMOF to XSD,
                                                                  10.1109/EDOC.2007.42 and 10.1109/EDOC.2007.44,
to and MOF (it also leaves JavaScript intact!).
                                                                  all DOIs at DBLP are incorrect but xed at Bib-
Figure 1 shows a typical use of a word link. A      role is       SLEIGH. This was spotted automatically by reporting
some facilitating role a person has played in an issue:
                                                                  that some entries in this issue had no page infor-
being an editor, a keynote speaker, a PC member, etc.,
                                                                  mation;      an attempt to x it revealed a mismatch
are roles.
                                                                  between DBLP and IEEE Xplore.            DOI information
                                                                  is usually reliable; we know of only one counterex-
   By the time of submission of this paper,           Bib-
SLEIGH covered 166 brands in 26 domains, sum-
                                                                  ample: http://doi.ieeecomputersociety.org/10.
marised on Table 1.       There are 2726 issues of these
                                                                  1109/ICSM.1997.624246 resolves successfully, but
brands with 144589 papers in total.       There are cur-
                                                                  http://dx.doi.org/10.1109/ICSM.1997.624246
                                                                  does not.
rently 684 tags with 354720 markings. The total vo-
cabulary is 24359 stems derived from 1183492 words.                    BibSLEIGH contains proles on 150454 people,


                                                              6
Figure 1: A screenshot demonstrating the usefulness of stemming: an abstract domain is a proper tag, but
functor is not, but we can still jump from this paper to all 17 papers that use that word and than to any of
them with just another click.


                          Figure 2: The front page of BibSLEIGH with 26 domains


                                                      7
Figure 3: Prole example: a grammarware researcher
that started at CC and even RTA, to move on to the
                                                                 Figure 4:   Prole example:    a modelware researcher
likes of SCAM and CSMR. Strong community involve-
                                                                 with a strong focus one one domain: started in OOP,
ment in LDTA, SLE and SANER, even though he
                                                                 moved to enterprise and settled in model-driven do-
has not published at SANER for a while, preferring
                                                                 main, which is reected not only by contributions, but
ICSM(E). Recently started to broaden his interests to
                                                                 also in his vocabulary. Strong community involvement
contribute to issues in the domains of testing, architec-
                                                                 in modelware venues. Prefers writing solo papers, but
ture and automation. Strongly collaborates with one
                                                                 also collaborates broadly, with a bias towards one of
of his colleagues (not inferable from the raw data: ex-
                                                                 his colleagues (not inferable that it is an ex-student).
supervisor).The prole is incomplete because we do               The prole is incomplete because we do not have com-
not have complete information on all involved venues             plete information on all involved venues yet!
yet!
                                                                 3    Future directions
some of them might erroneously view several name-
sakes as one person  no noticeable attention was                What makes BibSLEIGH become more than a gloried
devoted to this issue so far.   Some scraping for roles          wrapper for DBLP is harvesting its domain specicity
has begun, so far we have 4154 roles, which is al-               and community specicity.     While keeping the auto-
most 10 times the size of the dataset of Vasilescu et            mated, semi-automated and heuristic-based transfor-
al. [VSM13], but still around 5% of total work if we op-         mations as maintenance activities, we can continue in-
timistically estimate 10 organisers and 20 PC members            graining the bibliographic entities and their groups
on average per issue. Figure 3 and Figure 4 show two             with information relating them to one another, as
examples of person proles, with corresponding narra-            well as to concepts, methods, frameworks, approaches,
tions in the captions. Notice how the prole is inter-           toolkits, datasets. Implementing various distance met-
preted without the usual bibliometric remarks about              rics, as well as annotating them manually or automat-
the number of papers!                                            ically with topic information can aid clustering and
  Exploring the rest is left as an exercise to the reader:       linking beyond traditional methods depending on the
                                                                 citation information. We see this as another step to-
  • http://bibtex.github.io  web front end                      wards the construction of a body of knowledge for the
  • http://github.com/slebok/bibsleigh  par-                    domain of software language engineering (SLEBoK).
    tially curated JSON data                                         Expansion of the BibSLEIGH data set will continue,
  • http://github.com/bibtex/bibsleigh                          but not far: most interesting next steps involve strate-
    JSON refactorings and visualisations                         gically adding special issues and role annotations to al-


                                                             8
ready imported conferences. We are afraid that overly           References
eager expansion will deprive us of the main advantage                  +
                                                                [AAA 12]   Z.    Akopov,        Silvia    Amerio,      David      As-
of being domain-specic. However, if we could nd a
                                                                           ner,    Eduard        Avetisyan,       Olof      Bärring,
way to eventually hide irrelevant parts from sight so
                                                                           James        Beacham,       Matthew        Bellis,    Gre-
that a user can productively focus on a reasonable sub-
                                                                           gorio       Bernardi,     Siegfried    Bethke,        Am-
set, that could solve the problem and open the door
                                                                           ber Boehnlein,            Travis Brooks,         Thomas
wider for interdisciplinary growth of this project.
                                                                           Browder,        Rene       Brun,      Concetta        Car-
     Navigational support at the current stage of devel-
                                                                           taro,       Marco      Cattaneo,       Gang          Chen,
opment is already quite strong: domains, brands, tags
                                                                           David        Corney,       Kyle      Cranmer,         Ray
and words let you browse through thousands of papers
                                                                           Culbertson,          Suenje       Dallmeier-Tiessen,
quite easily to nd that dozen that you are interested
                                                                           Dmitri Denisov,            Cristinel Diaconu,          Vi-
in. However, we believe this can be improved further
                                                                           taliy Dodonov, Tony Doyle, Gregory P.
 through adding annotations, leveraging metadata,
                                                                           Dubois-Felsmann, Michael Ernst, Martin
proper visualisations, ground-based ranking and clus-
                                                                           Gasthuber, Achim Geiser, Fabiola Gian-
tering, etc.
                                                                           otti, Paolo Giubellino, Andrey Golutvin,
     At BibSLEIGH's webpage the project is called
                                                                           John Gordon, Volker Guelzow, Takanori
   facilitated browsing of scientic knowledge .    In-
                                                                           Hara,       Hisaki    Hayashii,       Andreas        Heiss,
deed, providing interactive access to the curated an-
                                                                           Frederic Hemmer, Fabio Hernandez, Gra-
notated corpus of academic papers on programming
                                                                           ham     Heyes,       André      G.    Holzner,       Peter
language theory, compiler construction, metaprogram-
                                                                           Igo-Kemenes, Toru Iijima, Joe Incandela,
ming, software evolution and analytics, refactoring and
                                                                           Roger Jones, Yves Kemp, Kerstin Kleese
other related topics can serve as an entrance point into
                                                                           van Dam, Juergen Knobloch, David Krein-
the research domain as well as the foundation for some
                                                                           cik, Kati Lassila-Perini, and Francois Le
metaresearch activities. Software engineering Master
                                                                           Diberder.       Status Report of the DPHEP
students at the University of Amsterdam have already
                                                                           Study Group: Towards a Global Eort for
started using BibSLEIGH actively in their studies.
                                                                           Sustainable Data Preservation in High En-
     It remains to be seen which open problems of soft-
                                                                           ergy Physics.        CoRR, abs/1205.4667, 2012.
ware language engineering can this project contribute
to solving [BZ15]. SLE, besides being a subdomain of            [ABFM09] Denis         Avrilionis,       Grady    Booch,        Jean-
software engineering, is known to be a bridging area                       Marie Favre, and Hausi A. Müller.                     Soft-
of research, where a fair share of activities is devoted                   ware Engineering 2.0 & Research 2.0. In
to seeking similarities between technologies and tech-                     Patrick Martin, Anatol W. Kark, and Dar-
nical spaces, and to developing techniques with wide                       lene A. Stewart, editors,  Proceedings of
and cross-space applicability.    However, even within                     the conference of the Centre for Advanced
one space reaching a point of soundly relating concepts                    Studies on Collaborative Research (CAS-
can take substantial time and eort  consider laying                      CON), pages 353355. ACM, 2009.
relations between attribute grammars and ax gram-
                                                                [ASU85]    A.     V.    Aho,    R.    Sethi,    and    J.   D.    Ull-
mars [Kos91] or between object algebras to attribute
grammars [RBO14]. We will try to push BibSLEIGH
                                                                           man.  Compilers: Principles, Techniques
towards facilitating this, and any help is welcome.
                                                                           and Tools. Addison-Wesley, 1985.
                                                                [Atl05]    AtlanMod.            Atlantic       Metamodel         Zoo,
                                                                           2005.     http://www.emn.fr/z-info/
                                                                           atlanmod/index.php/Zoos.
                                                                [BZ15]     Anya Helene Bagge and Vadim Zaytsev.
                                                                           Open and Original Problems in Software
                                                                           Language Engineering 2015 Workshop Re-
                                                                           port.   SIGSOFT Software Engineering
                                                                           Notes, 40:3237, May 2015.
                                                                  +
                                                                [C ]       Harry Coonce et al. Mathematics Geneal-
                                                                           ogy     Project.          http://www.genealogy.
                                                                           ams.org.
                                                                [CC90]     Elliot J. Chikofsky and James H. Cross II.
                                                                           Reverse Engineering and Design Recovery:


                                                            9
           A Taxonomy.          IEEE Software, 7(1):1317,                        national ACM SIGIR Conference on Re-
           1990.                                                                  search and Development in Information
                                                                                  Retrieval, pages 653654. ACM, 2006.
[CCCB14]   Javier Canovas, Valerio Cosentino, Jordi
           Cabot,        and    Robin   Boncorps.        Meta-
                                                                            +
                                                                       [KFH 12]   Iman     Keivanloo,     Christopher   Forbes,
           Science:      Analyzing the Research Prole                            Aseel Hmood, Mostafa Erfani, Christo-
           of   Authors,       Conferences   and      Journals,                   pher Neal, George Peristerakis, and Juer-
           2014.  http://som-research.uoc.edu/                                    gen Rilling.      A Linked Data Platform
           tools/metaScience.                                                     for   Mining   Software    Repositories.    In
                                                                                  Proceedings of the Ninth IEEE Working
[CJ14]     Richard Cyganiak and Anja Jentzsch. The
                                                                                  Conference on Mining Software Reposito-
           Linking Open Data Cloud Diagram, 2014.
                                                                                  ries, pages 3235. IEEE Computer Soci-
           http://lod-cloud.net.                                                  ety, 2012.

[Cos15]    Roberto Di Cosmo. Ten Years Analysing
                                                                       [KKM06]    Ville Karavirta, Ari Korhonen, and Lauri
           Large Code Bases: A Perspective.              http:                    Malmi.    Taxonomy of Algorithm Anima-
           //tinyurl.com/z44ydlw, 2015. EvoLille                                  tion Languages.    In Proceedings of the
           2015.
                                                                                  ACM Symposium on Software Visualiza-
[Ecl09]    Eclipse.        CDO      (Connected     Data    Ob-
                                                                                  tion, pages 7785. ACM, 2006.
           jects) Model Repository, 2009.              https:          [KLS09]    Justin    King,    Kiran    Lakkaraju,     and
           //eclipse.org/cdo/.                                                    Adam J. Slagell.      A Taxonomy and Ad-
      +                                                                           versarial Model for Attacks Against Net-
[FBM 12]   Robert     B.    France,     James    M.   Bieman,
                                                                                  work Log Anonymization. In Sung Y. Shin
           Sai Pradeep Mandalaparty, Betty H. C.
           Cheng, and Adam C. Jensen.                  Reposi-
                                                                                                              Proceedings
                                                                                  and Sascha Ossowski, editors,

           tory for Model Driven Development (Re-
                                                                                  of the 24th Symposium on Applied Com-
           MoDD). In Martin Glinz, Gail C. Murphy,
                                                                                  puting, pages 12861293. ACM, 2009.
           and Mauro Pezzè, editors,  Proceedings of                   [Kos91]    C. H. A. Koster. Ax Grammars for Pro-
           the 34th International Conference on Soft-                             gramming Languages.         In H. Alblas and
           ware Engineering, pages 14711472. IEEE,                                                   Attribute Grammars,
                                                                                  B. Melichar, editors,
           2012.
                                                                                  Applications and Systems, volume 545 of
[FDNT15]   Johan Fabry,          Tom Dinkelaker,       Jacques
                                                                                  LNCS, pages 358373. Springer, 1991.
           Noyé,     and Éric Tanter.           A Taxonomy
                                                                       [Ley02]    Michael Ley.      The DBLP Computer Sci-
           of     Domain-Specic        Aspect    Languages.
                                                                                  ence Bibliography:      Evolution, Research
           ACM Computing Surveys,                 47(3):40:1
                                                                                  Issues, Perspectives.      In Alberto H. F.
           40:44, February 2015.
                                                                                  Laender and Arlindo L. Oliveira, editors,

[Gen14]    GenMyModel,              2014.              https:                     Proceedings of the 9th International Sym-
           //repository.genmymodel.com.                                           posium on String Processing and Infor-
                                                                                  mation Retrieval, volume 2476 of LNCS,
[GMB10]    Anne Gentil-Beccot, Salvatore Mele, and                                pages 110. Springer, 2002.
           Travis C. Brooks. Citing and Reading Be-
           haviours in High-energy Physics.           Sciento-         [LWT08]    Yuefeng Li, Sheng-Tang Wu, and Xiao-

           metrics, 84(2):345355, 2010.                                          hui Tao.       Eective Pattern Taxonomy
                                                                                  Mining    in   Text
                                                                                                    Documents.    In Pro-
      +
[GMH 09] Anne Gentil-Beccot, Salvatore Mele, An-                                  ceedings of the 17th ACM International
           nette Holtkamp, Heath B. O'Connell, and                                Conference on Conference on Information
           Travis C. Brooks.          Information resources                       and Knowledge Management, pages 1509
           in   high-energy       physics:   Surveying     the                    1510. ACM, 2008.
           present landscape and charting the future
           course.    JASIST, 60(1):150160, 2009.                     [Nov07]    Vít Novácek. Imprecise Empirical Ontol-
                                                                                  ogy Renement  Application to Tax-
[HZL06]    Ruizhang        Huang,     Zhigang    Zhang,    and                    onomy Acquisition.         In Jorge Cardoso,
           Wai Lam.            Rening hierarchical taxon-                        José Cordeiro, and Joaquim Filipe, ed-
           omy structure via semi-supervised learn-                               itors, Proceedings of the Ninth Interna-
           ing.     In   Proceedings of the 29th Inter-                           tional Conference on Enterprise Informa-


                                                                  10
           tion Systems, Volume 2: AIDSS, pages                   [VVvC09]   Eelco Visser, Sander Vermolen, and Elmer
           3138, 2007.                                                      van Chastelet.    Researchr, 2009.     http:
                                                                             //researchr.org.
     +
[PDP 07]   Damien Pollet, Stéphane Ducasse, Loïc
                                                                        +
           Poyet,     Ilham   Alloui,   Sorana     Cîmpan,        [WNB 15] Gemma        Webster,     Hai    H.    Nguyen,
           and Hervé Verjus.        Towards A Process-                       David E. Beel, Chris Mellish, Claire D.
           Oriented     Software    Architecture    Recon-                   Wallace,   and   Je   Z.   Pan.    CURIOS:
           struction Taxonomy. In René L. Krikhaar,                          Connecting Community Heritage through
           Chris Verhoef, and Giuseppe Antonio Di                            Linked Data. In Proceedings of the 18th
                         Proceedings of the 11th Eu-
           Lucca, editors,                                                   ACM Conference on Computer Supported
           ropean Conference on Software Mainte-                             Cooperative Work & Social Computing,
           nance and Reengineering, pages 137148.                           pages 639648. ACM, 2015.
           IEEE Computer Society, 2007.
                                                                  [YC09]     Hui Yang and Jamie Callan. Feature Se-

[RBO14]    Tillmann     Rendel,     Jonathan     Immanuel                    lection for Automatic Taxonomy Induc-

           Brachthäuser,      and     Klaus    Ostermann.                    tion.  In Proceedings of the 32nd Inter-

           From Object Algebras to Attribute Gram-                           national ACM SIGIR Conference on Re-
           mars. In Proceedings of the 29th Interna-                         search and Development in Information
           tional Conference on Object Oriented Pro-                         Retrieval, pages 684685. ACM, 2009.
           gramming Systems Languages and Appli-
                                                                  [Zay15]    Vadim Zaytsev. Grammar Zoo: A Corpus
           cations, pages 377395. ACM, 2014.
                                                                             of Experimental Grammarware.   Fifth Spe-
[SHK14]    Harald Störrle, Regina Hebig, and Alexan-
                                                                             cial issue on Experimental Software and
           der Knapp.      An Index for Software En-
                                                                             Toolkits of Science of Computer Program-
           gineering Models.        In Stefan Sauer and
                                                                             ming (SCP EST5), 98:2851, February
           Manuel     Wimmer,       editors,   Poster Ses-                   2015.

           sion of MoDELS 2014, volume 1258 of                    [ZB14]     Vadim Zaytsev and Anya Helene Bagge.
           CEUR Workshop Proceedings, pages 36                              Parsing in a Broad Sense.           In Jürgen
           40. CEUR-WS.org, 2014.
                                                                             Dingel, Wolfram Schulte, Isidro Ramos,
                                                                             Silvia Abrahão, and Emilio Insfrán, edi-
[Sou13]    David M. South.          The DPHEP Study
           Group: Data Preservation in High Energy
                                                                                 Proceedings of the 17th International
                                                                             tors,

           Physics.   CoRR, abs/1302.3379, 2013.                             Conference on Model Driven Engineering
                                                                             Languages and Systems, volume 8767 of
[Spi03]    Diomidis Spinellis. The decay and failures                        LNCS, pages 5067. Springer, 2014.
                            Communications of the
           of web references.
           ACM, 46(1):7177, 2003.
[vDV02]    Arie van Deursen and Eelco Visser.         The
                                Proceedings of the
           Reengineering Wiki. In
           Sixth European Conference on Software
           Maintenance and Reengineering, pages
           217220. IEEE Computer Society, 2002.


[VSM13]    Bogdan Vasilescu, Alexander Serebrenik,
           and Tom Mens.        A Historical Dataset of
                                              Pro-
           Software Engineering Conferences. In
           ceedings of the 10th Working Conference
           on Mining Software Repositories, pages
           373376. IEEE Computer Society, 2013.

     +
[VSM 14]   Bogdan Vasilescu, Alexander Serebrenik,
           Tom Mens, Mark G. J. van den Brand,
           and Ekaterina Pek.           How Healthy are
           Software Engineering Conferences? Sci-
           ence of Computer Programming, 89:251
           272, 2014.


                                                             11