BibSLEIGH: Bibliography of Software (Language) Engineering in Generated Hypertext Vadim Zaytsev vadim@grammarware.net Universiteit van Amsterdam, The Netherlands Raincode, Belgium which is a specialisation, and just as any specialisation, can lead to signicant optimisation. We have collected Abstract some features in Ÿ 1.3 that are missing from the cur- rent widespread remedies (we refuse to call them so- The body of research contributions is vast and lutions). Each of the features is missing for a good full of papers. Existing projects help us nav- reason: each requires research, development and do- igate through it and relate authors to papers main focus. This makes them both attractive to invest and papers to venues. In this paper we list eort in and dangerous because most are non-trivial. features missing from those projects and pro- Finally, in Ÿ 1.4 the most obvious point will be raised pose a solution in the form of BibSLEIGH  about information that is interesting in bibliographical a work in progress on facilitated browsing of context, being distributed over various unconnected scientic knowledge objects. Through leverag- sources of not that structured data. ing domain focus, by actively employing auto- mated data collection and scraping tools, and 1.1 BibTEX non-uniformity across sources with automated annotating of the corpus, we are able to gain and provide insights into sci- If we attempt to download .bib les for the same entic communities and topics, as well as sur- publication from various sources, they will all look face potential interdisciplinary opportunities. dierently, sometimes drastically so. Many publish- ers do not curate their data, rely on automatic text 1 Motivation recognition and only occasionally and serendipitously x misspellings. BibTEX providers are often volatile BibSLEIGH has started in 2014 as a project to scratch when it comes to conference naming. IEEE and some personal itches and solve problems that were eat- ACM are obviously inclined to include their alia- ing away from the authors' time as well as anyone tion (the IEEE/ACM international conference on...), else's. These issues can be broadly categorised into sometimes in favour of more useful information like four categories. In Ÿ 1.1, we will discuss in some detail the number of the conference in the series. DBLP problems with the bibTEX format and the unnecessary has changed their policy on abbreviating venue names diversity of conventions for equivalent items, which has during the period of writing this paper (between SAT- a chance of making academic publications look unpro- ToSE in July 2015 and post-proceedings in November). fessional and can also lead to confusion and mistakes. However, consistency enforcing is very time consum- When information is available, bibTEX providers ing. In Ÿ 1.2, the focus will be on domain specicity, usually decide to include it  yet what was the last time someone cared about whether ESOP 1986 took Copyright c 2015 by the paper's authors. Copying permitted place in Saarbrücken or in Passau? This information for private and academic purposes. This volume is published can be leveraged for other purposes, like tracking coun- and copyrighted by its editors. try and continent preferences and their shifting over In: A.H. Bagge, T. Mens (eds.): Postproceedings of SATToSE the years, or investigating the impact of location on 2015 Seminar on Advanced Techniques and Tools for Software Evolution, University of Mons, Belgium, 6-8 July 2015, the number, quality and aliation of papers. However, published at http://ceur-ws.org it is not used for any of those purposes, yet included in 1 + the bibliographical entry. Nevertheless, many details al. [VSM13, VSM 14] that harvested PC members of about in which hotel near which city on which exact several top conferences and cross-checked them with days the conference has taken place, nd their way into authors publishing there to measure academic inbreed- bibTEX, even though they were important only for the ing. However, the focus of such a website is limited to briefest of times, and only to immediate attendees of one event, or in some lucky cases to a series of events, the event. and such websites are very prone to disappearing for- So, on one hand, there is too much information ever once their organisers retire or change employers. in the bibTEX entries supplied by publishers and accumulators like DBLP and Google Scholar: ad- dresses, dates, timestamps, keywords, sometimes en- As the other extreme we have services that make an tire abstracts. On the other hand, however, some endeavour to collect information over a broad choice of of more useful information is routinely missed. Fre- conferences on all kinds of topics, and put them in one quent omissions concern editor names and hyper- place for display and consumption. The most famous links that can be used to access the actual content ones are DBLP with its 6500+ venues, Google Scholar of the publication. Editor names play exactly the which is based on web crawling and Microsoft Aca- same role in events and journal special issues as au- demic Search that contains ranking tables sorting con- thor names play in individual publications: they help ferences of one eld by the number of citations their ar- to identify the item but also establish community ticles enjoyed over the years. Such services try to be as links across dierently named and formally unrelated general and comprehensive as possible, and this is ex- events. Hyperlinks are not always entirely missing, actly where they fail short. Broad generalisations are but oftentimes hidden behind non-standard elds like impossible without compromises on metadata models, ee or acmid; not curated in a way that a doi eld on information representation, on clone detection. A sometimes starts with http://; and even outdated website of one particular conference typically shows  most if not all links like http://www.computer. very clearly which volume of which journal contains its org/proceedings/csmr/0546/05460161abs.htm be- post-proceedings special issue  while DBLP habitu- ing provided by DBLP have been dead (HTTP Status ally gives you all issues of the conference and all issues 404) for several years since the redesign of the IEEE of all journals and leaves the search for a match in your Computer Society website made them obsolete. own hands. University libraries fall into the same cat- Time lost in reformatting is only a part of this egory: while limiting their databases to material avail- side of the problem. Inconsistencies lead to unpro- able physically or through subscriptions, they do not fessional look of those papers whose authors have de- dierentiate among domains, so searching for muta- cided against wasting time on bibliography beautica- tion will likely result in many items unrelated to mu- tion; and worse yet  to duplicate entries appearing tation testing; and searching for graph, while more within the same paper with slight variations in spelling productive, will still yield results from graph transfor- and data details provided, which made searching for mation research as well as from general graph theory. the right entry harder and clone detection impossible within a typical textual editor. The quest for broad coverage makes the project vul- nerable. For instance, DBLP covers millions of authors 1.2 Lack of domain focus and thus has to be extremely careful about not con- Academic researchers tend to specialise but never limit fusing authors with similar names  however, many themselves overly to one particular series of events. researchers, especially in the pre-google era, did not Yet, when we look at sources of information we have write their names always in the same fashion. This at our disposal, they come in two sizes only. On one would have been known to domain experts who are extreme we have websites devoted to individual con- familiar with key authors in their eld, but domain ferences. They usually contain a lot of information knowledge does not scale up. Similarly, Google Scholar that is not immediately required for a decent bibTEX relies on its web crawler, and so it is not uncommon for entry, but can be quite useful in the long run for com- it to point you to papers that are no longer available or munity recognition: after all, one is much more likely are in fact no papers at all, no matter what their au- to submit to a conference chaired by someone whose thors claim. Microsoft Academic Search is based on ci- name they recognise and whose work they can relate tation information  and as a result of dierent people to that of their own. Organisation committee details citing the same venue in dierent ways (e.g., with In- and programme committee members provide refresh- ternational Conference or without it), the same venue ingly large foundation for automation of this process, appears several times in the ranking, both positioned as demonstrated by the recent work of Vasilescu et much lower than they deserve. 2 1.3 Missing features since even fairly focused researchers will nd them- selves contemplating submission to a dozen or two rea- When we like a paper, we often begin investigating sonable venues. There is quite some space for auto- its authors to see if they have contributed to simi- mated clustering. lar lines of research before or after. DBLP lookup Topic-driven grouping is not the only kind of classi- has become a part of a routine check in many cases cation that would be sensible for a bibliographic por- from research exploration to job candidate evaluation. tal: some venues are linked by a subcommunity of However, a graph transformation researcher that occa- people who strongly contribute to both. For instance, sionally published a model transformation paper, or a there are many people who publish regularly both at grammarware engineer masquerading as a metamodel MoDELS and ICSME/SCAM, even though they can- evolution contributor, will have dierent styles across not attend both within the same year (they happen other of their papers, and might not be as fruitful to simultaneously). Having linked data about people's investigate if your interest is particular and your time contributions, we can surface such relations  and budget is limited. What could have helped here is some RDF frontends to DBLP let you do that with a visualisation beyond textual: instead of browsing couple of medium-size SPARQL queries. through a multi-page wall of text prole on DBLP, All that being said in Ÿ 1.1 about the state of some of us would have wanted to take a quick look bibTEX entries obtainable from available sources, we at a diagram depicting community contribution in a still want to have some freedom in formatting: ev- concise and illustrative manner. eryone in computer science research knows what LNCS Natural language processing techniques have a is; in a paper submitted to SLE one does not need powerful arsenal: even the simplest analyses like stem- to explain this abbreviation; editor names are nice to ming and lemmatisation can provide great aid in surf- have but sacriceable under pressing space constraints, ing through the ocean of papers to pick the right ones etc. We want exible bibTEX formatting: DBLP pro- to read and cite. It is common knowledge that the vides you with some very limited options (crossref or names of conferences do not always completely repre- no crossref ); IEEE Xplore and Elsevier as well (ab- sent their intentions: having languages in the name stract or no abstract); but BibSLEIGH even in its very can mean one or two of a dozen of entirely dierent beginning stage provides its users with more freedom. research directions; venues with engineering in their Desktop software for managing bibliographies like name can get quite science-y and theoretical, just as Mendeley has tagging functionality that can help its a name starting with trends does not mean all pa- users to annotate the papers they read into dierent pers are surveys, overviews and vision statements. To categories or add brief descriptions to them. However, the best of our knowledge, no currently existing biblio- there is a huge gap between doing that and providing a graphic website currently provides a lot of NLP-based comprehensive annotated bibliography on the subject: features, although ACM Digital Library has recently in fact, such contributions are rare and properly trea- started collaborating with IBM Watson to pursue that. sured, for it takes a lot of expertise and work to craft Scraping older sources from document scans to them. Unfortunately, there are much many topics and websites that fell apart decades ago and have their subtopics than there will even be annotated bibliogra- ruins exposed though the Wayback Machine, is usu- phies. We need some semi-automatic way of providing ally beyond the goals and capabilities of bibliographic us with at least bundles of related papers if we websites. Armed with domain knowledge and the in- indicate the selection criteria. terest seriously linked to that domain, we can gather enough eort to complete such endeavours and ask se- 1.4 Distributed information nior and emeritus colleagues directly about that one long-forgotten obscure workshop that a reputable It was already pointed out above that participating in conference has grown from. event organisation and serving in programme commit- Grouping and clustering of conferences is usu- tees can be seen as community binding and is therefore ally either manual work, or done though event co- metadata of interest. Yet, to the best of our knowl- location, or not done at all. The rst option is edge, there is no project currently dedicated to collect- labour-intensive, error-prone, vulnerable to biases and ing this kind of information, and it remains scattered prejudice. The second option delivers complications half over the internet and half in the Way Back Ma- for roaming venues like BX (deliberately co-locating chine. + each year with a dierent community: ETAPS, STAF, Mathematics Genealogy Project [C ] is a totally VLDB, etc) and for diverging venues that stopped co- disconnected project dedicated to documenting top- locating deliberately to emphasize pursuing a diver- ics of doctoral dissertations (and occasionally habili- gent path. The third option is not an option at all, tations) and supervisorship information. It certainly 3 has a merit of its own, but we believe it can also be we call it LRJ, short for Lexically Reliable JSON, be- coupled with other kinds of metadata in a sensible way. cause we store all key-value pairs one per line sorted Aliation information very occasionally nd its by keys. This was chosen over a more classic database way into DBLP as well as into Google Scholar where setup in order to allow individual traceable edits of academics can log in and update it (unfortunately, each piece of data and at the same time to guarantee some choose to log in and prohibit Google from ever user responsiveness. Data is imported to this central showing information about them), but there is no easy place through any of the existing importers, which are way of tracking and leveraging it. However, it is not usually implemented as iterative parsers (to process outrageous to think of research dedicated to tracking the DBLP dump which is around 2 GB) or webscrap- research centres of activities on particular topics over ers (at this moment we have those for individual DBLP the years. pages, CEUR and EasyChair). JSON les can also Finally, citation information  it is available on obviously be added manually. There is also an ad-hoc publishers' websites in limited form (because they are importer that creates appropriate JSON entities from not big fans of sharing it among themselves) and on a list it reads from a textual le  this helps to prop- Google Scholar (where it is heavily guarded against erly add ancient entries. any form of automated scraping). While acknowledg- Once the data is in the repository, it can be fur- ing some interest in it, we choose to avoid this aspect ther curated, normalised, improved, enhanced and for now, because it is not static by nature: citation in- crosschecked with other sources. Typical maintenance formation available today can be totally out of date by activities include adding a fresh issue of an already tomorrow. However, there is a lot of potential research known conference or a journal issue known to be re- here that goes way beyond traditional bibliometrics: lated to one of the known conferences (automated: for instance, we can identify canonical sources (which one just needs to run an incremental updater), im- often will be books, like the Dragon Book [ASU85]) proving the name of the proceedings booktitle (semi- that are used throughout a large fraction of papers in a automated: changed manually at the top and au- specic conference, and nd other venues in a dierent tomatically propagated downwards), removing non- language that have the tendency to cite translations of academic clutter such as forewords and panel sum- this book. maries (manually or heuristic-based). As an example Additionally, academic articles also contain links to of crosschecking we can talk about adding PC mem- web resources such as additional documentation, wikis bers and organisers: this information is never found on and tool repositories, and such links have a half life DBLP, but can be harvested elsewhere and integrated of 4 years on average [Spi03]. The Software Heritage into the same system. Project was recently proposed by Roberto Di Cosmo as Once normalisation reaches a point of being a valid a project to organise, preserve and share all academi- input for analysis, we enrich the data by stemming all cally produced software to provide much desired avail- titles and tagging them by predened tags  following ability, traceability and uniformity. Unfortunately the the spirit of the rest of the project, each tag has its own project seems to be in early stages, its call to action is denition stored in a separated JSON le which can available on SlideShare [Cos15] but the project itself be accessed, inspected and changed right on GitHub. is yet unknown to public search engines. It will be Stemming provides fully automated foundation to nat- interesting to see if the corpus of BibSLEIGH can be urally link papers to their conceptual neighbours, tags automatically mined for references to tools and clus- play the same role for previously known manually de- tered by technological space. ned concepts (so that λ-lifting falls under the same tag as λ-calculus, but µ-kernel is kept away from µ- 2 BibSLEIGH to the rescue! calculus, even though the characters look similar ). 1 Each tag denition can contain links to Wikipedia, BibSLEIGH is a work in progress. Keeping that in Wikidata and other places that are displayed on the mind, we would like to sketch preliminary require- tag's webpage. Stems can only rely on automatically ments and architecture decisions in Ÿ 2.1, point out derivable information, so their webpages display neigh- some related work in Ÿ 2.2 and describe the state of bours  stems that are commonly used together with the project as it is by the time of submission in Ÿ 2.3. them. Next, Ÿ 3 will draft some possible future directions we might decide to explore. 1 As a side remark, in Unicode these are dierent symbols: µ- kernel is read as microkernel and therefore uses the micro sign 2.1 Proposed solution character (U+00B5), while µ-calculus is read as mu-calculus and is thus represented by the Greek small letter mu (U+03BC). In the centre of BibSLEIGH there is one centralised BibSLEIGH is the only website that gets it right in all places, repository containing all its data in JSON format  the readers are welcome to check. 4 Whenever the central dataset of BibSLEIGH is 2.0 [ABFM09], SL(E)BOK, etc. They usually combine needed for inspection, it is formatted as a collection of requirements elicitation with experience reports with almost-static XHTML pages: the only dynamic part of calls to arms. One of those very similar to ours is Meta- them is the pretty-printing of bibTEX itself. The out- Science [CCCB14]  unlike BibSLEIGH that mainly look of BibSLEIGH is less austere than that of DBLP, aims at cross-referencing various information sources it makes full use of a palette of colours and a collection and using domain knowledge, MetaScience is focused of icons for each covered brand of conferences. exclusively on automatically deriving metadata such as coauthor graphs and pages published per year, and contains impressive interactive visualisations of it. 2.2 Related work Linked data is an initiative that started in the se- In the eld of High-Energy Physics there has been a mantic web community and has gained a lot of at- movement concerning long time preservation of pub- tention over the decade of its existence. The idea lications, datasets, repositories and relations between revolves around uniform identication of entities by + + them [GMH 09, GMB10, AAA 12, Sou13], and there URIs and uniform encoding of a graph of their rela- is a prospering project called INSPIRE-HEP at http: tions as a collection of subject-predicate-object triples. //inspirehep.net. It covers a dierent domain than They have standard formats for specifying the triples software (language) engineering, but otherwise partly (mostly RDF or Turtle), languages for querying them addresses the same problems we have pointed out. It (nowadays mostly SPARQL) and over half a thousand does oer additional functionality such as job listings open datasets containing up to several billion of such and does not intend to cover some of our goals such as triples [CJ14]. There is research evidence backed up visualisations. by operational prototypes, that points to usefulness ACM Digital Library in recent collaboration with of linked data for many related tasks from connect- IBM Watson has started to provide feature called + ing community heritage [WNB 15] to mining software Concept Insights. For each paper, two things can + repositories [KFH 12]. be explored: concepts in this article that links glossary terms mined from the full text of the pa- 2.3 Terminology and current state of Bib- per, to their denitions on Wikipedia and recent SLEIGH authors with related interests that visualises people who recently published something that share these By domain we mean a top group of conferences: the concepts. This functionality is certainly welcome, front page of BibSLEIGH displays logos of its domains. even though it remains to be seen how such auto- Right now they are dened ad-hoc with the help of mated concept matching can compete with and com- some domain knowledge; in the future we will use au- plement manual research eorts in taxonomies that tomated clustering techniques to form such domains. try to identify key publications and tie them with A brand is a series of events with continuing numbering key concepts and relations between them: exam- and, more often than not, the same name. One event ples exist for taxonomies of domain specic aspect can belong in several brands: a brand of MoDELS cov- languages [FDNT15], reverse engineering [CC90], re- ers the UML series because they kept the numbering, + verse architecting [PDP 07], (un)parsing [ZB14], algo- but events of the brand LDTA and ATEM belong only rithm animated visualisation [KKM06], security top- to the domain of SLE, but not to the brand SLE. Each ics [KLS09]. Information retrieval research has also proceedings entity is called an issue : usually it is reg- demonstrated promising results in helping to select ular conference proceedings issue, but it can also be a features for automated induction [YC09, LWT08] and journal special issue. Multi-volume proceedings have renement [HZL06, Nov07] of taxonomies, which we one issue per volume because bibTEX entries for such have not yet explored. volumes are dierent. A tag is a predened term such One step farther from bibliographical reposito- as context-free grammar or visual notation speci- ries there are model repositories such as FMI (Free ed as a set of matching rules covering spelling variants Model Initiative) [SHK14], ReMoDD (Repository and synonyms (so a paper with graphical notation in + for Model Driven Development) [FBM 12], CDO the title will be tagged with visual notation). There (Connected Data Objects) [Ecl09], Atlantic Meta- are several style-dening tags like question (the ti- model Zoo [Atl05], Grammar Zoo [Zay15], GenMy- tle ends in a question, like Can Programming Be Model [Gen14], that are on a quest of collecting mod- Liberated from the Von Neumann Style?), towards els for various purposes. There are quite a num- (like Towards Incremental Execution of ATL Trans- ber of initiatives related specically to community formations), considered harmful, past, present and management and facilitation: DBLP [Ley02], Reengi- future, etc. Interestingly, one of the most popular neering wiki [vDV02], Researchr [VVvC09], Research tags (covering around 7.2% of all papers) is named, 5 Domain Brands Applied computing SAC Components / architecture WICSA, ECSA, CBSE, QoSA Design / automation ASE, CASE, DAC, DATE Documentation / databases DocEng, DRR, HT, ICDAR, PODS, SIGMoD, TPDL, JCDL, VLDB Education CSEET, ITiCSE, TFPiE, LAK, SIGITE Federated computing PEPM, PLDI, SAS, STOC Formal language theory AFL, CIAA, DLT, ICALP, LATA Formal methods FM, iFM, SEFM, SFM, VDM Functional AFP, CEFP, FPCA, ICFP, IFL, ILC, LFP Graphs ICGT, AGTIVE, GaM, GCM, GG, GRAPHITE, GT-VMT High level / logics ALP, FLOPS, GPCE, LOPSTR, PLILP, PPDP, QAPL Human factors CHI, CSCW, DHM, DUXU, HCD, HCI, HIMI, IDGD, LCT, OCSC, SCSM, SOFTVIS, VISSOFT Information systems CAiSE, EDOC, ICEIS Knowledge engineering CIKM, ECIR, ICML, ICPR, KDD, KDIR, KEOD, KMIS, KR, LSO, MLDM, RecSys, SEKE, SIGIR, SKY Language engineering SLE, ATEM, LDTA, ASF+SDF, WAGA Modelware MoDELS, UML, ECMFA, ICMT, AMT, BX Object orientation ECOOP, Onward!, OOPSLA, PLATEAU, SPLASH, TOOLS Product lines SPLC, PLEASE Programming languages POPL, PADL Reliability AdaEurope, HILT, SIGAda, TRIAda Requirements ICRE, RE, REFSQ Software engineering ESEC, FSE, ICSE, GTTSE Software evolution SANER, SCAM, CSMR, WCRE, ICPC, ICSME, PASTE, MSR System software ASPLOS, CC, COCV, CGO, HPCA, HPDC, ISMM, LCTES, OSDI, PLOS, PPoPP, SOSP Testing CADE, CAV, CSL, FATES, FLoC, ICLP, ICST, ICTSS, IJCAR, ISSTA, LICS, MBT, RTA, SAT, SMT, TAP, TLCA, VMCAI Theory of software ESOP, FASE, FoSSaCS, TACAS, WRLA Table 1: Snapshot of the brands and domains currently in BibSLEIGH. which corresponds to the pattern of starting the title The oldest entry so far is the First International with a word followed by a colon or an em-dash  like LISP Conference held in 1963 in México, with atten- Lilith: A Personal Computer for the Software Engi- dees like John McCarthy and Marvin Minsky. It has neer, or Miranda: A Non-Strict Functional language mostly historical value, but a nice part was that it was with Polymorphic Types, or GHC: Operational Se- possible to surface most of the papers and reconstruct mantics, Problems, and Relationships with CP (↓, |). metadata by googling and scraping. This issue is not Currently tags are created based on titles only, because present on DBLP. that information is indisputably in the public domain Many mistakes in DBLP data (and sometimes in and can be used fairly; there is an ongoing discussion publishers' data) were corrected because they were about fair use of abstracts and keywords, but techni- becoming quite apparent once automated processing cally they can be harvested as well, so we plan to do so began: the longest stems were words erroneously (perhaps not committing the results of such harvest to glued together; matching heuristics work reason- public repositories to avoid copyright claims). A word ably well to equate dierent spellings of diacritical is what we call a stem obtained from a classic Snowball names, etc. An example of DBLP mismatch could comparing http://dblp.uni-trier. stemmer for English. We use our own lexer that tries be seen by to split camelcased words properly: not just Camel- de/db/conf/edoc/edoc2007.html to http: Case to Camel and Case, but also APIExplorer //bibtex.github.io/EDOC-2007.html: except for to API and Explorer and XSDtoMOF to XSD, 10.1109/EDOC.2007.42 and 10.1109/EDOC.2007.44, to and MOF (it also leaves JavaScript intact!). all DOIs at DBLP are incorrect but xed at Bib- Figure 1 shows a typical use of a word link. A role is SLEIGH. This was spotted automatically by reporting some facilitating role a person has played in an issue: that some entries in this issue had no page infor- being an editor, a keynote speaker, a PC member, etc., mation; an attempt to x it revealed a mismatch are roles. between DBLP and IEEE Xplore. DOI information is usually reliable; we know of only one counterex- By the time of submission of this paper, Bib- SLEIGH covered 166 brands in 26 domains, sum- ample: http://doi.ieeecomputersociety.org/10. marised on Table 1. There are 2726 issues of these 1109/ICSM.1997.624246 resolves successfully, but brands with 144589 papers in total. There are cur- http://dx.doi.org/10.1109/ICSM.1997.624246 does not. rently 684 tags with 354720 markings. The total vo- cabulary is 24359 stems derived from 1183492 words. BibSLEIGH contains proles on 150454 people, 6 Figure 1: A screenshot demonstrating the usefulness of stemming: an abstract domain is a proper tag, but functor is not, but we can still jump from this paper to all 17 papers that use that word and than to any of them with just another click. Figure 2: The front page of BibSLEIGH with 26 domains 7 Figure 3: Prole example: a grammarware researcher that started at CC and even RTA, to move on to the Figure 4: Prole example: a modelware researcher likes of SCAM and CSMR. Strong community involve- with a strong focus one one domain: started in OOP, ment in LDTA, SLE and SANER, even though he moved to enterprise and settled in model-driven do- has not published at SANER for a while, preferring main, which is reected not only by contributions, but ICSM(E). Recently started to broaden his interests to also in his vocabulary. Strong community involvement contribute to issues in the domains of testing, architec- in modelware venues. Prefers writing solo papers, but ture and automation. Strongly collaborates with one also collaborates broadly, with a bias towards one of of his colleagues (not inferable from the raw data: ex- his colleagues (not inferable that it is an ex-student). supervisor).The prole is incomplete because we do The prole is incomplete because we do not have com- not have complete information on all involved venues plete information on all involved venues yet! yet! 3 Future directions some of them might erroneously view several name- sakes as one person  no noticeable attention was What makes BibSLEIGH become more than a gloried devoted to this issue so far. Some scraping for roles wrapper for DBLP is harvesting its domain specicity has begun, so far we have 4154 roles, which is al- and community specicity. While keeping the auto- most 10 times the size of the dataset of Vasilescu et mated, semi-automated and heuristic-based transfor- al. [VSM13], but still around 5% of total work if we op- mations as maintenance activities, we can continue in- timistically estimate 10 organisers and 20 PC members graining the bibliographic entities and their groups on average per issue. Figure 3 and Figure 4 show two with information relating them to one another, as examples of person proles, with corresponding narra- well as to concepts, methods, frameworks, approaches, tions in the captions. Notice how the prole is inter- toolkits, datasets. Implementing various distance met- preted without the usual bibliometric remarks about rics, as well as annotating them manually or automat- the number of papers! ically with topic information can aid clustering and Exploring the rest is left as an exercise to the reader: linking beyond traditional methods depending on the citation information. We see this as another step to- • http://bibtex.github.io  web front end wards the construction of a body of knowledge for the • http://github.com/slebok/bibsleigh  par- domain of software language engineering (SLEBoK). tially curated JSON data Expansion of the BibSLEIGH data set will continue, • http://github.com/bibtex/bibsleigh  but not far: most interesting next steps involve strate- JSON refactorings and visualisations gically adding special issues and role annotations to al- 8 ready imported conferences. We are afraid that overly References eager expansion will deprive us of the main advantage + [AAA 12] Z. Akopov, Silvia Amerio, David As- of being domain-specic. However, if we could nd a ner, Eduard Avetisyan, Olof Bärring, way to eventually hide irrelevant parts from sight so James Beacham, Matthew Bellis, Gre- that a user can productively focus on a reasonable sub- gorio Bernardi, Siegfried Bethke, Am- set, that could solve the problem and open the door ber Boehnlein, Travis Brooks, Thomas wider for interdisciplinary growth of this project. Browder, Rene Brun, Concetta Car- Navigational support at the current stage of devel- taro, Marco Cattaneo, Gang Chen, opment is already quite strong: domains, brands, tags David Corney, Kyle Cranmer, Ray and words let you browse through thousands of papers Culbertson, Suenje Dallmeier-Tiessen, quite easily to nd that dozen that you are interested Dmitri Denisov, Cristinel Diaconu, Vi- in. However, we believe this can be improved further taliy Dodonov, Tony Doyle, Gregory P.  through adding annotations, leveraging metadata, Dubois-Felsmann, Michael Ernst, Martin proper visualisations, ground-based ranking and clus- Gasthuber, Achim Geiser, Fabiola Gian- tering, etc. otti, Paolo Giubellino, Andrey Golutvin, At BibSLEIGH's webpage the project is called John Gordon, Volker Guelzow, Takanori  facilitated browsing of scientic knowledge . In- Hara, Hisaki Hayashii, Andreas Heiss, deed, providing interactive access to the curated an- Frederic Hemmer, Fabio Hernandez, Gra- notated corpus of academic papers on programming ham Heyes, André G. Holzner, Peter language theory, compiler construction, metaprogram- Igo-Kemenes, Toru Iijima, Joe Incandela, ming, software evolution and analytics, refactoring and Roger Jones, Yves Kemp, Kerstin Kleese other related topics can serve as an entrance point into van Dam, Juergen Knobloch, David Krein- the research domain as well as the foundation for some cik, Kati Lassila-Perini, and Francois Le metaresearch activities. Software engineering Master Diberder. Status Report of the DPHEP students at the University of Amsterdam have already Study Group: Towards a Global Eort for started using BibSLEIGH actively in their studies. Sustainable Data Preservation in High En- It remains to be seen which open problems of soft- ergy Physics. CoRR, abs/1205.4667, 2012. ware language engineering can this project contribute to solving [BZ15]. SLE, besides being a subdomain of [ABFM09] Denis Avrilionis, Grady Booch, Jean- software engineering, is known to be a bridging area Marie Favre, and Hausi A. Müller. Soft- of research, where a fair share of activities is devoted ware Engineering 2.0 & Research 2.0. In to seeking similarities between technologies and tech- Patrick Martin, Anatol W. Kark, and Dar- nical spaces, and to developing techniques with wide lene A. Stewart, editors, Proceedings of and cross-space applicability. However, even within the conference of the Centre for Advanced one space reaching a point of soundly relating concepts Studies on Collaborative Research (CAS- can take substantial time and eort  consider laying CON), pages 353355. ACM, 2009. relations between attribute grammars and ax gram- [ASU85] A. V. Aho, R. Sethi, and J. D. Ull- mars [Kos91] or between object algebras to attribute grammars [RBO14]. We will try to push BibSLEIGH man. Compilers: Principles, Techniques towards facilitating this, and any help is welcome. and Tools. Addison-Wesley, 1985. [Atl05] AtlanMod. Atlantic Metamodel Zoo, 2005. http://www.emn.fr/z-info/ atlanmod/index.php/Zoos. [BZ15] Anya Helene Bagge and Vadim Zaytsev. Open and Original Problems in Software Language Engineering 2015 Workshop Re- port. SIGSOFT Software Engineering Notes, 40:3237, May 2015. + [C ] Harry Coonce et al. Mathematics Geneal- ogy Project. http://www.genealogy. ams.org. [CC90] Elliot J. Chikofsky and James H. Cross II. Reverse Engineering and Design Recovery: 9 A Taxonomy. IEEE Software, 7(1):1317, national ACM SIGIR Conference on Re- 1990. search and Development in Information Retrieval, pages 653654. ACM, 2006. [CCCB14] Javier Canovas, Valerio Cosentino, Jordi Cabot, and Robin Boncorps. Meta- + [KFH 12] Iman Keivanloo, Christopher Forbes, Science: Analyzing the Research Prole Aseel Hmood, Mostafa Erfani, Christo- of Authors, Conferences and Journals, pher Neal, George Peristerakis, and Juer- 2014. http://som-research.uoc.edu/ gen Rilling. A Linked Data Platform tools/metaScience. for Mining Software Repositories. In Proceedings of the Ninth IEEE Working [CJ14] Richard Cyganiak and Anja Jentzsch. The Conference on Mining Software Reposito- Linking Open Data Cloud Diagram, 2014. ries, pages 3235. IEEE Computer Soci- http://lod-cloud.net. ety, 2012. [Cos15] Roberto Di Cosmo. Ten Years Analysing [KKM06] Ville Karavirta, Ari Korhonen, and Lauri Large Code Bases: A Perspective. http: Malmi. Taxonomy of Algorithm Anima- //tinyurl.com/z44ydlw, 2015. EvoLille tion Languages. In Proceedings of the 2015. ACM Symposium on Software Visualiza- [Ecl09] Eclipse. CDO (Connected Data Ob- tion, pages 7785. ACM, 2006. jects) Model Repository, 2009. https: [KLS09] Justin King, Kiran Lakkaraju, and //eclipse.org/cdo/. Adam J. Slagell. A Taxonomy and Ad- + versarial Model for Attacks Against Net- [FBM 12] Robert B. France, James M. Bieman, work Log Anonymization. In Sung Y. Shin Sai Pradeep Mandalaparty, Betty H. C. Cheng, and Adam C. Jensen. Reposi- Proceedings and Sascha Ossowski, editors, tory for Model Driven Development (Re- of the 24th Symposium on Applied Com- MoDD). In Martin Glinz, Gail C. Murphy, puting, pages 12861293. ACM, 2009. and Mauro Pezzè, editors, Proceedings of [Kos91] C. H. A. Koster. Ax Grammars for Pro- the 34th International Conference on Soft- gramming Languages. In H. Alblas and ware Engineering, pages 14711472. IEEE, Attribute Grammars, B. Melichar, editors, 2012. Applications and Systems, volume 545 of [FDNT15] Johan Fabry, Tom Dinkelaker, Jacques LNCS, pages 358373. Springer, 1991. Noyé, and Éric Tanter. A Taxonomy [Ley02] Michael Ley. The DBLP Computer Sci- of Domain-Specic Aspect Languages. ence Bibliography: Evolution, Research ACM Computing Surveys, 47(3):40:1 Issues, Perspectives. In Alberto H. F. 40:44, February 2015. Laender and Arlindo L. Oliveira, editors, [Gen14] GenMyModel, 2014. https: Proceedings of the 9th International Sym- //repository.genmymodel.com. posium on String Processing and Infor- mation Retrieval, volume 2476 of LNCS, [GMB10] Anne Gentil-Beccot, Salvatore Mele, and pages 110. Springer, 2002. Travis C. Brooks. Citing and Reading Be- haviours in High-energy Physics. Sciento- [LWT08] Yuefeng Li, Sheng-Tang Wu, and Xiao- metrics, 84(2):345355, 2010. hui Tao. Eective Pattern Taxonomy Mining in Text Documents. In Pro- + [GMH 09] Anne Gentil-Beccot, Salvatore Mele, An- ceedings of the 17th ACM International nette Holtkamp, Heath B. O'Connell, and Conference on Conference on Information Travis C. Brooks. Information resources and Knowledge Management, pages 1509 in high-energy physics: Surveying the 1510. ACM, 2008. present landscape and charting the future course. JASIST, 60(1):150160, 2009. [Nov07] Vít Novácek. Imprecise Empirical Ontol- ogy Renement  Application to Tax- [HZL06] Ruizhang Huang, Zhigang Zhang, and onomy Acquisition. In Jorge Cardoso, Wai Lam. Rening hierarchical taxon- José Cordeiro, and Joaquim Filipe, ed- omy structure via semi-supervised learn- itors, Proceedings of the Ninth Interna- ing. In Proceedings of the 29th Inter- tional Conference on Enterprise Informa- 10 tion Systems, Volume 2: AIDSS, pages [VVvC09] Eelco Visser, Sander Vermolen, and Elmer 3138, 2007. van Chastelet. Researchr, 2009. http: //researchr.org. + [PDP 07] Damien Pollet, Stéphane Ducasse, Loïc + Poyet, Ilham Alloui, Sorana Cîmpan, [WNB 15] Gemma Webster, Hai H. Nguyen, and Hervé Verjus. Towards A Process- David E. Beel, Chris Mellish, Claire D. Oriented Software Architecture Recon- Wallace, and Je Z. Pan. CURIOS: struction Taxonomy. In René L. Krikhaar, Connecting Community Heritage through Chris Verhoef, and Giuseppe Antonio Di Linked Data. In Proceedings of the 18th Proceedings of the 11th Eu- Lucca, editors, ACM Conference on Computer Supported ropean Conference on Software Mainte- Cooperative Work & Social Computing, nance and Reengineering, pages 137148. pages 639648. ACM, 2015. IEEE Computer Society, 2007. [YC09] Hui Yang and Jamie Callan. Feature Se- [RBO14] Tillmann Rendel, Jonathan Immanuel lection for Automatic Taxonomy Induc- Brachthäuser, and Klaus Ostermann. tion. In Proceedings of the 32nd Inter- From Object Algebras to Attribute Gram- national ACM SIGIR Conference on Re- mars. In Proceedings of the 29th Interna- search and Development in Information tional Conference on Object Oriented Pro- Retrieval, pages 684685. ACM, 2009. gramming Systems Languages and Appli- [Zay15] Vadim Zaytsev. Grammar Zoo: A Corpus cations, pages 377395. ACM, 2014. of Experimental Grammarware. Fifth Spe- [SHK14] Harald Störrle, Regina Hebig, and Alexan- cial issue on Experimental Software and der Knapp. An Index for Software En- Toolkits of Science of Computer Program- gineering Models. In Stefan Sauer and ming (SCP EST5), 98:2851, February Manuel Wimmer, editors, Poster Ses- 2015. sion of MoDELS 2014, volume 1258 of [ZB14] Vadim Zaytsev and Anya Helene Bagge. CEUR Workshop Proceedings, pages 36 Parsing in a Broad Sense. In Jürgen 40. CEUR-WS.org, 2014. Dingel, Wolfram Schulte, Isidro Ramos, Silvia Abrahão, and Emilio Insfrán, edi- [Sou13] David M. South. The DPHEP Study Group: Data Preservation in High Energy Proceedings of the 17th International tors, Physics. CoRR, abs/1302.3379, 2013. Conference on Model Driven Engineering Languages and Systems, volume 8767 of [Spi03] Diomidis Spinellis. The decay and failures LNCS, pages 5067. Springer, 2014. Communications of the of web references. ACM, 46(1):7177, 2003. [vDV02] Arie van Deursen and Eelco Visser. The Proceedings of the Reengineering Wiki. In Sixth European Conference on Software Maintenance and Reengineering, pages 217220. IEEE Computer Society, 2002. [VSM13] Bogdan Vasilescu, Alexander Serebrenik, and Tom Mens. A Historical Dataset of Pro- Software Engineering Conferences. In ceedings of the 10th Working Conference on Mining Software Repositories, pages 373376. IEEE Computer Society, 2013. + [VSM 14] Bogdan Vasilescu, Alexander Serebrenik, Tom Mens, Mark G. J. van den Brand, and Ekaterina Pek. How Healthy are Software Engineering Conferences? Sci- ence of Computer Programming, 89:251 272, 2014. 11