Introduction

CSS Corpus for Reproducible Analysis

Nico de Groot

nico@nicasso.nl 1

Vadim Zaytsev

vadim@grammarware.net 0 0 Raincode Labs 1 Universiteit van Amsterdam

Reproducibility of research heavily depends on the availability of the datasets from the experiments in the context of metaprogramming, the corpus of the code that was used to run the analyses and transformations. In the case of CSS, the problem is even more acute since the web is a constantly changing environment where the same address can refer to a frequently changing artefact. In this report, we explain how we created a corpus of CSS les as a part of our project of building a framework for analysing style sheets. We also include two case studies of explanatory nature showing how style sheets from various websites go about coding conventions and about code duplication. We believe this work will be useful for other CSS researchers to compare techniques they develop, on a uniform yet realistic dataset.

Introduction

CSS, or Cascading Style Sheets, is the de facto standard in specifying the appearance of web pages. It is a language supported to some extent by all existing internet browsers and standardised by the World Wide Web Consortium the leading authority in web technologies and standards [B˙HL11, eEG +11]. Even in the presence of other better, modern, e cient, well-designed alternatives, it remains the only industrially viable option for deployment of styles, leaving languages like SASS [CWE06] or LESS [SSP +09] to be used strictly on developers’ side, if at all.

A typical style sheet in CSS could look like this: }

This sheet contains one rule with two selectors and three declarations. Each selector speci es one particular element to be matched with this style: in our simple example these are element selectors, other kinds of selectors including class selectors and ID selectors, as well as more complex pseudo-selectors for specifying the rst child or the rst line of the matched area. Each declaration assigns a value to a property, with the type of a value being determined by the particular property: a padding’s value is expected to be a length with a unit, but a font-family property expects a comma-separated list of names of individual fonts and font families.

CSS is an important element of the web development landscape, yet it is largely underrepresented in academic research. In a recent study we managed to cite all peer-reviewed papers ever published with CSS or Cascading Copyright c by the paper’s authors. Copying permitted for private and academic purposes.

Proceedings of the Seminar Series on Advanced Techniques and Tools for Software Evolution SATToSE2016 (sattose.org), Bergen, Norway, 11-13 July 2016, published at http://ceur-ws.org Style Sheets in the title [GZ16a]: 4 with general discussions, 2 case studies, 3 on pinpointing language shortcomings and improving on them, 5 on preprocessors, 2 on classi cation of syntactic errors, 7 on refactoring, 7 on analysis, 5 on security issues, 6 on IDE support. With the raising interest in spreadsheets and the maturity of them gaining acceptance among researchers, CSS is probably the most scarcely investigated industrially successful mainstream software language.

As part of a bigger project of building a framework to analyse CSS speci cations [dG16b, dG16a], we have faced the challenge of empirical validation. We wanted to rely on a comprehensive corpus of CSS les, with reasonably high feature coverage numbers and potential for regression testing integration. This report collects many issues related to that particular part of our work, and exposes preliminary results. The work on the actual analysis tool and infrastructure is still ongoing and can be observed from the GitHub accounts of the authors.

Replications of website analysis papers are usually next to impossible since most modern active vendors change their applications continuously and deploying new versions up to 50 times a day [Sch14]. This means that providing the extensive list of websites used in the experiment, is not sustainable, since the actual CSS les behind those names would have changed hundreds and thousands times between the original experiment and its replication. One of the approaches is to provide a timed snapshot of a collection of web applications, and this is what this report focuses on.

As related work we can point the readers to Qualitas Corpus [TAD +10], a large collection of compilable software projects in Java; Atlantic Zoo [CTB +03], a versatile gathering of metamodels obtained by many means from mining papers to converting ontologies; or Grammar Zoo [Zay15], a grammarware-centred repository of artefacts of various nature containing knowledge about language structure. On an even more closely related note, both scale-wise and topic-wise, let us point to a recent paper by Mazinanian et al. [MTM14] on discovering refactoring opportunities in CSS. The dataset for the project was made publicly available [Maz14], which was greatly appreciated by replicators [PVZ16b]. However, as the e ort by Punt et al. [PVZ16a] showed, the dataset had some issues related to crawling time glitches, crawling location speci city, le access miscon guration, unavailability of cookies, and les being renamed. In our work we tried to combine the best we could learn from all these projects and approaches, which led to automated crawling of popular and acknowledged sources, with subsequent manual and tool-supported ltering, curation, standardisation. The result has over 0.5 MLOC of correct pretty-printed CSS, and can be used to test parsers and try out and compare techniques developed by CSS researchers.

The remainder of the report is organised as follows: section 2 explains the inclusion criteria, exposes details on how the corpus was composed and shows simple and advanced metrics calculated on it; section 3 sketches a case study of using the corpus with our work-in-progress framework to detect coding conventions; section 4 shows another case study about clone detection; section 5 draws preliminary conclusions and contemplates future endeavours. 2

The Corpus: Contents, Selection, Metrics The 50 style sheets for the corpus were picked from a selection of the most popular websites from the Alexa top 500 most popular sites on the web. Duplicate websites within the list such as http://google.es and http://google.fr have been ignored, since those would just result in the same style sheets multiple times. Furthermore, http://t.co has been ignored from the list since it is the link/redirect service of Twitter, and not a real website. Finally, http://blogspot.com is ignored as well since it refers to http://blogger.com which is already in the top 50. The nal list covered the following websites: 360.cn blogger.com gmw.cn kat.cr naver.com pinterest.com stackover ow.com weibo.com yandex.ru aliexpress.com chinadaily.com google.com linkedin.com net ix.com pornhub.com taobao.com whatsapp.com youtube.com amazon.com diply.com hao123.com live.com o ce.com qq.com tmall.com wikipedia.org apple.com ebay.com imdb.com mail.ru ok.ru reddit.com tumblr.com wordpress.com baidu.com facebook.com imgur.com microsoft.com onclickads.net sina.com.cn twitter.com xvideos.com bing.com fc2.com instagram.com msn.com paypal.com sohu.com vk.com yahoo.com

The actual style sheets from these websites have been downloaded using the CSS Stats tool [MJO14], which automatically extracts external style sheets as well as embedded CSS from web pages. The CSS provided by CSS 1 3 0 ; 2

9 .c360n li.rsseecaoxpm.zcaaoonmm l.ecaoppm i.caoubdmi.cgobnm l.recggoobm ili.ccaaoynhdm li.coypdm .ecaoybm .fcecaoookbm .fcc2om.cgnwm l.ecgogoom .caoo312hm i.cobdmmi.rcogumm i.trscagaonmm .trcak il.iceoknndm .liceovm il.raum i.frtsccooomm.sconmm .recaovnm i.tecoxnm .cecoom .roku l.itscceoakndn .lcaaoyppm i.trtseeconpm .rcoopnhubm.coqqm i.rtecoddm i..sccaonnm .scoouhm .recoovwm .tcoaaoobm ll.tcaomm l.trcoubmm i.trttecowm .covkm i.ecoobwm.tscoaaphpwm iii.regoakpdw .rrssceoopdwm .isceooxvdm.coooayhm .reayxnud .tceooyubum 4 3 6 ; 3 3 7 9 9 2 7 ; 7 6 6 2 ; 7 5 9 4 2 ; 5 2 9 2 5 ; 40 81 9 ; 3 3 8 8 5 ; 79 12 9 ; 6 0 ;375 ;5931 ;3492 4 2 ;342 ;160 9 1 62 1 ; 9 5 8 1 ; 1 2 (a) Amount of important statements vs the average selector specicity

500

2 3 4 5 (c) Percentage of declarations with !important modiers 0 2 4 6 8 (d) Percentage of cloned lines per type

10 upwards trend.

Having this upward trend in the speci city of selectors in the source le order does not impact the e ect of the style sheet, since the actual order is only considered in case of con icting selectors with equal speci city since then the source order will solve the con ict [eEG +11]. However, placing selectors in the style sheet in an upward trend, based on their speci city and source order, does make it easier to reason about the CSS. For example, if selectors with high speci city are placed at the beginning of the style sheet, and you later on have to change the presentation of those elements, you have either to overrule the speci city of the earlier de ned selectors, or ensure they all have the same speci city.

An example of a speci city graph is shown in Figure 3, which is created using the style sheet of Whatsapp.com. Like the speci city graphs of most of the other websites from the sample set, the graph does not display a slowly increasing line. This can have multiple reasons, as for example all CSS from the websites of the sample set contain multiple style sheets which are all combined in one graph. This is due to the fact that during the downloading of the style sheets using the CSSstats tool, all style sheets have been merged into a single style sheet. Furthermore CSS preprocessors such as SASS or LESS could have been used to parse the CSS, placing rules in non-optimal positions. Our last hypothesis is that developers are not always completely familiar with the cascading characteristics of CSS.

What is interesting when looking at the speci city graph of Whatsapp.com, is that the style sheet immediately starts with very speci c selectors. About the rst 100 selectors seem to consist of mostly ID selectors and just some class and element selectors. This could a ect the maintenance aspect of the style sheet, as possibly more speci c selectors have to be used later on in the style sheet to overrule the already very speci c base styles. This could also explain the high usage of the !important modi er for Whatsapp.com, as 9.09% of all declarations have applied it. The !important would allow the developers at Whatsapp.com for a quick and easy solution for solving cascading problems, even though it is considered a code smell [Zak11, GZ16a, Gha14].

The high amount of !important modi ers in Whatsapp.com, and its high average speci city, may give an impression that it could have positive correlation. This would be interesting due to the fact that a higher average speci city value would badly a ect the maintainability of the style sheet, creating complex cascading related problems. Important modi ers would be a tempting solution for developers to use when the average speci city is high, as those are a quick and dirty way to solve these kinds of problems. However, this hypothesis has been refuted after a little probe, which results are shown in Figure 2a. An explanation for this outcome could be that websites which mostly use higher speci city selectors, will simply keep creating selectors with even higher speci city values, increasing the average speci city. As long as no low speci city selectors are used, no major cascading issues are likely to occur therefore not increasing the temptation for developer to use the !important modi er. 300 One of the analyses that are possible to implement within our framework is checking whether developers have applied coding conventions correctly. Checking if a semicolon is present after each declaration, if short hexadecimal values are used, or that a vendor-pre xed property is followed by a standard property [GZ16a], is all possible. Since there are also other tools that check coding conventions for CSS, we will compare our implementations for some coding conventions to theirs, and analyse how much more e cient our model is in conducting such analysis. Finally, a selection of the following ten coding conventions will be validated on the sample set, providing additional insights in the quality of the CSS [GZ16b]:

Use short hexadecimal values (Performance)

Use the shorthand margin and padding property (Performance)

Disallow empty rules (Possible error) Do not use id selectors (Maintainability)

Require standard property with vendor pre x (Compatibility)

When possible, use em instead of pix (Accessibility) Disallow duplicate properties (Possible error) Avoid using !important (Maintainability)

Avoid qualifying ID and class names with type selectors (Performance)

The conventions were taken from open-source communities, companies and CSS professionals. They regard possible errors, compatibility, accessibility, maintainability, and performance [GZ16b]. Coding conventions related to lexical details such as required locations of spaces, are not taken into account since the CSS Stats tool [MJO14] used to download the style sheets, as mentioned above, has pretty-printed them all uniformly.

Figure 4 shows the percentage of violations per coding convention for the complete sample set. The most violated coding convention is the disallowing of the ID selector. ID selectors are disallowed since those should be unique, pointing to only a single element. By using ID selectors, developers limit themselves to styling only a single element, losing the bene t CSS provides regarding to the reuse of styles. However, as can be seen in the graph, not all style sheets adhere to this coding convention. Of all 50 websites, 15 of those have a minimum of 10% ID selectors, even ranging up to 36.05% (bing.com). Furthermore we have analysed that the !important modi er is used on average 16 times every 1000 declarations. Some websites even have more than 5% of all their declarations use the !important modi er, with Whatsapp.com being on top with 9.09%. Such a high usage of !important modi ers demonstrates bad use/understanding of the cascading characteristic of CSS. Figure 2c shows more information on the usage of the !important modi er.

Relating the amount of violations per category of coding conventions to the amount of lines of code in the style sheet, presented some insights in the occurrences of violations. Both the maintainability and compatibility related coding smells showed a strong positive correlation against the amount of lines of code in a style sheet, with their correlations being 0:9938, and 0:9960 respectively. The possible error category also had a positive correlation, being 0:8319. For the performance category there was no signi cant correlation, as its value was 0:0532. These values are based on only 2 3 coding conventions per category, therefore not being a complete representation for each category, as only a small section of the available coding conventions per category have been used. However, they do indicate that there is a need for better CSS standards to prevent, better CSS analysis tools to detect, and better CSS refactoring tools tools to x, the decline of quality in style sheets. 4

Case Study: Detecting Code Clones

Clone analysis, detection, management and tool evaluation have been very active topics in software engineering research at least since 1994, with numbers of papers dedicated to them climbing each year [RZK14]. Clones are usually considered harmful, since they bloat the codebase and hamper proper maintenance since each bug xed in cloned code, needs its xes propagated to all code incarnations, including those that signi cantly evolved since the cloning time. Without joining the ongoing discussion on the usefulness of clones for solving some tasks (in particular in product line implementation), we can point out that in general having or lacking duplicates is a remarkable property of the source code, and quite characteristic of the programming style, partly dictated by the chosen software language. Hence, we are interested in investigating clones in CSS. The expectation is to have results similar to other small DSLs [TC11].

Results of running the clone detector on the sample set, can be seen in Figure 5. These are the results using the current con guration as shown in the list below. Running the clone detector with di erent con gurations would result in di erent results, however, for the current sample set these speci c con gurations result in accurate clones.

Clones should have a minimum mass of 6. Clones should occupy a minimum of 3 LOC.

The following is normalised for type 2 and type 3 clones:

File names of style sheets Selectors of rule sets Media queries in @media rules 6 Type 3 clones should be at least 80% equal.

In Figure 2d, three box plots are shown, one for each clone type. It shows that clones of type 1 have an average length in lines of 4.76% and a median of 0.75%, however more interesting are the outliers which have more than 20%, and for 2 websites even around 40% clones lines. These two are chinadaily.com and ok.ru. A reason for chinadaily.com to have such a high percentage of type 1 clones, is that the style sheet contains a @media rule, which aims at screens with a max-width of 1154 pixels. However, instead of only adding rule sets to the @media rule that override the previously de ned properties for speci c resolutions (e.g., for smartphones), it also contains direct duplicates of rule sets from outside the @media rule which add no bene t whatsoever. Refactoring the style sheet to remove the speci c @media rule, and pretty printing the CSS to run it through the clone detector again would be a simple and easy way to verify this, however, the @media rule is never closed with a right brace, making it impossible to do, as we can only guess where the @media rule should have been closed.

For ok.ru, the high amount of type 1 clones does not seem to be related to any @media as was the case for chinadaily.com. For some reason a lot of rule sets that are de ned relatively early in the style sheet ( rst 10,000 lines), are de ned again later on in the style sheet (from about 25,000 lines). It seems that the ok.ru website loads 3 style sheets, with 2 of them being fairly equal, containing a lot of the same rule sets. It seems like one of the style sheets is simply duplicated and then partially modi ed, while not keeping in mind that most rule sets are already de ned elsewhere.

Type 2 clones are found the most often, having an average of 17.30% and a median of 16.76%, with one outlier of 38.96% which is microsoft.com. When looking at the style sheet of microsoft.com, it seems like they have used a tool to generate CSS with, maybe SASS or LESS, because the rst 2000 lines mostly contains rule sets as shown in Listing 1, rule sets with one or multiple selectors and only a single width declaration width a percentage value. Most of these width declarations with equal values occurred multiple times in di erent rule sets, which could explain the high amount of type 2 clones.

Listing 1: Part of the microsoft.com style sheet 1 . CSPvNext . margin-row-fluid >. bp2-col-10-3 { 2 width: 27.4% 3 } 4 . CSPvNext . margin-row-fluid >. bp2-col-10-4 { 5 width: 37.2% 6 } 7 . CSPvNext . margin-row-fluid >. bp2-col-10-6 { 8 width: 56.8% 9 }

Then there are the type 3 clones, that with the current con gurations, result in an average of 4.79% and a median of 2.59%. There is one outlier that stands out the most, as it has 73.91% of type 3 clones. This is the qq.com website, and what is surprising about qq.com is that when combining its type 1, type 2, and type 3 clones, it shows that a 100% of the lines are considered cloned lines. This means that every line in the style sheet, is part of one or more clones. After analysing the qq.com style sheet, the high amount of type 3 clone seems to be a result of copy and pasting. To give an example, in Listing 2 two rule sets are shown taken from the qq.com style sheet. The only thing which sets these rule sets apart are their selectors and the font-size and display declarations in the second rule set, with the remaining 8 declarations being identical. The two rule sets where not even a 100 lines apart from each other in the style sheet, giving the impression that the developers do not fully understand the inheritance and cascading characteristics of CSS, or that maybe they did, but just wanted to develop the style sheet in a short amount of time while not being bothered by CSS’ inheritance and cascading characteristics. Nevertheless, it shows that the 5,449 lines of code that the qq.com style sheet now uses to style its website with, can be reduced signi cantly.

Listing 2: Part of the qq.com style sheet

Preliminary Conclusions and Future Work In this report, we have explained how we composed a corpus of realistic CSS code from popular websites, as a part of the e ort to build a framework for CSS analysis. We have brie y gone through two case studies that showed how the corpus can be (re)used. The project is still a work in progress, but the corpus is ready and is already serving us well.

Ultimately we will use this corpus of CSS les to compare our framework with existing alternatives, by implementing the same algorithms within various frameworks. Smell detection, clone management, metrics calculation and detecting refactoring opportunities will remain the main themes. Analysing the corpus statistically to see which language features of CSS are more widely used and therefore more crucial to support, is also an interesting option.

Acknowledgements

This report is based on the extended abstract of the presentation given at the SATToSE symposium in Bergen, Norway, on 12 July 2016 [dG16b], as well as on the graduate thesis defended at the University of Amsterdam, The Netherlands, on 21 July 2016 [dG16a]. [BD16] [dG16a]

Golnaz Gharachorlu. Code Smells in Cascading Style Sheets: An Empirical Study and a Predictive Model. Master’s thesis, University of British Columbia, Canada, 2014. URL: http://hdl.handle. net/2429/51364 .

Boryana Goncharenko and Vadim Zaytsev. Language Design and Implementation for the Domain of Coding Conventions. In Tijs van der Storm, Emilie Balland, and DÆniel Varr , editors, Proceedings of the Ninth International Conference on Software Language Engineering (SLE) , pages 90 104, 2016. doi:10.1145/2997364.2997386 .

[AMIO12]

Adewole

Adewumi , Sanjay Misra, and Nicholas Ikhu-Omoregbe. Complexity Metrics for Cascading Style Sheets . In Beniamino Murgante, Osvaldo Gervasi, Sanjay Misra, Nadia Nedjah, Ana Maria A. C. Rocha , David Taniar, and Bernady O. Apduhan, editors, Proceedings of the 12th International Conference on Computational Science and Its Applications (ICCSA) , pages 248 257 . Springer, 2012 . doi: 10 .1007/978-3- 642 -31128-4_ 18 .

[B˙HL11] Bert

Bos

, Tantek ˙elik, Ian Hickson, and H kon Wium Lie. Cascading Style Sheets Level 2 Revision 1 (CSS 2.1) Speci cation . W3C Recommendation , June 2011 . http://www.w3.org/TR/2011/ REC-CSS2- 20110607 .

M. Serdar

Bi er and Banu Diri. Defect Prediction for Cascading Style Sheets . Applied Soft Computing , 2016 . doi:http://dx.doi.org/10.1016/j.asoc. 2016 . 05 .038 .

[CTB+03] Jordi

Cabot

, Massimo Tisi, Hugo BruneliŁre , et al. AtlantEcore Metamodel Zoo . http://www.emn. fr/z-info/atlanmod/index.php/Ecore , 2003 .

[CWE06] Hampton Catlin, Natalie Weizenbaum, and Chris Eppstein . SASS: Syntactically Awesome Style Sheets , 2006 . http://sass-lang. com .

Nico de Groot . Analysing and Manipulating CSS using the M 3 Model . Master's thesis , Universiteit van Amsterdam, The Netherlands, July 2016 . URL: http://www.scriptiesonline.uba.uva.nl/ en/scriptie/613750 .

Nico de Groot . Analysing CSS using the M3 Model . In Pre-proceedings of the Ninth Seminar on Advanced Techniques and Tools for Software Evolution (SATToSE) , 2016 . URL: http://sattose.

wdfiles.com/local--files/ 2016 :alltalks/SATTOSE2016_paper_10.pdf .

[eEG+11] Tantek ˙elik, Elika J. Etemad, Daniel Glazman, Ian Hickson,

Peter

Linss ,

and John Williams. Cascading

Style Sheets (CSS) Selectors Level 3 . W3C Recommendation , September 2011 . http: //www.w3.org/TR/2011/REC-css3 - selectors-20110929/ .

[Maz14] Boryana Goncharenko and Vadim Zaytsev. Reverse Engineering a CSS Coding Conventions Catalogue . Draft, https://github.com/boryanagoncharenko/CssCoco/blob/master/analysis.md , 2016 .

Davood

Mazinanian . Dataset for FSE'14 submission , 2014 . URL: http://users.encs.concordia.

ca/~d_mazina/papers/FSE'14/ .

[MJO14]

Adam

Morse , Brent Jackson , and John Otander. CSS Stats , 2014 . http://cssstats.com .

[MTM14]

Davood

Mazinanian , Nikolaos Tsantalis, and

Ali

Mesbah . Discovering Refactoring Opportunities in Cascading Style Sheets . In Proceedings of the 22nd Symposium on the Foundations of Software Engineering (FSE) , pages 496 506 . ACM, 2014 . doi: 10 .1145/2635868.2635879 .

[PVZ16a]

Leonard

Punt , Sjoerd Visscher, and

Vadim

Zaytsev . Experimental Data for the A?B*A Pattern in CSS: Inputs and Outputs . In Proceedings of the 32nd International Conference on Software Maintenance and Evolution (ICSME) , page 616 , 2016 . Best Artefact Award. doi: 10 .1109/ICSME. 2016 . 91 .

[PVZ16b]

Leonard

Punt , Sjoerd Visscher, and

Vadim

Zaytsev. The A?B*A Pattern : Undoing Style in CSS and Refactoring Opportunities it Presents . In Proceedings of the 32nd International Conference on Software Maintenance and Evolution (ICSME) , pages 67 77 , 2016 . doi: 10 .1109/ICSME. 2016 . 73 .

[RZK14] Chanchal

Roy , Minhaz F.

Zibran , and Rainer

Koschke . The vision of software clone management: Past, present, and future (Keynote paper) . In Serge Demeyer , David Binkley, and Filippo Ricca, editors, Proceedings of the Software Evolution Week: Conference on Software Maintenance, Reengineering, and Reverse Engineering , pages 18 33 . IEEE Computer Society, 2014 . doi: 10 .1109/CSMR-WCRE. 2014 . 6747168 .

[Sch14]

Daniel

Schauenberg . Development, Deployment & Collaboration at Etsy. In QCon London, 2014 . https://qconlondon.com/london-2014/london-2014/presentation/Development, % 20Deployment % 20 & %20Collaboration%20at%20Etsy .html .

[SSP+09] Alexis

Sellier

, Jon Schlinkert, Luke Page, Marcus Bointon, MÆria Jur£ovi£ovÆ,

Matthew

Dean ,

and Max

Mikhailov . Less, 2009 . http://lesscss.org .

[TAD+10] Ewan

Tempero

, Craig Anslow, Jens Dietrich, Ted Han,

Jing

Li ,

Markus

Lumpe , Hayden Melton, and

James

Noble . Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies . In Asia Pacic Software Engineering Conference (APSEC 2010 ) , pages 336 345 , December 2010 .

[TC11] [Zak11] [Zay15] Robert Tairas and Jordi Cabot. Cloning in DSLs: Experiments with OCL . In Anthony M.

Sloane and Uwe A mann, editors, Revised Selected Papers of the Fourth International Conference on Software Language Engineering , volume 6940 of LNCS , pages 60 76 . Springer, 2011 .

doi:10 .1007/978-3- 642 -28830- 2 _ 4 .

Nicholas C.

Zakas . Disallow !important, 2011 . https://github.com/CSSLint/csslint/wiki/ Disallow-!important .

Vadim

Zaytsev. Grammar Zoo : A Corpus of Experimental Grammarware . Fifth Special issue on Experimental Software and Toolkits of Science of Computer Programming (SCP EST5) , 98 : 28 51, February 2015 . doi: 10 .1016/j.scico. 2014 . 07 .010 .