<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Securing Software Ecosystems through Repository Mining</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Aminul</forename><forename type="middle">Didar</forename><surname>Islam</surname></persName>
							<email>aminul.islams@lut.fi</email>
							<affiliation key="aff0">
								<orgName type="institution">LUT University</orgName>
								<address>
									<addrLine>Yliopistonkatu 34</addrLine>
									<postCode>53850</postCode>
									<settlement>Lappeenranta</settlement>
									<country key="FI">Finland</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Utrecht University</orgName>
								<address>
									<settlement>Utrecht</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Slinger</forename><surname>Jansen</surname></persName>
							<email>slinger.jansen@uu.nl</email>
							<affiliation key="aff1">
								<orgName type="institution">Utrecht University</orgName>
								<address>
									<settlement>Utrecht</settlement>
									<country key="NL">The Netherlands</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Securing Software Ecosystems through Repository Mining</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">A78A141EF91FF3A12488719B4AC4BEFC</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:08+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Code clones</term>
					<term>Repository mining</term>
					<term>Code identification</term>
					<term>License violations</term>
					<term>Software engineering</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Through the incessant reuse of code fragments, the worldwide software ecosystem has become highly connected. This provides advantages, such as faster software engineering, however, it also provides new challenges, such as easier spreading of vulnerabilities. The world depends on software and the proliferation of code also causes the proliferation of vulnerabilities. In this PhD project, we explore the use of a code clone hashing and storing technique to enable fast searches of abstract code clones in the worldwide software ecosystem, called SearchSECO. With SearchSECO, we can rapidly identify code, code clones, vulnerabilities, license conflicts, and other aspects of code cloning. With SearchSECO as a platform, we hope to move forward the art and science of repository mining.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The worldwide software ecosystem <ref type="bibr" target="#b0">[1]</ref> (SECO) concerns all software producing organizations and individuals that collaboratively serve a market for software and services. SECOs comprises a network of developers, vendors, consumers, and other stakeholders who interact through various platforms. It includes open-source repository software and also proprietary software ranging from small to large-scale enterprise software systems <ref type="bibr" target="#b1">[2]</ref>.</p><p>A SECO represents a set of actors that function together as a unit that interacts and communicates with a shared software market according to the services and their relationships <ref type="bibr" target="#b2">[3]</ref>. Building on this concept, this research proposes a software provenance theory, ensuring that the origin and history of every software engineering artifact are traceable across the entire software supply network.</p><p>Mining Software Repositories (MSR) is a research area within software engineering that focuses on analyzing the vast amount of data generated during software development, maintenance, and usage <ref type="bibr" target="#b3">[4]</ref>. This data is stored in various repositories such as version control systems (e.g., GitHub), issue trackers (e.g., Jira), and code review systems (e.g., Gerrit), among others. The main goal of MSR is to extract actionable insights and patterns that can improve software quality, guide development processes, and inform decision-making within software teams.</p><p>In the broader context of software engineering, MSR plays a crucial role in supporting evidence-based decision-making. By applying techniques from data mining, machine learning, and information retrieval to software repositories, MSR helps researchers and practitioners understand trends, predict future issues, improve software processes, and evaluate the effectiveness of different practices.</p><p>Traditionally, repo mining has focused on file-level or project-level data, providing a broad view of software systems and their evolution <ref type="bibr" target="#b4">[5]</ref>. However, as software ecosystems grow in complexity, a finergrained approach is increasingly necessary. By analyzing source code at the method level, researchers can obtain more detailed insights into code reuse, identify security vulnerabilities with greater precision, and understand the intricacies of software dependencies and maintenance challenges <ref type="bibr" target="#b2">[3]</ref>.</p><p>We propose SearchSECO, a hash based index for code fragments that enables searching source code at the method level in the worldwide software ecosystem <ref type="bibr" target="#b2">[3]</ref>. Currently, it is possible to identify files by their hashes in the Software Heritage Graph. We want to create a set of parsers that extract fragments (methods) from the code files and makes them findable. By making methods from the worldwide software ecosystem findable, we can perform more reliable license checks, search for vulnerabilities, and extract call graphs from those methods <ref type="bibr" target="#b5">[6]</ref>.</p><p>We unearth the relationships between code fragments, code files, and their projects on a worldwide scale. This fine-grained data enables much richer analyses, significantly moving forward the field of empirical software engineering and its sub-field of repository mining. Our first projects will deal with license violation detection in open source, vulnerability finding, and software package identification for SBOM creation <ref type="bibr" target="#b5">[6,</ref><ref type="bibr" target="#b6">7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Research Aim and Impact</head><p>The primary aim of this research is to enhance the capabilities of repository mining through the development and deployment of SearchSECO, a hash-based index for code fragments <ref type="bibr" target="#b6">[7]</ref>. By enabling method-level source code searching, this research seeks to address critical issues in license conflict detection and vulnerability benchmarking within the worldwide software ecosystem. The goal is to provide a robust, scalable tool that can significantly improve the accuracy and efficiency of software license compliance and security vulnerability detection.</p><p>Impact of the Project: The project's main predicted impacts are: 1) enabling data-driven analysis of software ecosystems, and potentially faster submission to the repository mining community, 2) code license checking and conflict identification will be easier and faster with the cutting-edge technology, 3) the outcome of this research will help for better code quality and better findability 4) promotion of the SearchSECO tool to researchers and SearchSECO will ease of use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Related Work</head><p>Seulbae et al. introduced one state-of-the-art solution named VUDDY (VUlnerable coDe clone DiscoverY) project which is a scalable approach for code clone detection <ref type="bibr" target="#b4">[5]</ref>. This work identifies the code clone and vulnerability by leveraging the syntactic and symbolic information of the code. This research worked on four types of code clones that have been recognized and published by scientific papers according to the granularity units such as token level, line level, function level, file level, etc. However, this work has limitations in terms of accuracy and consistency because of granularity abstraction. This leads to higher false negatives and the paper also acknowledges that trustworthiness a concern originates from the false negative. We effectively use similar techniques as VUDDY, however, we are storing all the methods that we encounter instead of only the methods encountered with potential vulnerabilities. This for instance enables us to study license violations, something that VUDDY was not built for.</p><p>License conflicts and violations are universal issues, and software licenses generally fall into two categories. The first one is declared licenses which is specified for the whole project and the second one is in-code licenses which are directly attached to files throughout the entire directory tree <ref type="bibr" target="#b7">[8]</ref>. Most of the violations originated from the declared to-in-code mismatches on the other hand declared to declared-to-declared inconsistencies also occur but less often.</p><p>Research shows <ref type="bibr" target="#b8">[9]</ref> there are multiple reasons behind code license violations in open-source code software (OSS). The first one originates from the resource and time constraints of software developers and they do not want to focus on trivial tasks. And the second one is related to the misconceptions about the nature and characteristics of open-source licenses which are escalating chronologically because of the large number of repositories produced. Another paper mentioned about the incompatibility of licenses among components, for example, GNU General Public License (GPL) has multiple versions but it doesn't have backward compatibility such as components released under GPL version 3 are not compatible with components released under GPL version 2 <ref type="bibr" target="#b9">[10]</ref>. However, when the same project shares two different licenses then it's an inconsistency but that does not necessarily mean a license conflict. License conflict means contradictory, incompatible obligations or contradictory rights <ref type="bibr" target="#b7">[8]</ref>.</p><p>Wolter et al. also published a research work based on 1,000 GitHub repositories and the found that fifty percent of the work repositories did not include a complete and accurate list of all the licenses associated with the code. However, out of these 10% has a mismatch between permissive and copyleft licenses. This work heavily relied on existing open source tools such as Nomos, ScanCode etc. also mentioned the necessity of license scanning tools directly from the code.</p><p>The FOSSology project is one of the widely used projects for license detection <ref type="bibr" target="#b10">[11]</ref>. Based on regular expressions a license scanner has been developed and the main tool of FOSSology. For the purpose of license scanning, FOSSology introduced open-source license scanners, for example, Ninka. Within source code comments, Nika can analyze sentences and it can recognize over 120 different licenses <ref type="bibr" target="#b11">[12]</ref>. Some research done based on binary code clone detection for detecting software code released in binary form <ref type="bibr" target="#b12">[13]</ref> which mentioned that upstream suppliers often provide solutions in binary form, thus it's difficult to assess the existence of unlicensed third-party code. It's also mentioned that the license violations are not accidental, but rather more systematic, and for most software and hardware products this is a large-scale problem.</p><p>This research is based on a previously proposed solution named SearchSECO by Slinger Jansen et al. for a hash-based index that aims to collect billions of open-source files from open-source repositories to provide full software provenance which also address the license conflicts and violations problem <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Research Method</head><p>Software engineering is a maturing field, and repository mining as a branch of it, is as well. This can be observed especially when looking at the way in which empirical software engineering research is being conducted and evaluated. In this thesis project, we follow the empirical software engineering research standards for our sub-projects. <ref type="foot" target="#foot_0">1</ref>To address the issues and problems of vulnerability detection, license conflicts, software maintenance, and scalability in large-scale software ecosystems, this PhD research will enhance the capability and explore the possibility of an already proposed solution named SearchSECO <ref type="bibr" target="#b2">[3]</ref>. This PhD research will be under the Design Science Research (DSR) paradigm because this research is based on software artifact development which involves an evaluation process for code license violation detection. The evaluation process will have two steps: the first step is a vulnerability benchmarking framework that will be established to evaluate the capabilities of the SearchSECO software artifact. The second step is a case study in an industry organization and for this purpose, we plan to incorporate other research methods under DSR to answer a particular research question. For example, to deploy SearchSECO in an industry organization and to identify code license violations in large unidentified code bases we will follow Action Design Research (ADR) as a part of this PhD research (section 6, WP2 and WP3) <ref type="bibr">[14,</ref><ref type="bibr" target="#b13">15]</ref>. ADR is particularly designed to develop, work, and evaluate with organizational settings where researcher intervention is expected <ref type="bibr" target="#b14">[16]</ref>. This method centers on creating, intervening, and evaluating an artifact that embodies the researchers' theoretical foundations and intentions while incorporating user influence and the impact of real-world use.</p><p>A validity threat for our project is that we can hardly generalize our findings to the whole software ecosystem. Even though we currently have analyzed the top 100.000 projects on Github, that is still only a fraction of the total Github source code, let alone the worldwide software ecosystems' source code. For now, we will refrain from making inductive statements outside of the scope of our own data set.</p><p>A construct validity threat is on the definition of a clone. Currently, we are storing any clone, including for instance, getters and setters, in an abstract manner. However, as these are typically uniform and we use a high level of abstraction, we find many false positives in our clone set. In the near future we hope to counter this validity threat by setting a standard length for a 'valid' clone, e.g., a minimum of five lines of code.</p><p>A proposed and partially implemented system diagram is shown in figure <ref type="figure" target="#fig_0">1</ref>. It illustrates the project SearchSECO intended to collect source code from the worldwide software ecosystem and store methodlevel code with the call graph of the code in a Software Method Knowledge Base (SMKB), which allows for structural analysis of the source code. For example, license violation and vulnerability patterns will be identifiable utilizing the call graph of a source code. Thus, in the software engineering domain more specifically in repository mining SearchSECO is a radical innovation <ref type="bibr" target="#b2">[3]</ref>. To construct SeacrhSECO we follow four lines of query: first, we develop parsing techniques and design work distribution mechanisms to explore the global SECO. Next, we store the collected methods in the SecureSECO Knowledge Base (SMKB). Third, we apply basic data analysis techniques to the stored data within the SMKB. Finally, we leverage artificial intelligence to perform graph mining on the worldwide SECO graph for deeper insights <ref type="bibr" target="#b2">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Implementation Plan and Timeline</head><p>A total of four years PhD thesis plans has been added to the Table1. The first paper with the initial results has been accepted for publication at BENEVOL 2024 conference. The title is "Work in Progress Paper: Detecting Method Level License Conflicts in the Worldwide Software Ecosystem". This paper demonstrated code-level license extraction and violation detection as a definitive method for ensuring license compliance in borrowed code, independent of any declarations. Using SearchSECO, we examined 3,500 repositories from leading software companies to assess the prevalence of violations. Our analysis uncovered approximately 32,000 violations in total. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Future work</head><p>The FOSSology project first introduced an ML-based solution for license identification problems utilizing license classification <ref type="bibr" target="#b15">[17]</ref>. Another research introduced Machine learning-based license exception detection <ref type="bibr" target="#b16">[18]</ref>. However, only a few ML-based solutions are introduced, and some topics such as license violation are not explored. Moming et al. introduced an ML-based solution named ModelGo for license conflict detection <ref type="bibr" target="#b17">[19]</ref>. Research needs to be done on this problem. In our SearchSECO project introducing ML-based license violation solution could be an interesting experiment for the scientific and research community. However, another idea is to create a hybrid solution, utilizing the current proposed SearchSECO license violation and combining an ML-based solution.</p><p>A recent development is the use of LLMs for generating code. Frequently, the code that is generated follows exact patterns from licensed code, meaning that the LLM could potentially be offering licensed code, thereby stimulating license violations by the software engineer <ref type="bibr" target="#b18">[20]</ref>. While we cannot guarantee it currently, we look forward to exploring whether we can identify such licensed code suggestions, with the goal of improving these LLMs to avoid licensed code.</p><p>Incorporating existing tools such as Binary Analysis Tool (BAT) <ref type="bibr" target="#b12">[13]</ref> binary code clone detection can be done in the SearchSECO project too. Utilizing bytecode scanning tools like JEB Decompiler or ProGuard <ref type="bibr" target="#b19">[21]</ref> that can extract relevant license details directly from compiled code (e.g., Java .class files or .NET assemblies). Bytecode-level analysis allows for the detection of licensing information even when source code is unavailable <ref type="bibr" target="#b20">[22]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Contribution of the Thesis</head><p>SearchSECO has the aim of creating a new sense of provenance in software engineering, where we try to find the earliest version of a code clone and its authors. Provenance, which concerns the origins of an artifact, has been neglected in software engineering for far too long. For instance, if we look at the Dieselgate scandal, the code which violated the tests was never found. With SearchSECO, it would perhaps have been possible to identify the intended code <ref type="bibr" target="#b21">[23]</ref>.</p><p>As we are performing design science, with potentially highly useful artifacts, both for research and industry, we foresee several routes towards research impact. For one, we hope to apply and evaluate our artifacts in case studies. Secondly, if the technology proves valuable for the industry, we could consider spinning out the SearchSECO features into a startup. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: SearchSECO System Architecture [3]</figDesc><graphic coords="5,72.00,65.61,451.28,218.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>SearchSECO Development and Deployment Plan Over Four Years Develop the vulnerability benchmarking framework and tools and conduct detailed case studies on license conflict detection and vulnerability benchmarking. Q2: Refine the artifacts based on feedback and testing results. Q3: Publish initial findings in peer-reviewed venues. Q4: Expand the dataset and improve the scalability and robustness of the tools. Analyze scalability and performance implications of implementing SearchSECO on a global scale. Q2: Deploy code in an industry organization and collect data. Q3: Write articles for the scalability and performance implications of implementing SearchSECO on a global scale and publish papers in journals and conferences. Q4: Conduct extensive evaluations and final validation of the developed artifacts. Disseminate the tools and methodologies to the broader community and present the technical artifact and share knowledge with the community. Q4: Complete the dissertation writing and defense.</figDesc><table><row><cell>Year</cell><cell cols="2">WP(s) Objective</cell><cell></cell><cell>Milestones</cell></row><row><cell cols="2">Year 1 WP0</cell><cell cols="2">Understanding</cell><cell>Q1: Finalize the enhancements to SearchSECO for license conflict</cell></row><row><cell></cell><cell>&amp;</cell><cell cols="2">SearchSECO</cell><cell>and</cell><cell>detection.</cell></row><row><cell></cell><cell>WP1</cell><cell>Code</cell><cell cols="2">Borrowing,</cell><cell>Q2: Apply prototypes to real-world datasets and gather preliminary</cell></row><row><cell></cell><cell></cell><cell cols="3">Using SearchSECO for</cell><cell>results.</cell></row><row><cell></cell><cell></cell><cell cols="3">determining license</cell><cell>Q3: Conduct initial tests and validations of the developed artifacts</cell></row><row><cell></cell><cell></cell><cell>conflicts</cell><cell></cell><cell>and publish the code license violation detection paper.</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>Q4: Enhance the SearchSECO license-checking capability for bench-</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>marking.</cell></row><row><cell cols="5">Year 2 WP2 Q1: Year 3 WP3 Benchmarking Search-SECO's capabilities for identifying vulner-abilities Deploying Search-SECO in an industry organization: identi-fying code in large unidentified code bases Q1: Year 4 WP4 StackOverflow Data Q1: Collect StackOverflow data and check against SearchSECO</cell></row><row><cell></cell><cell></cell><cell cols="3">Analysis with Search-</cell><cell>database.</cell></row><row><cell></cell><cell></cell><cell>SECO</cell><cell></cell><cell>Q2: Publish comprehensive results and methodology in high-impact</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>journals and conferences.</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>Q3:</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">For more details on empirical software engineering standards, see Empirical Standards Repository.<ref type="bibr" target="#b0">1</ref> This diagram provides an overview of the SearchSECO system components and their interactions for efficient code fragment search at the method level.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We wish to thank Geert-Jan Giezeman, Wouter Beffers, and Deekshitha for their important contributions to the SearchSECO project on Github. This research was funded by the Business Finland project 6G Bridge -6G software for extremely distributed and heterogeneous massive networks of connected devices (8516/31/2022).</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>(S. Jansen) https://yamadharma.github.io/ (A. Didar Islam); https://kmitd.github.io/ilaria/ (S. Jansen) 0000-0002-0877-7063 (A. Didar Islam); 0000-0003-3752-2868 (S. Jansen)    </p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Declaration on Generative AI</head><p>ChatGPT 4o was used for polishing the grammar and spelling of the text in this document.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Buxmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kude</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Popp</surname></persName>
		</author>
		<ptr target="https://books.google.fi/books?id=BbivKcE6vWMC" />
		<title level="m">Proceedings of European Workshop on Software Ecosystems: 2012 -Walldorf, Synomic Academy, Books on Demand</title>
				<meeting>European Workshop on Software Ecosystems: 2012 -Walldorf, Synomic Academy, Books on Demand</meeting>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Cusumano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brinkkemper</surname></persName>
		</author>
		<title level="m">Software ecosystems: analyzing and managing business networks in the software industry</title>
				<imprint>
			<publisher>Edward Elgar Publishing</publisher>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Searchseco: A worldwide index of the open source software ecosystem</title>
		<author>
			<persName><forename type="first">S</forename><surname>Jansen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Farshidi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Gousios</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Visser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Van Der Storm</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bruntink</surname></persName>
		</author>
		<ptr target=".org" />
	</analytic>
	<monogr>
		<title level="m">The 19th Belgium-Netherlands Software Evolution Workshop</title>
				<meeting><address><addrLine>BENEVOL</addrLine></address></meeting>
		<imprint>
			<publisher>CEUR-WS</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">202</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Tools in mining software repositories</title>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Chaturvedi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sing</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Singh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">13th International Conference on Computational Science and Its Applications</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2013">2013. 2013</date>
			<biblScope unit="page" from="89" to="98" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Vuddy: A scalable approach for vulnerable code clone discovery</title>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Woo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Oh</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE symposium on security and privacy (SP), IEEE</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="595" to="614" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title/>
		<author>
			<persName><surname>Secureseco</surname></persName>
		</author>
		<ptr target="https://github.com/SecureSECODAO/searchSECO-miner,[wwwdocu-ment" />
		<imprint>
			<date type="published" when="2024-09-12">2024. on 12.09.2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title/>
		<author>
			<persName><surname>Searchseco</surname></persName>
		</author>
		<ptr target="https://secureseco.org/secureseco-introduction/searchseco/,[wwwdocu-ment" />
		<imprint>
			<date type="published" when="2024-09-12">2024. Accessed on 12.09.2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Open source license inconsistencies on github</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wolter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Barcomb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Riehle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Harutyunyan</surname></persName>
		</author>
		<idno type="DOI">10.1145/3571852</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Softw. Eng. Methodol</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A study of potential code borrowing and license violations in java projects on github</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Golubev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Eliseeva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Povarov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bryksin</surname></persName>
		</author>
		<idno type="DOI">10.1145/3379597.3387455</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th International Conference on Mining Software Repositories, MSR &apos;20</title>
				<meeting>the 17th International Conference on Mining Software Repositories, MSR &apos;20<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="54" to="64" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">An empirical study of license violations in open source projects</title>
		<author>
			<persName><forename type="first">A</forename><surname>Mathur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Choudhary</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Vashist</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Thies</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Thilagam</surname></persName>
		</author>
		<idno type="DOI">10.1109/SEW.2012.24</idno>
	</analytic>
	<monogr>
		<title level="m">2012 35th Annual IEEE Software Engineering Workshop</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="168" to="176" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">The fossology project</title>
		<author>
			<persName><forename type="first">R</forename><surname>Gobeille</surname></persName>
		</author>
		<idno type="DOI">10.1145/1370750.1370763</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2008 International Working Conference on Mining Software Repositories, MSR &apos;08</title>
				<meeting>the 2008 International Working Conference on Mining Software Repositories, MSR &apos;08<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="47" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The fossology project: 10 years of license scanning</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">C</forename><surname>Jaeger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Fendt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Gobeille</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Huber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Najjar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stewart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Weber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wurl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IFOSS L. Rev</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page">9</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Finding software license violations through binary code clone detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Hemel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">T</forename><surname>Kalleberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Vermaas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Dolstra</surname></persName>
		</author>
		<idno type="DOI">10.1145/1985441.1985453</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 8th Working Conference on Mining Software Repositories, MSR &apos;11</title>
				<meeting>the 8th Working Conference on Mining Software Repositories, MSR &apos;11<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="63" to="72" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Soft design science methodology</title>
		<author>
			<persName><forename type="first">R</forename><surname>Baskerville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pries-Heje</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Venable</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 4th international conference on design science research in information systems and technology</title>
				<meeting>the 4th international conference on design science research in information systems and technology</meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="page" from="1" to="11" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Action design research</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Sein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Henfridsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Purao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lindgren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">MIS quarterly</title>
		<imprint>
			<biblScope unit="page" from="37" to="56" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">The fossology project</title>
		<author>
			<persName><forename type="first">R</forename><surname>Gobeille</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2008 international working conference on Mining software repositories</title>
				<meeting>the 2008 international working conference on Mining software repositories</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="47" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Machine learning-based detection of open source license exceptions</title>
		<author>
			<persName><forename type="first">C</forename><surname>Vendome</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Linares-Vásquez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Bavota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Di Penta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>German</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Poshyvanyk</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICSE.2017.19</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/ACM 39th International Conference on Software Engineering (ICSE)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="118" to="129" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Modelgo: A practical tool for machine learning license analysis</title>
		<author>
			<persName><forename type="first">M</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>He</surname></persName>
		</author>
		<idno type="DOI">10.1145/3589334.3645520</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM Web Conference 2024</title>
				<meeting>the ACM Web Conference 2024<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="1158" to="1169" />
		</imprint>
	</monogr>
	<note>WWW &apos;24</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Trained without my consent: Detecting code inclusion in language models trained on code</title>
		<author>
			<persName><forename type="first">V</forename><surname>Majdinasab</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Nikanjam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Khomh</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2402.09299" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<title level="m" type="main">Shrink your java and android code, proguard</title>
		<author>
			<persName><forename type="first">E</forename><surname>Lafortune</surname></persName>
		</author>
		<ptr target="https://www.guardsquare.com/proguard,[wwwdocument" />
		<imprint>
			<date type="published" when="2016-10-21">2016. Accessed on 21.10.2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Advanced obfuscation techniques for java bytecode</title>
		<author>
			<persName><forename type="first">J.-T</forename><surname>Chan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yang</surname></persName>
		</author>
		<idno type="DOI">10.1016/S0164-1212(02)00066-3</idno>
		<ptr target="https://doi.org/10.1016/S0164-1212(02)00066-3" />
	</analytic>
	<monogr>
		<title level="j">Journal of Systems and Software</title>
		<imprint>
			<biblScope unit="volume">71</biblScope>
			<biblScope unit="page" from="1" to="10" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Through the smokescreen of the dieselgate disclosure: Neutralizing the impacts of a major sustainability scandal</title>
		<author>
			<persName><forename type="first">O</forename><surname>Boiral</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-C</forename><surname>Brotherton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Yuriev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Talbot</surname></persName>
		</author>
		<idno type="DOI">10.1177/10860266211043561</idno>
	</analytic>
	<monogr>
		<title level="j">Organization &amp; Environment</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="175" to="201" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
