Data processing method for Gini coefficient application in assessing the centralization within the BTC lightning network* Laura Atmanavičiūtė1,∗,†, Tomas Vanagas1,†, Justinas Grigaras1,† and Saulius Masteika1,∗,† 1 Vilnius University, Kaunas Faculty, Kaunas, Lithuania Abstract The Bitcoin Lightning Network (BLN) is a second layer blockchain solution, which emerged to address scalability issues. However, potential centralization concerns have surfaced as current distribution might indicate a trend toward centralization. The Gini coefficient, a measure of inequality, can be applied to BLN to assess its centralization by analyzing the distribution of channel capacity among nodes. This research proposes a data processing method specifically designed to utilize the Gini coefficient for evaluating centralization within the BLN. Main challenge in applying the Gini coefficient to assess BLN centralization is limitations of existing research. The lack of description on how to process data makes it difficult to replicate these studies and verify the conclusions made by other researchers. The proposed data processing method addresses the challenges associated with collecting data from both Bitcoin blockchain and Lightning Network, including data linking, storage, and variable selection. Results of the experimental research of the proposed method show that Gini coefficient increased from 0.829 to 0.930. The results are confirmed by existing research and can be used for future research to explore the BLN centralization. Keywords Bitcoin, lightning network, blockchain, data processing, Gini coefficient 1. Introduction Since its beginning Bitcoin (BTC) has undergone a remarkable evolution – a growing demand for faster transactions has emerged the Lightning Network (LN), a second layer blockchain solution [1]. LN acts as a separate layer (Layer 2) built on top of the BTC blockchain (Layer 1). It functions like a network of channels designed for micro-payments. Instead of adding every individual payment to the blockchain, two counterparts open a secure channel with each other on the BTC blockchain. This channel, established through a multi-signed transaction, allows them to send and receive a predetermined amount of BTC back and forth quickly and efficiently [2]. Originally LN was designed to address scalability issues in BTC by enabling faster and cheaper transactions while maintaining decentralization. But as Bitcoin Lightning Network (BLN) grows, it appears to be shifting towards a more centralized architecture [3]. While BTC was designed to be decentralized, the LN has witnessed a trend towards centralization, particularly evident in the concentration of power among specific nodes. These nodes, often referred to as "hubs," possess a disproportionately large share of the network's total channel capacity [2]. The hubs with the largest capacity in the network earn super linearly more than nodes with lower capacity. This occurs when the routing algorithm prioritizes routes based on capacity rather than minimizing fees [4]. This concentration of resources and influence raises questions about the integrity of the LN decentralized architecture. * IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania 1,∗ Corresponding author † These author contributed equally. laura.atmanaviciute@knf.vu.lt (L. Atmanavičiūtė); tomas.vanagas@knf.vu.lt (T. Vanagas); justinas.grigaras@knf.stud.vu.lt (J. Grigaras); saulius.masteika@knf.vu.lt (S. Masteika) 0000-0002-1770-670X (S. Masteika) ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings One of the most common methods for determining the level of centralization in LN is the Gini coefficient. It is used to measure inequalities in the distribution of specific resources or characteristics within a certain population [5]. Gini coefficient is an important measure used in many different fields, including healthcare economics, sociology, tourism management, social determinant of diseases and BLN centralization studies. It can be measured using various parameters like income, population demographics, channel capacity or resource allocation, and offering a quantitative method to evaluate inequalities within a system [6, 7, 8, 9]. In healthcare, Gini coefficient was used to analyze inequalities in tuberculosis incidence, where it was demonstrated how it is associated with various social determinants of health – including income inequality, education, and demographic [6]. Similarly, the Gini index was utilized in tourism management research to evaluate the seasonal concentration of tourism demand, gaining insights into the distribution of visitors across different periods [8]. Furthermore, the Gini index is used to assess the distribution of healthcare resources, which revealed disparities in the availability of physicians, paramedics, and hospital beds [9]. Moreover, in BLN centralization studies, Gini coefficient is used to measure inequal distribution between nodes, associating it with node capacity and channels being controlled by a few nodes [10, 11]. Overall, Gini coefficient applications in different fields show Gini’s versatility in quantifying inequality across different fields and this way allowing for comparisons between them. Existing studies which assessed the centralization of the BLN by applying Gini coefficient showed a possible centralization. Research [10] presented the average high coefficient of 0.95 for node capacity and average coefficient of 0.76 for the number of channels in nodes. Another research [3] also reveals an average Gini coefficient value of 0.88 of channel count per node. Furthermore, authors [11] presented Gini value for BLN of 0.77, while research [12] also highlighted the gradual growth of Gini coefficient from 0.82 to 0.92 between April 2019 and January 2021. When utilizing the Gini coefficient in assessing the centralization of the LN, it can be measured using the following formula: n n (1) ∑ ∑ ¿ x i−x j ∨¿ G= i=1 j=1 2 ,¿ 2N x where N should be used to represent a total number of nodes, x i and x j to represent the capacity of nodes and x as an average capacity across all nodes. The Gini coefficient ranges between 0 (perfect equality) and 1 (maximum inequality), where 0 signifies everyone having an equal share of the resource and 1 represents a scenario where one individual has everything [3], [11]. The problem in applying the Gini coefficient when assessing the centralization within the BLN is data collection across Layer 1 and Layer 2, including data linking, storing, and choosing variables for calculations. Existing research often overlooks how data is extracted from blockchain and how data from both layers can be linked ensuring consistency. The aim of this study is to establish a data processing method for applying the Gini coefficient to assess the centralization of the BLN. To achieve this aim, the research will focus on the following objectives: 1. Extract data from both Layer 1 (L1) and Layer 2 (L2) to gather relevant datasets on node activity and capacity. 2. Categorize data and link datasets from different sources, ensuring consistent integration of L1 and L2 data without distortions. 3. Propose a data processing method and scheme for applying the Gini coefficient to assess the centralization within the BLN. 4. Implement and conduct experimental calculations using the Gini coefficient and provide visual representation of the results. This research focuses on developing a data processing method specifically designed for applying the Gini coefficient to assess centralization within the BLN. First, data will be collected across from both L1 (BTC blockchain) and L2 (LN). This includes aspects like data linking, storage and selecting appropriate variables. Subsequently, a data processing method will be proposed. Finally, the paper will apply the method in calculating the Gini coefficient for BLN. The results will be presented visually. 2. Data processing method Within the framework of the LN, the Gini coefficient serves as a metric for assessing centralization. This method involves aggregating nodes under common or similar aliases to evaluate the distribution of channel capacities and the consolidation of authority within the network. While other studies measure Gini coefficient in BLN by considering node capacity, this paper proposes a method that goes beyond analyzing individual nodes by grouping them into entities based on aliases. Through the systematic analysis of the Gini coefficient across these entities, our objective is to discern the degree of centralization inherent within the BLN ecosystem. This attempt facilitates a comprehensive exploration of the network's structural composition and its implications for decentralization. To research the centralization tendencies withing the BLN, data will be gathered from 2 primary sources – LN Research [13] and BTC blockchain. LN Research meticulously investigates the LN, a second-layer solution built on the BTC blockchain to tackle scalability and fees [13]. To access BTC blockchain Bitcoin Core was utilized, the authoritative BTC protocol implementation which validates transactions and confirms blocks and Electrum node which indexes BTC blockchain for fast information retrieval. LN gossip messages do not contain information when channels have been opened or closed. To address this limitation the ‘MyNodeBTC’ environment was utilized – an operating system designed to manage different BTC node types – to synchronize a BTC full node and an Electrum node for the transaction indexing [10]. BTC blockchain transactions dataset includes all transactions that have occurred on the BTC network. This includes information on the date and time when BTC was locked in a transaction, the specific amount, and the status of the channel – whether it was closed or still open. Examples of BTC blockchain blocks database table can be found in Figure 4 and BTC transactions database table in Figure 5. BLN works by exchanging messages to enable finding payment routes within the LN. These messages have been broadcast to all network participants and have been collected by LN Research team. This information provides the foundation for the research. The data has been imported from both the BTC blockchain and the LN Research repository as shown in Figure 1. Figure 1 . Initial database tables This information exchange is specified in the gossip protocol, where nodes broadcast 3 types of messages to the network – ‘Channel Announcement’, ‘Channel Update’ and ‘Node announcement’. For the purposes of this study, 2 specific message types will be leveraged – channel announcements and node announcements [13]: ‘Channel Announcement’ message announces the creation of a new payment channel on the LN, including the unique identifier (ID) of the channel and the public keys of the two nodes participating in the channel. Example of channel announcement database table is presented in Figure 2. The ‘ShortChannelID' provides concise information about the channel. 'NodeID1' and 'NodeID2' represent the IDs of LN nodes that initiated the channel specified in the preceding 'ShortChannelID' field. Figure 2. Example of channel announcement database table ‘Node Announcement’ message informs other nodes about a new node joining the LN, typically containing the unique identifier (ID) of the node and optional information such as the node's operator or its public key (depending on the implementation). As shown in Figure 3, in ‘NodeAnnouncement’ messages, there are two fields ‘NodeID’ and ‘Alias’. ‘NodeID’ field contains a unique identifier assigned to the node within the LN, enabling identification and communication. The ‘Alias’ field provides a user-friendly name or identifier associated with the node, allowing for easy recognition and interaction without the need to refer to the ‘NodeID’. Figure 3. Example of node announcement database table The LN does not contain information about when the channel was opened or closed to address this limitation, the ‘MyNodeBTC’ operating system was utilized – an operating system designed to manage different BTC node types – to synchronize a Bitcoin full node and an Electrum node for the transaction indexing [10]. BTC blockchain contains all transactions that have occurred on the BTC network. This includes information of the date and time when BTC was locked in a payment channel, the specific amount, and if the channel was closed – date and time when it was closed. Example of data fragment is shown in Figure 4. The 'BlockIndex' column denotes the height of each block in the blockchain, starting from 0 for the Genesis Block. The 'BlockHash' serves as a unique identifier for each block in the blockchain. The 'Timestamp' represents the UNIX timestamp of each block. The 'Time' and 'Date' fields are derived from the 'Timestamp' field and are utilized for easier data selection in subsequent calculations. Figure 4. Example of blockchain blocks database table For transactions identified as spent, it was further investigated by assigning the specific block height where the spending transaction has occurred. Figure 5. Example of transaction database table The database table ‘Blockchain_Transactions’ comprises a relevant BTC transaction list for our research, with all imported transactions involving the opening and closing of the LN channels. Among the fields present, 'ShortChannelID' encapsulates crucial channel details, including the block height, transaction index within the block, and the output index within the transaction, while 'FundingBlockIndex', 'FundingTxIndex', and 'FundingOutputIndex' are derived from the 'ShortChannelID' field, signifying the block height, transaction index, and output index associated with channel funding. Additionally, 'FundingTxID' serves as the hash of the transaction that funded and initiated the channel, while 'Value' represents the amount of BTC locked within the lightning channel. Furthermore, the 'SpendingBlockIndex' column denotes the block height of the transaction that closes the channel, with open channels during the research marked with an arbitrary large number, '9999999999', in the 'SpendingBlockIndex' field. Lastly, 'SpendingTxID' indicates the hash of the transaction responsible for closing the channel, remaining empty if the channel remains open. The data collected by LN Research was linked to the relevant blockchain transactions that opened the channels. This link was facilitated by the ‘ShortChannelID’, which consists of the block height, the transaction index within the block, and the transaction output index, facilitating the linking of data collected by LN Research to the relevant blockchain transactions that opened the channels. A method utilizing the Gini coefficient and Lorenz curve was developed to assess centralization in the BLN. Data was imported from the BTC blockchain and LN research, grouped based on node ids and node aliases. Filtered at six timestamps to capture a snapshot of network channels capacity distribution. The Gini coefficient quantified centralization into a single number for a specific moment of time, while the Lorenz curve depicted channel capacity distribution at a specific moment of time throughout the nodes in the network. This approach enabled comprehensive analysis and trend identification. Data collection and processing workflow is presented in Figure 6. This scheme illustrates data retrieval, storage, and calculation workflow to calculate Gini coefficient of weighted degree centrality throughout the time grouped by ‘NodeID’. At the very beginning BTC full node needs to be synchronized, which will be used to retrieve information from the blockchain such as timestamps of the transactions, their values and when those transactions have been spent. All transactions do not need to be imported from the BTC blockchain – LN research’s collected BLN gossip data is utilized to identify which transactions need to be imported by using ‘ShortChannelID’ in the ‘ChannelAnnouncement’ message. At this stage necessary data is imported to the initial database tables. The research then proceeds with analysis of imported data. In this example Gini coefficient is calculated on weighted degree centrality, grouped by nodes. To achieve this, the opened channels are filtered at the specific moments of time and their channels capacity are summed up and then stored in the next database table ‘_CACHED_WeightedDegreeCentralityByNode’ for further calculations. Date variable is iterated with a granularity of 1 month, for example 2018-03-01, 2018-04-01, etc. In the newly created database table named ‘_CACHED_WeightedDegreeCentralityByNode’ amount in BTC is locked in BLN channels for each public node in the network at each moment of time of the iteration – in this study it is every month since BLN inception. The research then utilizes the grouped data from the previous step and Gini coefficient formula is applied to the data at each moment of time. The results are stored in the new database table named ‘_CACHED_GiniByWeightedDegreeCentrality’. This database table has the information about the whole network in the form of Gini coefficient, which allows to query the Gini coefficient data whenever it is needed by the frontend or chart creation tool to visualize the data. Figure 6. Gini coefficient calculation workflow After processing data according to Figure 6, data is ready to be utilized to apply Gini coefficient calculations to assess the level of centralization of the BLN. 3. Experimental research of proposed method The research employs a static analysis approach, examining the snapshots of the asset distribution at specific points in time - timestamps. Six timestamps were utilized, starting in March 2018, then Lightning Labs’ lnd became the first LN implementation was released, and ending in March 2023, the most recent available data. Each timestamp and number of nodes is presented in Table 1. Static analysis allows to track changes, analyse trends, and understand the dynamics of a phenomenon over time. However, the research is interested not only in its static distribution at any given timestamp, but also in its dynamic flow across different time periods. Table 1 Lightning Network nodes at specific timestamps Abbr. Timestamp Date Number of nodes T1 1519855474 Mar. 2018 467 T2 1551391683 Mar. 2019 4347 T3 1583014153 Mar. 2020 4978 T4 1614550557 Mar. 2021 6893 T5 1646088233 Mar. 2022 15933 T6 1677621623 Mar. 2023 11889 The Gini coefficient is a widely employed metric for evaluating inequality and plays a crucial role in understanding the distribution of transaction activity within the LN. The Gini coefficient aids in gauging the concentration of transactions among nodes. ‘Weighted degree centrality’ for a node in the BLN is calculated by summing the capacities of all its channels. This helps to understand how important or central a node is within the network based on the capacity of its channels. Unlike ‘Degree centrality’, which counts the number of channels a node has, ‘weighted degree centrality’ considers the capacity of these connections [14]. The experimental research results of Gini coefficient of BLN nodes are described in Table 2. Results show that the Gini coefficient of the BLN has been increasing over time. At timestamp 1 Gini coefficient is 0.832 and when with each timestamp it gets bigger and reaches 0.95 at the timestamp 6, which indicates greater inequality. Calculations reveal an average coefficient of 0.918 and indicate that a few nodes have a much higher weighted degree centrality than others. Table 2 Gini coefficient of Bitcoin Lightning Network nodes on weighted degree centrality aspect Abbr. Timestamp Date Gini Coefficient T1 1519855474 Mar. 2018 0.832 T2 1551391683 Mar. 2019 0.892 T3 1583014153 Mar. 2020 0.930 T4 1614550557 Mar. 2021 0.950 T5 1646088233 Mar. 2022 0.951 T6 1677621623 Mar. 2023 0.954 The experimental research results of the proposed method are shown in Table 3. These results present the Gini coefficient of BLN entities, instead of nodes. The coefficient values a lower compared to Table 2, but nevertheless it shows an apparent centralization of BLN entities. It was 0.829 in March 2018 and steadily grew to 0.930 in March 2023. Table 3 Gini coefficient of Bitcoin Lightning Network entities on weighted degree centrality aspect Abbr. Timestamp Date Gini Coefficient T1 1519855474 Mar. 2018 0.829 T2 1551391683 Mar. 2019 0.855 T3 1583014153 Mar. 2020 0.899 T4 1614550557 Mar. 2021 0.921 T5 1646088233 Mar. 2022 0.912 T6 1677621623 Mar. 2023 0.930 The results in Table 3 are visually represented by utilizing Lorenz curves. Figure 7 presents Lorenz curves for the BLN entities on weighted degree centrality aspect captured at six specific timestamps. The Gini coefficient is the area below the line of perfect equality (45 degrees), minus the area beneath the Lorenz curve, and then this difference is divided by the total area under the line of perfect equality [12]. Figure 7 shows how weighted degree centrality of BLN entities moves further away from the perfect equality and area which covers Gini coefficient grows. This graph was created by retrieving data from intermediate database table ‘_CACHED_WeightedDegreeCentralityByNode’ at specific moments of time, joining the data with ‘Lightning_Entities’ and ‘Lightning_NodeAliases’ tables to retrieve entity name and then grouping by the entities and summing up BTC amounts. Last step sorting all the entities in ascending order by amount and calculating cumulative percentages of the whole network in 1% granularity to calculate Lorenz curve. Figure 7. Lorenz curves of weighted degree centrality of BLN entities Experimental research results of proposed method agree with the results of the existing research – Gini coefficient values inequalities in the BLN. As previously analyzed, other research also measured high values of the Gini coefficient – the values range between 0.76 and 0.95 depending on specific timestamps and method used for calculations. This confirms that the data processing method proposed in this paper is reliable and can be used in the future studies of assessing the centralization within the BLN and utilizing Gini coefficient for this task. 4. Results and conclusions In this paper, to assess centralization within the BLN, data was successfully extracted from separate sources for L1 and L2. L1 data on transactions was taken from Bitcoin Core blocks and Electrum Nodes facilitated transaction indexing. L2 data was obtained from LN research, where specific gossip messages were used to gather relevant data. This study ensured consistent integration of data from both layers by focusing on specific details within each data source. Channel-related data messages provided information on nodes IDs and channel capacities from L2 which was then linked to blockchain transactions in L1 using a unique identifier ‘ShortChannelID’. This linking process connected channel information directly to the actual locked BTC within the channel and ensured consistent data without distortions. The data processing method for applying the Gini coefficient to assess the centralization within the BLN was proposed and explained in detail. This paper contributes to the research of applying Gini coefficient in BLN by grouping nodes into entities based on aliases and this way providing a broader understanding of network distribution. This approach utilizes data from both the BTC blockchain and LN research, this way ensuring that data for calculating the Gini coefficient is accurate. To evaluate the quality of the proposed data processing method, experimental calculations using the Gini coefficient were implemented with a static analysis for the specific six different timestamps. The data processing method proved reliable as results obtained were verified by already existing research. In this paper, Gini coefficient for entities reached 0.930 in March 2023, and as well as other authors’ articles, demonstrated a clear trend of increasing inequality in the BLN over time. 5. Discussions This research proposes a data processing method for applying the Gini coefficient to assess centralization within the BLN. While Gini coefficient is a valuable measure, the proposed method opens doors for future research to explore the BLN centralization. This method could be potentially adapted to incorporate alternative network centrality measures, such as degree, betweenness, eigenvector or closeness centrality, providing a more comprehensive picture. This study might lay the path for extending the proposed method to analyze dynamic centralization trends – future studies could incorporate real-time data collection, tracking trends and identifying potential transition towards centralization within the BLN. This research also contributes to the standardized approach to centralization assessment. The proposed method could serve as a foundation for future work towards standardizing data collection and processing methodologies. 6. References [1] Divakaruni, A., Zimmerman, P. (2022). The Lightning Network: Turning Bitcoin into money. Finance Research Letters 52. [2] Martinazzi, S., Flori, A. (2020). The evolving topology of the Lightning Network: Centralization, efficiency, robustness, synchronization, and anonymity. PloS One, 15(1), e0225966–e0225966. doi:10.1371/journal.pone.0225966 [3] Lin, J.-H., Primicerio, K., Squartini, T., Decker, C., & Tessone, C. J. (2020). Lightning network: a second path towards centralisation of the Bitcoin economy. New Journal of Physics, 22(8), 83022. doi:10.1088/1367-2630/aba062 [4] Carotti, A., Sguanci, C., & Sidiropoulos, A. (2023). Rational Economic Behaviours in the Bitcoin Lightning Network. doi:10.48550/arxiv.2312.16496 [5] Crucitti, P., Latora, V., & Porta, S. (2006). Centrality Measures in Spatial Networks of Urban Streets. doi:10.48550/arxiv.physics/0504163 [6] De Castro, D. B., de Seixas Maciel, E. M. G., Sadahiro, M., Pinto, R. C., de Albuquerque, B. C., & Braga, J. U. (2018). Tuberculosis incidence inequalities and its social determinants in Manaus from 2007 to 2016. International Journal for Equity in Health, 17(1), 187–187. doi:10.1186/s12939- 018-0900-3 [7] Wong, S. K. (2010). Crime clearance rates in Canadian municipalities: A test of Donald Black's theory of law. International Journal of Law, Crime and Justice, 38(1), 17–36. https://doi.org/10.1016/j.ijlcj.2009.11.002 [8] Fernández-Morales, A., Cisneros-Martínez, J. D., & McCabe, S. (2016). Seasonal concentration of tourism demand: Decomposition analysis and marketing implications. Tourism Management (1982), 56, 172–190. https://doi.org/10.1016/j.tourman.2016.04.004 [9] Darzi Ramandi, S., Niakan, L., Aboutorabi, M., Javan Noghabi, J., Khammarnia, M., & Sadeghi, A. (2016). Trend of Inequality in the Distribution of Health Care Resources in Iran. Galen, 5(3). https://doi.org/10.31661/gmj.v5i3.618 [10] Masteika, S., Rebždys, E., Driaunys, K., Šapkauskienė, A., Mačerinskienė, A., & Krampas, E. (2023). Bitcoin double-spending risk and countermeasures at physical retail locations. International Journal of Information Management, 102727. doi:10.1016/j.ijinfomgt.2023.102727 [11] Mahdizadeh, M. S., Bahrak, B., & Sayad Haghighi, M. (2023). Decentralizing the lightning network: a score-based recommendation strategy for the autopilot system. Applied Network Science, 8(1), 73–33. doi:10.1007/s41109-023-00602-2 [12] Zabka, P., Foerster, K.-T., Decker, C., & Schmid, S. (2022). Short Paper: A Centrality Analysis of the Lightning Network. In Financial Cryptography and Data Security (pp. 374–385). Cham: Springer International Publishing. doi:10.1007/978-3-031-18283-9_18 [13] Decker, C. (2023). Lightning Network Gossip. https://github.com/lnresearch/topology [14] Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality in weighted networks: Generalizing degree and shortest paths. Social Networks, 32(3), 245–251. doi:10.1016/j.socnet.2010.03.006