Exploring the Auto Model Competition Patterns in China’s Auto Market based on Complex Networks Theory Sheng Zhang Haoyang Che College of Artificial Intelligence College of Artificial Intelligence Beijing Normal University Beijing Normal University Beijing, China Beijing, China zsheng_2018@163.com chehy@hotmail.com Jiacai Zhang∗ Yucong Duan College of Artificial Intelligence College of Information Science and Technology Beijing Normal University Hainan University Beijing, China Haikou, China jiacai.zhang@bnu.edu.cn duanyucong@hotmail.com ABSTRACT have accumulated more than 500 million users, 100 million sales Understanding the competition pattern of auto models is critical leads, and billions of user behavior data. for stakeholders including automakers and dealers. However, the In order to solve the problems of traditional methods, we pro- traditional methods mainly rely on the experience and analytical pose a novel method from the perspective of complex networks, dimensions of the analyst, which lack reliable methodology and using the sales lead data of auto models from VAWs to build an ignore the value of user behavior. In this paper, we propose a auto model competition network, and explore and analyze the novel method based on complex network theory, construct an auto model competition pattern of China’s auto market. Figure auto model competition network with users’ sales leads, and 1 outlines our framework, which consists of three parts: data analyze the static characteristics of the network. Besides, by us- preprocessing, network construction, and competition pattern ing different community detection algorithms and constructing analysis. Among them, competition pattern analysis includes predictive models, we discovered that there are six major com- network visualization, characteristic analysis, and community munities in the network, and that price, popularity, model level, structure analysis. Compared with the traditional method, our as well as model asset ownership, are the main factors affecting method has the following advantages: First, our model is based community division. on a complex network and has a solid theoretical foundation. Second, we use the sales lead data of auto models, which is more valuable than data such as car sales. It comprehensively reflects 1 INTRODUCTION the preferences of users and the comparison of different mod- China’s auto sales declined for the first time in 2018 [17]. This els. Lastly, we have established a complete analysis framework, is undoubtedly putting tremendous pressure on stakeholders, which can improve the efficiency and reliability of the analysis. including automakers and dealers. It is extremely important to By applying our model to 6,152,335 sales leads of 1069 auto understand the competition pattern of auto models, which can models in January 2019, we have two main contributions: help them to recognize market needs, identify emerging competi- tors, and develop targeted auto production and sales strategies. • We constructed auto model competition networks, per- In terms of the competition patterns analysis, traditional meth- formed visualization and network characteristic analysis, ods are often limited to strategic management and market analy- revealing the characteristics such as intensified competi- sis, such as SWOT analysis [8] and the Porter Five Forces model tion and small-world phenomenon. [13]. However, these methods mainly rely on the experience and • We found six major communities using community de- intuition of analysts, and lack reliable methodology. In addition, tection algorithms, and built prediction models based on the analysis dimension is often confined to car sales and user them. We found that price, model level, and popularity feedback, ignoring the value of other user behaviors. Thus, it were the main factors to affecting community division. may cause unstable performance in pattern interpretation. The rest of this article is organized as follows. Section II in- At the same time, with the advent of mobile Internet, vertical troduces the related work of strategic management, marketing auto websites (VAWs) have become an important channel for and complex network in auto competition pattern analysis. In people to obtain car information and buy cars. More and more section III, we describe the dataset and data preprocessing steps. users will browse the car information on VAWs and leave their In section IV, we construct the auto model competition network sales leads (customer’s personal information, including name and in January 2019 and perform the network visualization. Section phone number, for sales purposes) before purchasing a car, so V analyzes the static characteristics of the network. In section that dealers can contact them to make an appointment for a test VI, we divide the community structure of the network, and find drive. After more than a decade of accumulation, leading websites the main factors affecting community division by constructing ∗ corresponding author predictive models. Section VII concludes the paper. © 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed- ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, 2 RELATED WORK Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- Many investigations have researched the auto market competi- tribution 4.0 International (CC BY 4.0) tion pattern from different aspects. In this section, we will classify Figure 1: Framework Overview the related work into strategic management and marketing anal- Table 1: The Original Dataset Schema ysis, and complex network. Strategic management and marketing analysis are the most Field Description Example common methods to study the competition pattern in the auto ID Row ID RID_00000001 market. Study [4] applied SWOT analysis and the five-force User Anonymous user ID UID_111111111 model, studied the competition pattern of Chery Automobile, Province User province Guangdong and pointed out the huge threat posed by other brands entering City User city Shenzhen the low-end model market. Study [14] analyzed the competi- Source Sales lead source type Mobile, Web, other tive environment, opportunities and challenges faced by FAW- Time Sales lead time 2019-01-01 00:00:00 Volkswagen’s new energy models based on the PEST model and Brand Brand of sales lead Volkswagen the SWOT model. However, these methods highly depend on Model Model of sales lead Volkswagen Lavida analysts’ experience and intuitions instead of a solid method- Style Style of sales lead Volkswagen Lavida 1.4L ology foundation, which could perform less stable in bidding presentation and pattern interpretation. Another method is complex networks based on graph theory. In recent years, the research of complex networks has expended leaving sales leads, improving the authenticity of the data. The from the fields of physics and computers to society and tech- overlapping sales lead refers to the situation that different mod- nology. Numerous theoretical studies and empirical analyses els have the same sales leads, which means that users may be have also emerged [1, 5, 15]. In the auto field, Lijuan Zhang et al. interested in multiple models at the same time, so these models studied the cooperation network between automakers and parts could be potential competitors. Overlapping sales leads reflect suppliers, and found the small-world phenomenon of the net- users’ comparison of different models. Compared with other data, work [16]. Jianmei Yang et al. used the Newman fast community sales leads and overlapping sales leads have higher authentic- algorithm to divide the network of auto companies into different ity, and timely and accurately reflect the user’s preferences for communities based on their product categories, and established models (sales leads) and comparisons between different models a multi-layer network to analyze the confrontation behavior be- (overlapping sales leads). tween automotive companies [10]. However, these researches are In order to construct the auto brand competition network, we only from the perspective of automakers and suppliers, without need to process the data into the required form. First, since some taking user behavior data into consideration. cars have different ids, and/or names of brand, model or style, we In summary, different from existing researches, we build an need to identify and unify them. After that, duplicate, missing and effective framework from the complex network perspective, and erroneous entries are eliminated. And because the automakers use massive sales leads data from VAWs to analyze the auto model and dealers usually analyze the auto model data monthly, we competition pattern. need to aggregate the sales leads data by month. Besides, shorter (such as daily) or longer periods (such as annually) may not be able to accurately or timely reflect the model competition pattern. 3 DATA AND PREPROCESSING For instance, if 100 users left their sales leads in October 2018 on The original dataset is from one of China’s largest VAWs, which Camry, the model sales leads of Camry in October 2018 are 100. contains 1 PB anonymous log data from January 2017 to January Finally, we extract and aggregate the same sales leads between 2019. Each entry includes anonymous user ID, province, city, different models as overlapping leads. For example, if ten users sales lead source and time, as well as the corresponding brand, left their sales leads on Camry and Jetta in October 2018, the model and style information, which is shown in Table 1. overlapping sales leads between Camry and Jetta in October 2018 As we mentioned before, sales leads refer to users’ information are 10. To study the recent competition pattern, we selected the for sales use, including names, regions and contact information data for January 2019, including 1069 brands with 6,152,335 sales of potential customers. If a user leaves his/her information on a leads and 1,129,919 overlapping brand sales leads. model on a VAW, which indicates that he/she is interested in this Figure 2 illustrates the model sales leads and overlapping sales car and could be a potential buyer. Because sales leads require the leads distribution in January 2019. In Figure 2 (a), it is obvious that user’s personal information, users will be more cautious when most models have relatively low sales leads, but a few models such sales leads from the nodes) and edge weights (the number of overlapping sales leads from the edges) have been rescaled for clarity. Figure 3 gives an overview of the auto model competition network. Basically, the nodes in the middle have more sales leads and overlapping sales leads. However, there is a certain distance between the nodes with the most sales leads (such as Lavida and Jett), suggesting a potential community structure and they may belong to different communities. 5 NETWORK CHARACTERISTICS (a) Sales Leads Distribution ANALYSIS Degree distribution, average shortest path length and clustering coefficient are the most common characteristics of a network. In this section, we will analyze the characteristics of the auto model competition network in January 2019, and discuss the interpretation of these characteristics. • Degree distribution: The degree of a node refers to the number of edges connected to the node. The degree distribution is shown in Figure 4 (a). As we can see, the number of nodes decreases as the degree increase and decreases almost constantly, except for the beginning part. Since the degree represents the number of connected (b) Overlapping Sales Leads Distribution edges of a node, that is, the number of directly adjacent nodes, which means the degree of a node represents the Figure 2: Distribution of Sales Leads and Overlapping number of direct competitors of the model it represents. Sales Leads Therefore, Figure 4 (a) illustrates that as the number of competitors increases, the number of models decreases. Among them, the node with the highest degree is Lavida as Jetta, Lavida and Sylphy have a very high amount of sales leads, (with 809 competitors), instead of the node with the most ranging from 1 to 165,199. Figure 2 (b) shows the distribution sales leads—Jetta. On the contrary, there are also 28 nodes of overlapping sales leads between different brands. Similarly, with a degree of 0, that is, isolated nodes without com- most overlapping sales leads are low, while others are very high petitors. And these models are excluded in the following such as overlapping sales leads between Jetta and Santana (4,393 discussion. Besides, the average degree is 236.95, which overlapping sales leads). Figure 2 indicates the number of sales shows that there are nearly 240 competitors per model, leads between different models is huge, suggesting that there are reflecting the fierce market competition. different model divisions. • Average shortest path length and diameter: The av- erage shortest path length is the average distance between 4 NETWORK CONSTRUCTION & all pairs of nodes (if the graph is connected). And diameter VISUALIZATION describes the maximum path length in a network. The auto model competition network is essentially a graph. By Due to the large difference in weight between nodes, and regarding the auto models as nodes (sales leads as size), and the weighted shortest path length cannot be used to de- competition relationship as edges (if two nodes have overlapping scribe the small-world phenomenon of the network, we sales leads) which link different models, we can abstract the will ignore the weight of the connected edges (i.e. re- auto model competition network. In the network, brands with garded as a binary network). And as we mentioned before, overlapping sales leads are considered to be competitors. And since the original network is not connected, we choose the the network is built with networkx Python library [7]. largest giant component (LGC network), which is exactly There are 1069 nodes and 126,650 edges in the network of the original network after removing all isolated nodes. January 2019. Among them, there are 28 isolated nodes (i.e., no The average shortest path length of the LGC network is edges). And Figure 3 shows the network of January 2019 without 1.82, and the diameter is 4, which are really small compared isolated nodes. The size and color of nodes reflect the number of to the number of nodes (1041 nodes). Figure 4 (b) shows sales leads for the model, and the thickness of edges represents the distribution of the shortest path length between all the amount of overlapping sales leads. To be specific, if the size of node pairs in the network. Obviously, most nodes have the node is larger and the color of the node is redder, it has more direct competition (the shortest path length is 1, 23.4%) or sales leads. And if the thickness of the edge is thicker, the color of common competitors (the shortest path length is 2, 71.1%). the edge is redder, there are more overlapping sales leads between Only less than 0.5% of the shortest path length equals to the two nodes, and their competition is fiercer. In addition, the the diameter of the network (length = 4). figure is drawn using Gephi and its built-in ForceAltas2 layout • Clustering coefficient: The clustering coefficient mea- algorithm [2, 9]. Non-overlap option was chosen to ensure the sures the situation of interconnection between neighbor nodes do not overlap. And all the node sizes (the number of nodes of nodes in the network. Figure 3: Auto Model Competition Network Figure 4 (c) depicts the distribution of clustering coeffi- and dealers, it is important to understand the actual division of cient in the network. The clustering coefficient of most auto models in the auto market, identify current and even poten- nodes is between 0.4 to 0.8, which indicates that most of tial competitors, and assist them in formulating future production the models with common competitors are also competi- and marketing strategies. Besides, we have initially determined tors, and there is a relative obvious clique effect. And the that there is a certain community structure in the auto model average clustering coefficient is 0.64, significantly higher competition network. Therefore, in this section, we will first than corresponding random network. detect the community structure of the network, and then build In summary, low average shortest path length and high prediction models based on the communities to find key features clustering coefficient imply the network possesses the that affect community division and users’ choice. small-world phenomenon. It means that although most nodes are not connected to each other, most nodes can be reached in a few steps. And it is likely to contain cliques or sub-networks, which implies that the network may 6.1 Community Structure Detection contain multiple communities, and this will be discussed The community structure was proposed by Girvan and Newman in section VI. in 2002 [6]. Generally, a community represents a group of nodes In conclusion, the auto model competition network presents with similar characteristics, and there may be multiple communi- the differences in degree distribution and small-world phenome- ties in a network. According to the definition, the nodes within non. Corresponding to the real world, they illustrate the fierce a community are more closely connected, while the nodes of market competition, and potential multiple communities. different communities are loosely connected. At present, many community detection algorithms have been proposed, such as the 6 COMMUNITY STRUCTURE AND GN algorithm [6], the fast Newman algorithm [11], and the Lou- vain algorithm [3]. At the same time, Newman et al. also proposed PREDICTION a modularity function to evaluate the quality of community struc- In fact, the auto models already have different classifications ture division in the network [12]. This value is between [-1/2, 1], according to auto brand, usage, nationality, price range and so and the closer is it to 1, the better the community division effect. on. However, these classifications can only represent the model’s In fact, the value in practical applications is generally between own attributes, and cannot comprehensively reflect the users’ 0.3 to 0.7 [12]. evaluation and actual division in the auto market. For automakers (a) Degree Distribution (b) Shortest Path Length Distribution (c) Clustering Coefficient Distribution Figure 4: Distribution of Degree, Shortest Path Length & Clustering Coefficient Table 2: Comparison of Community Detection Algorithms Table 3: Features to Predict Community Division Modularity Number of Computation Fields Description Example Algorithm Score clusters Time (s) Number of sales Num_leads 1000 Fast Newman 0.031 282 7.652 leads of the model Louvain 0.329 6 4.412 The highest price 16.28 Price_high of the model (in 10,000 CNY) The lowest price 11.08 Price_low In this section, we use the Fast Newman and Louvain algo- of the model (in 10,000 CNY) rithms for community detection, both of which are greedy algo- The model Minicar Model_level rithms based on modularity maximization. And the algorithms’ classification (14 kinds in total) results are shown in Table 2 (edge weights are considered here). The country of Germany Country_name Obviously, the Louvain algorithm performs better, not only has the model (10 countries in total) a higher modularity score, but also has a shorter computation Domestic Model asset time. Besides, the interpretability of 6 clusters of 1041 nodes is Country_class (or imported/ ownership significantly higher than that of 282 clusters. Figure 5 shows the joint venture) community detection results of the Louvain algorithm, where The brand of Volkswagen, . . . Brand_name different colors represent different communities. Although the the model (130 brands in total) number of nodes in each community is different, the nodes within the same community are all in proximity. A detailed interpreta- Table 4: Model Prediction tion of the community division will be in the next part. 6.2 Community Prediction Model Accuracy Precision Recall F1 Score Based on the community structure detected in the previous sec- Random 0.8119 0.8290 0.8120 0.8036 tion, we constructed several predictive models to find the key Forest features that affect community division. XGBoost 0.8220 0.8420 0.8220 0.8203 First, we need to propose several features that may influence the community division of the auto model competition network, including the number of sales leads, the highest price of the 170K to 240K CNY. Community 2 is mainly popular compact model, the lowest price of the model, the model classification, the cars between 120K to 170K CNY. Community 3 is basically some country of the model, the model asset ownership and the brand cheap cars, including mini cars, compact cars, small cars and of the model, as shown in Table 3. Among all these features, the SUVs. Community 4 has the most models, which are all expen- first three features are numerical variables, and one-hot encoding sive cars, such as SUVs, medium cars, medium and large cars, is used on the rest four features. large cars, luxury cars and sports cars. Community 5 does not Then Random Forest and XGBoost with 5-folds cross-validation include sedans, but MPVs, trucks, pickups, vans, buses and so on. are applied to these features and community labels. The metrics Finally, Community 6 is mainly domestic SUVs. and performances are shown in Table 4, which are all mean val- Combined with Figure 5 and the community characteristics ues with 5-folds cross-validation. Obviously, the XGBoost has above, we have several findings: First of all, the compact cars better performance in all metrics. And we find that the most within 120k to 170k in China are the most popular ones (i.e. important features are price (‘price_low’ and ‘price_high’), popu- community 2) with the highest average sales leads. Second, SUV larity (‘num_leads’), model level (‘model_level’), and model asset is the most popular model classification, appearing in almost ownership (‘country_class’). every community. And domestic SUVs and imported and joint Therefore, according to the community division and key fea- venture SUVs are in different communities. Finally, we find that tures, we can summarize the characteristics of all the 6 com- the high price community (community 4) has the largest number munities, illustrating in Table 5. To be specific, community 1 is of models, but the average number of sales leads is the minimum mainly imported or joint-venture SUV with price ranging from in sedans (excluding community 5). Table 5: Community Characteristics Community Number Color Number of Models Average Number of Sales Leads Characteristics Example 1 Dark Green 104 10646.30 Mainly SUVs (imported or joint-venture) CR-V 2 Pink 48 20834.48 Mainly popular compact cars Lavida 3 Light Green 204 5598.01 Low price (mini/compact/small cars and SUVs) Jetta High price (mainly SUV/medium/medium and 4 Violet 312 4861.70 Accord large/large cars, Luxury cars, and Sports cars) 5 Blue 203 2271.48 Not sedan (MPV/truck/pickup/van/bus. . . ) WulingHongguangS 6 Orange 149 6187.33 Mainly domestic SUV Haval H6 of comprehensive data, and lack of a complete analysis frame- work. However, this paper only researches the characteristics and community structure of the auto model competition network in January 2019 in detail, and the subsequent work will further study the dynamic characteristics and community structures. ACKNOWLEDGMENTS This research was funded by the National Key Technologies RD Program (2017YFB1002502), and the General Program (61977010) of Nature Science Foundation of China. This work was also sup- ported by the project of Beijing Advanced Education Center for Future Education (BJAICFE2016IR-003). We would also like to thank Mr. Ran Feng for his suggestion on data preprocessing. REFERENCES [1] Albert-László Barabási and Réka Albert. 1999. Emergence of scaling in random networks. science 286, 5439 (1999), 509–512. [2] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. 2009. Gephi: an open source software for exploring and manipulating networks. In Third international AAAI conference on weblogs and social media. [3] Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Figure 5: Community Detection in the Auto Model Com- Lefebvre. 2008. Fast unfolding of communities in large networks. Journal of petition Network statistical mechanics: theory and experiment 2008, 10 (2008), P10008. [4] Faen Chen and Yukio Kodono. 2012. SWOT analysis and five competitive forces of chery automobile company. In The 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium In summary, we use the Louvain algorithm to find 6 commu- on Advanced Intelligence Systems. IEEE, 1959–1962. [5] Paul Erdős and Alfréd Rényi. 1960. On the evolution of random graphs. Publ. nities in the auto model competition network, and construct the Math. Inst. Hung. Acad. Sci 5, 1 (1960), 17–60. XGBoost predictive model to find key features that affect commu- [6] Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 nity division and users’ choice, and summarize the characteristics (2002), 7821–7826. of the 6 communities. [7] Aric Hagberg, Pieter Swart, and Daniel S Chult. 2008. Exploring network structure, dynamics, and function using NetworkX. Technical Report. Los Alamos National Lab.(LANL), Los Alamos, NM (United States). 7 CONCLUSION [8] Terry Hill and Roy Westbrook. 1997. SWOT analysis: it’s time for a product In this paper, we studied the competition pattern of auto models recall. Long range planning 30, 1 (1997), 46–52. [9] Mathieu Jacomy, Tommaso Venturini, Sebastien Heymann, and Mathieu Bas- in China’s auto market based on the sales leads data with the com- tian. 2014. ForceAtlas2, a continuous graph layout algorithm for handy net- plex network theory. Our investigation involved 6,152,335 sales work visualization designed for the Gephi software. PloS one 9, 6 (2014). leads with 1069 models from a vertical auto website, and China’s [10] YANG Jianmei ZHOU Lian ZHOU Lianqiang. 2013. Competitive Relationships of Auto Industry and Rivalry Actions of Car Community Enterprises in China. auto model competition network was established based on the Chinese Journal of Management 1 (2013). models as nodes, and the competition relationship as edges. There [11] Mark EJ Newman. 2004. Fast algorithm for detecting community structure in networks. Physical review E 69, 6 (2004), 066133. are two important contributions. First, we constructed auto model [12] Mark EJ Newman and Michelle Girvan. 2004. Finding and evaluating commu- competition networks of January 2019, performed visualization nity structure in networks. Physical review E 69, 2 (2004), 026113. and network characteristic analysis, revealing the characteristics [13] Michael E Porter and Competitive Strategy. 1980. Techniques for analyzing industries and competitors. Competitive Strategy. New York: Free (1980). such as intensified competition and small-world phenomenon. [14] Long Sun. 2019. Research on the development strategy of new energy vehicles Second, we discovered that there are 6 communities in the net- for FAW-Volkswagen. Master’s thesis. Jilin University, Changchun, China. work, and built predictive models to find that price, popularity, [15] Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of ‘small- world’networks. nature 393, 6684 (1998), 440. model level and model asset ownership are the key features to [16] Li-juan ZHANG and Chang-hong LI. 2007. Study on cooperative networks in determine the model community structure. In conclusion, with enterprises——An analysis of automobile manufacturing. Science-Technology and Management 4 (2007). the decline in car sales, the competition between models has [17] Jie Zheng. 2019. Negative growth dust of the auto market in 2018 is settled. become increasingly fierce. And among the 6 communities in the Automobile Watch 01 (2019), 18–19. auto model network, the compact models within 120K to 170K CNY are the most popular. SUVs occupy a pivotal position in the entire auto model market. Our research solves the problems in previous auto competition pattern analysis: the lack of solid theoretical foundation, the lack