Towards Framework for Discovery of Export Growth Points © Dmitry Devyatkin1 © Roman Suvorov1 © Ilya Tikhomitov 1 © Yulia Otmakhova 2 1 Federal Research Center Computer Science and Control of the Russian Academy of Sciences, Moscow, Russia 2 Novosibirsk State University, Novosibirsk, Russia devyatkin@isa.ru rsuvorov@isa.ru tih@isa.ru otmakhovajs@yandex.ru Abstract. Export value of the Russian Federation has been reducing in the latest years, as well as the corresponding relative yield. Most probably, this trend is caused by Russia total export decline together with growth of food export. Thus, it is very important to not only increase export volumes, but also adjust export structure to fit nowadays reality better. The paper presents a computer-aided framework for export growth points discovery. While the full framework is described briefly, more attention is paid to the first sub-task: growth point candidates ranking. The objective of this sub-task is to reveal combinations of commodities and partner countries with high probability of successful export. The method uses open data about international trade flows and production from United Nations databases and modern machine learning methods. The experimental evaluation shows that taking into account retrospective data allows ranking growth point candidates significantly better. Finally, the limitations and the possible directions of future research are discussed. Keywords: export growth potential, data mining, international trade, customs statistics, open data, machine learning. 1 Introduction implementation. The first step consists in ranking pairs in such a way so most Sanctions pose both difficulties and opportunities for likely growing pairs appear in the beginning of the list. the Russian economy. On the one hand, traditional In this paper we propose a machine-learning-based foreign markets may be restricted or their growth method that ranks the “growth point” candidates using potential may be exhausted. On another hand, exploring features, extracted from historical data from FAOSTAT new markets may become a fruitful workaround. We and UN Comtrade databases [2, 3]. The presented believe that modern big data and machine learning evaluation is preliminary, because it is based on technologies should be useful to discover new foreign retrospective data. We understand such a weakness and markets with high probability of growth in the nearest we are going to address it in the future work. future. We will refer to the pairs of countries and The rest of the paper is organized as follows: in the commodities as potential growth points. This paper aims Section 2 we review the most related works published so on making a step towards finding new growth points far; in Sections 3 and 4 we briefly describe our using machine learning and open data analysis. framework and the task of export growth point candidates Authors of [1] consider export growth potential as an ranking; in Section 5 we describe our dataset and present opportunity to meet the primary demand for a certain the results of experimental evaluation; in Section 5 we product or service. At the same time, the possibility to conclude and discuss future work. satisfy the demand arises locally and has a specific territorial, and, therefore, national binding. 2 Related work There are two possible ways to satisfy growing Most commonly used approaches to foreign trade demands: extensive and intensive. Intensive way implies modeling include: gravity models, computable general improving technologies, scientific and engineering equilibrium models, heuristic ranking models, solutions and increasing the resource potential and Markovian models, common statistical approaches efficiency of management. Therefore, a product may (regressions, histograms) for manual analysis of a have high export growth potential if it has high added situation. value, robust interbranch relations and stable external The paper [4] presents the empirical evaluation of demand. In this paper, we propose a framework for spatial gravity model of Russian trade. The authors discovery of “export growth points”. High-level concluded that the spatial variables such as the location procedure of this framework consists of two main steps: of the state border checkpoints have a significant effect (1) finding candidates for “growth points”; (2) assessing on the volume and routes of Russian imports. In [5] each candidate and discovering difficulties with its authors study factors of export and import value-added trade and suggest some recommendations for Proceedings of the XIX International Conference management of industrial and trade policy. The “Data Analytics and Management in Data Intensive techniques proposed in this paper allow to determine Domains” (DAMDID/RCDL’2017), Moscow, Russia, main directions of economic policy to expand exports October 10–13, 2017 142 and improve Russian production structure. Duenas and Shen et al [16] considered the international trade Fagiolo in their paper [6] concluded that gravity models network at the level of countries and goods. They used are poorly suited to predicting the presence of trade flow analysis in graphs and statistics on tops to study the relations between some two countries. However such network. The authors draw a number of conclusions models allow us to accurately estimate and forecast the related to the specialization of countries, as well as the volume, given the knowledge that such trade relation dominance of developed countries in terms of the exists. In [7] researchers use gravity models to diversity of exported products (the principle of investigate the export destinations that could be preferential accession). effectively developed with internal financial support. They empirically confirm the fact that food products Experimental work was carried out on the data of food are mostly traded between the most closely located export at the firm-level. countries, while high-tech goods are distributed virtually In [8] authors consider Markov models for all over the world. Also, the authors detect countries with forecasting the variability of the network of foreign trade an anomalous profile of imports, which can talk about a financial flows. In [9] an approach for detecting number of economic problems. In [17] authors presented promising areas of export in the sector of both service and the analysis of export in the service sector on the example goods is proposed. The approach is based on the of Germany companies. The main goal of the analysis is sequential filtering of potential markets via a number of to determine the dependence of directions and the mode heuristics, including estimation of the market volume, a of export on the various features of exported services. level of demand, market openness, etc. In [10] authors They used a non-open dataset from Deutsche Bank. studied the relationships between migration flows and Among other things, the authors detected such heuristics foreign trade. They concluded that the trade flows for as "exports are more preferable to countries with higher some products are positively and significantly correlated incomes (for countries with lower incomes, an with migration flows. That feature can be taken into international partnership is more preferable)"; "When account during analyzing and evaluating the prospects of selling in more remote countries, international an export. partnership is more profitable." In [11] Lall et al. investigated relationship between In [18; 19] researchers developed machine learning exports volume and the "complexity" of goods and models to forecast export dynamics of agricultural introduced a metric of "complexity" or products. They compare Support Vector Machines "manufacturability" of goods. They mentioned the (SVM) and Autoregressive Integrate Moving Average dependence between the rate of growth of prices on a (ARIMA). The experiments showed that SVM achieves product and the degree of it manufacturability. This significantly smaller error rates. dependence can be used as one of the features for To sum the review up, we can say that quite extensive detecting and assessing the export growth potential. efforts have been committed to analyze and predict Bernard et al. [12] proposed a method for estimating the international trade flows. However, most papers describe feasibility of entering the international market for a fragmentary studies, which are focused on a limited set particular company. They used indicators of the of factors. Thus, a goal-oriented and comprehensive company past activity, including participation in exports, approach is in high demand. a competitive environment, etc. It is worth noting the weak influence of sectoral state support for exports on the 3 Framework for discovering export growth actual volume of exports. In [13] authors considered the relationship of the topology of the international trade points network between countries in general with network In this section we will try to formalize the problem of topologies within each product group. They proposed a export growth points discovery. The objective is to find methodology for studying the dynamics of changing the combinations , which have the structure of several heterogeneous networks that highest unrealized potential for export growth. Also, represent trade flows between countries for individual production and export management of these commodity groups. As a result, the most active exporters combinations has to be feasible in the Russian and importers were detected for separate groups. Federation. Producti is a product or product category to In [14] authors try to model the structure and export and Countryj is a country or a group of countries dynamics of the international trade network using the classical methods for solving selecting balls from urns to export to. problem. The analysis is carried out at the level of We propose to use open data analysis and modern countries and the principle of preferential attachment is machine learning techniques to find such growth points. implemented ("the rich get richer, the poor get poorer"). The high-level algorithm of our framework consists of In [15] authors propose to model the structure and the following steps: dynamics of the International Trade Network via the 1. Construct a list of growth point Hamiltonian system. The authors describe the dynamics candidates. Reorder this of the International Trade Network in terms of list so the candidates with higher likelihood of Hamiltonian, and also make the assumption that the main becoming successful export direction appear provisions from the field of statistical physics will also earlier. be applicable to modeling the International Trade 2. Analyze supply chains which contain Network. commodities from our candidate list. Products 143 with higher added value should be reviewed first. three main reasons. The first one is that information about Consider the product lifecycle (including order is more abstract than information about exact production, storage, transportation and increase of trade value or volume (and thus the processing for the selected products) in order to corresponding predictive model should generalize detect the most probable difficulties for each better). The second reason is that we plan to use LTR in stage of the lifecycle in the context of the Russian more general case and thus we want to conduct Federation. Propose intensive or extensive ways experiments as close to the proposed framework as of overcoming them. Products with too many possible. And the third reason is that we can generate difficulties are removed from the list. more data to train LTR model and thus try to reduce Novelty of our approach consists in maximum overfitting. possible automation. We can automate step 1 (candidates To facilitate solution of the described LTR problem, ranking) and aid step 2. Ranking in Step 1 can be carried we treat it as pairwise ranking problem: we build a out with a predictive machine-learning based model. Step regression model, which is given a pair of two export 2 can be highly facilitated by developing a specialized growth point candidates and information retrieval system which uses big collections returns a difference between of scientific and engineering documents, such as patents, export flows for the first and second pair. Generally, such scientific papers, grant reports. Step 1 is discussed in a model operates on a feature set consisting of three detail later in this paper. We are going to consider step 2 major parts: description of global macroeconomic in future. situation; description of trade flows for the first candidate; description of trade flows for the second 4 Data Driven Candidates Ranking candidate. Ideally, information about both candidates Formally, the problem of candidates ranking is a should also somehow describe prices, competitiveness, Learning-To-Rank (LTR) problem. Traditionally, each quality etc. LTR problem is specified by three components: a set of The objective of the experimental evaluation in this possible queries, a set of objects and a target metric to paper is to verify that retrospective data is useful to optimize. In this work each query is formulated as compare trade flow dynamics for different commodities “Which products to which countries should we try to and foreign markets. To achieve this goal, we applied export to increase budget income, in the context of ARIMA model as a baseline and also built two machine current macroeconomic situation and our state of learning models: “baseline” and “advanced”. industry?”. In other words, a query is specified by current 4.1 Dataset economic context (wide or narrow, depends on implementation). Objects that are ranked relative to that We used excerpts from FAOSTAT [2] and UN query are export growth point candidates or pairs Comtrade (Comstat) [3] databases from 2011 to 2015 (what and where to export). years. The main source of data is Comstat (import, The main difficulty with LTR problem statement is export, re-import, re-export). From FAOSTAT we took target metric construction. This metric must reflect the information about production volumes. The last year likelihood of success if export of Producti to Countryj FAOSTAT contains data about is 2014, so 2015 is the from the Russian Federation will be established. Such a last year we could predict for. Full dataset contained 307 metric cannot be constructed in purely data-driven way, million data points. because no database of such cases exists. To overcome Due to limited time and computational resources, we this issue, we propose to base on two sources of conducted experiments only on the 10 most exported knowledge: (1) opinion of experts in the field of food from the Russian Federation commodities. Also, we market and international trade; (2) retrospective data selected 20 countries in the same way. Thus, we got 200 about dynamics of international trade. On the one hand, growth points. Surely, in future experiments we should retrospective data alone cannot be used to predict future, consider much larger set of commodities and countries, because the world context is changing and it will almost not only those well-developed already. never become same again. On another hand, experts base The testbed was set up as follows. All available data on a limited number of factors and limited knowledge (it were split into two parts: train and test. Train subset may be very deep but still limited). Thus, we propose to contained information about trade from 2013 to 2014. use experts to take into account factors which are hard to Test subset contained information about only 2015. Each formalize; and retrospective data - to measure prior subset consisted of datapoints each representing a pair of likelihood of trade flow of Producti to Countryj to grow. export growth point candidates to compare. Features Taking into account expert opinion requires labeling were constructed using “current” and “previous” year. a training dataset. In this paper we conduct preliminary Outcomes were constructed on the base of the “next” studies only using retrospective data, due to limitations year. Thus, in train features were constructed on the base of time and resources. Experiments with manually of 2011-2012 (2013 as “next”) and 2012-2013 (2014 as annotated datasets will be considered in future. “next”) and outcomes were constructed on the base of In other words, in this paper we study only export 2013 and 2014 correspondingly. In test subset features dynamics prediction. One can dispute that LTR is a reasonable approach to this problem and claim that traditional regression is a better fit. We chose LTR due to 144 Table 1 Top 5 predicted export growth points andoftheir squares summary of proportion country portions in export in the total a flow). gainEtalon No Actual Predicted outcomes for “advanced” model were constructed as 𝑠𝑖𝑔𝑛(𝑑𝐸𝑉1 )𝑙𝑜𝑔(|𝑑𝐸𝑉1 | + 1) − ARIMA Baseline model Advanced model 𝑠𝑖𝑔𝑛(𝑑𝐸𝑉2 )𝑙𝑜𝑔(|𝑑𝐸𝑉2 | + 1), where 𝑑𝐸𝑉𝑖 is the first Partner Commodity Partner Commoditydifference Partnerof exportCommodity PartnerfromCommodity value of Product the Russian i Country Country Federation to Countryi. TrainingCountry Country dataset for “advanced” 1 Saudi Barley Libya Barley modelAzerbaijan consisted Potatoes of 68370 samplesItaly (pairs Maize of growth Arabia points) and 1398 features. Test dataset consisted of 2 China Soybeans Spain Soybeans 35700Georgia samples. Maize Spain Maize 3 Turkey Maize Ukraine Wheat Uzbekistan Wheat Libya Maize 4 Azerbaijan Wheat Ukraine Molasses Ukraine We Potatoes tried Support Spain with Vector Machines Ryelinear and 5 Italy Maize Kazakhstan Soybeans polynomial China kernels, Wheat random forestUkraine regressor Molasses (bagging) and gradient tree boosting (as implemented in LightGBM Export $ 360059k 11710k 13830k 19197k gain [20]). Hyperparameters were optimized using grid % 76.2 2.4 2.9 4.0 search. To prevent overfitting during hyperoptimization, were constructed using 2013-2014 and outcomes training data was split so that data for each year was used represented difference in dynamics in 2015. Each subset solely either for the train or for evaluation. After best was symmetric: for each pair there was also pair hyperparameters were chosen, the model was refitted . Samples with outcome of 0 were excluded from using all training data. Finally, we decided to use both subsets. LightGBM to train that model, because it showed the most promising results. All the results presented for 4.2 Baseline model “advanced” model were constructed using LightGBM. The objective of baseline model is to estimate, how One can notice that we do not explicitly use accurate candidates can be compared using only information about global economic situation. We omitted knowledge about titles of these candidates. Baseline is it from the feature set due to two main reasons: (1) it is implemented as Bernoully Naive Bayes classifier with very difficult to represent in such a way so a machine feature set, consisting only of learning-based model can take full advantage of it (only elements of left hand part of comparison). Etalon (unclear how to prepare features); (2) some global oucomes for training the baseline model were information is implicitly encoded into difference between constructed as 𝑠𝑖𝑔𝑛(𝑑𝐸𝑉1 − 𝑑𝐸𝑉2 ), where 𝑑𝐸𝑉𝑖 is the production, import and export, and also in first difference of export value of Producti from the monopolization estimates. Surely, explicitly taking into Russian Federation to Countryi. account the global economic situation is very important. Thus, this classifier estimates prior marginal We will consider it in next papers. probability of each candidate to grow faster than each 5 Experimental evaluation other candidate. This model is very naive and measures skewness of our dataset and most frequent patterns of the As written before in the paper, the main objective of Russian Federation international trade. experimental evaluation is to estimate how much the detailed retrospective data about international trade is 4.3 «Advanced» model useful for the problem of growth point candidate ranking. The objective of this model is to estimate, how much Because of the nature of the problem, the standard simple context information can improve comparison classification or regression scores are not well applicable accuracy. There are several differences from the baseline: to measure the prediction quality, i.e. miscomparison of the feature set, the machine learning method used and the different pairs may have very different significance. loss function. Therefore, we used a proportion of the predicted export The feature set consists of two parts: historical growth points in the total export gain as the score. In information about trade of the Russian Federation with other words, the bigger part of export growth the model Producti and Countryi; and the same information about detects (the list “%” row in tables), the better the model the second candidate. “Historical information about works. These percent values may be treated as trade” includes the following basic values from UN quantitative prediction quality measures. Comtrade database: export amount (in tonnes), export Table 1 contains the scores for the top 5 actual value (in USD), export prices (as ratio of value to growth points and for the predicted alternatives. Sum amount), export monopolization; the same corresponding absolute export value growth for the predicted pairs is parameters for re-export, import, re-import. The feature presented. The last row (%) contains the portion of total set also contains information about production (from growth of export from Russia in 2015, calculated for all FAOSTAT database). Prior dynamics is taken into growth point candidates (as specified above). From this account using first order differences and ratios. First table one can see that it is nearly impossible to predict order difference (or ratio) is the difference (or ratio) of short one-year trade flow dynamics without additional the value for the current year and that for the previous information about global economic situation. one. Monopolization (or competitiveness, or A notable difficulty here is high volatility of the concentration) is estimated using Herfindahl index (sum product market, while the creation or development of a 145 food manufacture is a long-term process. Therefore, we than 30% of actual export growth. ARIMA and think that prediction of averaged, long-term trends would “advanced” model performed approximately equally. So, yield a more meaningful ranking. we conclude that almost no new markets are explored: Advanced model achieved slightly better results than we will trade tomorrow with those, who we trade today. baseline and ARIMA models. From that we conclude that Additional unaccounted factors may include politics, retrospective data is useful to predict flow dynamics. wars, sanctions, etc. This in turn means that combining open retrospective data about international trade with expert opinions makes 7 Conclusion and future work much sense in order to maximize both likelihood and In this paper we have reviewed and discussed the novelty. problem of export growth points discovery. The main Table 2 Top 5 predicted commodities and their contribution of this paper is an automated data-driven proportion in the total export gain framework that addresses the problem. The framework uses open data from many data sources and modern No Actual ARIMA Baseline Advanced machine learning techniques. We also conducted model model preliminary experiments to evaluate the possibility to use retrospective data to rank growth point candidates. The 1 Barley Barley Potatoes Maize experiments were based on open data from FAOSTAT and UN Comtrade. 2 Soybeans Soybeans Maize Rye Currently, it is very difficult to say for sure, which 3 Maize Wheat Wheat Molasses method is more useful for the final task – growth point discovery. Different methods compared to each other 4 Wheat Molasses Linseed Soybeans differently, depending on how to compare (top5 growth points, top5 commodities or top5 directions). This fact 5 Potatoes Maize Rye Wheat gives some clues on what a better model should look like. Another thing that has to be changes is the objective $ 446903k 440272k 137694k 225233k function: predicting short-term export value changes is % 94.6 93.2 29.1 47.6 very difficult and useless, because developing a new manufacture needs much more than one year. Thus, it Table 2 presents five commodities with the highest makes much more sense to predict long-term trends. expected growth. The last row (%) contains the portion Main directions of future work include (a) repeating of total growth. One can see how much Russian food experiments with adjusted methodology; (b) creating a export is non-diversified: 5 commodities occupy more manually-annotated dataset of growth points; (c) than 90% of total export value growth. Also, we can see incorporating information about global economic that ARIMA predicts commodity dynamic much better situation and substitutes. than both baseline and advanced model. We think that Acknowledgment this is mostly due to inertia of flows: if something grows today, it will most probably grow tomorrow. Again, The research is supported by Russian Foundation for “advanced” model performed better than baseline. This Basic Research, project 16-29-12877. means that prior information is not very useful to predict commodity dynamics. References Table 3 Top 5 predicted directions and their proportion [1] Rodrik D. Institutions, integration, and geography: in the total export gain In search of the deep determinants of economic growth //In Search for Prosperity: Analytic No Actual ARIMA Baseline Advanced Narratives on Economic Growth. Princeton model model University Press, Princeton. – 2003 [2] Food and Agriculture Organization of the United 1 Saudi Libya Azerbaijan Italy Nations. URL: http://www.fao.org/faostat/en/ Arabia 2 China Spain Georgia Spain [3] UN Comtrade: International Trade Statistics. URL: 3 Turkey Ukraine Uzbekistan Libya https://comtrade.un.org/data/ 4 Azerbaijan Kazakhstan Ukraine Ukraine [4] Kaukin A., Idrisov G. The gravity model of Russia’s 5 Italy Georgia China Armenia international trade: the case of a large country with a $ 374755k 49666k 145263k 47982k long border. Working paper. – 2014 % 79.3 13.6 31.8 13.1 [5] Gordeev D. et al. Analysis of Global Supply Chains Table 3 presents five countries with the highest in International Trade Patterns. – 2016. – №. 765 expected import growth from the Russian Federation. [6] Duenas M., Fagiolo G. Modeling the International- From this table we conclude that Russia export is not only Trade Network: a gravity approach //Journal of commodity-non-diversified, but also partner-non- Economic Interaction and Coordination. – 2013. – diversified. From this table we can see that purely prior- Vol. 8(1). – pp. 155-178 based “baseline” model performed best: it predicted more 146 [7] Jaud M., Kukenova M., Strieborny M. Financial [14] Peluso S. et al. International Trade: a Reinforced Urn Development and Sustainable Exports: Evidence Network Model. – 2016. – №. 1601.03067. from Firm product Data //The World Economy. – [15] Fronczak A. Structural Hamiltonian of the 2015. – Vol. 38(7). – pp. 1090-1114 international trade network //No. – 2012. – Vol. 1. – [8] Snijders T. A. B. Models for longitudinal network No. arXiv: 1205.4589. – pp. 31-46 data //Models and methods in social network [16] Shen B., Zhang J., Zheng Q. Exploring multi-layer analysis. – 2005. – Vol. 1. – pp. 215-247 flow network of international trade based on flow [9] Grater S. et al. Linking export opportunities of distances //arXiv preprint arXiv:1504.02361. – 2015 products and services: the case of South Africa. [17] Kelle M. et al. Cross border and Foreign Affiliate [10] Sgrignoli P. The World Trade Web: A Multiple- Sales of Services: Evidence from German Microdata Network Perspective //arXiv preprint //The World Economy. – 2013. – Vol. 36(11). – pp. arXiv:1409.3799. – 2014 1373-1392 [11] Lall S., Weiss J., Zhang J. The “sophistication” of [18] Sujjaviriyasup T., Pitiruek K. Agricultural Product exports: a new trade measure //World Development. Fore-casting Using Machine Learning Approach – 2006. – Vol. 34(2). – pp. 222-237. //Int. Journal of Math. Analysis. – 2013. – Vol. 7. – [12] Bernard A. B., Jensen J. B. Why some firms export №. 38. – p. 1869-1875 //Review of Economics and Statistics. – 2004. – Vol. [19] Sujjaviriyasup T., Pitiruek K. Hybrid ARIMA- 86(2). – p. 561-569 support vector machine model for agricultural [13] Barigozzi M., Fagiolo G., Garlaschelli D. production planning //Applied Mathematical Multinetwork of international trade: A commodity- Sciences. – 2013. – Vol. 7. – №. 57. – p. 2833-2840 specific analysis //Physical Review E. – 2010. – Vol. [20] Microsoft. https://github.com/microsoft/lightgbm 81(4). – p. 46-104 147