A Novel Approach of Cognitive Base Station with Dynamic Spectrum Management For High-speed Rail ∗ Qingting Wu, Yiming Wang, Zhijie Yin, Hongyu Deng, Cheng Wu† School of Urban Rail Transportation, Soochow University, Suzhou, China Abstract always deployed outside along the railway and BBU is insid- e. One BBU is connected to multiple RRUs. BBU and RRU The characteristic of fast movement in high-speed are used to process baseband signal and radio frequency sig- rail seriously affects the stability of vehicular wire- nal, respectively. To ensure the communication between RRU less communication. Applying cognitive technolo- and passengers, two vehicular stations (VS) are installed on gy to individual users often brings frequent channel the top and final carriages of the train. The network architec- switch and inefficient blind learning. To address ture is illustrated in Fig. 1 [isheng Zhao et al., 2013], [Tian these issues this paper proposes a novel concept of et al., 2012]. The GSM-R system consists of base transceiver Cognitive Base Station (CBS), which has the capa- stations (BTS) along the railway lines and embedded GSM- bility of forecasting spectrum holes and assigning R mobiles connected to antennas on the roof of the trains. spectrum to individuals. We then give the model of The train has to be permanently connected to the trains con- cognitive base station and evaluate the performance trol center. This connection has a high priority level, and if in our simulation platform within high-speed rail the modem connection is lost, the train stops automatically environment. The experiment results further prove [Dudoyer et al., 2012]. that the model can significantly improve the perfor- mance of vehicular communication. However, under the circumstance of high-speed railway [Zhang et al., 2012], vehicular communication often shows unstable, even sometime dreadful [Ai et al., 2014]. Usual- 1 Introduction ly, when the speed is up to 350 kilometers per hour, there unavoidably arises some issues, such as Doppler shifts, fast With the development of era, the demand for rail transit is cell switching and the penetration loss [Zhou and Ai, 2014]. rapidly increasing. When travelling on train, the passengers The Doppler shifts results from the relative motion between always hope to enjoy better communication quality and faster a vehicle and a base station. Doppler Effect becomes another data access service. European Rail Traffic Management Sys- pivotal factor degrading system performance, which increas- tem (ERTMS) is a revolution in railways to guarantee the es randomness of received signal [Liu et al., 2011], [Li and communication, which is consist of European Train Control Zhao, 2012], [Dybala and Radkowski, 2013]. The high speed System (ETCS) and a mobile-communications network opti- operation of the train leads to fast cell switching. As a train mized for railways called GSM-R. moves across the footprint of the satellite beam, the receiv- GSM-R is the Global System for Mobile Communications- ing signal level may vary, especially towards the edge of the Railway in the worldwide and is dedicated to provide the bidi- beam, which significantly impacts service rates even causing rectional radio bearer for the train signaling systems, which service drops [Li et al., 2013], [Alkayal and Saada, 2013]. operates in a 4MHz band (876-880 MHz for uplink and 921- The fully enclosed body structure with good sealing proper- 925 MHz for downlink) [Sniady and Soler, 2012]. It is possi- ty of the high-speed train results in penetration loss. Typi- ble to divide the authorized band into 19 channels of 200KHz cally, the terminals inside the train connect to the base sta- width in each GSM-R group. The rail line is covered with tions along the railway tracks via wireless links, in which the GSM-R groups and each consists of many GSM-R cells. A s- large penetration loss will directly degrade the communica- ingle GSM-R cell can use only few of the channels in a round tion link quality and decrease the cell coverage [Zhu et al., robin manner, because the same channel cannot be reused by 2013], [Liu et al., 2012]. Furthermore, Federal Communi- neighboring cells due to interference. Each cell is equipped cations Commission (FCC) released the investigation on the with a base station. The base station is made up of building usage of spectrum In 2003. It suggested that the authorized baseband unit (BBU) and radio remote unit (RRU). RRU is band in 3 − 6GHz range is less than 0.5% utilized on av- ∗ Project supported by the National Nature Science Foundation erage. And so is the band below 3GHz, which is less than of China (No. 61471252) and the Natural Science Foundation of 35% [Commission and others, 2003]. Just based on these Jiangsu Province (No. BK20130303). viewpoints, it is necessary to introduce a novel architecture † Corresponding Author: cwu@suda.edu.cn for high-speed vehicular communication to address the issues BBU BBU RRU train R R vs CR CR CR CR CR CR Figure 1: Networks architecture for the high-speed rail communication system. from individual user’s high-speed movement along the rails same environment and each user is independent. So they and the inefficiency in the spectrum usage. compete each other for the spectrum resources, which In recent years, a lot of researchers used cognitive radio leads to blind learning and frequent conflicts. (CR) to improve the performance of wireless communication. (2) The rail transit contains a large number of CR user- The basic idea of CR networks is that the unlicensed devices s. While every user sense the environment, the sys- (also called cognitive radio users or secondary users) need to tem works with heavy workload and high computational vacate the spectrum band once detect the licensed devices (al- complexity. so known as primary users). Simon HayKin defined the CR as an intelligent wireless communication system that is aware of (3) The operations of mutual competition and cooperation its environment and uses the methodology of under-standing- between the CR users interfere with not only primary by-building to learn from the environment and adapt to sta- users, but also themselves and their neighbors. tistical variations in the input stimuli [Haykin, 2005]. Letaief (4) Spectrum holes in each base station are different. It presented a cognitive space-time-frequency coding technique would inevitably occur spectrum handoff. that can opportunistically adjust its coding structure by adapt- For addressing the above issues, we try to propose a novel ing itself to the dynamic spectrum environment [Letaief and model of cognitive base station in the paper. Our proposed Zhang, 2009]. Soyeon Kim proposed a CR operational algo- CBS attempts to use the authorized bands for railway without rithm for mobile cellular systems, which was applicable to the interrupting PUs. The CBS model should satisfy the follow- multiple secondary user environment [Kim and Sung, 2014]. ing conditions: These results proved CR technology can significantly reduce interference to licensed users, while maintaining a high prob- (1) The CBS can forecast spectrum holes according to its ability of successful transmissions in a cognitive radio (CR) experience and assign spectrum to individuals within its ad hoc network. range of coverage. In this way, the computational com- There are few publications about applying CR to the field plexity of the entire network can be reduced. of urban rail transit. Wu proposed a wireless cognitive model (2) The rail transit runs daily over a fixed route according for high-speed individuals’ spectrum management and show a to its timetable. The CBS can take the advantage of small performance improvement in wireless communication these characteristics, cooperate with each other to fore- [Wu et al., 2015]. Although using cognitive radio in high- cast spectrum holes on the whole route. speed-railway has improved the performance, there are still This paper is organized as follow. We first introduce the so many issues that are open to address: concept of cognitive base station and its mathematical mod- (1) Most of the cognitive radio users usually sense in the el in Section 2. Section 3 then applies the novel CBS model in which, the CBS agents correctly infer the presence of the Radio PUs if the former lies within the PUs’ transmission range. Environment • Long-term Awareness of Spectrum Usage Characterizing the spectrum bands based on their activi- ty, and in particular, learning about the utilization of the channel is a key function of the CR users. Online learn- ing algorithms must be developed that allow the CBS a- gents to continuously gather information about its radio Spectrum environment, and construct a utilization function. Apart Mobility from simply classifying the spectrum as busy or avail- able, it is beneficial if a probability distribution of the anticipated transmission/silent durations of the PUs can be derived. We propose a tightly integrated reinforce- Spectrum Spectrum ment learning equipped link layer protocol to schedule Sharing Sensing the transmissions between CBS agents and CR user a- gents over time. • End-to-End Learning Distributed networks rely on multihop forwarding of Spectrum packets between a source-destination pair. Each CBS a- Decision gent on this path learns of its own spectrum environment over time, and this information can be leveraged at the s- tart and end points of the path to make optimal decisions Figure 2: The cognitive cycle of a cognitive base station. regarding the spectrum choices and routing options. As an example, spectrum switching costs locally at a node affects end-to-end delays. While spectrum characteris- with RL into the scenario of high-speed rail, and propose the tics can be locally inferred, the specific choice of the cooperation mechanism of multiple CBS agents. The experi- spectrum at each link to minimize intra-path switching mental simulation results are given in Section 4. We conclude must be undertaken at the end points of the path. We this paper in Conclusion. explore ways to share this learning and spectrum aware- ness obtained by a node between its local neighbors, and 2 Cognitive Base Station Model subsequently over multiple hops to the destination. The Our proposed CBS is deployed along the railway, which cost of this learning and the benefits are investigated as works as a spectrum assigner. It learns from feedback re- part of this project. ceived through interactions with an external environment and assigns spectrum to the passengers in the range of coverage. 3 SPECTRUM MANAGEMENT BASED We consider each CBS to be an agent, which has four spec- COGNITIVE BASE STATION trum management functions: spectrum sensing, spectrum mobility, spectrum decision and spectrum sharing [Chkirbene 3.1 The Q-Learning and Hamdi, 2015], [Lee and Akyildiz, 2012]. Fig. 2 gives Reinforcement learning, which is inspired by psychological the steps of the cognitive cycle within the framework of CBS, learning theory from biology [Waltz and Fu, 1965], enables which is formed by the spectrum-aware operations. Each CB- the agent to learn behavior through trail-and error interactions S agent uses reinforcement learning to operate spectrum man- with a dynamic environment [Sutton and Barto, 1998]. The agement. All of the agents can sense the environment, obtain classical reinforcement algorithm is Q-Learning, the process its own current state about spectrum usage, and communicate of which is as follows [Puterman, 1994]. On each step of with each other for the purpose of cooperation. They then interaction the agent chooses an action according to the ex- make decision according to its own state and the whole net- ternal environment based on its current state. As a result, the work situation, then use spectrum mobility to choose actions. action changes the environment and receives a reward. The Finally, these CBS agents continue to send its new state to the agent need to develop a policy, that maximizes the long-run other neighbor CBS agents. measure of reinforcement. We assume that our cognitive radio network along high- The classic reinforcement learning algorithm is formulat- speed rail consists of a collection of CBS agents and CR user ed as follows. At each time t, the agent perceives its current agents. Each CBS agent has its own PUs and available spec- state st ∈ S and the set of possible actions Ast . The agent trums. The CBS agents undertake decisions on choosing the chooses an action a ∈ Ast and receives from the environ- spectrum independently of the CR user agents in the range. ment a new state st+1 and a reward rt+1 . Based on these A choice of spectrum by the CBS agent i is essentially the interactions, the reinforcement learning agent must develop a choice of the frequency represented by f i ∈ F . The CR policy Pπ : S → A which maximizes the long-term reward user agents continuously monitor the spectrum that the CBS R = t γrt for MDPs, where 0 ≤ γ ≤ 1 is a discount- agent choose in each slot time. We assume perfect sensing, ing factor for subsequent rewards. The long-term reward is PU agents CBS agent PU agents PU agents PU agents CBS agent CR CBS agent agent CBS agent Figure 3: The cognitive base station within the high-speed-rail transportation. the expected accumulated reward that the agent expects to re- mally, if there are m spectrums, we can using the index ceive in the future under the policy, which can be specified to specify these spectrums. In this way, we have SP ~ = by a value function. In this way, the Q-learning can calcu- {SP 1 , SP 2 , ..., SP m }. late an update to its expected discounted reward, Q(st , at ) as At a particular time and a particular state, the CBS will take follows: action according to learning results to either switch channel Q(st , at ) ← Q(st , at ) + or transmit. At time t we define at = k, where k is the action α[rt + γ max Q(st+1 , a) − Q(st , at )] that CBS chooses at time t and a where γ is the discount factor such that 0 ≤ γ < 1. The agent k ∈ {switch to channel1 , switch to channel2 , stores the state-action values in a table Q [Wu et al., 2010], [Jiang et al., 2011], [Bkassiny et al., 2013]. ..., switch to channelm , transmit data}. Recently the reinforcement learning has attracted increas- Once the CBS agent has detected any active PU, it would ing interest in the machine learning and artificial intelligence take action to channel switching. We use the Q table to s- communities. Kadam etc. applied the Q-Learning into rout- tore state-action values. At time t, the state is spt and the ing data in Wireless Sensor Network scenario to route data action is k, then we can calculate the value Q(spt , k) by the efficiently from one source to multiple mobile sinks [Kadam above Q-learning formulas. If PU is detected, the CBS agent and Srivastava, 2012]. It turned out that the algorithm can would switch to the other available spectrum with the largest extend the network lifetime. Q-value. 3.2 Application to Cognitive Base Station The reward is the estimate for spectrum usage availablity on a CBS agent. The different network situation results in We illustrate the high-speed railway environment with CBS different rewards as follows. agents along the way in Fig. 3 . We further model a cogni- tive radio network as consisting of a set of Cognitive Base • CR-PU interference: If a PU’s activity occurs in the Stations, denoted CBS, a set of primary users, denoted P U , spectrum shared by any CR user, and in the slot same and a set of available frequencies, denoted SP . We assume selected for transmission, then a high penalty of −15 is that the topological structure of a given network is fixed. assigned. The intuitive meaning of this is as follows: We Spectrum holes vary due to the behavior of PUs, which can avoid the collisions among the CR users using the causes the change of environment. CBS agents can perceive mediation from the CBS agents. However, the concur- the states within the environment. The state of an CBS agent rent use of the spectrum with a PU goes against the prin- is the current spectrum of its transmission. The state of the ciple of protection of the licensed devices, and hence, multi-agent system includes the state of every CBS agent. We must be strictly avoided. therefore define the state of the system at time t, denoted st , • Successful Transmission: If none of the above condition- as s are observed to be true in the given transmission slot, st = (sp) ~ t then packet is successfully transmitted from the sender , where sp~ is a vector of spectrums across all agents. Here to receiver, and a reward of +5 is assigned, which is spi are the spectrum on the ith agent and spi ∈ SP ~ . Nor- found experimentally to give the best results. Algorithm 1 Pseudo code of Q-learning on CBS Initial state and reward Main() ~ value; Initialize state st and action at and their Q repeat ~ Q-learning(st , at , Q) Assign +5 until all episodes are traversed Is PU on? No reward ~ Q-with-Kanerva(st , at , Q) Yes repeat Take action st , observe reward rt , get next state st+1 Assign -15 Get Q(st at ) from the Q-table; reward for all actions a* under new state st+1 do Generate the state-action pair st+1 at+1 from state st+1 and action a* Get Q(st+1 at+1 ) from the Q-table; end for Change state δ = r + γ ∗ maxQ(st+1 at+1 ) − Q(st at ) ∆Q ~ =α∗δ ~ Q=Q ~ + ∆Q ~ Figure 4: The Q-learning process on CBS model. st = st+1 if random probability ≤ ε then Once detected the primary user, a harsh punishment will be for all actions a* under current state st do given. Otherwise, a positive reward will be assigned. Fig. 4 at = argmaxa Q(st at ) illustrates the proposed process, and Algorithm 1 describes end for our algorithm for implementing the Q-learning on CBS agent. else at = random action 4 EXPERIMENTAL SIMULATION end if until st is terminal 4.1 Experimental Design In this section, we describe preliminary results from applying our reinforcement learning based approach to the cognitive transmissions in the network. The Cross Layer Repository radio model. To detect the PUs correctly is the necessary facilitates the information sharing between the different pro- prerequisite. The overall aim of our proposed learning based tocol stack layers. approach is to allow the CBS agents to decide on an optimal We conduct our experiment in the following scenario: there choice of spectrum so that (i) PUs are not affected, and (ii) CR are 2 trains which take on 21 passengers for each and 5 CBS users share the spectrum in a fair manner. These two rules are agents aside the railway. The average speed of train is 10m/s. to simulate the public’s behaviors in Urban Rail Transit En- We have 10 primary users in the range of each CBS. The ac- vironment. That is, those bands that are frequently occupied tivity of primary users is based on ON-OFF model and each by licensed users are rarely utilized because of open areas or primary user is assigned the spectrum randomly from 5 spec- relatively closed environment, and the public can opportunis- trums (small network) or10 spectrums (large network) . The tically use band resources with a same probability. CBS agent senses the spectrum holes per 0.1 second and as- Our novel CBS network simulator within the framework signs available spectrum to CR user agent. The simulation of high-speed rail has been designed to investigate the effect parameters are summarized in Table 1. of the proposed reinforcement learning technique on the net- work operation. The implemented ns-2 model is composed of 4.2 Experimental results several modifications to the physical, link and network layers We compare the performance of our CBS with reinforcement in the form of stand-alone C++ modules. The PU Activity learning (CBS-RL) scheme with the CBS with Round-Robin Block describes the activity of PUs based on the on-off mod- scheme (CBS-RR), which is a typical way in GSM-R sys- el, including their transmission range, location, and spectrum tem. The Round-robin (RR) scheme employs the principle band of use. The Channel Block contains a channel table that once a spectrum is not available, the agent switches to with the background noise, capacity, and occupancy status. next channel in equal portions and in circular order, handling The Spectrum Sensing Block implements the energy-based all switches without priority (also known as cyclic executive). sensing functionalities, and if a PU is detected, the Spectrum This method is simple, easy to implement, and starvation- Management Block is notified. This, in turn causes the device free. In our RL-based scheme, the exploration rate  is set to to switch to the next available channel, and also alert the up- 0.2, which we found experimentally to give the best results. per layers of the change of frequency. The Spectrum Sharing The initial learning rate α is set to 0.8, and it is decreased by Block coordinates the distributed channel access, and calcu- a scaling factor of 0.995 after each time slot. lates the interference at any given node due to the ongoing Figure 5(a) shows an example about the distribution of Chan.5 Channel5 0 100 200 300 400 500 600 700 800 900 The Number of Epoch Chan.4 Channel4 0 100 200 300 400 500 600 700 800 900 The Number of Epoch Chan.3 Channel3 0 100 200 300 400 500 600 700 800 900 The Number of Epoch Chan.2 Channel2 0 100 200 300 400 500 600 700 800 900 The Number of Epoch Channel1 Chan.1 0 100 200 300 400 500 600 700 800 900 The Number of Epoch (a) An example about the distribution of spectrums occupancy on CBS with 5 spectrums. (b) Average rewards for 5 spectrum bands (c) Average rewards for 10 spectrum bands 14 25 12 20 10 Channel  Switches Channel  Switches 15 8 6 10 4 CBS-­‐RR 5 CBS-­‐RR 2 CBS-­‐RL CBS-­‐RL 0 0 Epoch Epoch (d) Cumulative number of channel switching for 5 spectrum bands (e) Cumulative number of channel switching for 10 spectrum bands Figure 5: CBS simulations with RL and RR schemes. Matolak, David G Michelson, and Cesar Briso-Rodriguez. Table 1: Simulation Parameters Challenges toward wireless communications for high- Parameters Values Topology size X:7000m Y:500m speed railway. Intelligent Transportation Systems, IEEE Number of passengers Transactions on, 15(5):2143–2158, 2014. 42 Number of primary users 50 [Alkayal and Saada, 2013] Fisal Alkayal and Johnny Bou Number of cognitive base station 5 Saada. Compact three phase inverter in silicon carbide Speed 10m/s technology for auxiliary converter used in railway applica- Number of spectrums 6 tions. In Power Electronics and Applications (EPE), 2013 Bandwidth 2000000Hz 15th European Conference on, pages 1–10. IEEE, 2013. Simulation time 1000s [Bkassiny et al., 2013] Mario Bkassiny, Yang Li, and Sud- harman K Jayaweera. A survey on machine-learning tech- niques in cognitive radios. Communications Surveys & spectrums occupancy on the CBS with 5 spectrums. Spec- Tutorials, IEEE, 15(3):1136–1159, 2013. trums occupancy on CBS follows the ON-OFF model: the ON mode is in the normal distribution with the parameter [Chkirbene and Hamdi, 2015] Zina Chkirbene and Noured- µ = 25, and the OFF mode is in the exponential distribu- dine Hamdi. A survey on spectrum management in cogni- tion with the parameter β. the value of which is randomly tive radio networks. International Journal of Wireless and generated. Mobile Computing, 8(2):153–165, 2015. Figure 5(b) and 5(c) show the average rewards received by [Commission and others, 2003] Federal Communications CBS agent across all spectrums using the CBS-RL scheme. Commission et al. Facilitating opportunities for flexible, The result in Figure 5(b) shows that after learning over 1000 efficient, and reliable spectrum use employing cognitive epochs, Channel 5 receives the largest positive reward of ap- radio technologies. Et docket, (03-108):05–57, 2003. proximately +5.5, while Channel 1, 2, 3 and 4 gets a reward [Dudoyer et al., 2012] Stephen Dudoyer, Virginie Deniau, of approximately −11.8, +0.7, −5.1 and +3.3. The results Ricardo Adriano, MN Ben Slimen, Jean Rioult, Benoı̂t indicate that our approach pushes the CBS agents to gradual- Meyniel, and Marion Berbineau. Study of the susceptibili- ly achieve higher positive rewards and choose more suitable ty of the gsm-r communications face to the electromagnet- spectrum for their transmission. The results also indicate that ic interferences of the rail environment. Electromagnet- the reward tends to be suitable to the distribution of spectrums ic Compatibility, IEEE Transactions on, 54(3):667–676, occupancy. A similar trend is observed in Figure 5(c), with 2012. Channel 10 receiving the highest average reward of approxi- [Dybala and Radkowski, 2013] Jacek Dybala and Stanislaw mately +5.2. Figure 5(d) and 5(e) show the cumulative number of chan- Radkowski. Reduction of doppler effect for the needs of nel switching using CBS-RL and CBS-RR schemes. The wayside condition monitoring system of railway vehicles. result in Figure 5(d) shows the average number of channel Mechanical Systems and Signal Processing, 38(1):125– switches for the small topology. We observe that after learn- 136, 2013. ing, the CBS-RL scheme tends to decrease number of channel [Haykin, 2005] Simon Haykin. Cognitive radio: brain- switching to 5, while CBS-RR keeps the channel switches empowered wireless communications. Selected Areas in to approximately 12. For the large topology in Figure 5(e), Communications, IEEE Journal on, 23(2):201–220, 2005. the CBS-RL scheme reduces the channel switches to 6, while [isheng Zhao et al., 2013] isheng Zhao, Xi Li, Yi Li, and CBS-RR keeps the channel switches approximately 23. The Hong Ji. Resource allocation for high-speed railway results indicate that our proposed CBS-RL approach can keep downlink mimo-ofdm system using quantum-behaved par- the channel switches lower than the CBS-RR approach and ticle swarm optimization. In Communications (ICC), converge to an optimal solution. 2013 IEEE International Conference on, pages 2343– 2347. IEEE, 2013. 5 CONCLUSIONS [Jiang et al., 2011] Tianzi Jiang, David Grace, and Paul D To address the issues of frequent channel switches and inef- Mitchell. Efficient exploration in reinforcement learning- ficient blind learning in high-speed rail, we propose a novel based cognitive radio spectrum sharing. Communications, concept of Cognitive Base Station, which has the capability IET, 5(10):1309–1317, 2011. of forecasting spectrum holes and assigning spectrum to indi- [Kadam and Srivastava, 2012] Kaveri Kadam and Navin Sri- viduals. Our simulation results prove that after autonomous vastava. Application of machine learning (reinforcement learning, the CBS-RL scheme can forecast spectrum holes. learning) for routing in wireless sensor networks (wsns). In this way, our proposed model can significantly improve In Physics and Technology of Sensors (ISPTS), 2012 1st the performance of vehicular communication, which can de- International Symposium on, pages 349–352. IEEE, 2012. crease cell-switching and unsuccessful transmission. [Kim and Sung, 2014] Soyeon Kim and Wonjin Sung. Op- erational algorithm for wireless communication systems References using cognitive radio. In Communication, Networks and [Ai et al., 2014] Bo Ai, Xiang Cheng, Thomas Kurner, Satellite (COMNETSAT), 2014 IEEE International Con- Zhangdui Zhong, Ke Guan, Ruisi He, Lei Xiong, David W ference on, pages 29–33. IEEE, 2014. [Lee and Akyildiz, 2012] Won-Yeol Lee and Ian F Akyildiz. cognitive radio in intelligent transportation system. In Ap- Spectrum-aware mobility management in cognitive radio plied Mechanics and Materials, volume 743, pages 765– cellular networks. Mobile Computing, IEEE Transactions 773. Trans Tech Publ, 2015. on, 11(4):529–542, 2012. [Zhang et al., 2012] Jiayi Zhang, Zhenhui Tan, Xiaoxi Yua, [Letaief and Zhang, 2009] Khaled Ben Letaief and Wei Haibo Wang, and Linwen Zhang. Review of public broad- Zhang. Cooperative communications for cognitive radio band access systems for high-speed railways and key tech- networks. Proceedings of the IEEE, 97(5):878–893, 2009. nologies. Journal of the China Railway Society, 34(1):46– 53, 2012. [Li and Zhao, 2012] Jinxing Li and Youping Zhao. Radio en- vironment map-based cognitive doppler spread compensa- [Zhou and Ai, 2014] Yuzhe Zhou and Bo Ai. Quality of ser- tion algorithms for high-speed rail broadband mobile com- vice improvement for high-speed railway communication- munications. EURASIP Journal on Wireless Communica- s. Communications, China, 11(11):156–167, 2014. tions and Networking, 2012(1):1–18, 2012. [Zhu et al., 2013] Xiangqian Zhu, Shanzhi Chen, Haijing [Li et al., 2013] Ying Li, Lei Lei, Zhangdui Zhong, and Siyu Hu, Xin Su, and Yan Shi. Tdd-based mobile communica- tion solutions for high-speed railway scenarios. Wireless Lin. Performance analysis for high-speed railway com- Communications, IEEE, 20(6):22–29, 2013. munication network using stochastic network calculus. In Wireless, Mobile and Multimedia Networks (ICWMMN 2013), 5th IET International Conference on, pages 100– 105. IET, 2013. [Liu et al., 2011] Qiuyan Liu, Miao Wang, and Zhangdui Zhong. Statistics of capacity analysis in high speed rail- way communication systems. Tamkang Journal of Science and Engineering, 14(3):209–215, 2011. [Liu et al., 2012] Liu Liu, Cheng Tao, Jiahui Qiu, Houjin Chen, Li Yu, Weihui Dong, and Yao Yuan. Position-based modeling for wireless channel on high-speed railway un- der a viaduct at 2.35 ghz. Selected Areas in Communica- tions, IEEE Journal on, 30(4):834–845, 2012. [Puterman, 1994] M. L. Puterman. Markov decision process- es. In Wiley, 1994. [Sniady and Soler, 2012] Aleksander Sniady and Jose? Sol- er. An overview of gsm-r technology and its shortcomings. In ITS Telecommunications (ITST), 2012 12th Internation- al Conference on, pages 626–629. IEEE, 2012. [Sutton and Barto, 1998] R. Sutton and A. Barto. Reinforce- ment Learning: An Introduction. Bradford Books, 1998. [Tian et al., 2012] Lin Tian, Juan Li, Yi Huang, Jinglin Shi, and Jihua Zhou. Seamless dual-link handover scheme in broadband wireless communication systems for high- speed rail. Selected Areas in Communications, IEEE Jour- nal on, 30(4):708–718, 2012. [Waltz and Fu, 1965] M. D. Waltz and K. S. Fu. A heuris- tic approach to reinforcment learning control systems. In IEEE Transactions on Automatic Control, 10:390-398., 1965. [Wu et al., 2010] Cheng Wu, Kaushik Chowdhury, Marco Di Felice, and Waleed Meleis. Spectrum management of cognitive radio using multi-agent reinforcement learning. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: Industry track, pages 1705–1712. International Foundation for Au- tonomous Agents and Multiagent Systems, 2010. [Wu et al., 2015] Cheng Wu, Yiming Wang, Xiang Qiang, and Zhaoyang Zhang. Adaptive spectrum management of