Convolutional Expectation Maximization for Population Estimation Andrea Pomente (1) and David Aleandri (1) (1) University of Rome Tor Vergata, Rome, Italy {pomente.andrea,davidaleandri}@gmail.com Abstract. There is a fundamental spatial mismatch in the data avail- able for estimate population from satellite imagery. Spectral reflectances are available for each pixel of an image, but ground reference population data are available only for larger zones, therefore satellite imagery has bigger resolution than ground reference images. The general response has been to build models for the average population density of the zones, utilizing spatially aggregated spectral data. This article reports a new approach to solve this problem where per pixel spectral data are used. The already used expectation maximization algorithm (EM) [1] is paired with a convolutional neural network to improve the resolution of a pre- existent population ground truth provided by the GHS POPULATION GRID (LDS) [5]. We start with the satellite imagery by Sentinel-2 mis- sion and, the regression model we have built, upscales the LDS dataset to 10 meters resolution, the same as Sentinel-2 images. As you can see in Table 1, we obtained an AvgRelDelta of 50.05 for Uganda’s rural area and 96.31 for the city of Lusaka in Zambia, while our method scored 87.57 in Overall. This results are computed from a test dataset provided for ImageCLEF 2017 Population Estimation Task [8]. Keywords: Convolutional Neural Networks, Information Retrieval, Im- age Retrieval, Copernicus data, Automatic Population Estimation, Fab- Space 2.0, CLEF 1 Introduction There is a substantial literature on the application of orbital remote sensing to the estimation of human population and population growth. The levels of accu- racy achieved in these studies has been limited by several factors [1–3]. First, the relationship between human habitation and the spectral characteristics of land surface and land cover is inherently indirect and inexact. Second, the problem of estimating a quantitative variable like population density across the spatial dimensions of an image is intrinsically more difficult than the more usual qualita- tive objectives of remote sensing analysis, such as segmentation or classification. And third, while the resolution of remotely sensed imagery is quite adequate for most demographic purposes, ground reference population data for model devel- opment and training is generally only available at a much lower resolution, either for census-related areal units. To date, the response to this problem of spatial incompatibility of data has been to aggregate the finer resolution spectral data up to the geographical level of the available ground reference populations, and then to build regression models for estimating the population density of these larger spatial aggregate units [4]. Aggregated remote sensing predictor variables have included mean reflectances of individual spectral bands, numbers of pix- els in various land-use categories, measures of variability and image texture and various band-to-band ratios and other mathematical transformations of the mul- tispectral data. The availability of demographic data defined the spatial level at which modeling took place (either the elements of a regular grid or some form of census district), and the remote sensing data were spatially aggregated. The aggregation took various forms: averages of individual spectral bands or band ratios, measures of variability and texture, counts or proportions of pixels in different land use classes, and so on. By contrast, this paper examines several potential advantages to be gained by modeling at the level of single pixels rather than larger spatial aggregates. 2 Studied areas and data This work was based on two areas centered on the Lusaka city in Zambia and on a rural area in Uganda. The satellite imagery is provided by Sentinel-2 mission, in particular the 10 meters resolution RGB and Near-infrared bands are used. This choice was made because of the highest resolution respect to other bands. The LDS was recently produced from Landsat imagery (30 meters resolution) collections through automatic analysis of satellite imagery to produce unprece- dented fine scale maps quantifying built-up structures in terms of their location and density. The image processing technology exploits structure (texture, mor- phology, and pattern) as key information. Population estimates were produced and made available for processing by the Center for International Earth Science Information Network (CIESIN)1 . These estimates consist of country based layers (one for each of 241 countries) of census and administrative polygons containing estimated residential population. The LDS resolution is 250 meters and in order to use in this work the following steps are done: – The LDS images are re-projected to the same geo-referentiation (UTM 35- South) of Sentinel-2 imagery – The image is stretched to fit the same dimensions of Sentinel-2 imagery through nearest neighbor interpolation algorithm – Each pixel has 10x10 meters of resolution, but the value is still related to an area of 250x250 meters. To correct this problem, an initial redistribution operation is applied following this equation: p250x250 p10x10 i,j = ; i = 1, . . . 25; j = 1, . . . 25; n = 25 · 25 (1) n 1 http://www.ciesin.org/ where p10x10 i,j is the pixel value in the new resampled image, obtained in the second step, n is the number of 10x10 resolution pixels inside each 250x250 resolution pixel and p250x250 is the pixel value in original LDS imagery, with: 25 X X 25 10x10 pi,j = p250x250 (2) i=1 j=1 The equation (1) is performed for each pixel in original LDS imagery. 3 Convolutional EM The method described in this paper assumes that the relation between pixel reflectance and the number of population is non-linear. To make this relation a convolutional neural network (CNN) [6] is defined, it takes as input the pixel values (RGB and near infrared) and returns the population estimate. The CNN is composed of two convolutional layers with kernel size of 2 and 64 feature maps with strides equal to 1 and a ReLU activation function. Each convolutional layer is followed by a batch normalization layer. The final layers are fully connected; the first one has 128 ReLU neurons and the last one has only a neuron with linear activation function. We can now refine our initial assigned pixel population estimates, by redistribut- ing population within each pixel away from underestimated pixels and towards overestimated pixels while maintaining the known image total. Intuition sug- gests [1], that the optimal redistribution, which minimizes the sum of squared residues on the regression line and simultaneously holds the sum of the constant p values, is obtained by making all residuals equal, by adjusting the estimated population as follows: pi,j,adj = pi,j,pred + r (3) where pi,j,pred is the previous iteration population prediction make by CNN described above and r is defined by: Pm Pm i=1 j=1 (pi,j − pi,j,pred ) r= (4) m·m where m · m is the total number of pixels in the input image. This procedure is iterated many times to obtain a redistribution of pixel’s pop- ulation to minimize a RMS loss function. To train the CNN at each iteration, Adam [9] is used as optimizer algorithm with default parameters. The iteration process is stopped analyzing the values of R2 coefficient. The whole process is written in Python using the Keras library [10] with Ten- sorflow backend. The output of this process is a new image, with the same size of input, representing the new population for each pixel, also the geodata from the geotiff input is reused for the output. Each area of interest is represented by a shape file. To extract the population of each area we used QGIS importing the shape file and the output image from the CNN. 4 Results The model was tested using the ImageCLEF 2017 Remote Task [8] evaluation system. This pilot task is part of the imageCLEF 2017 Labs [7] and was intro- duced this year. It aims at investigating the use of non commercial satellite data as a free and quicker process to estimate the population of an area of interest. The following result were obtained: Team Name Team Country Geographic Zone Sum Delta RMSE Pearson AvgRelDelta AndreaDavid Italy UGD 18,485 1,816 0.76 50.05 AndreaDavid Italy ZMB 1,465,603 30,480 0.08 96.31 AndreaDavid Italy Overall 1,484,088 2,7462 0.21 87.57 Table 1. Population estimation. UGD stands for Uganda while ZMB is for Zambia region. Overall is when considering both regions all together Looking at the AvgRelDelta metric in the table above, our model performs better for rural area (UGD) than for Lusaka city(ZMB). A possible reason why we have obtained this different can be found in the better initial estimation in LDS ground truth for UGD respect the ZMB area. Considering the short period of development we forced due dead line, we trust the method proposed in this article is promising. It can be surely improved by using better resolution imagery and making more tests on CNN having a test dataset. Acknowledgments This work is part of FabSpace project that received funding from the European Unions Horizon 2020 Research and Innovation programme under the Grant Agreement n693210. https://www.fabspace.eu/ References 1. Jack T. Harvey: Population Estimation Models Based on Individual TM Pixels. Photogrammetric Engineering & Remote Sensing. Vol. 68, No. 11, November 2002, pp. 1181-1192 (2002) 2. Iisaka, J. and E. Hegedus: Population estimation from Landsat imagery. Remote Sensing of Environment, 12:259-272 (1982) 3. Langford, M., D.J. Maguire, and D.J. Unwin: The areal interpolation problem: Estimating population using remote sensing within a GIs framework. Handling Ge- ographical Information: Methodology and Potential Applications (I. Masser and M. Blakemore, editors), Longman, London, U.K., pp. 55-77 (1991) 4. Webster, C.J.: Population and dwelling unit estimates from space. Third World Planning Review, 18(2):155-176 (1996) 5. European Commission, Joint Research Centre (JRC); Columbia University, Center for International Earth Science Information Network - CIESIN (2015): GHS popu- lation grid, derived from GPW4, multitemporal (1975, 1990, 2000, 2015). European Commission, Joint Research Centre (JRC) [Dataset] PID: http://data.europa. eu/89h/jrc-ghsl-ghs_pop_gpw4_globe_r2015a 6. Matan, O. and Kiang, R. K. and Stenard, C. E. and Boser, B. and Denker, J. S. and Henderson, D. and Howard, R. E. and Hubbard, W. and Jackel, L. D. and LeCun, Y.: Handwritten character recognition using neural network architectures. Proc. of the 4th US Postal Service Advanced Technology Conference (1990) 7. B. Ionescu, H. Muller, M. Villegas, H. Arenas, G. Boato, D.-T. Dang-Nguyen, Y. Dicente Cid, C. Eickhoff, A. Garcia Seco de Herrera, C. Gurrin, B. Islam, V. Ko- valev, V. Liauchuk, J. Mothe, L. Piras, M. Riegler, and I. Schwall. Overview of ImageCLEF 2017: Information extraction from images. In CLEF 2017 Proceedings, volume 10456 of Lecture Notes in Computer Science, Dublin, Ireland, September 11-14 2017. Springer. 8. Arenas, Helbert and Islam, Bayzidul and Mothe, Josiane: Overview of the Image- CLEF 2017 Population Estimation Task. CLEF 2017 Labs Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, September 11-14 2017, Dublin, Ireland. 9. Kingma, D. P., and Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 113. 10. Chollet François and others: Keras, https://github.com/fchollet/keras (2015)