=Paper=
{{Paper
|id=Vol-3207/paper6
|storemode=property
|title=METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping
|pdfUrl=https://ceur-ws.org/Vol-3207/paper6.pdf
|volume=Vol-3207
|authors=Bryan Zhu,Nicholas Lui,Jeremy Irvin,Jimmy Le,Sahil Tadwalkar,Chenghao Wang,Zutao Ouyang,Frankie Y. Liu,Andrew Y. Ng,Robert B. Jackson
|dblpUrl=https://dblp.org/rec/conf/cdceo/ZhuLILT0OLNJ22
}}
==METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping==
METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping Bryan Zhu1,*,† , Nicholas Lui2,† , Jeremy Irvin1,† , Jimmy Le1 , Sahil Tadwalkar3 , Chenghao Wang4 , Zutao Ouyang4 , Frankie Y. Liu4 , Andrew Y. Ng1 and Robert B. Jackson4,5 1 Department of Computer Science, Stanford University 2 Department of Statistics, Stanford University 3 Department of Civil and Environmental Engineering, Stanford University 4 Department of Earth System Science, Stanford University 5 Woods Institute for the Environment and Precourt Institute for Energy, Stanford University Abstract Reducing methane emissions is essential for mitigating global warming. To attribute methane emissions to their sources, a comprehensive dataset of methane source infrastructure is necessary. Recent advancements with deep learning on remotely sensed imagery have the potential to identify the locations and characteristics of methane sources, but there is a substantial lack of publicly available data to enable machine learning researchers and practitioners to build automated mapping approaches. To help fill this gap, we construct a multi-sensor dataset called METER-ML containing 86,599 georeferenced NAIP, Sentinel-1, and Sentinel-2 images in the U.S. labeled for the presence or absence of methane source facilities including concentrated animal feeding operations, coal mines, landfills, natural gas processing plants, oil refineries and petroleum terminals, and wastewater treatment plants. We experiment with a variety of models that leverage different spatial resolutions, spatial footprints, image products, and spectral bands. We find that our best model achieves an area under the precision recall curve of 0.915 for identifying concentrated animal feeding operations and 0.821 for oil refineries and petroleum terminals on an expert-labeled test set, suggesting the potential for large-scale mapping. We make METER-ML freely available at this link to support future work on automated methane source mapping. Keywords Earth observation, remote sensing, machine learning, deep learning, dataset, climate change, methane 1. Introduction Figure 1: METER-ML is a multi-sensor dataset containing 86,599 examples of NAIP aerial imagery, Sentinel-2 satellite Anthropogenic methane emissions are the main contribu- imagery, and Sentinel-1 satellite imagery. We include 19 spec- tor to the rise of atmospheric methane [1], and mitigating tral bands across these three products, with the RGB and methane emissions is widely recognized as crucial for VH&VV bands shown here. Each example is labeled with the slowing global warming and achieving the goals of the presence or absence of six different methane source facilities Paris Agreement [2]. Multiple satellites are in orbit or and is georeferenced. A small amount of examples are labeled launching soon which will measure methane emissions to contain facilities from more than one category and 34,870 examples contain no facilities from the six categories. from the surface using top-down approaches, but in or- der to attribute these emissions to specific sources on the ground, a comprehensive database of methane emitting infrastructure is necessary [3]. Although several public databases of this infrastructure exist, the data available globally is incomplete, erroneous, and unaggregated. AI approaches on Earth observation data have the po- CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth Observation, July 25, 2022, Vienna, Austria * Corresponding author. † These authors contributed equally. $ bwzhu@cs.stanford.edu (B. Zhu); niclui@stanford.edu (N. Lui); jirvin16@cs.stanford.edu (J. Irvin); jimmyle@cs.stanford.edu (J. Le); stadwalk@stanford.edu (S. Tadwalkar); chenghao.wang@stanford.edu (C. Wang); ouyangzt@stanford.edu (Z. Ouyang); frankliu@stanford.edu (F. Y. Liu); ang@cs.stanford.edu (A. Y. Ng); Rob.Jackson@stanford.edu tential to fill in this gap. Several recent works have de- (R. B. Jackson) veloped deep learning models to automatically interpret © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). remotely sensed imagery and deploy them at scale to CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) map infrastructure [4, 5, 6, 7]. Methods for mapping Table 1 methane source infrastructure have been emerging as Counts and proportions of each category in METER-ML. well, including well pads in the Denver basin [8], oil re- The labels on the training set are obtained from public data fineries and concentrated animal feed operations in the whereas the labels on validation and test sets are obtained U.S. [9, 10], and wastewater treatment plants in Germany from a consensus of two methane source identification experts. [11]. Each of these works depended on the curation of The individual category counts do not add up to the overall train/valid/test counts as some (0.8%) of the positive examples large, labeled datasets to develop the machine learning are labeled with more than one methane source category. models, but there is a lack of publicly available, labeled Earth observation data, specifically on methane emitting Category Train (%) Valid (%) Test (%) Total infrastructure, which prohibits researchers and practi- CAFOs 24957 (29.3%) 47 (9.1%) 92 (9.0%) 25096 tioners from building automated mapping approaches. Landfills 4085 (4.8%) 46 (8.9%) 111 (10.9%) 4242 Coal Mines 1776 (2.1%) 40 (7.8%) 72 (7.1%) 1888 In this work, we construct a multi-sensor Earth obser- Proc Plants 1900 (2.2%) 38 (7.4%) 107 (10.5%) 2045 vation dataset for methane source infrastructure iden- R&Ts 4012 (4.7%) 59 (11.5%) 108 (10.6%) 4179 tification called METER-ML. In support of a new ini- WWTPs 14519 (17.1%) 46 (8.9%) 129 (12.7%) 14694 tiative to build a global database of methane emitting Negatives 34195 (40.2%) 249 (48.3%) 426 (41.8%) 34870 infrastructure called the MEthane Tracking Emissions Total 85066 515 1018 86599 Reference (METER) [12], we develop METER-ML to al- low the machine learning community to experiment with multi-view/multi-modal modeling approaches to auto- visible differentiating features which make them feasible matically identify this infrastructure in remotely sensed to identify in high resolution remotely sensed imagery. imagery. METER-ML includes georeferenced imagery The locations are obtained from 18 different publicly from three remotely sensed image products, specifically available datasets, all of which have licenses that allow for 19 spectral bands in total from NAIP, Sentinel-1, and redistribution (see Table 6 in the Appendix). As various Sentinel-2, capturing 51,729 sources of methane from six datasets may contain the same locations of infrastructure, different classes as well as 34,870 negative examples (Fig- we deduplicate by considering locations within 500m of ure 1). The dataset includes expert-reviewed validation each other identical. In total we include 51,729 unique and test sets for robustly evaluating the performance of locations of methane source infrastructure in the dataset, derived models. Using the dataset, we experiment with a which we refer to as positive examples. variety of convolutional neural network models which leverage different spatial resolutions, spatial footprints, image products, and spectral bands. The dataset is freely 2.2. Negative locations available1 in order to encourage further work on develop- We additionally include a variety of images in the dataset ing and validating methane source mapping approaches. which capture none of the six methane emitting facilities. To do this, we define around 50 classes (see Appendix) of different facilities and landscapes and select character- 2. Methods istic examples of each class. Then we collect locations containing similar facilities and landscapes using the 2.1. Methane source locations Descartes Labs GeoVisual Search [13], providing up to We collect locations of methane emitting infrastructure 1000 similar locations per example. A sample of the simi- in the U.S. from a variety of public datasets. We focus on lar locations were manually vetted in each case to ensure the U.S. in this study due to the high availability of pub- no locations obtained actually corresponded to the six licly accessible infrastructure data and remotely sensed methane source categories. In total we include 34,870 imagery. The infrastructure categories we include are locations of facilities and landscapes which are not any concentrated animal feeding operations (CAFOs), coal of the six infrastructure categories, and refer to these as mines (Mines), landfills (Landfills), natural gas processing negative examples. The counts and proportions of the plants (Proc Plants), oil refineries and petroleum termi- positive and negative classes in the dataset are shown in nals (including crude oil and liquified natural gas termi- Table 1. nals), and wastewater treatment plants (WWTPs). We group oil refineries and petroleum terminals together 2.3. Remotely sensed imagery due to their high similarity in appearance, and refer to that category as “Refineries & Terminals” (R&Ts). These We pair all of the locations in the dataset with three infrastructure categories were chosen based on their po- publicly available remotely sensed image sources. Specif- tential for emitting methane along with their consistent, ically we include aerial imagery from the USDA National Agriculture Imagery Program (NAIP) as well as satellite 1 https://stanfordmlgroup.github.io/projects/meter-ml imagery captured by Sentinel-1 (S1) and Sentinel-2 (S2). Table 2 2.4. Validation and test sets Summary of the remotely sensed image products and bands included in METER-ML. RGB are the three visible bands, NIR Two Stanford University postdoctoral researchers with is a single near-infrared band, RE1-4 are the four red-edge expertise in methane emissions and related infrastruc- bands, SWIR1-2 are the two shortwave infrared bands, CA is ture individually reviewed 1,533 examples to compose the single coastal aerosol band, WV is the single water-vapor the held-out validation and test sets. To determine which band, C is the single cirrus band, and VH & VV are the two examples to include in these held-out sets, we randomly V-transmit bands. sampled 150 images from each of the six positive classes Product Bands Image Size Resolution as well as a random sample of 33 images which have mul- NAIP RGB & NIR 720x720 1m tiple labels, constituting 933 positive examples according Sentinel-2 RGB & NIR 72x72 10m to the original public dataset labels. We additionally Sentinel-2 RE1-4 & SWIR1-2 36x36 20m sampled 12 images from each of the 50 negative cate- Sentinel-2 CA & WV & C 12x12 60m gories resulting in 600 negative examples. The experts Sentinel-1 VH & VV 72x72 10m both manually reviewed these examples and identified the presence or absence of the six methane source cate- gories by using a combination of NAIP imagery as well NAIP imagery covers the contiguous U.S. and S1 and S2 as Google Maps imagery, which often had finer spatial imagery both have global coverage. For NAIP we use resolution as well as place names. The facility had to be 1m resolution imagery, for Sentinel-2 we use the L1C captured by the NAIP image for the corresponding label product at 10m resolution, and for Sentinel-1 we use the to be assigned. If the expert identified no clearly visible Sigma Nought Backscatter product at 10m resolution. We methane source categories in the image, the example was use all spectral bands from each product. Specifically, we labeled “negative”, and if the expert was uncertain about use the three visible (RGB) and single near-infrared (NIR) any label, the example was labeled “uncertain”. The two bands from NAIP and S2, the single coastal aerosol (CA) labels per example were then resolved as follows: band, four red-edge (RE1-4) bands, single water vapor 1. If the experts agreed and neither was uncertain, (WV) band, single cirrus (C) band, and the two shortwave the agreed upon label was taken as the final label. infrared (SWIR1-2) bands from S2, and the V-transmit 2. If the experts disagreed, and one was uncertain (VH and VV) bands from S1. We include S1 and S2 in the but the other was not, the expert’s certain label dataset in order to enable experimenting with coarser was taken as the final label. resolution satellite imagery which is globally available, unlike NAIP. The details of each imagery product and 3. If the experts disagreed, but one agreed with the band are shown in Table 2. original label, the original label was taken as the In order to construct images containing each location final label. in the dataset, we consider a 720m x 720m footprint cen- 4. In all other scenarios, the example was reviewed tered around the location. This footprint was chosen to jointly by the experts and a final label was as- balance the size of the images with the contextual infor- signed. mation, but we investigate this choice in the experiments. Only 76 examples out of the 1,533 went to another Due to the geographic coordinate noise in the publicly round of review. The resulting datasets have 858 positive available datasets, we chose to center the imagery at the examples and 675 negative examples. We split the 1,533 locations which increases the likelihood the facilities are examples into 515 for the validation set and 1,018 for the captured in the imagery, but still has natural variation in test set. The label counts on the validation and test sets the locations of the facilities in the imagery. We construct are shown in Table 1. a mosaic of the most recently captured pixels in a time range for each image product, where we consider NAIP images captured between 2017 and 2021 and Sentinel- 3. Experiments 1 and Sentinel-2 images between May and September 2021, where Sentinel-2 images are selected based on low- We run a variety of multi-label classification experiments est cloud cover. We use the Descartes Labs platform to on the curated dataset. In all of our experiments, we use a download all of the imagery [14]. DenseNet-121 convolutional neural network architecture The total dataset contains 86,599 images capturing [15]. Preliminary experiments on the dataset explored ten spectral bands across the three imagery products. various ResNet and DenseNet architectures and found Information about the remotely sensed image products that DenseNets outperformed all ResNet variants [16]. and bands included in the dataset are provided in Table 2 We use a linear layer which outputs six values indicat- and characteristic examples for each methane source ing the likelihood that each of the six methane source category are shown in Figure 2 in the Appendix. Table 3 Per-class and overall (macro-average) validation AUPRC for different remotely sensed image products and bands. All of these experiments use images of size 720x720 at a spatial resolution of 1m per pixel, with S1 and S2 upsampled to that resolution. Image Product Bands CAFOs Landfills Mines Proc Plants R&Ts WWTPs Overall S1 VH&VV 0.519 0.107 0.152 0.218 0.487 0.119 0.267 S2 RGB 0.889 0.268 0.305 0.374 0.694 0.204 0.456 S2 All 0.889 0.189 0.382 0.368 0.690 0.183 0.450 S2 & S1 All 0.923 0.152 0.379 0.391 0.612 0.231 0.448 NAIP RGB 0.903 0.270 0.348 0.327 0.849 0.182 0.480 NAIP All 0.945 0.276 0.401 0.508 0.857 0.303 0.548 NAIP & S2 & S1 All 0.889 0.214 0.473 0.457 0.796 0.272 0.517 categories are present in the input image, which outper- all spectral bands from S1 and S2 together (representing formed individual models across all classes in our prelimi- the model closest to public global transferability due to nary experiments. Although the model does not explicitly the global coverage of S1 and S2), and all spectral bands produce a value indicating the likelihood that the image from the three products together. is negative, a low value assigned to all classes indicates a The best model according to macro-average AUPRC negative prediction. The loss function is the mean of six is the one which uses NAIP with all bands (the unweighted binary cross entropy losses, where the label three visible bands and NIR band), achieving an over- is 1 if the class if present in the image and 0 otherwise. all AUPRC of 0.548 and the highest performance on All six labels in the negative examples are 0. The network CAFOs, Landfills, Proc Plants, R&Ts, and WWTPs com- weights are initialized with weights from a network pre- pared to all other tested product and band combina- trained on ImageNet [17]. Before inputting the images tions. Notably, it achieves very high performance on into the networks, we upscale the Sentinel-1 and Sentinel- CAFOs (AUPRC=0.945) and high performance on R&Ts 2 images to match the size of NAIP images using bilinear (AUPRC=0.857). The second best model is the joint resampling and normalize the values by the display range NAIP+S2+S1 model, achieving an overall AUPRC of 0.517 of the bands (see Table 7 in the Appendix). When using and the highest performance on Mines (AUPRC=0.473) inputs with less than or more than 3 channels, we re- compared to all other tested product and band combina- place the first convolutional neural network layer with tions. one which accepts the corresponding number of chan- S1 alone underperforms all other combinations of prod- nels. Each model is trained for 5 epochs with a batch size ucts and bands, followed by S2 and S1 jointly, which of 4. For each model we use the checkpoint saved after performed similarly overall to S2 with only the visi- an epoch which led to the lowest validation loss. We use ble bands and all spectral bands. Importantly, the S2 an Adam optimizer with standard parameters [18] and and S1 joint model still achieves high performance on a learning rate of 0.02. All models are trained using a CAFOs (AUPRC=0.923), although the performance is GeForce GTX 1070 GPU. lower than performance on CAFOs using NAIP imagery The baseline setting for all experiments uses images (AUPRC=0.947). There is a significant drop in perfor- capturing a footprint of 720m x 720m with 1m spatial res- mance on all classes when moving from NAIP to S2, olution (720 x 720 image dimensions). After the models highlighting the benefit of using high spatial resolution are trained, each of the six values output by the model are imagery. fed through an element-wise sigmoid function to produce The inclusion of the non-visible information substan- a probability for each of the six categories. To evaluate tially improves overall AUPRC for NAIP (AUPRC=0.480 the performance of the models, we compute the per-class → 0.548) but not for Sentinel 2 (AUPRC=0.450 → 0.448). area under the precision recall curve (AUPRC) and sum- For NAIP, the improvement is observed for all classes, marize the performance over all classes by taking the with substantial gains on CAFOs, Mines, Proc Plants, macro-average of the per-class AUPRCs. and WWTPs. For Sentinel 2, the inclusion of non-visible bands substantially improves performance on Mines but 3.1. Impact of using different imaging substantially degrades performance on Landfills. For both products, minimal change on R&Ts performance is products and bands observed when including the non-visible bands. We investigate the impact of using different combinations of image products and bands in the dataset (Table 3). Specifically, we experiment with NAIP, S2, and S1 alone, only visible bands and all spectral bands for S2 and NAIP, Table 4 Per-class and overall (macro-average) validation AUPRC at varying image footprints and spatial resolutions. Image Footprint Resolution CAFOs Landfills Mines Proc Plants R&Ts WWTPs Overall 240x240 1m 0.773 0.217 0.407 0.438 0.735 0.337 0.485 480x480 1m 0.772 0.226 0.260 0.371 0.855 0.506 0.498 720x720 3m 0.891 0.245 0.378 0.566 0.837 0.269 0.531 720x720 1.5m 0.927 0.244 0.426 0.366 0.831 0.449 0.541 720x720 1m 0.945 0.276 0.401 0.508 0.857 0.303 0.548 Table 5 treatment plants are surrounded by industrial buildings Per-class and overall (macros-average) test metrics of the per- and other infrastructure, so cropping out this infrastruc- class expert model. The per-class expert model consists of one ture improves the model’s ability to identify the salient model per class, where the model used for each class is selected features of the wastewater treatment facilities. based on the highest performing settings for that class across We further find that the highest spatial resolution the product, bands, footprint, and resolution experiments. achieves the best overall performance (AUPRC=0.548), Category AUPRC AUROCC Precision Recall F1 outperforming the coarser resolution models on CAFOs, CAFOs 0.915 0.989 0.822 0.902 0.860 Landfills, and R&Ts. The 1.5m resolution model closely Landfills 0.259 0.754 0.246 0.523 0.334 follows with an overall AUPRC of 0.541 and outperforms Mines 0.470 0.905 0.558 0.403 0.468 the 1m resolution model on Mines. The 3m resolution Proc Plants 0.350 0.787 0.336 0.477 0.394 model also closely follows the 1.5m resolution model R&Ts 0.821 0.956 0.752 0.787 0.769 achieving an overall performance of 0.531, and substan- WWTPs 0.534 0.836 0.633 0.477 0.544 Overall 0.558 0.871 0.558 0.595 0.562 tially outperforms both the higher resolution models on Proc Plants. This result suggests that models developed at 1.5m and even 3m resolution have the potential to 3.2. Impact of image footprint and spatial perform almost as well as 1m resolution models, which has implications on global applicability as Airbus SPOT resolution and PlanetScope are globally (privately) available at 1.5m As image footprint (i.e. the amount of area on the ground and 3m resolution respectively. captured by the image) and spatial resolution likely im- pact model performance due to the variation in the sizes 3.3. Per-class expert model test set results of the methane-emitting facilities and equipment, we conduct experiments to test these effects (Table 4). To For each methane source category, we select the ex- investigate the impact of footprint, we center crop the perimental configuration (product/band/footprint/reso- 720 x 720 1m images to obtain 480 x 480 and 240 x 240 lution) that achieved the highest validation AUPRC for 1m images corresponding to 480m x 480m and 240m x that class to serve as the “class expert”. We refer to the 240m footprints respectively. Note that this reduces the combination of the different class experts as the per-class area on the ground with spatial resolution held constant. expert model. To investigate the impact of spatial resolution, we use We evaluate the per-class expert model on the hold-out cubic spline interpolation [19] to downsample the 720 x test set using a variety of metrics including AUPRC and 720 images to 480 x 480 (1.5m resolution, corresponding area under the receiver operating characteristic curve to Airbus SPOT imagery) and 240 x 240 (3m resolution, (AUROCC) as well as precision, recall, and F1 at the corresponding to PlanetScope imagery). Note that this re- threshold which achieves the highest F1 on the valida- duces the spatial resolution without modifying the image tion set. The results are shown in Table 5. The per-class footprint. In all experiments, we up-sample the images expert model obtains a macro-average AUPRC of 0.558. back to 720 x 720 to avoid any differences in performance The model does especially well on CAFOs (AUPRC=0.915) due to varying image size. We use NAIP with RGB + NIR and R&Ts (AUPRC=0.821), possibly because these sources bands for these experiments as this setting produced the have very distinctive features (e.g., long barns in CAFOs best overall performance compared to the other combi- and storage tank farms in R&Ts). It performs more poorly nations of products and bands. on the other sources, especially landfills which do not We find that the largest tested image footprint achieves have many clear distinctive features visible at 1m reso- the highest overall performance (0.548) and substantially lution. Notably it achieves the lowest performance on outperforms both smaller spatial footprints across all the categories with the least number of examples in the classes except for WWTPs. This may be explained by dataset, excluding R&Ts which may be simpler to identify the fact that a significant number of smaller wastewater due to their homogeneity and discernible features. 4. Discussion dinal information has the potential to provide additional signal to help differentiate certain facilities, e.g. waste The experiments suggest that the choice of imaging prod- pile evolution at landfills. Third, we use a DenseNet121 uct, spectral band, image footprint, and spatial resolution model that is pre-trained on ImageNet, but the shape and can lead to substantial differences in model performance, number of channels of remote sensing imagery can be with the effect often depending on the methane source significantly different from ImageNet. It would be worth- category. In particular, this suggests that there is sig- while to train a network from scratch on METER-ML, nificant room to explore approaches which leverage the and compare its performance against a network that is multi-sensor and multi-spectral aspects of METER-ML. pre-trained on ImageNet and fine-tuned on METER-ML. For example, the NAIP & S2 & S1 model underperformed Fourth, our approach to combine the multi-sensor data the model which used NAIP alone, and using all 13 spec- may not be optimal as the products and spectral bands tral bands in the S2 model did not lead to substantial per- have different spatial resolutions and sensor types (e.g. formance differences compared to the S2 model which active vs. passive sensors). One alternate approach may only used the three visible bands. We also do not leverage be to dedicate different network branches for the inputs the geographic information explicitly in the models, but and combine the representations from each branch. this has been shown to improve performance on other Earth observation tasks [20, 21]. Furthermore, there is potential to augment the dataset with other sources of 5. Conclusion imagery and information available at the provided geo- graphic locations. We hope to help create new versions In this work, we curate a large georeferenced multi- of METER-ML which may include other sources of input sensor dataset called METER-ML to test automated data and methane emitting infrastructure categories. methane source identification approaches. We conduct The best model from our experiments achieves high a variety of experiments investigating the impact of re- performance on identifying CAFOs and R&Ts, suggesting motely sensed image product, spectral bands, image foot- the potential to map these facilities with NAIP imagery print, and spatial resolution on model performance mea- in the U.S. which aligns with findings from prior studies sured against a consensus of expert labels. We find that a [9, 10]. The performance for identifying CAFOs remains model which leverages NAIP with all four bands achieves high when using S1 and S2, which are globally and pub- the highest overall performance across the tested image licly available. This suggests the potential to use these product and spectral band combinations, followed closely lower spatial resolution imagery sources to map CAFOs by a joint NAIP, Sentinel-2, and Sentinel-1 model. We also in other countries besides the U.S., but future work should find that the highest spatial resolution and footprint leads investigate whether these findings generalize to other to the best overall performance, although performance regions. There is still a large gap to achieving high per- can depend on the methane source category. Finally we formance for each of the other methane source categories show that the best model achieves high performance in and further improve performance on the high performing identifying concentrated animal feeding operations and categories, so METER-ML is a challenging benchmark to oil refineries and petroleum terminals, suggesting the test new infrastructure identification approaches. potential to map them at scale, but substantially lower There are many other publicly available remote sens- performance on the other four categories with notably ing datasets for classification, with some of the most com- lower performance identifying processing plants and mon being UC Merced [22], SAT-4 and SAT-6 [23], AID landfills. We make METER-ML freely available in or- [24], NWPU-RESISC45 [25], EuroSAT [26], and BigEarth- der to encourage and support future work on developing Net [27]. Few of these datasets have georeferenced multi- Earth observation models for mitigating climate change. sensor images, which limits their utility for new modeling approaches and downstream use. The OGNet dataset [9] Acknowledgments is the most similar publicly available dataset to METER- ML and is essentially a subset of it, containing NAIP This work was supported by the High Tide Foundation imagery of refineries in the contiguous U.S. to construct the METER database. We acknowledge Rose We identify four limitations of this work. First, we Rustowicz and Kyle Story for their support of this work, limit the geographic scope of METER-ML to the U.S. due as well as the Descartes Labs Platform API and tools to the availability of disseminatable infrastructure data for downloading and processing the remotely sensed im- and publicly available, high resolution imagery. Future agery. We also thank Ritesh Gautam and Mark Omara work should include data in other regions worldwide. for their help working with the oil and gas infrastruc- Second, we do not include longitudinal imagery in the ture data, Evan Sherwin for his advice on the dataset and dataset to reduce the size and complexity of the dataset as methane source categories, and Victor Maus for provid- most infrastructure is static over time. However, longitu- ing the coal mines data. References tional Journal of Applied Earth Observation and Geoinformation 110 (2022) 102804. [1] Z. Zhang, B. Poulter, S. Knox, A. Stavert, G. Mc- [12] Stanford University, Methane tracking emis- Nicol, E. Fluet-Chouinard, A. Feinberg, Y. Zhao, sions reference platform, 2022. URL: https:// P. Bousquet, J. G. Canadell, et al., Anthropogenic meterplatform.web.app/, accessed: 2022-06-30. emission is the main contributor to the rise of at- [13] R. Keisler, S. W. Skillman, S. Gonnabathula, mospheric methane during 1993–2017, National J. Poehnelt, X. Rudelis, M. S. Warren, Visual search science review 9 (2022) nwab200. over billions of aerial and satellite images, Com- [2] P. Agreement, Paris agreement, in: Report of the puter Vision and Image Understanding 187 (2019) Conference of the Parties to the United Nations 102790. Framework Convention on Climate Change (21st [14] D. Labs, Descartes labs platform, 2022. URL: https: Session, 2015: Paris). Retrived December, volume 4, //descarteslabs.com/, accessed: 2022-06-26. HeinOnline, 2015, p. 2017. [15] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein- [3] D. J. Jacob, D. J. Varon, D. H. Cusworth, P. E. Denni- berger, Densely connected convolutional networks, son, C. Frankenberg, R. Gautam, L. Guanter, J. Kel- in: Proceedings of the IEEE conference on computer ley, J. McKeever, L. E. Ott, et al., Quantifying vision and pattern recognition, 2017, pp. 4700–4708. methane emissions from the global scale down [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn- to point sources using satellite observations of at- ing for image recognition, in: Proceedings of the mospheric methane, Atmospheric Chemistry and IEEE conference on computer vision and pattern Physics Discussions (2022) 1–44. recognition, 2016, pp. 770–778. [4] J. Yu, Z. Wang, A. Majumdar, R. Rajagopal, Deep- [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- solar: A machine learning framework to efficiently Fei, Imagenet: A large-scale hierarchical image construct a solar deployment database in the united database, in: 2009 IEEE conference on computer states, Joule 2 (2018) 2605–2617. vision and pattern recognition, Ieee, 2009, pp. 248– [5] J. Lee, N. R. Brooks, F. Tajwar, M. Burke, S. Ermon, 255. D. B. Lobell, D. Biswas, S. P. Luby, Scalable deep [18] D. P. Kingma, J. Ba, Adam: A method for stochas- learning to identify brick kilns and aid regulatory tic optimization, arXiv preprint arXiv:1412.6980 capacity, Proceedings of the National Academy of (2014). Sciences 118 (2021). [19] P. S. Parsania, P. V. Virparia, A comparative analysis [6] L. Kruitwagen, K. Story, J. Friedrich, L. Byers, of image interpolation algorithms, International S. Skillman, C. Hepburn, A global inventory of Journal of Advanced Research in Computer and photovoltaic solar energy generating units, Nature Communication Engineering 5 (2016) 29–34. 598 (2021) 604–610. [20] O. Mac Aodha, E. Cole, P. Perona, Presence-only [7] W. Sirko, S. Kashubin, M. Ritter, A. Annkah, Y. S. E. geographical priors for fine-grained image classi- Bouchareb, Y. Dauphin, D. Keysers, M. Neumann, fication, in: Proceedings of the IEEE/CVF Interna- M. Cisse, J. Quinn, Continental-scale building detec- tional Conference on Computer Vision, 2019, pp. tion from high resolution satellite imagery, arXiv 9596–9606. preprint arXiv:2107.12283 (2021). [21] J. Irvin, H. Sheng, N. Ramachandran, S. Johnson- [8] S. Dileep, D. Zimmerle, J. R. Beveridge, T. Vaughn, Yu, S. Zhou, K. Story, R. Rustowicz, C. Elsworth, Automated identification of oil field features using K. Austin, A. Y. Ng, Forestnet: Classifying drivers cnns, in: NeurIPS, Workshop on Tackling Climate of deforestation in indonesia using deep learning on Change with Machine Learning, 2020, 2020. satellite imagery, arXiv preprint arXiv:2011.05479 [9] H. Sheng, J. Irvin, S. Munukutla, S. Zhang, C. Cross, (2020). K. Story, R. Rustowicz, C. Elsworth, Z. Yang, [22] Y. Yang, S. Newsam, Geographic image retrieval M. Omara, et al., Ognet: Towards a global oil using local invariant features, IEEE Transactions on and gas infrastructure database using deep learn- Geoscience and Remote Sensing 51 (2012) 818–832. ing on remotely sensed imagery, arXiv preprint [23] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano, arXiv:2011.07227 (2020). M. Karki, R. Nemani, Deepsat: a learning frame- [10] C. Handan-Nader, D. E. Ho, Deep learning to map work for satellite imagery, in: Proceedings of the concentrated animal feeding operations, Nature 23rd SIGSPATIAL international conference on ad- Sustainability 2 (2019) 298–306. vances in geographic information systems, 2015, pp. [11] H. Li, J. Zech, D. Hong, P. Ghamisi, M. Schultz, 1–10. A. Zipf, Leveraging openstreetmap and multimodal [24] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, remote sensing data with joint deep learning for L. Zhang, X. Lu, Aid: A benchmark data set for wastewater treatment plants detection, Interna- performance evaluation of aerial scene classifica- tion, IEEE Transactions on Geoscience and Remote 2020. URL: https://maps.indiana.edu/metadata/ Sensing 55 (2017) 3965–3981. Environment/Agribusiness_Confined_Feeding_ [25] G. Cheng, J. Han, X. Lu, Remote sensing image Operations.html#Identification_Information, scene classification: Benchmark and state of the art, accessed: 2022-02-01. Proceedings of the IEEE 105 (2017) 1865–1883. [38] U.S. Environmental Protection Agency, [26] P. Helber, B. Bischke, A. Dengel, D. Borth, Eurosat: Landfill methane outreach program, A novel dataset and deep learning benchmark for 2022. URL: https://www.epa.gov/lmop/ land use and land cover classification, IEEE Journal project-and-landfill-data-state, accessed: 2022-02- of Selected Topics in Applied Earth Observations 01. and Remote Sensing 12 (2019) 2217–2226. [39] A. J. Marchese, T. L. Vaughn, D. J. Zimmerle, D. M. [27] G. Sumbul, M. Charfuelan, B. Demir, V. Markl, Martinez, L. L. Williams, A. L. Robinson, A. L. Bigearthnet: A large-scale benchmark archive for Mitchell, R. Subramanian, D. S. Tkacik, J. R. Roscioli, remote sensing image understanding, in: IGARSS et al., Methane emissions from united states natu- 2019-2019 IEEE International Geoscience and Re- ral gas gathering and processing, Environmental mote Sensing Symposium, IEEE, 2019, pp. 5901– science & technology 49 (2015) 10718–10727. 5904. [40] V. Maus, S. Giljum, D. M. da Silva, J. Gutschlhofer, [28] CA Governor’s Office of Emergency Services, Ca R. P. da Rosa, S. Luckeneder, S. L. Gass, M. Lieber, energy commission - oil refineries and terminals, I. McCallum, An update on global mining land use, 2021. URL: https://arcg.is/1rXn4G0, accessed: 2022- Scientific data (2022). 02-01. [41] Metropolitan Council, Wastewater treatment [29] CA State Water Resources Control Board, 2022. plants, 2020. URL: https://gisdata.mn.gov/dataset/ URL: https://www.waterboards.ca.gov/resources/ us-mn-state-metc-util-wastewater-treatment-plants, data_databases/site_map.html, accessed: 2022-02- accessed: 2022-02-01. 01. [42] MN Pollution Control Agency, Feedlots in min- [30] D. McKernan, 2012 factory farms in michi- nesota, 2016. URL: https://gisdata.mn.gov/dataset/ gan, 2012. URL: https://data.world/dmckernan/ env-feedlots, accessed: 2022-02-01. 2012-factory-farms-in-michigan, accessed: 2022- [43] ORNL DAAC, Sources of methane emissions 02-01. (vista-ca), state of california, usa, 2019. URL: [31] O. Tsubiks, Concentrated animal https://daac.ornl.gov/NACP/guides/NACP_Vista_ feeding operations (cafo), us, 2017. CA_CH4_Inventory.html, accessed: 2022-02-01. URL: https://data.world/dataforacause/ [44] Sierra Club Michigan Chapter, Michigan cafo map- concentrated-animal-feeding-operations-cafo, ping report, 2017. URL: https://www.sierraclub. accessed: 2022-02-01. org/michigan/michigan-cafo-mapping-report, ac- [32] U.S. Energy Information Administration, U.s. en- cessed: 2022-02-01. ergy mapping system, 2021. URL: https://www.eia. [45] Stanford Regulation, Evaluation, and Governance gov/state/maps.php, accessed: 2022-02-01. Lab, Cafo training dataset, 2019. URL: https://reglab. [33] U.S. Environmental Protection Agency, Greenhouse stanford.edu/data/cafo-training-dataset/, accessed: gas reporting program (ghgrp), 2021. URL: https: 2022-02-01. //www.epa.gov/ghgreporting, accessed: 2022-02- 01. [34] K. Rose, J. Bauer, V. Baker, A. Barkhurst, A. Bean, J. DiGiulio, K. Jones, T. Jones, D. Justman, R. Miller, et al., Global Oil & Gas Features Database, Technical Report, 2018. [35] U.S. D of Homeland Security, Homeland infras- tructure foundation-level data (hifld), 2021. URL: https://hifld-geoplatform.opendata.arcgis.com/, ac- cessed: 2022-02-01. [36] H. Ehalt Macedo, B. Lehner, J. Nicell, G. Grill, J. Li, A. Limtong, R. Shakya, Distribution and charac- teristics of wastewater treatment plants within the global river network, Earth System Science Data 14 (2022) 559–577. [37] Indiana D of Environmental Management, Confined feeding operation facilities: Indiana, A. Appendix A.1. Methane-Emitting Infrastructure Datasets Table 6 Summary of the datasets containing the locations of methane-emitting infrastructure that are included in METER-ML. All datasets are subsetted to the locations within the contiguous U.S. We use the centroids of polygons for any datasets provided in polygon format. Dataset Source Scope Methane Source Categories CA Energy Commission [28] California R&Ts CSWRCB [29] California WWTPs data.world [30] Michigan CAFOs Data For Cause Challenge [31] US CAFOs EIA [32] US Proc Plants, R&Ts, WWTPs GHGRP [33] US Landfills, Mines, R&Ts, WWTPs GOGI [34] Global Proc Plants, R&Ts HIFLD [35] US R&Ts, WWTPs HydroWASTE [36] Global WWTPs IndianaMap [37] Indiana CAFOs LMOP [38] US Landfills Marchese et al. [39] US Proc Plants Maus et al. [40] Global Mines Minnesota Metropolitan Council [41] Minnesota WWTPs Minnesota Pollution Control Agency [42] Minnesota CAFOs ORNL DAAC [43] California CAFOs Sierra Club [44] Michigan CAFOs Stanford RegLab [45] North Carolina CAFOs A.1.1. Coal Mines Data The mines data from [40] were subsetted to coal mines in order to capture the mines responsible for the vast majority of methane emissions. To do this, the polygons and coal mine coordinates obtained from S&P Global Commodity Insights were matched to determine which polygons were spatially related to coal mine coordinates. Then a visual check and hand cleaning was performed on the polygons assigned a coal mining label to ensure correctness. A.2. Negative Classes We identify a variety of infrastructure which are not any of the six infrastructure categories to use as negatives in the dataset. Specifically we include football fields, marinas, solar panels, large bodies of water, parking lots, windmills, baseball fields, airport runways, clouds, neighborhoods, golf courses, roundabouts, mountainous terrain, trees, boats, islands, rocks, rivers, roads, bridges, ripples in water, snow, canyon formations, sparse forests, suburban neighborhoods, beaches, clear water, swimming pools, sand, corn farms, soy farms, trees on mountainside, farm houses, grass, airplanes, turning roads, intersections, multifamily residential facilities, rapids, docks, highway loops, mowed grass, container yards, soccer fields, greenhouses, crops, personal watercrafts, pivot irrigation systems, and concrete plants. Characteristic examples of each type were selected and a variety of similar examples per type were obtained using the Descartes Labs GeoVisual Similarity tool [13]. A.3. Remotely Sensed Image Statistics and Examples Table 7 Summary of the remotely sensed image products and bands included in METER-ML. The raw image data contains values from the data range, while the display range is used to normalize values before displaying imagery or inputting into models. Product/Bands Image Size Resolution Data Range Display Range NAIP RGB & NIR 720x720 1m [0,255] [0,255] Sentinel-2 CA 12x12 60m [0,10000] [0,10000] Sentinel-2 RGB 72x72 10m [0,10000] [0,4000] Sentinel-2 RE1-4 36x36 20m [0,10000] [0,10000] Sentinel-2 NIR 72x72 10m [0,10000] [0,10000] Sentinel-2 WV 12x12 60m [0,10000] [0,10000] Sentinel-2 C 12x12 60m [0,10000] [0,10000] Sentinel-2 SWIR1-2 36x36 20m [0,10000] [0,10000] Sentinel-1 VH 72x72 10m [1,4095] [585,2100] Sentinel-1 VV 72x72 10m [1,4095] [585,2926] Figure 2: Characteristic example images of each category in METER-ML. METER-ML contains 19 spectral bands across 3 image products, visualized here in grayscale for single bands and false color composites for multiple bands (the first listed band is used in the red channel, second in the green, third in the blue). Category NAIP RGB NAIP NIR S1 VV&VH CAFOs Coal Mines Landfills Proc Plants R&Ts WWTPs Negatives Category S2 RGB S2 NIR S2 RE1&SWIR1-2 S2 RE2-4 S2 CA&WV&C CAFOs Coal Mines Landfills Proc Plants R&Ts WWTPs Negatives