=Paper=
{{Paper
|id=Vol-3207/paper6
|storemode=property
|title=METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping
|pdfUrl=https://ceur-ws.org/Vol-3207/paper6.pdf
|volume=Vol-3207
|authors=Bryan Zhu,Nicholas Lui,Jeremy Irvin,Jimmy Le,Sahil Tadwalkar,Chenghao Wang,Zutao Ouyang,Frankie Y. Liu,Andrew Y. Ng,Robert B. Jackson
|dblpUrl=https://dblp.org/rec/conf/cdceo/ZhuLILT0OLNJ22
}}
==METER-ML: A Multi-Sensor Earth Observation Benchmark for Automated Methane Source Mapping==
<pdf width="1500px">https://ceur-ws.org/Vol-3207/paper6.pdf</pdf>
<pre>
METER-ML: A Multi-Sensor Earth Observation Benchmark
for Automated Methane Source Mapping
Bryan Zhu1,*,† , Nicholas Lui2,† , Jeremy Irvin1,† , Jimmy Le1 , Sahil Tadwalkar3 ,
Chenghao Wang4 , Zutao Ouyang4 , Frankie Y. Liu4 , Andrew Y. Ng1 and Robert B. Jackson4,5
1
  Department of Computer Science, Stanford University
2
  Department of Statistics, Stanford University
3
  Department of Civil and Environmental Engineering, Stanford University
4
  Department of Earth System Science, Stanford University
5
  Woods Institute for the Environment and Precourt Institute for Energy, Stanford University


                                           Abstract
                                           Reducing methane emissions is essential for mitigating global warming. To attribute methane emissions to their sources, a
                                           comprehensive dataset of methane source infrastructure is necessary. Recent advancements with deep learning on remotely
                                           sensed imagery have the potential to identify the locations and characteristics of methane sources, but there is a substantial lack
                                           of publicly available data to enable machine learning researchers and practitioners to build automated mapping approaches.
                                           To help fill this gap, we construct a multi-sensor dataset called METER-ML containing 86,599 georeferenced NAIP, Sentinel-1,
                                           and Sentinel-2 images in the U.S. labeled for the presence or absence of methane source facilities including concentrated
                                           animal feeding operations, coal mines, landfills, natural gas processing plants, oil refineries and petroleum terminals, and
                                           wastewater treatment plants. We experiment with a variety of models that leverage different spatial resolutions, spatial
                                           footprints, image products, and spectral bands. We find that our best model achieves an area under the precision recall curve
                                           of 0.915 for identifying concentrated animal feeding operations and 0.821 for oil refineries and petroleum terminals on an
                                           expert-labeled test set, suggesting the potential for large-scale mapping. We make METER-ML freely available at this link to
                                           support future work on automated methane source mapping.

                                           Keywords
                                           Earth observation, remote sensing, machine learning, deep learning, dataset, climate change, methane


1. Introduction                                                                                               Figure 1: METER-ML is a multi-sensor dataset containing
                                                                                                              86,599 examples of NAIP aerial imagery, Sentinel-2 satellite
Anthropogenic methane emissions are the main contribu-                                                        imagery, and Sentinel-1 satellite imagery. We include 19 spec-
tor to the rise of atmospheric methane [1], and mitigating                                                    tral bands across these three products, with the RGB and
methane emissions is widely recognized as crucial for                                                         VH&VV bands shown here. Each example is labeled with the
slowing global warming and achieving the goals of the                                                         presence or absence of six different methane source facilities
Paris Agreement [2]. Multiple satellites are in orbit or                                                      and is georeferenced. A small amount of examples are labeled
launching soon which will measure methane emissions                                                           to contain facilities from more than one category and 34,870
                                                                                                              examples contain no facilities from the six categories.
from the surface using top-down approaches, but in or-
der to attribute these emissions to specific sources on the
ground, a comprehensive database of methane emitting
infrastructure is necessary [3]. Although several public
databases of this infrastructure exist, the data available
globally is incomplete, erroneous, and unaggregated.
   AI approaches on Earth observation data have the po-
CDCEO 2022: 2nd Workshop on Complex Data Challenges in Earth
Observation, July 25, 2022, Vienna, Austria
*
  Corresponding author.
†
  These authors contributed equally.
$ bwzhu@cs.stanford.edu (B. Zhu); niclui@stanford.edu (N. Lui);
jirvin16@cs.stanford.edu (J. Irvin); jimmyle@cs.stanford.edu (J. Le);
stadwalk@stanford.edu (S. Tadwalkar);
chenghao.wang@stanford.edu (C. Wang); ouyangzt@stanford.edu
(Z. Ouyang); frankliu@stanford.edu (F. Y. Liu);
ang@cs.stanford.edu (A. Y. Ng); Rob.Jackson@stanford.edu                                                                            tential to fill in this gap. Several recent works have de-
(R. B. Jackson)                                                                                                                     veloped deep learning models to automatically interpret
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                   remotely sensed imagery and deploy them at scale to
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
map infrastructure [4, 5, 6, 7]. Methods for mapping          Table 1
methane source infrastructure have been emerging as           Counts and proportions of each category in METER-ML.
well, including well pads in the Denver basin [8], oil re-    The labels on the training set are obtained from public data
fineries and concentrated animal feed operations in the       whereas the labels on validation and test sets are obtained
U.S. [9, 10], and wastewater treatment plants in Germany      from a consensus of two methane source identification experts.
[11]. Each of these works depended on the curation of         The individual category counts do not add up to the overall
                                                              train/valid/test counts as some (0.8%) of the positive examples
large, labeled datasets to develop the machine learning
                                                              are labeled with more than one methane source category.
models, but there is a lack of publicly available, labeled
Earth observation data, specifically on methane emitting        Category    Train (%)    Valid (%)   Test (%) Total
infrastructure, which prohibits researchers and practi-          CAFOs 24957 (29.3%) 47 (9.1%) 92 (9.0%) 25096
tioners from building automated mapping approaches.             Landfills  4085 (4.8%) 46 (8.9%) 111 (10.9%) 4242
                                                               Coal Mines 1776 (2.1%) 40 (7.8%) 72 (7.1%) 1888
   In this work, we construct a multi-sensor Earth obser-
                                                               Proc Plants 1900 (2.2%) 38 (7.4%) 107 (10.5%) 2045
vation dataset for methane source infrastructure iden-            R&Ts     4012 (4.7%) 59 (11.5%) 108 (10.6%) 4179
tification called METER-ML. In support of a new ini-            WWTPs 14519 (17.1%) 46 (8.9%) 129 (12.7%) 14694
tiative to build a global database of methane emitting          Negatives 34195 (40.2%) 249 (48.3%) 426 (41.8%) 34870
infrastructure called the MEthane Tracking Emissions              Total      85066          515        1018     86599
Reference (METER) [12], we develop METER-ML to al-
low the machine learning community to experiment with
multi-view/multi-modal modeling approaches to auto-           visible differentiating features which make them feasible
matically identify this infrastructure in remotely sensed     to identify in high resolution remotely sensed imagery.
imagery. METER-ML includes georeferenced imagery                 The locations are obtained from 18 different publicly
from three remotely sensed image products, specifically       available datasets, all of which have licenses that allow for
19 spectral bands in total from NAIP, Sentinel-1, and         redistribution (see Table 6 in the Appendix). As various
Sentinel-2, capturing 51,729 sources of methane from six      datasets may contain the same locations of infrastructure,
different classes as well as 34,870 negative examples (Fig-   we deduplicate by considering locations within 500m of
ure 1). The dataset includes expert-reviewed validation       each other identical. In total we include 51,729 unique
and test sets for robustly evaluating the performance of      locations of methane source infrastructure in the dataset,
derived models. Using the dataset, we experiment with a       which we refer to as positive examples.
variety of convolutional neural network models which
leverage different spatial resolutions, spatial footprints,
image products, and spectral bands. The dataset is freely     2.2. Negative locations
available1 in order to encourage further work on develop-    We additionally include a variety of images in the dataset
ing and validating methane source mapping approaches.        which capture none of the six methane emitting facilities.
                                                             To do this, we define around 50 classes (see Appendix)
                                                             of different facilities and landscapes and select character-
2. Methods                                                   istic examples of each class. Then we collect locations
                                                             containing similar facilities and landscapes using the
2.1. Methane source locations                                Descartes Labs GeoVisual Search [13], providing up to
We collect locations of methane emitting infrastructure 1000 similar locations per example. A sample of the simi-
in the U.S. from a variety of public datasets. We focus on lar locations were manually vetted in each case to ensure
the U.S. in this study due to the high availability of pub- no locations obtained actually corresponded to the six
licly accessible infrastructure data and remotely sensed methane source categories. In total we include 34,870
imagery. The infrastructure categories we include are locations of facilities and landscapes which are not any
concentrated animal feeding operations (CAFOs), coal of the six infrastructure categories, and refer to these as
mines (Mines), landfills (Landfills), natural gas processing negative examples. The counts and proportions of the
plants (Proc Plants), oil refineries and petroleum termi- positive and negative classes in the dataset are shown in
nals (including crude oil and liquified natural gas termi- Table 1.
nals), and wastewater treatment plants (WWTPs). We
group oil refineries and petroleum terminals together 2.3. Remotely sensed imagery
due to their high similarity in appearance, and refer to
that category as “Refineries & Terminals” (R&Ts). These We pair all of the locations in the dataset with three
infrastructure categories were chosen based on their po- publicly available remotely sensed image sources. Specif-
tential for emitting methane along with their consistent, ically we include aerial imagery from the USDA National
                                                             Agriculture Imagery Program (NAIP) as well as satellite
1
  https://stanfordmlgroup.github.io/projects/meter-ml        imagery captured by Sentinel-1 (S1) and Sentinel-2 (S2).
Table 2                                                         2.4. Validation and test sets
Summary of the remotely sensed image products and bands
included in METER-ML. RGB are the three visible bands, NIR      Two Stanford University postdoctoral researchers with
is a single near-infrared band, RE1-4 are the four red-edge     expertise in methane emissions and related infrastruc-
bands, SWIR1-2 are the two shortwave infrared bands, CA is      ture individually reviewed 1,533 examples to compose
the single coastal aerosol band, WV is the single water-vapor   the held-out validation and test sets. To determine which
band, C is the single cirrus band, and VH & VV are the two      examples to include in these held-out sets, we randomly
V-transmit bands.                                               sampled 150 images from each of the six positive classes
  Product        Bands      Image Size Resolution               as well as a random sample of 33 images which have mul-
   NAIP       RGB & NIR      720x720       1m                   tiple labels, constituting 933 positive examples according
 Sentinel-2   RGB & NIR       72x72       10m                   to the original public dataset labels. We additionally
 Sentinel-2 RE1-4 & SWIR1-2   36x36       20m                   sampled 12 images from each of the 50 negative cate-
 Sentinel-2 CA & WV & C       12x12       60m                   gories resulting in 600 negative examples. The experts
 Sentinel-1    VH & VV        72x72       10m                   both manually reviewed these examples and identified
                                                                the presence or absence of the six methane source cate-
                                                                gories by using a combination of NAIP imagery as well
NAIP imagery covers the contiguous U.S. and S1 and S2           as Google Maps imagery, which often had finer spatial
imagery both have global coverage. For NAIP we use              resolution as well as place names. The facility had to be
1m resolution imagery, for Sentinel-2 we use the L1C            captured by the NAIP image for the corresponding label
product at 10m resolution, and for Sentinel-1 we use the        to be assigned. If the expert identified no clearly visible
Sigma Nought Backscatter product at 10m resolution. We          methane source categories in the image, the example was
use all spectral bands from each product. Specifically, we      labeled “negative”, and if the expert was uncertain about
use the three visible (RGB) and single near-infrared (NIR)      any label, the example was labeled “uncertain”. The two
bands from NAIP and S2, the single coastal aerosol (CA)         labels per example were then resolved as follows:
band, four red-edge (RE1-4) bands, single water vapor
                                                                    1. If the experts agreed and neither was uncertain,
(WV) band, single cirrus (C) band, and the two shortwave
                                                                       the agreed upon label was taken as the final label.
infrared (SWIR1-2) bands from S2, and the V-transmit
                                                                    2. If the experts disagreed, and one was uncertain
(VH and VV) bands from S1. We include S1 and S2 in the
                                                                       but the other was not, the expert’s certain label
dataset in order to enable experimenting with coarser
                                                                       was taken as the final label.
resolution satellite imagery which is globally available,
unlike NAIP. The details of each imagery product and                3. If the experts disagreed, but one agreed with the
band are shown in Table 2.                                             original label, the original label was taken as the
   In order to construct images containing each location               final label.
in the dataset, we consider a 720m x 720m footprint cen-            4. In all other scenarios, the example was reviewed
tered around the location. This footprint was chosen to                jointly by the experts and a final label was as-
balance the size of the images with the contextual infor-              signed.
mation, but we investigate this choice in the experiments.
                                                                   Only 76 examples out of the 1,533 went to another
Due to the geographic coordinate noise in the publicly
                                                                round of review. The resulting datasets have 858 positive
available datasets, we chose to center the imagery at the
                                                                examples and 675 negative examples. We split the 1,533
locations which increases the likelihood the facilities are
                                                                examples into 515 for the validation set and 1,018 for the
captured in the imagery, but still has natural variation in
                                                                test set. The label counts on the validation and test sets
the locations of the facilities in the imagery. We construct
                                                                are shown in Table 1.
a mosaic of the most recently captured pixels in a time
range for each image product, where we consider NAIP
images captured between 2017 and 2021 and Sentinel-             3. Experiments
1 and Sentinel-2 images between May and September
2021, where Sentinel-2 images are selected based on low-        We run a variety of multi-label classification experiments
est cloud cover. We use the Descartes Labs platform to          on the curated dataset. In all of our experiments, we use a
download all of the imagery [14].                               DenseNet-121 convolutional neural network architecture
   The total dataset contains 86,599 images capturing           [15]. Preliminary experiments on the dataset explored
ten spectral bands across the three imagery products.           various ResNet and DenseNet architectures and found
Information about the remotely sensed image products            that DenseNets outperformed all ResNet variants [16].
and bands included in the dataset are provided in Table 2       We use a linear layer which outputs six values indicat-
and characteristic examples for each methane source             ing the likelihood that each of the six methane source
category are shown in Figure 2 in the Appendix.
Table 3
Per-class and overall (macro-average) validation AUPRC for different remotely sensed image products and bands. All of these
experiments use images of size 720x720 at a spatial resolution of 1m per pixel, with S1 and S2 upsampled to that resolution.
               Image Product Bands CAFOs Landfills Mines Proc Plants               R&Ts WWTPs Overall
                    S1        VH&VV 0.519 0.107    0.152    0.218                  0.487 0.119 0.267
                    S2         RGB  0.889 0.268    0.305    0.374                  0.694 0.204 0.456
                    S2          All 0.889 0.189    0.382    0.368                  0.690 0.183 0.450
                  S2 & S1       All 0.923 0.152    0.379    0.391                  0.612 0.231 0.448
                   NAIP        RGB  0.903 0.270    0.348    0.327                  0.849 0.182 0.480
                   NAIP         All 0.945 0.276    0.401    0.508                  0.857 0.303 0.548
               NAIP & S2 & S1   All 0.889 0.214    0.473    0.457                  0.796 0.272 0.517


categories are present in the input image, which outper-       all spectral bands from S1 and S2 together (representing
formed individual models across all classes in our prelimi-    the model closest to public global transferability due to
nary experiments. Although the model does not explicitly       the global coverage of S1 and S2), and all spectral bands
produce a value indicating the likelihood that the image       from the three products together.
is negative, a low value assigned to all classes indicates a      The best model according to macro-average AUPRC
negative prediction. The loss function is the mean of six      is the one which uses NAIP with all bands (the
unweighted binary cross entropy losses, where the label        three visible bands and NIR band), achieving an over-
is 1 if the class if present in the image and 0 otherwise.     all AUPRC of 0.548 and the highest performance on
All six labels in the negative examples are 0. The network     CAFOs, Landfills, Proc Plants, R&Ts, and WWTPs com-
weights are initialized with weights from a network pre-       pared to all other tested product and band combina-
trained on ImageNet [17]. Before inputting the images          tions. Notably, it achieves very high performance on
into the networks, we upscale the Sentinel-1 and Sentinel-     CAFOs (AUPRC=0.945) and high performance on R&Ts
2 images to match the size of NAIP images using bilinear       (AUPRC=0.857). The second best model is the joint
resampling and normalize the values by the display range       NAIP+S2+S1 model, achieving an overall AUPRC of 0.517
of the bands (see Table 7 in the Appendix). When using         and the highest performance on Mines (AUPRC=0.473)
inputs with less than or more than 3 channels, we re-          compared to all other tested product and band combina-
place the first convolutional neural network layer with        tions.
one which accepts the corresponding number of chan-               S1 alone underperforms all other combinations of prod-
nels. Each model is trained for 5 epochs with a batch size     ucts and bands, followed by S2 and S1 jointly, which
of 4. For each model we use the checkpoint saved after         performed similarly overall to S2 with only the visi-
an epoch which led to the lowest validation loss. We use       ble bands and all spectral bands. Importantly, the S2
an Adam optimizer with standard parameters [18] and            and S1 joint model still achieves high performance on
a learning rate of 0.02. All models are trained using a        CAFOs (AUPRC=0.923), although the performance is
GeForce GTX 1070 GPU.                                          lower than performance on CAFOs using NAIP imagery
   The baseline setting for all experiments uses images        (AUPRC=0.947). There is a significant drop in perfor-
capturing a footprint of 720m x 720m with 1m spatial res-      mance on all classes when moving from NAIP to S2,
olution (720 x 720 image dimensions). After the models         highlighting the benefit of using high spatial resolution
are trained, each of the six values output by the model are    imagery.
fed through an element-wise sigmoid function to produce           The inclusion of the non-visible information substan-
a probability for each of the six categories. To evaluate      tially improves overall AUPRC for NAIP (AUPRC=0.480
the performance of the models, we compute the per-class        → 0.548) but not for Sentinel 2 (AUPRC=0.450 → 0.448).
area under the precision recall curve (AUPRC) and sum-         For NAIP, the improvement is observed for all classes,
marize the performance over all classes by taking the          with substantial gains on CAFOs, Mines, Proc Plants,
macro-average of the per-class AUPRCs.                         and WWTPs. For Sentinel 2, the inclusion of non-visible
                                                               bands substantially improves performance on Mines but
3.1. Impact of using different imaging                         substantially degrades performance on Landfills. For
                                                               both products, minimal change on R&Ts performance is
     products and bands                                        observed when including the non-visible bands.
We investigate the impact of using different combinations
of image products and bands in the dataset (Table 3).
Specifically, we experiment with NAIP, S2, and S1 alone,
only visible bands and all spectral bands for S2 and NAIP,
Table 4
Per-class and overall (macro-average) validation AUPRC at varying image footprints and spatial resolutions.
              Image Footprint Resolution CAFOs Landfills Mines Proc Plants             R&Ts WWTPs Overall
                 240x240         1m       0.773 0.217    0.407    0.438                0.735 0.337 0.485
                 480x480         1m       0.772 0.226    0.260    0.371                0.855 0.506 0.498
                 720x720         3m       0.891 0.245    0.378    0.566                0.837 0.269 0.531
                 720x720        1.5m      0.927 0.244    0.426    0.366                0.831 0.449 0.541
                 720x720         1m      0.945  0.276    0.401    0.508                0.857 0.303 0.548


Table 5                                                            treatment plants are surrounded by industrial buildings
Per-class and overall (macros-average) test metrics of the per-    and other infrastructure, so cropping out this infrastruc-
class expert model. The per-class expert model consists of one     ture improves the model’s ability to identify the salient
model per class, where the model used for each class is selected   features of the wastewater treatment facilities.
based on the highest performing settings for that class across        We further find that the highest spatial resolution
the product, bands, footprint, and resolution experiments.         achieves the best overall performance (AUPRC=0.548),
   Category AUPRC AUROCC Precision Recall F1                       outperforming the coarser resolution models on CAFOs,
    CAFOs     0.915 0.989  0.822   0.902 0.860                     Landfills, and R&Ts. The 1.5m resolution model closely
   Landfills  0.259 0.754  0.246   0.523 0.334                     follows with an overall AUPRC of 0.541 and outperforms
    Mines     0.470 0.905  0.558   0.403 0.468                     the 1m resolution model on Mines. The 3m resolution
  Proc Plants 0.350 0.787  0.336   0.477 0.394
                                                                   model also closely follows the 1.5m resolution model
     R&Ts     0.821 0.956  0.752   0.787 0.769
                                                                   achieving an overall performance of 0.531, and substan-
   WWTPs      0.534 0.836  0.633   0.477 0.544
    Overall   0.558 0.871  0.558   0.595 0.562                     tially outperforms both the higher resolution models on
                                                                   Proc Plants. This result suggests that models developed
                                                                   at 1.5m and even 3m resolution have the potential to
3.2. Impact of image footprint and spatial                         perform almost as well as 1m resolution models, which
                                                                   has implications on global applicability as Airbus SPOT
     resolution                                                    and PlanetScope are globally (privately) available at 1.5m
As image footprint (i.e. the amount of area on the ground          and 3m resolution respectively.
captured by the image) and spatial resolution likely im-
pact model performance due to the variation in the sizes           3.3. Per-class expert model test set results
of the methane-emitting facilities and equipment, we
conduct experiments to test these effects (Table 4). To            For each methane source category, we select the ex-
investigate the impact of footprint, we center crop the            perimental configuration (product/band/footprint/reso-
720 x 720 1m images to obtain 480 x 480 and 240 x 240              lution) that achieved the highest validation AUPRC for
1m images corresponding to 480m x 480m and 240m x                  that class to serve as the “class expert”. We refer to the
240m footprints respectively. Note that this reduces the           combination of the different class experts as the per-class
area on the ground with spatial resolution held constant.          expert model.
To investigate the impact of spatial resolution, we use               We evaluate the per-class expert model on the hold-out
cubic spline interpolation [19] to downsample the 720 x            test set using a variety of metrics including AUPRC and
720 images to 480 x 480 (1.5m resolution, corresponding            area under the receiver operating characteristic curve
to Airbus SPOT imagery) and 240 x 240 (3m resolution,              (AUROCC) as well as precision, recall, and F1 at the
corresponding to PlanetScope imagery). Note that this re-          threshold which achieves the highest F1 on the valida-
duces the spatial resolution without modifying the image           tion set. The results are shown in Table 5. The per-class
footprint. In all experiments, we up-sample the images             expert model obtains a macro-average AUPRC of 0.558.
back to 720 x 720 to avoid any differences in performance          The model does especially well on CAFOs (AUPRC=0.915)
due to varying image size. We use NAIP with RGB + NIR              and R&Ts (AUPRC=0.821), possibly because these sources
bands for these experiments as this setting produced the           have very distinctive features (e.g., long barns in CAFOs
best overall performance compared to the other combi-              and storage tank farms in R&Ts). It performs more poorly
nations of products and bands.                                     on the other sources, especially landfills which do not
   We find that the largest tested image footprint achieves        have many clear distinctive features visible at 1m reso-
the highest overall performance (0.548) and substantially          lution. Notably it achieves the lowest performance on
outperforms both smaller spatial footprints across all             the categories with the least number of examples in the
classes except for WWTPs. This may be explained by                 dataset, excluding R&Ts which may be simpler to identify
the fact that a significant number of smaller wastewater           due to their homogeneity and discernible features.
4. Discussion                                                 dinal information has the potential to provide additional
                                                              signal to help differentiate certain facilities, e.g. waste
The experiments suggest that the choice of imaging prod-      pile evolution at landfills. Third, we use a DenseNet121
uct, spectral band, image footprint, and spatial resolution   model that is pre-trained on ImageNet, but the shape and
can lead to substantial differences in model performance,     number of channels of remote sensing imagery can be
with the effect often depending on the methane source         significantly different from ImageNet. It would be worth-
category. In particular, this suggests that there is sig-     while to train a network from scratch on METER-ML,
nificant room to explore approaches which leverage the        and compare its performance against a network that is
multi-sensor and multi-spectral aspects of METER-ML.          pre-trained on ImageNet and fine-tuned on METER-ML.
For example, the NAIP & S2 & S1 model underperformed          Fourth, our approach to combine the multi-sensor data
the model which used NAIP alone, and using all 13 spec-       may not be optimal as the products and spectral bands
tral bands in the S2 model did not lead to substantial per-   have different spatial resolutions and sensor types (e.g.
formance differences compared to the S2 model which           active vs. passive sensors). One alternate approach may
only used the three visible bands. We also do not leverage    be to dedicate different network branches for the inputs
the geographic information explicitly in the models, but      and combine the representations from each branch.
this has been shown to improve performance on other
Earth observation tasks [20, 21]. Furthermore, there is
potential to augment the dataset with other sources of        5. Conclusion
imagery and information available at the provided geo-
graphic locations. We hope to help create new versions        In this work, we curate a large georeferenced multi-
of METER-ML which may include other sources of input          sensor dataset called METER-ML to test automated
data and methane emitting infrastructure categories.          methane source identification approaches. We conduct
   The best model from our experiments achieves high          a variety of experiments investigating the impact of re-
performance on identifying CAFOs and R&Ts, suggesting         motely sensed image product, spectral bands, image foot-
the potential to map these facilities with NAIP imagery       print, and spatial resolution on model performance mea-
in the U.S. which aligns with findings from prior studies     sured against a consensus of expert labels. We find that a
[9, 10]. The performance for identifying CAFOs remains        model which leverages NAIP with all four bands achieves
high when using S1 and S2, which are globally and pub-        the highest overall performance across the tested image
licly available. This suggests the potential to use these     product and spectral band combinations, followed closely
lower spatial resolution imagery sources to map CAFOs         by a joint NAIP, Sentinel-2, and Sentinel-1 model. We also
in other countries besides the U.S., but future work should   find that the highest spatial resolution and footprint leads
investigate whether these findings generalize to other        to the best overall performance, although performance
regions. There is still a large gap to achieving high per-    can depend on the methane source category. Finally we
formance for each of the other methane source categories      show that the best model achieves high performance in
and further improve performance on the high performing        identifying concentrated animal feeding operations and
categories, so METER-ML is a challenging benchmark to         oil refineries and petroleum terminals, suggesting the
test new infrastructure identification approaches.            potential to map them at scale, but substantially lower
   There are many other publicly available remote sens-       performance on the other four categories with notably
ing datasets for classification, with some of the most com-   lower performance identifying processing plants and
mon being UC Merced [22], SAT-4 and SAT-6 [23], AID           landfills. We make METER-ML freely available in or-
[24], NWPU-RESISC45 [25], EuroSAT [26], and BigEarth-         der to encourage and support future work on developing
Net [27]. Few of these datasets have georeferenced multi-     Earth observation models for mitigating climate change.
sensor images, which limits their utility for new modeling
approaches and downstream use. The OGNet dataset [9]          Acknowledgments
is the most similar publicly available dataset to METER-
ML and is essentially a subset of it, containing NAIP         This work was supported by the High Tide Foundation
imagery of refineries in the contiguous U.S.                  to construct the METER database. We acknowledge Rose
   We identify four limitations of this work. First, we       Rustowicz and Kyle Story for their support of this work,
limit the geographic scope of METER-ML to the U.S. due        as well as the Descartes Labs Platform API and tools
to the availability of disseminatable infrastructure data     for downloading and processing the remotely sensed im-
and publicly available, high resolution imagery. Future       agery. We also thank Ritesh Gautam and Mark Omara
work should include data in other regions worldwide.          for their help working with the oil and gas infrastruc-
Second, we do not include longitudinal imagery in the         ture data, Evan Sherwin for his advice on the dataset and
dataset to reduce the size and complexity of the dataset as   methane source categories, and Victor Maus for provid-
most infrastructure is static over time. However, longitu-    ing the coal mines data.
References                                                         tional Journal of Applied Earth Observation and
                                                                   Geoinformation 110 (2022) 102804.
 [1] Z. Zhang, B. Poulter, S. Knox, A. Stavert, G. Mc-        [12] Stanford University, Methane tracking emis-
     Nicol, E. Fluet-Chouinard, A. Feinberg, Y. Zhao,              sions reference platform, 2022. URL: https://
     P. Bousquet, J. G. Canadell, et al., Anthropogenic            meterplatform.web.app/, accessed: 2022-06-30.
     emission is the main contributor to the rise of at-      [13] R. Keisler, S. W. Skillman, S. Gonnabathula,
     mospheric methane during 1993–2017, National                  J. Poehnelt, X. Rudelis, M. S. Warren, Visual search
     science review 9 (2022) nwab200.                              over billions of aerial and satellite images, Com-
 [2] P. Agreement, Paris agreement, in: Report of the              puter Vision and Image Understanding 187 (2019)
     Conference of the Parties to the United Nations               102790.
     Framework Convention on Climate Change (21st             [14] D. Labs, Descartes labs platform, 2022. URL: https:
     Session, 2015: Paris). Retrived December, volume 4,           //descarteslabs.com/, accessed: 2022-06-26.
     HeinOnline, 2015, p. 2017.                               [15] G. Huang, Z. Liu, L. Van Der Maaten, K. Q. Wein-
 [3] D. J. Jacob, D. J. Varon, D. H. Cusworth, P. E. Denni-        berger, Densely connected convolutional networks,
     son, C. Frankenberg, R. Gautam, L. Guanter, J. Kel-           in: Proceedings of the IEEE conference on computer
     ley, J. McKeever, L. E. Ott, et al., Quantifying              vision and pattern recognition, 2017, pp. 4700–4708.
     methane emissions from the global scale down             [16] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learn-
     to point sources using satellite observations of at-          ing for image recognition, in: Proceedings of the
     mospheric methane, Atmospheric Chemistry and                  IEEE conference on computer vision and pattern
     Physics Discussions (2022) 1–44.                              recognition, 2016, pp. 770–778.
 [4] J. Yu, Z. Wang, A. Majumdar, R. Rajagopal, Deep-         [17] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
     solar: A machine learning framework to efficiently            Fei, Imagenet: A large-scale hierarchical image
     construct a solar deployment database in the united           database, in: 2009 IEEE conference on computer
     states, Joule 2 (2018) 2605–2617.                             vision and pattern recognition, Ieee, 2009, pp. 248–
 [5] J. Lee, N. R. Brooks, F. Tajwar, M. Burke, S. Ermon,          255.
     D. B. Lobell, D. Biswas, S. P. Luby, Scalable deep       [18] D. P. Kingma, J. Ba, Adam: A method for stochas-
     learning to identify brick kilns and aid regulatory           tic optimization, arXiv preprint arXiv:1412.6980
     capacity, Proceedings of the National Academy of              (2014).
     Sciences 118 (2021).                                     [19] P. S. Parsania, P. V. Virparia, A comparative analysis
 [6] L. Kruitwagen, K. Story, J. Friedrich, L. Byers,              of image interpolation algorithms, International
     S. Skillman, C. Hepburn, A global inventory of                Journal of Advanced Research in Computer and
     photovoltaic solar energy generating units, Nature            Communication Engineering 5 (2016) 29–34.
     598 (2021) 604–610.                                      [20] O. Mac Aodha, E. Cole, P. Perona, Presence-only
 [7] W. Sirko, S. Kashubin, M. Ritter, A. Annkah, Y. S. E.         geographical priors for fine-grained image classi-
     Bouchareb, Y. Dauphin, D. Keysers, M. Neumann,                fication, in: Proceedings of the IEEE/CVF Interna-
     M. Cisse, J. Quinn, Continental-scale building detec-         tional Conference on Computer Vision, 2019, pp.
     tion from high resolution satellite imagery, arXiv            9596–9606.
     preprint arXiv:2107.12283 (2021).                        [21] J. Irvin, H. Sheng, N. Ramachandran, S. Johnson-
 [8] S. Dileep, D. Zimmerle, J. R. Beveridge, T. Vaughn,           Yu, S. Zhou, K. Story, R. Rustowicz, C. Elsworth,
     Automated identification of oil field features using          K. Austin, A. Y. Ng, Forestnet: Classifying drivers
     cnns, in: NeurIPS, Workshop on Tackling Climate               of deforestation in indonesia using deep learning on
     Change with Machine Learning, 2020, 2020.                     satellite imagery, arXiv preprint arXiv:2011.05479
 [9] H. Sheng, J. Irvin, S. Munukutla, S. Zhang, C. Cross,         (2020).
     K. Story, R. Rustowicz, C. Elsworth, Z. Yang,            [22] Y. Yang, S. Newsam, Geographic image retrieval
     M. Omara, et al., Ognet: Towards a global oil                 using local invariant features, IEEE Transactions on
     and gas infrastructure database using deep learn-             Geoscience and Remote Sensing 51 (2012) 818–832.
     ing on remotely sensed imagery, arXiv preprint           [23] S. Basu, S. Ganguly, S. Mukhopadhyay, R. DiBiano,
     arXiv:2011.07227 (2020).                                      M. Karki, R. Nemani, Deepsat: a learning frame-
[10] C. Handan-Nader, D. E. Ho, Deep learning to map               work for satellite imagery, in: Proceedings of the
     concentrated animal feeding operations, Nature                23rd SIGSPATIAL international conference on ad-
     Sustainability 2 (2019) 298–306.                              vances in geographic information systems, 2015, pp.
[11] H. Li, J. Zech, D. Hong, P. Ghamisi, M. Schultz,              1–10.
     A. Zipf, Leveraging openstreetmap and multimodal         [24] G.-S. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong,
     remote sensing data with joint deep learning for              L. Zhang, X. Lu, Aid: A benchmark data set for
     wastewater treatment plants detection, Interna-               performance evaluation of aerial scene classifica-
     tion, IEEE Transactions on Geoscience and Remote               2020. URL: https://maps.indiana.edu/metadata/
     Sensing 55 (2017) 3965–3981.                                   Environment/Agribusiness_Confined_Feeding_
[25] G. Cheng, J. Han, X. Lu, Remote sensing image                  Operations.html#Identification_Information,
     scene classification: Benchmark and state of the art,          accessed: 2022-02-01.
     Proceedings of the IEEE 105 (2017) 1865–1883.             [38] U.S.     Environmental       Protection       Agency,
[26] P. Helber, B. Bischke, A. Dengel, D. Borth, Eurosat:           Landfill      methane        outreach       program,
     A novel dataset and deep learning benchmark for                2022.       URL:        https://www.epa.gov/lmop/
     land use and land cover classification, IEEE Journal           project-and-landfill-data-state, accessed: 2022-02-
     of Selected Topics in Applied Earth Observations               01.
     and Remote Sensing 12 (2019) 2217–2226.                   [39] A. J. Marchese, T. L. Vaughn, D. J. Zimmerle, D. M.
[27] G. Sumbul, M. Charfuelan, B. Demir, V. Markl,                  Martinez, L. L. Williams, A. L. Robinson, A. L.
     Bigearthnet: A large-scale benchmark archive for               Mitchell, R. Subramanian, D. S. Tkacik, J. R. Roscioli,
     remote sensing image understanding, in: IGARSS                 et al., Methane emissions from united states natu-
     2019-2019 IEEE International Geoscience and Re-                ral gas gathering and processing, Environmental
     mote Sensing Symposium, IEEE, 2019, pp. 5901–                  science & technology 49 (2015) 10718–10727.
     5904.                                                     [40] V. Maus, S. Giljum, D. M. da Silva, J. Gutschlhofer,
[28] CA Governor’s Office of Emergency Services, Ca                 R. P. da Rosa, S. Luckeneder, S. L. Gass, M. Lieber,
     energy commission - oil refineries and terminals,              I. McCallum, An update on global mining land use,
     2021. URL: https://arcg.is/1rXn4G0, accessed: 2022-            Scientific data (2022).
     02-01.                                                    [41] Metropolitan Council, Wastewater treatment
[29] CA State Water Resources Control Board, 2022.                  plants, 2020. URL: https://gisdata.mn.gov/dataset/
     URL: https://www.waterboards.ca.gov/resources/                 us-mn-state-metc-util-wastewater-treatment-plants,
     data_databases/site_map.html, accessed: 2022-02-               accessed: 2022-02-01.
     01.                                                       [42] MN Pollution Control Agency, Feedlots in min-
[30] D. McKernan, 2012 factory farms in michi-                      nesota, 2016. URL: https://gisdata.mn.gov/dataset/
     gan, 2012. URL: https://data.world/dmckernan/                  env-feedlots, accessed: 2022-02-01.
     2012-factory-farms-in-michigan, accessed: 2022-           [43] ORNL DAAC, Sources of methane emissions
     02-01.                                                         (vista-ca), state of california, usa, 2019. URL:
[31] O.        Tsubiks,         Concentrated       animal           https://daac.ornl.gov/NACP/guides/NACP_Vista_
     feeding      operations      (cafo),     us,     2017.         CA_CH4_Inventory.html, accessed: 2022-02-01.
     URL:               https://data.world/dataforacause/      [44] Sierra Club Michigan Chapter, Michigan cafo map-
     concentrated-animal-feeding-operations-cafo,                   ping report, 2017. URL: https://www.sierraclub.
     accessed: 2022-02-01.                                          org/michigan/michigan-cafo-mapping-report, ac-
[32] U.S. Energy Information Administration, U.s. en-               cessed: 2022-02-01.
     ergy mapping system, 2021. URL: https://www.eia.          [45] Stanford Regulation, Evaluation, and Governance
     gov/state/maps.php, accessed: 2022-02-01.                      Lab, Cafo training dataset, 2019. URL: https://reglab.
[33] U.S. Environmental Protection Agency, Greenhouse               stanford.edu/data/cafo-training-dataset/, accessed:
     gas reporting program (ghgrp), 2021. URL: https:               2022-02-01.
     //www.epa.gov/ghgreporting, accessed: 2022-02-
     01.
[34] K. Rose, J. Bauer, V. Baker, A. Barkhurst, A. Bean,
     J. DiGiulio, K. Jones, T. Jones, D. Justman, R. Miller,
     et al., Global Oil & Gas Features Database, Technical
     Report, 2018.
[35] U.S. D of Homeland Security, Homeland infras-
     tructure foundation-level data (hifld), 2021. URL:
     https://hifld-geoplatform.opendata.arcgis.com/, ac-
     cessed: 2022-02-01.
[36] H. Ehalt Macedo, B. Lehner, J. Nicell, G. Grill, J. Li,
     A. Limtong, R. Shakya, Distribution and charac-
     teristics of wastewater treatment plants within the
     global river network, Earth System Science Data
     14 (2022) 559–577.
[37] Indiana D of Environmental Management,
     Confined feeding operation facilities: Indiana,
A. Appendix
A.1. Methane-Emitting Infrastructure Datasets
Table 6
Summary of the datasets containing the locations of methane-emitting infrastructure that are included in METER-ML. All
datasets are subsetted to the locations within the contiguous U.S. We use the centroids of polygons for any datasets provided
in polygon format.
                                Dataset Source                    Scope          Methane Source Categories
                         CA Energy Commission [28]              California                   R&Ts
                                 CSWRCB [29]                    California                 WWTPs
                                data.world [30]                 Michigan                    CAFOs
                        Data For Cause Challenge [31]              US                       CAFOs
                                    EIA [32]                       US            Proc Plants, R&Ts, WWTPs
                                  GHGRP [33]                       US          Landfills, Mines, R&Ts, WWTPs
                                   GOGI [34]                      Global              Proc Plants, R&Ts
                                  HIFLD [35]                       US                   R&Ts, WWTPs
                              HydroWASTE [36]                     Global                   WWTPs
                               IndianaMap [37]                   Indiana                    CAFOs
                                  LMOP [38]                        US                      Landfills
                             Marchese et al. [39]                  US                     Proc Plants
                                Maus et al. [40]                  Global                    Mines
                     Minnesota Metropolitan Council [41]       Minnesota                   WWTPs
                    Minnesota Pollution Control Agency [42]    Minnesota                    CAFOs
                              ORNL DAAC [43]                    California                  CAFOs
                                Sierra Club [44]                Michigan                    CAFOs
                            Stanford RegLab [45]              North Carolina                CAFOs
A.1.1. Coal Mines Data
The mines data from [40] were subsetted to coal mines in order to capture the mines responsible for the vast majority
of methane emissions. To do this, the polygons and coal mine coordinates obtained from S&P Global Commodity
Insights were matched to determine which polygons were spatially related to coal mine coordinates. Then a visual
check and hand cleaning was performed on the polygons assigned a coal mining label to ensure correctness.

A.2. Negative Classes
We identify a variety of infrastructure which are not any of the six infrastructure categories to use as negatives
in the dataset. Specifically we include football fields, marinas, solar panels, large bodies of water, parking lots,
windmills, baseball fields, airport runways, clouds, neighborhoods, golf courses, roundabouts, mountainous terrain,
trees, boats, islands, rocks, rivers, roads, bridges, ripples in water, snow, canyon formations, sparse forests, suburban
neighborhoods, beaches, clear water, swimming pools, sand, corn farms, soy farms, trees on mountainside, farm
houses, grass, airplanes, turning roads, intersections, multifamily residential facilities, rapids, docks, highway loops,
mowed grass, container yards, soccer fields, greenhouses, crops, personal watercrafts, pivot irrigation systems, and
concrete plants. Characteristic examples of each type were selected and a variety of similar examples per type were
obtained using the Descartes Labs GeoVisual Similarity tool [13].

A.3. Remotely Sensed Image Statistics and Examples
Table 7
Summary of the remotely sensed image products and bands included in METER-ML. The raw image data contains values from
the data range, while the display range is used to normalize values before displaying imagery or inputting into models.
                            Product/Bands       Image Size    Resolution   Data Range   Display Range
                           NAIP RGB & NIR        720x720          1m          [0,255]        [0,255]
                             Sentinel-2 CA        12x12          60m        [0,10000]      [0,10000]
                            Sentinel-2 RGB        72x72          10m        [0,10000]       [0,4000]
                           Sentinel-2 RE1-4       36x36          20m        [0,10000]      [0,10000]
                            Sentinel-2 NIR        72x72          10m        [0,10000]      [0,10000]
                            Sentinel-2 WV         12x12          60m        [0,10000]      [0,10000]
                              Sentinel-2 C        12x12          60m        [0,10000]      [0,10000]
                          Sentinel-2 SWIR1-2      36x36          20m        [0,10000]      [0,10000]
                             Sentinel-1 VH        72x72          10m         [1,4095]     [585,2100]
                             Sentinel-1 VV        72x72          10m         [1,4095]     [585,2926]
Figure 2: Characteristic example images of each category in METER-ML. METER-ML contains 19 spectral bands across 3
image products, visualized here in grayscale for single bands and false color composites for multiple bands (the first listed
band is used in the red channel, second in the green, third in the blue).


                          Category         NAIP RGB            NAIP NIR           S1 VV&VH


                          CAFOs


                          Coal Mines


                          Landfills


                          Proc Plants


                          R&Ts


                          WWTPs


                          Negatives
Category      S2 RGB   S2 NIR   S2 RE1&SWIR1-2   S2 RE2-4   S2 CA&WV&C


CAFOs


Coal Mines


Landfills


Proc Plants


R&Ts


WWTPs


Negatives

</pre>