Background

The MODIS satellites have collected a huge volume of data that have applications in numerous ecological, geographical, and other scientific pursuits. These data are collected daily, but in many instances only reported once for a given 16-day time window. Dealing with the metadata for these products can be painstaking.

MODISTools is an R package that allows users to download time series data that were collected by the MODIS satellites. It is ideal for point data, and enables the download of data from just one pixel per location or data from a 2-dimensional pixel window centered on each location.

Here, I’ll explain first how to identify available MODIS data, and how to download MODIS data using MODISTools. Then, I’ll show how to convert the metadata into a meaningful format and clean NDVI time series data using metadata parameters. Finally, I’ll take a quick look at the NDVI time series at a few locations and make some guesses about what’s going on at each site. There will be lots of simple visualization steps along the way to help clarify what the code is actually doing.

An example

Import packages:

library(MODISTools)
library(ggplot2)
library(sf)
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
library(USAboundaries)
library(ggrepel)

Set up parameters

Here, I selected a few exemplar locations for exploring NDVI time series in California. They include an oak woodland (Henry Coe State Park), an agricultural field (Solano), a large lake (Lake Tahoe), the high Sierra (Mt. Lyell), an evergreen forest (Whisky Falls), and desert (Alabama Hills). To successfully download data using MODISTools, we need to create a data.frame with columns “site_name”, “lat”, and “lon” with the latitude and longitude coordinates expressed in epsg:4326. The ROIs declaration could be replaced with any data frame containing columns with these exact names. We can plot the locations over a California shapefile for reference.

# Define sites of interest
ROIs = data.frame("site_name" = c("Henry Coe",
                                  "Solano",
                                  "Lake Tahoe",
                                  "Mt. Lyell",
                                  "Whisky Falls",
                                  "Alabama Hills"),
                  "lat" = c(37.2021213, 
                            38.4412444, 
                            39.0898006, 
                            37.7393911, 
                            37.2869562, 
                            36.634138),
                  "lon" = c(-121.4346147, 
                            -121.7082489, 
                            -120.0329867, 
                            -119.2805897, 
                            -119.4471062,
                            -118.112799))

# Access CA boundary shape
CA = USAboundaries::us_states(resolution = "low",
                              states = "CA")
# Plot
ggplot() +
  geom_sf(data = CA) +
  geom_point(data = ROIs,
             aes(x = lon, y = lat)) +
  geom_text_repel(data = ROIs,
                  aes(x = lon,
                      y = lat,
                      label = site_name))

We also need to decide which MODIS product we’re going to use, and the timeframe over which we want to analyze. Here, we will use the MOD13Q1 product, which includes NDVI (and EVI) along with several relevant metadata layers.

# List available products, choose one
products = mt_products()
products$product
##  [1] "Daymet"   "MCD12Q1"  "MCD12Q2"  "MCD15A2H" "MCD15A3H" "MCD19A3" 
##  [7] "MCD64A1"  "MOD09A1"  "MOD11A2"  "MOD13Q1"  "MOD14A2"  "MOD15A2H"
## [13] "MOD16A2"  "MOD17A2H" "MOD17A3H" "MOD21A2"  "MOD44B"   "MYD09A1" 
## [19] "MYD11A2"  "MYD13Q1"  "MYD14A2"  "MYD15A2H" "MYD16A2"  "MYD17A2H"
## [25] "MYD17A3H" "MYD21A2"  "SPL4CMDL" "VNP09A1"  "VNP09H1"  "VNP13A1" 
## [31] "VNP15A2H" "VNP21A2"
MODproduct = "MOD13Q1"

# See available dates, choose a window
availableDates = mt_dates(MODproduct, ROIs$lat[1], ROIs$lon[1])
head(availableDates)
##   modis_date calendar_date
## 1   A2000049    2000-02-18
## 2   A2000065    2000-03-05
## 3   A2000081    2000-03-21
## 4   A2000097    2000-04-06
## 5   A2000113    2000-04-22
## 6   A2000129    2000-05-08
startDate = "2016-01-01"
endDate = "2019-12-31"

# List available bands
bands = mt_bands(MODproduct)
bands$band
##  [1] "250m_16_days_VI_Quality"               
##  [2] "250m_16_days_blue_reflectance"         
##  [3] "250m_16_days_pixel_reliability"        
##  [4] "250m_16_days_NIR_reflectance"          
##  [5] "250m_16_days_MIR_reflectance"          
##  [6] "250m_16_days_NDVI"                     
##  [7] "250m_16_days_red_reflectance"          
##  [8] "250m_16_days_composite_day_of_the_year"
##  [9] "250m_16_days_relative_azimuth_angle"   
## [10] "250m_16_days_sun_zenith_angle"         
## [11] "250m_16_days_EVI"                      
## [12] "250m_16_days_view_zenith_angle"
# Useful here are bands 1 (VI quality), 3 (Pixel reliability), 
# 6 (NDVI), 8 (Composite day of year) and possibly 11 (EVI), 
# depending on your needs. Here we'll ignore EVI.
bandsOfInterest = bands[c(1,3,6,8),]$band
bandsOfInterest
## [1] "250m_16_days_VI_Quality"               
## [2] "250m_16_days_pixel_reliability"        
## [3] "250m_16_days_NDVI"                     
## [4] "250m_16_days_composite_day_of_the_year"

Download MODIS data

Let’s download the time series for each band of interest (an element of bandsOfInterest), at each location (a row in ROIs). This takes a while, especially for larger datasets, so you may want to get out a good book or take a walk. On my computer, downloading MOD13Q1 data from the 6 locations in ROIs above took around 4 minutes when I downloaded data for 2016-2019. That speed can depend on number of sites, duration of data window, and traffic on the host website’s end.

# Benchmark the download time
dlStart = Sys.time()

# Batch download MODIS timeseries
VIQ = mt_batch_subset(df = ROIs,
                      product = MODproduct,
                      band = bandsOfInterest[1],
                      internal = T,
                      start = startDate,
                      end = endDate)
PR = mt_batch_subset(df = ROIs,
                     product = MODproduct,
                     band = bandsOfInterest[2],
                     internal = T,
                     start = startDate,
                     end = endDate)
NDVI = mt_batch_subset(df = ROIs,
                      product = MODproduct,
                      band = bandsOfInterest[3],
                      internal = T,
                      start = startDate,
                      end = endDate)
DOY = mt_batch_subset(df = ROIs,
                      product = MODproduct,
                      band = bandsOfInterest[4],
                      internal = T,
                      start = startDate,
                      end = endDate)
dlEnd = Sys.time()
dlEnd - dlStart
## Time difference of 3.447753 mins

Dealing with the VIQ beast

First, note that in the VI quality layer, values are expressed as numbers with a base 10 counting system. The VIQ layer needs to be converted into binary form in order to be interpreted.

# Make a new data.frame that will contain binarized VIQ values.
VIQbin = VIQ

# Solve for VI Quality
# Source: https://gis.stackexchange.com/questions/144441/how-can-i-parse-modis-mod13q1-quality-layers-in-r
first_k_bits <- function(int, k=16, reverse=T) {
  integer_vector <- as.integer(intToBits(int))[1:k]
  if(reverse) integer_vector <- rev(integer_vector)
  return(paste(as.character(integer_vector), collapse=""))
}

# We can check the output of the function using 7633, a point of reference provided on the MODIS QA tutorial (see link above)

# first_k_bits(7633)
# Binary of 7633 is 0001110111010001

# MODLAND QA = bits 0-1 = bitword 01
# QA bitword 01 for 7633 is "01"
# therefore substr(firstkbits(x), start = 15, stop = 16) is
# an indicator of overall VI Quality with 
# 00 = good quality
# 01 = VI produced but check other layers
# 10 = Probably cloudy
# 11 = Pixel not produced

# Binarize the VIQ values:
VIQbin_list = lapply(VIQ$value,
                FUN = first_k_bits)
VIQbin_vector = unlist(VIQbin_list)
VIQbin$value = as.character(VIQbin_vector)

Compile data

Since information from each band is stored in a different dataframe, let’s compile everything into a list so that we can perform filtering steps more efficiently. From now on, all of the data we work with will exist in list format, where the list elements are, in order: -Binarized VI Quality (VIQbin) -Pixel Reliability (PR) -NDVI -Composite Day of Year (DOY)

# Compile each band's time series dataframe into a list.
myBands = list(VIQbin,
               PR,
               NDVI,
               DOY)

Metadata parameters

Now that we have the VI Quality time series in a format that we can interpret, let’s look at some of the metadata parameters over the course of the time series. First, plot the Pixel Reliability data. Good data is indicated by a pixel reliability value of 0. When the pixel reliability value is 1, the data may be ok, but other quality control measurements should be investigated. Values less than 0 and greater than 1 are no good.

# Visualize Pixel Reliability time series
# Note that PR == 0 means good data
# PR == 1 is maybe good, but we should check other VIQ stuff
# Here, I added a slight jitter to aid with visibility of 
# overlapping points.
ggplot(myBands[[2]],
       aes(x = as.POSIXct(calendar_date),
           y = value,
           col = site)) + 
  geom_jitter(width = 0.1, height = 0.1) +
  geom_line() +
  ylab("QA") +
  xlab("Date") +
  ggtitle("Raw pixel quality") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2")

We can also look at factors directly from the VI Quality layer, for example the land/water and snow/ice flags. We need to access the bitword for each of these VI Quality parameters by using the substr() function, which allows us to pull out elements from a particular position in a string. For the land/water mask, values of 001 indicate we’re looking at land and only land. Values between 011 and 111 indicate we’re looking at water. With the snow/ice flag, 0 is good data and 1 indicates snow or ice may be present.

# Land/water; this flag lives on bits 11-13, which comprise the 
# 3rd through 5th integers in the converted VIQ string.
ggplot(myBands[[1]],
       aes(x = as.POSIXct(calendar_date),
           y = substr(value, start = 3, stop = 5),
           col = site)) + 
  geom_jitter(width = 0.1, height = 0.1) +
  ylab("LandWater") +
  xlab("Date") +
  ggtitle("Land/Water Flag") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2")

# Snow/ice; this flag is bit 14, which is the second integer 
# in the converted VIQ string
ggplot(myBands[[1]],
       aes(x = as.POSIXct(calendar_date),
           y = as.numeric(substr(value, start = 2, stop = 2)),
           col = site)) + 
  geom_jitter(width = 0.1, height = 0.1) +
  geom_line() +
  ylab("SnowIce") +
  xlab("Date") +
  ggtitle("Snow/Ice Flag") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2")

It’s worth noting here that the point directly on Lake Tahoe is always flagged as water, and sometimes even flagged as deep ocean! The majority of our snow/ice observations come from the high-elevation sites: Lake Tahoe and Mt. Lyell. Let’s look at the raw NDVI time series for each site. Because of the way the MODIS satellites report NDVI, we need to scale the data by dividing by 10,000.

# Visualize raw NDVI
# Note that MODIS reports NDVI on a scale of 10,000, so divide the values accordingly for visualization.
Rawplot = ggplot(myBands[[3]],
       aes(x = as.POSIXct(calendar_date),
           y = value/10000,
           col = site)) + 
  geom_point() +
  geom_line() +
  ylab("NDVI") +
  xlab("Date") +
  ggtitle("Raw NDVI download") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2"); Rawplot

Data cleaning

Several clear patterns emerge from the data, but let’s clean up the data before drawing any sweeping conclusions. Begin by filtering out poor-quality VI values based on the Pixel Reliability timeseries. Because each timeseries is saved in a common list, we can apply a filter that converts all band values to NA when a criterion is met. Let’s filter the data by saving only the points where the Pixel Reliability layer indicates at least marginally good quality data (see MOD13 user guide for more).

# Remove instances where pixel reliability is > 1 (bad data!)
# This makes a new list object with the same elements as myBands, 
# but values for all bands where Pixel Reliability > 1 are 
# converted into NA's.
PR_filtered = lapply(myBands,
                     function(x){
                       x$value[myBands[[2]]$value > 1] <- NA
                       return(x)
                     })
# Visualize
PRplot = ggplot(PR_filtered[[3]],
       aes(x = as.POSIXct(calendar_date),
           y = value/10000,
           col = site)) + 
  geom_point() +
  geom_line() +
  ylab("NDVI") +
  xlab("Date") +
  ggtitle("After pixel reliability filter") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2"); PRplot
## Warning: Removed 85 rows containing missing values (geom_point).
## Warning: Removed 11 rows containing missing values (geom_path).

A few warnings should spring up because ggplot2 is concerned about missing rows - but missing data are to be expected since the filtering has begun. Now, filter out additional pixels using the VI Quality layer:

# Mask out additional data based on land/water and snow/ice masks
VIQ_filtered = lapply(PR_filtered,
                      FUN = function(x){
                        x$value[substr(PR_filtered[[1]]$value,
                          start = 3, stop = 5) != "001"] <- NA
                        x$value[substr(PR_filtered[[1]]$value,
                          start = 2, stop = 2) != "0"] <- NA
                        return(x)
                      })
# Visualize
VIQplot = ggplot(VIQ_filtered[[3]],
       aes(x = as.POSIXct(calendar_date),
           y = value/10000,
           col = site)) + 
  geom_point() +
  geom_line() +
  ylab("NDVI") +
  xlab("Date") +
  ggtitle("After pixel land/water mask") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2"); VIQplot
## Warning: Removed 166 rows containing missing values (geom_point).
## Warning: Removed 107 rows containing missing values (geom_path).

Now we’re really cutting down on the dataset. You’ll notice that all of the data collected over Lake Tahoe is now gone (because we saved only data where pixels were over land and nothing but land). In the Cathedral Range (Mt. Lyell), almost all of the data outside the summer months has been removed due to the snow and ice that tops the Sierras each winter. This will be extremely impactful if the end goal is to measure timing of green-up.

Finally, we need to sort the dates of the observations because of the data collection and reporting style of MODIS: The satellite takes a picture every day, and then reports the highest VI composite for each 16-day window across the year. This is nice because it helps overcome some issues related to cloud cover, but means that we need to account for differences in the date of collection across pixels but within the same collection window.

To get the “Composite day of year” situation settled, we’ll need to add a few columns to each band’s data.frame. First, add a day of year measurement, which is the value reported on the DOY layer. Then extract the calendar year in another column, and finally create a YearDOY column which combines the year and DOY columns for an unambiguous date column.

# Add three columns to each data.frame based on the DOY layer. 
DOY_corrected = lapply(VIQ_filtered,
                       FUN = function(x){
                         x$DOY = VIQ_filtered[[4]]$value
                         x$Year = substr(x$calendar_date,
                                         start = 1,
                                         stop = 4)
                         x$YearDOY = paste(x$Year,
                                           x$DOY,
                                           sep = "-")
                         return(x)
                       })

# Visualize
DOYplot = ggplot(DOY_corrected[[3]],
       aes(x = as.POSIXct(YearDOY, 
                          format = "%Y-%j", 
                          origin = "2001-01-01"),
           y = value/10000,
           col = site)) + 
  geom_point() +
  geom_line() +
  ylab("NDVI") +
  xlab("Date") +
  ggtitle("Accounting for composite date") +
  theme_classic() +
  scale_color_brewer(palette = "Dark2"); DOYplot
## Warning: Removed 166 rows containing missing values (geom_point).
## Warning: Removed 166 rows containing missing values (geom_path).

You will notice when comparing the plot above to other plots we’ve made so far that the spacing between datapoints at a given plot is no longer regular; that is because rather than mapping NDVI against the start of the composite period, we’re now mapping NDVI against the composite date.

Now we have an NDVI time series that has been organized into proper timestamps. On the west side of the Sierra, NDVI followed typical annual cycles. NDVI at Coe, Solano, and Whisky Falls increased each spring, and fell back to a winter baseline in the fall. Lake Tahoe is completely removed because of the water flag. In the high Sierra (Mt. Lyell), high-quality NDVI is only produced during the summer months because of snow, and NDVI values are comparable to desert sites to the east. At Alabama Hills, high-quality pixels persist across the time series, but only minimal green-up cycling is detectable and it doesn’t occur every year.