# HLS VI Pipeline Guide [![Python](https://img.shields.io/badge/python-3.10--3.12-blue.svg)](https://www.python.org/) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) [![Platform](https://img.shields.io/badge/platform-linux%20%7C%20macOS-lightgrey.svg)]() [![Data: HLS v2.0](https://img.shields.io/badge/data-HLS%20v2.0-brightgreen.svg)](https://hls.gsfc.nasa.gov/) A production-ready, 11-step processing pipeline for computing and analyzing vegetation indices from NASA's Harmonized Landsat and Sentinel-2 (HLS) surface reflectance data. Designed for researchers across ecology, agriculture, land management, and remote sensing who need reproducible, time-series vegetation analysis at scale. --- ## Table of Contents - [Overview](#overview) - [Key Features](#key-features) - [Pipeline Architecture](#pipeline-architecture) - [Prerequisites](#prerequisites) - [Installation](#installation) - [Configuration](#configuration) - [Quick Start](#quick-start) - [Running the Pipeline](#running-the-pipeline) - [Step Reference](#step-reference) - [Output Products](#output-products) - [Advanced Usage](#advanced-usage) - [Troubleshooting](#troubleshooting) - [Credits & Acknowledgments](#credits--acknowledgments) - [License](#license) --- ## Overview The **HLS Vegetation Index Pipeline** automates the full workflow from raw HLS satellite imagery to analysis-ready vegetation index products. **Input:** NASA HLS v2.0 granules (Landsat-8/9 L30 and Sentinel-2 S30) queried directly from the NASA Common Metadata Repository (CMR) API. **Output:** - Cloud-masked, quality-filtered vegetation index GeoTIFFs per granule - Per-tile CF-1.8 compliant NetCDF time-series stacks with sensor metadata - Temporal mean raster mosaics reprojected to a user-specified CRS - Outlier-flagged pixel summaries (raster mean + count) and point-vector GeoPackages - Multi-band seasonal composite stacks with user-defined, named time windows **Supported Vegetation Indices:** | Index | Formula | Typical Range | |-------|---------|---------------| | NDVI | `(NIR − Red) / (NIR + Red)` | −1.0 to 1.0 | | EVI2 | `2.5 × (NIR − Red) / (NIR + 2.4×Red + 1)` | −1.0 to 2.0 | | NIRv | `NDVI × NIR` | −0.5 to 1.0 | **Data Source:** [NASA Harmonized Landsat and Sentinel-2 (HLS)](https://hls.gsfc.nasa.gov/) — 30 m spatial resolution, ~2–3 day combined revisit time. --- ## Key Features - **End-to-end automation** — a single `bash hls_pipeline.sh` command runs all 11 steps sequentially - **Flexible step control** — run the full pipeline or any subset using named step identifiers or built-in aliases - **Quality masking** — bitwise Fmask decode with independently configurable cloud, cloud shadow, snow/ice, water, and aerosol modes - **Outlier detection** — identifies and exports pixels outside per-VI valid ranges as raster summaries and searchable point-vector GeoPackages - **Seasonal composites** — user-defined, named time windows produce multi-band stacks for phenological or climatological analysis, with window labels embedded in band metadata - **Parallel processing** — multiprocessing across configurable worker counts for all compute-intensive steps - **Memory-efficient** — dask-chunked xarray processing and streaming rasterio mosaic merges scale to large study extents without out-of-memory failures - **Consistent tile filtering** — `HLS_TILES` enforces a fixed MGRS tile set uniformly across all 11 steps - **Cloud-optimized output** — all GeoTIFF outputs use LZW compression, internal tiling, and predictor settings appropriate to their data type - **Pre-flight validation** — the pipeline validates that all bands required for the selected VIs are configured before any step executes - **Tile-by-tile processing** — steps 01–03 always run one tile at a time, reducing peak disk usage to roughly one tile's worth of raw data; optional space-saver flags automatically remove raw and/or VI intermediate files after each tile's NetCDF is built --- ## Pipeline Architecture ``` NASA CMR API │ ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Step 01 · download │ │ src/01_hls_download_query.sh │ │ Query CMR API, estimate storage, download raw L30/S30 bands │ └──────────────────────────────┬──────────────────────────────────────┘ │ Raw GeoTIFF bands (B04, B05/B8A, Fmask) ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Step 02 · vi_calc │ │ src/02_hls_vi_calc.py │ │ Apply Fmask quality masking; compute NDVI / EVI2 / NIRv │ └──────────────────────────────┬──────────────────────────────────────┘ │ Per-granule VI GeoTIFFs ▼ ┌─────────────────────────────────────────────────────────────────────┐ │ Step 03 · netcdf │ │ src/03_hls_netcdf_build.py │ │ Aggregate GeoTIFFs → CF-1.8 NetCDF time-series per tile │ └──────────────────────────────┬──────────────────────────────────────┘ │ Per-tile NetCDF time-series ┌──────────┴──────────┐ ▼ ▼ ┌───────────────────┐ ┌──────────────────────┐ │ Steps 04 + 06 │ │ Steps 05 + 07 + 08 │ │ Mean products │ │ Outlier products │ └────────┬──────────┘ └──────────┬───────────┘ │ │ ▼ ├──────────────────────┐ ┌────────────────┐ ▼ ▼ │ Mean mosaic │ ┌─────────────────┐ ┌──────────────────┐ └────────┬───────┘ │ Outlier rasters│ │ Step 11 · GPKG │ │ └─────────────────┘ └──────────────────┘ ▼ ┌────────────────┐ │ Step 10 │ │ Time-series │ │ stacks │ └────────────────┘ ``` ### Step Summary | Step | Script | Step Name | Description | |------|--------|-----------|-------------| | 01 | `src/01_hls_download_query.sh` | `download` | Query NASA CMR API; download raw HLS granules (L30/S30 bands + Fmask) | | 02 | `src/02_hls_vi_calc.py` | `vi_calc` | Compute VI GeoTIFFs; apply configurable bitwise Fmask quality masking | | 03 | `src/03_hls_netcdf_build.py` | `netcdf` | Aggregate per-granule GeoTIFFs into CF-1.8 compliant NetCDF time-series per tile | | 04 | `src/04_hls_mean_reproject.py` | `mean_flat` | Temporal mean per tile; reproject to `TARGET_CRS` | | 05 | `src/05_hls_outlier_reproject.py` | `outlier_flat` | Outlier-aware mean + valid count per tile; reproject | | 06 | `src/06_hls_mean_mosaic.py` | `mean_mosaic` | Mosaic per-tile means into a study-area-wide GeoTIFF | | 07 | `src/07_hls_outlier_mean_mosaic.py` | `outlier_mosaic` | Mosaic outlier-filtered means | | 08 | `src/08_hls_outlier_count_mosaic.py` | `outlier_counts` | Mosaic valid-observation counts | | 09 | `src/09_hls_count_valid_mosaic.py` | `count_valid_mosaic` | Count valid observations per pixel across all download cycles; mosaic into a study-area-wide GeoTIFF | | 10 | `src/10_hls_timeseries_mosaic.py` | `timeseries` | Multi-band seasonal composite stacks with user-defined time windows | | 11 | `src/11_hls_outlier_gpkg.py` | `outlier_gpkg` | Export per-pixel outlier observations (value, date, location) to GeoPackage | --- ## Prerequisites ### 1. NASA Earthdata Account HLS data is hosted on NASA Earthdata. You must: 1. Register for a free account at [https://urs.earthdata.nasa.gov/](https://urs.earthdata.nasa.gov/) 2. Accept the required End User License Agreements (EULAs) for HLS data products on the Earthdata website 3. Create a `~/.netrc` file with your credentials: ``` machine urs.earthdata.nasa.gov login password ``` 4. Secure the file: ```bash chmod 600 ~/.netrc ``` ### 2. System Requirements | Tool | Minimum Version | Notes | |------|----------------|-------| | conda or mamba | any | Required for environment management. conda-forge packages also supply the native geospatial libraries (GDAL, PROJ, HDF5, GEOS) that the pipeline depends on — these are not separate system installs | | bash | 3.2+ | Uses standard POSIX-compatible syntax; tested on macOS (ZSH) and Linux | | wget or curl | any | Used by the download script (step 01); the script auto-detects whichever is available | | gdalinfo | any | GDAL command-line tool used by step 01 to validate each downloaded GeoTIFF; provided automatically by the conda environment via rasterio's GDAL dependency — no separate installation needed, but the pipeline must be run with the conda environment active | ### 3. MGRS Tile Identification HLS data is organized by [MGRS (Military Grid Reference System)](https://hls.gsfc.nasa.gov/products-description/tiling-system/) tiles. You need the MGRS tile IDs covering your area of interest before configuring the pipeline. Useful tools: - [NASA Earthdata Search](https://search.earthdata.nasa.gov/) — interactive map with MGRS tile overlays - [MGRS Mapper](https://mgrs-mapper.com/) — web-based tile lookup by location - QGIS with the HLS tiling shapefile ### 4. Storage Estimate When `download` is active, the pipeline prints a comprehensive storage estimate before any data are downloaded, then prompts for confirmation. The estimate covers all active steps and shows both a Peak and Final figure. **Steps 01–03 per-granule sizes (empirically calibrated):** | Product | Approximate size per granule | |---------|------------------------------| | Raw bands (NIR, Red, Fmask) | ~45 MB | | VI GeoTIFF per VI | ~15 MB | | NetCDF contribution per VI | ~12 MB (zlib compression achieves ~7× on NaN-heavy time-series) | **Steps 04–11 per-tile sizes (LZW-compressed estimates):** | Product | Approximate size | |---------|-----------------| | Mean reprojected tile (step 04) | ~55 MB per tile per VI | | Outlier tiles — mean + count (step 05) | ~5 MB per tile per VI (highly variable — nearly 0 for NDVI with default valid range) | | Float32 mosaic contribution (steps 06, 07) | ~35 MB per tile per VI | | uint16 mosaic contribution (steps 08, 09) | ~3 MB per tile per VI | | Time-series stack per window (step 10) | ~50 MB per tile per VI per window | | Outlier GeoPackage (step 11) | variable — depends on outlier rate | **Peak vs. Final storage:** - **Est. Peak** — worst-case simultaneous disk usage during the tile loop: all NetCDF files accumulated so far plus one tile's raw and/or VI files still in progress - **Est. Final** — what remains on disk when all active steps complete (with space savers: raw and VI intermediates are deleted; without: they persist) A 10-tile study area with 5 years of data (bi-weekly acquisitions) running all 11 steps can require **100–300+ GB** depending on VIs and time windows. Enable the space-saver options to minimize peak disk usage — raw and VI intermediate files are deleted per tile as soon as each NetCDF is built. --- ## Installation ### 1. Clone the Repository ```bash git clone https://github.com/stephenconklin/HLS_VI_Pipeline.git cd HLS_VI_Pipeline ``` ### 2. Create the Conda Environment ```bash conda env create -f environment.yml conda activate hls_pipeline ``` > **Tip:** [Mamba](https://mamba.readthedocs.io/) resolves conda environments significantly faster than conda: > > ```bash > mamba env create -f environment.yml > ``` ### 3. Verify the Installation ```bash python -c "import numpy, pandas, rasterio, netCDF4, xarray, rioxarray, dask, fiona; print('Environment OK')" ``` --- ## Configuration All pipeline parameters are set in `config.env`. This file is sourced by `hls_pipeline.sh` before each step — no environment variables need to be exported manually. For the full parameter reference, see the [Configuration Reference](configuration.md). ### Paths ```bash BASE_DIR="/path/to/your/project" # Root output directory — edit this first LOG_DIR="${BASE_DIR}/0_Logs" RAW_HLS_DIR="${BASE_DIR}/1_Raw" # Downloaded HLS granules VI_OUTPUT_DIR="${BASE_DIR}/2_Interim/1_VI_Products" NETCDF_DIR="${BASE_DIR}/2_Interim/2_NetCDF" REPROJECTED_DIR="${BASE_DIR}/2_Interim/3_VI_Mean_Tiles" REPROJECTED_DIR_OUTLIERS="${BASE_DIR}/2_Interim/4_VI_Outlier_Tiles" MOSAIC_DIR="${BASE_DIR}/3_Out/1_Mosaic" TIMESLICE_OUTPUT_DIR="${BASE_DIR}/3_Out/2_TimeSeries" OUTLIER_GPKG_DIR="${BASE_DIR}/3_Out/3_Outlier_Points" ``` ### Vegetation Indices ```bash PROCESSED_VIS="NDVI EVI2 NIRv" # Space-separated; compute all three PROCESSED_VIS="NDVI" # Compute NDVI only ``` ### Step Control ```bash STEPS="all" # Run all 11 steps STEPS="products" # Steps 02–11 (skip download) STEPS="mosaics" # Steps 06–08 only STEPS="outliers" # Steps 05, 07, 08, 11 STEPS="vi_calc netcdf mean_flat" # Named steps, space-separated ``` **Step aliases:** | Alias | Expands To | |-------|-----------| | `all` | Steps 01–11 | | `products` | Steps 02–11 | | `build_nc` | Steps 01–03 | | `mosaics` | Steps 06–08 | | `outliers` | Steps 05, 07, 08, 11 | **Named step identifiers:** `download` · `vi_calc` · `netcdf` · `mean_flat` · `outlier_flat` · `mean_mosaic` · `outlier_mosaic` · `outlier_counts` · `count_valid_mosaic` · `timeseries` · `outlier_gpkg` ### Tile List ```bash HLS_TILES="17TNE 17TNF 17TNG 17TPE 17TPF" # Space-separated MGRS tile IDs ``` Tile filtering is applied at every step. If `HLS_TILES` is unset, all tiles found in each input directory are processed. ### Download Settings ```bash DOWNLOAD_CYCLES="2015-12-01|2016-03-31 2016-12-01|2017-03-31" # Space-separated date ranges CLOUD_COVERAGE_MAX=75 # Maximum cloud cover % accepted by CMR query SPATIAL_COVERAGE_MIN=0 # Minimum spatial coverage % accepted by CMR query ``` ### Band Selection Specifies which bands to download for each HLS sensor: ```bash L30_BANDS="B05 B04 Fmask" # Landsat: NIR (B05), Red (B04), Quality (Fmask) S30_BANDS="B8A B04 Fmask" # Sentinel-2: NIR narrow (B8A), Red (B04), Quality (Fmask) ``` > **Note:** The pipeline validates that the bands required for each VI in `PROCESSED_VIS` are present in the band lists before any step runs and will warn you if something is missing. ### Fmask Quality Masking Control which pixel categories are excluded. Masked pixels are set to NaN and excluded from all downstream computations. ```bash MASK_CIRRUS="TRUE" MASK_CLOUD="TRUE" MASK_ADJACENT_CLOUD="TRUE" MASK_CLOUD_SHADOW="TRUE" MASK_SNOW_ICE="TRUE" MASK_WATER="TRUE" MASK_AEROSOL_MODE="MODERATE" # NONE | HIGH | MODERATE | LOW HLS_SCALE_FACTOR=0.0001 # HLS surface reflectance scale factor ``` **Aerosol masking modes:** | Mode | Behaviour | |------|-----------| | `NONE` | No aerosol masking applied | | `HIGH` | Mask high-aerosol pixels only | | `MODERATE` | Mask moderate and high aerosol pixels | | `LOW` | Mask all non-zero aerosol levels | ### Valid Ranges (Outlier Detection) Pixels within the valid range contribute to the temporal mean. Pixels outside the valid range are treated as outliers and tracked separately through Steps 05, 07, 08, and 11. ```bash VALID_RANGE_NDVI="-1,1" VALID_RANGE_EVI2="-1,2" VALID_RANGE_NIRv="-0.5,1" ``` ### Space Saver Options The pipeline always processes steps 01–03 one tile at a time — downloading, computing VIs, and building the NetCDF before moving to the next tile. This limits peak disk usage to roughly one tile's worth of raw data at any given moment. Two optional space-saver flags delete intermediate files for each tile immediately after its NetCDF is built. Both default to `FALSE` and only take effect when `netcdf` is included in `STEPS`. ```bash SPACE_SAVER_REMOVE_RAW="FALSE" # Delete raw HLS band + Fmask files after NetCDF is built SPACE_SAVER_REMOVE_VI="FALSE" # Delete VI GeoTIFF files after NetCDF is built ``` > **Note:** Both space-saver options can be enabled together. Raw files and VI GeoTIFFs are no longer needed once the NetCDF time-series is complete — all downstream steps (04–11) read only from the NetCDF files. ### Download Approval Before any data is downloaded, the pipeline prints a storage estimate and prompts for confirmation. To bypass this prompt in automated or non-interactive contexts: ```bash SKIP_APPROVAL="FALSE" # Set TRUE to skip the download confirmation prompt ``` Failed tiles are skipped with a logged error and the pipeline continues to the next tile. A summary of succeeded and failed tiles is printed at the end of the tile loop. ### Time-Series Windows Define named time windows for seasonal composite stack generation (Step 10): ```bash TIMESLICE_ENABLED="TRUE" TIMESLICE_STAT="mean" TIMESLICE_WINDOWS="Winter_2015_2016:2015-12-01|2016-03-31 \ Summer_2016:2016-06-01|2016-08-31 \ Winter_2016_2017:2016-12-01|2017-03-31" ``` Format: `label:YYYY-MM-DD|YYYY-MM-DD`, space-separated. Labels must be alphanumeric with underscores only. Each label becomes the band description in the output stack. ### Processing Performance ```bash NUM_WORKERS=8 # Parallel worker processes — set to available CPU cores CHUNK_SIZE=10 # NetCDF time slices per dask chunk TARGET_CRS="EPSG:6350" # Output CRS — must be a projected CRS (metres) ``` ### Output Format ```bash NETCDF_COMPLEVEL=1 # zlib compression level for NetCDF files (0–9; 1 = fast, 9 = smallest) GEOTIFF_COMPRESS="LZW" # GeoTIFF codec: LZW (default), DEFLATE, ZSTD, NONE GEOTIFF_BLOCK_SIZE=512 # Internal tile block size in pixels (512 for GIS; 256 for COG/web) ``` --- ## Quick Start A minimal example to process a single summer season for two tiles: **1. Edit `config.env`:** ```bash BASE_DIR="/path/to/your/output" HLS_TILES="18TVL 18TVM" PROCESSED_VIS="NDVI" DOWNLOAD_CYCLES="2020-06-01|2020-08-31" STEPS="all" NUM_WORKERS=4 ``` **2. Activate the environment and run:** ```bash conda activate hls_pipeline bash hls_pipeline.sh ``` The pipeline will print a run summary, prompt you to confirm the storage estimate before downloading, then execute each step sequentially, logging all output to `${BASE_DIR}/0_Logs/`. --- ## Running the Pipeline ### Full Run ```bash bash hls_pipeline.sh ``` The pipeline prints a run summary showing active steps, VIs, tile count, worker count, and target CRS, then logs all step output to a timestamped file in `LOG_DIR`. ### Partial Runs Use the `STEPS` variable to reprocess specific stages without rerunning the entire pipeline: ```bash # In config.env STEPS="mean_mosaic outlier_mosaic outlier_counts" # Or inline as a one-off override STEPS="mean_mosaic" bash hls_pipeline.sh ``` ### Resuming After Failure Each step skips output files that already exist. If a run is interrupted mid-step, simply rerun the pipeline — completed output files will not be regenerated. --- ## Step Reference ### Step 01 — Download Downloads HLS granules via the NASA CMR API with an interactive storage estimate and user approval gate. - **Inputs:** NASA CMR API (date ranges, tile IDs, cloud/spatial coverage thresholds, band list) - **Outputs:** Raw GeoTIFFs in `RAW_HLS_DIR`, organized by sensor/year/tile hierarchy - **Key feature:** Prints a comprehensive storage estimate covering all active steps (01–11) with Peak and Final figures, then requires user confirmation before downloading (set `SKIP_APPROVAL=TRUE` to bypass for automated runs) - **Credentials required:** `~/.netrc` with NASA Earthdata login ### Step 02 — VI Calculation Applies Fmask quality masking and computes vegetation indices from raw surface reflectance bands. - **Inputs:** Raw L30/S30 band GeoTIFFs from `RAW_HLS_DIR` - **Outputs:** One GeoTIFF per granule per VI in `VI_OUTPUT_DIR` - **Masking:** Configurable bitwise Fmask decode (cloud, shadow, snow/ice, water, aerosol) - **Parallelism:** `multiprocessing.Pool` with `NUM_WORKERS` ### Step 03 — NetCDF Build Aggregates per-granule VI GeoTIFFs into per-tile time-series files. - **Inputs:** VI GeoTIFFs from `VI_OUTPUT_DIR` - **Outputs:** `T{TILE}_{VI}.nc` in `NETCDF_DIR` — CF-1.8 compliant with `days since 1970-01-01` time encoding and sensor (L30/S30) metadata per observation - **Parallelism:** Chunked writes with `multiprocessing.Pool` - **Southern hemisphere CRS correction:** HLS v2.0 GeoTIFFs for tiles south of the equator embed a UTM North zone (EPSG:326xx) with negative northings instead of the standard UTM South convention (EPSG:327xx, false_northing=10,000,000). Step 03 automatically detects this case (UTM North EPSG code + negative y centroid) and corrects it: the CRS is replaced with the UTM South equivalent (EPSG + 100) and y-coordinates are shifted by +10,000,000 m. The output NetCDF files carry the correct EPSG:327xx CRS with positive northings, as expected by GIS tools and CF-1.8 validators. This correction is applied transparently and logged with a `[CRS fix]` prefix in the pipeline output. ### Step 04 — Temporal Mean + Reproject Computes the pixel-wise temporal mean for each tile across the full date range and reprojects to `TARGET_CRS`. - **Inputs:** NetCDF time-series from `NETCDF_DIR` - **Outputs:** `T{TILE}_{VI}_average_{VI}_{CRS}.tif` in `REPROJECTED_DIR` (30 m, Cloud-Optimized GeoTIFF) - **Note:** Valid-range filtering is applied before computing the mean; outlier pixels do not contribute to the average ### Step 05 — Outlier Extraction + Reproject Identifies pixels with unmasked values outside the per-VI valid range and summarizes them per tile. - **Inputs:** NetCDF time-series from `NETCDF_DIR` - **Outputs:** Two GeoTIFFs per tile per VI in `REPROJECTED_DIR_OUTLIERS`: - Outlier mean: temporal mean of out-of-range values - Outlier count: number of time slices with an outlier at each pixel ### Steps 06–08 — Mosaics Merge all per-tile rasters into study-area-wide mosaics using streaming merge (memory-efficient, handles large tile counts). | Step | Output Filename | Notes | |------|----------------|-------| | 06 | `HLS_Mosaic_{VI}_{CRS}.tif` | Temporal mean mosaic | | 07 | `HLS_Mosaic_Outlier_Mean_{VI}_{CRS}.tif` | Outlier mean mosaic | | 08 | `HLS_Mosaic_Outlier_Count_{VI}_{CRS}.tif` | Outlier count mosaic (uint16, nodata=0) | ### Step 09 — CountValid Mosaic Counts valid observations per pixel across the full downloaded period and mosaics all tiles into a study-area-wide GeoTIFF. The temporal scope is implicitly defined by `DOWNLOAD_CYCLES` — only data within those cycles is present in the NetCDF files. This step reads directly from NetCDF and is fully independent of Step 10 (`timeseries`). - **Inputs:** Per-tile NetCDF time-series from `NETCDF_DIR` (produced by Step 03) - **Outputs:** `HLS_Mosaic_CountValid_{VI}_{CRS}.tif` in `MOSAIC_DIR` - **To run independently:** `STEPS="count_valid_mosaic" bash hls_pipeline.sh` - **Band description:** `CountValid_AllDownloadCycles` embedded in band metadata ### Step 10 — Time-Series Stacks Builds multi-band seasonal composite GeoTIFFs from user-defined time windows. - **Inputs:** NetCDF time-series + `TIMESLICE_WINDOWS` definitions from `config.env` - **Outputs:** Two multi-band stacks per VI in `TIMESLICE_OUTPUT_DIR`: - `HLS_TimeSeries_{VI}_Mean_{CRS}.tif` — temporal mean per window (one band per window) - `HLS_TimeSeries_{VI}_CountValid_{CRS}.tif` — valid-pixel count per window (one band per window) - **Band descriptions:** Window labels are embedded in band metadata and are visible in QGIS, GDAL, and rasterio ### Step 11 — Outlier GeoPackage Exports every individual outlier pixel-date observation as a point feature, enabling spatial and temporal exploration of anomalies. - **Inputs:** NetCDF time-series from `NETCDF_DIR` - **Outputs:** `HLS_outliers_{VI}.gpkg` in `OUTLIER_GPKG_DIR` (WGS84 / EPSG:4326 points) - **Feature attributes:** `tile_id`, `vi_type`, `sensor`, `date`, `vi_value`, `geometry` - **Use cases:** Sensor artifact investigation, data quality review, anomaly mapping --- ## Output Products ### Directory Structure ``` ${BASE_DIR}/ ├── 0_Logs/ │ └── hls_pipeline_YYYYMMDD_HHMMSS.log ├── 1_Raw/ │ ├── L30/YYYY/HH/T/T/T/ │ │ └── HLS.L30.T{TILE}.{DATE}.v2.0.{Band}.tif │ └── S30/YYYY/HH/T/T/T/ │ └── HLS.S30.T{TILE}.{DATE}.v2.0.{Band}.tif ├── 2_Interim/ │ ├── 1_VI_Products/ │ │ └── HLS.{L30|S30}.T{TILE}.{DATE}.v2.0.{VI}.tif │ ├── 2_NetCDF/ │ │ └── T{TILE}_{VI}.nc │ ├── 3_VI_Mean_Tiles/ │ │ └── T{TILE}_{VI}_average_{VI}_{CRS}.tif │ └── 4_VI_Outlier_Tiles/ │ ├── T{TILE}_{VI}_outlier_mean_{VI}_{CRS}.tif │ └── T{TILE}_{VI}_outlier_count_{VI}_{CRS}.tif └── 3_Out/ ├── 1_Mosaic/ │ ├── HLS_Mosaic_{VI}_{CRS}.tif │ ├── HLS_Mosaic_Outlier_Mean_{VI}_{CRS}.tif │ ├── HLS_Mosaic_Outlier_Count_{VI}_{CRS}.tif │ └── HLS_Mosaic_CountValid_{VI}_{CRS}.tif (step 09 — count across all download cycles) ├── 2_TimeSeries/ │ ├── HLS_TimeSeries_{VI}_Mean_{CRS}.tif (N bands — one per window) │ └── HLS_TimeSeries_{VI}_CountValid_{CRS}.tif (N bands — one per window) └── 3_Outlier_Points/ └── HLS_outliers_{VI}.gpkg ``` ### File Format Reference | Product | Format | Dtype | Nodata | Compression | |---------|--------|-------|--------|-------------| | VI GeoTIFF (step 02) | GeoTIFF | float32 | NaN | LZW | | NetCDF time-series (step 03) | NetCDF-4 | float32 | NaN | zlib | | Mean tile (step 04) | COG GeoTIFF | float32 | NaN | LZW + predictor 3 | | Outlier mean tile (step 05) | GeoTIFF | float32 | NaN | LZW + predictor 3 | | Outlier count tile (step 05) | GeoTIFF | uint16 | 0 | LZW + predictor 2 | | Mean / outlier mosaics (steps 06–07) | GeoTIFF | float32 | NaN | LZW | | Count mosaic (step 08) | GeoTIFF | uint16 | 0 | LZW | | CountValid mosaic (step 09) | GeoTIFF | uint16 | 0 | LZW + predictor 2 | | Time-series stacks (step 10) | BigTIFF | float32 / uint16 | NaN / 0 | LZW | | Outlier GeoPackage (step 11) | GeoPackage | — | — | — | > **Nodata note:** A value of `0` in count products means no data at that pixel, not missing data in the raster sense. For outlier count products (steps 05/08), `0` means no outlier observations were recorded. For the CountValid mosaic (step 09), `0` means no valid observations were found across all download cycles. --- ## Advanced Usage ### Running a Subset of Steps ```bash # In config.env — rerun mosaics after adding more tiles STEPS="mean_mosaic outlier_mosaic outlier_counts" # One-off override without editing config.env STEPS="outlier_gpkg" bash hls_pipeline.sh ``` ### Computing Multiple VIs ```bash PROCESSED_VIS="NDVI EVI2 NIRv" ``` All steps loop over each VI in `PROCESSED_VIS`. Ensure all required bands are listed in `L30_BANDS` and `S30_BANDS`. For NDVI, EVI2, and NIRv, the required bands are `B04` (Red), `B05`/`B8A` (NIR), and `Fmask`. ### Custom Time Windows Time windows can span any date range and do not need to follow calendar boundaries: ```bash TIMESLICE_WINDOWS="Peak_Green_2020:2020-06-15|2020-07-31 \ Late_Summer_2020:2020-08-01|2020-09-15 \ Dormant_2020_2021:2020-11-15|2021-03-15" ``` The output stack will have one band per window. Band descriptions (window labels) are stored in the GeoTIFF metadata and are visible in QGIS, ArcGIS Pro, and when reading with rasterio or GDAL. ### Minimizing Disk Usage For large study areas with limited storage, enable both space-saver options. This keeps only one tile's raw and VI files on disk at a time — all other storage is the accumulated NetCDF outputs. ```bash SPACE_SAVER_REMOVE_RAW="TRUE" SPACE_SAVER_REMOVE_VI="TRUE" STEPS="build_nc" ``` After all tiles complete steps 01–03, continue with downstream steps: ```bash STEPS="mean_flat outlier_flat mean_mosaic outlier_mosaic outlier_counts count_valid_mosaic timeseries outlier_gpkg" ``` > **Note:** The space-saver options only apply during steps 01–03. Steps 04–11 always operate across all tiles using the NetCDF files. ### Tuning Valid Ranges Default valid ranges are broad. Narrow them for domain-specific applications: ```bash VALID_RANGE_NDVI="0.1,0.9" # Forest canopy: exclude bare soil and sparse cover VALID_RANGE_EVI2="-0.5,1.5" # Wider range for agricultural areas ``` Any unmasked pixel outside these bounds is flagged as an outlier and routed to Steps 05, 07, 08, and 11. ### Optimising Worker Count Set `NUM_WORKERS` based on available physical CPU cores. A safe starting point is (total cores − 2) to leave headroom for the OS and dask overhead: ```bash NUM_WORKERS=14 # Example for a 16-core workstation ``` Reduce this value if you encounter out-of-memory errors during Steps 04, 05, or 09. ### Changing the Output CRS The default CRS is `EPSG:6350` (NAD83 Conus Albers Equal Area). Change `TARGET_CRS` to any projected CRS supported by PROJ: ```bash TARGET_CRS="EPSG:32618" # UTM Zone 18N (WGS84) TARGET_CRS="EPSG:9221" # South African Albers Equal Area TARGET_CRS="EPSG:3857" # Web Mercator ``` All reprojected outputs and mosaics (Steps 04–11) will use the new CRS. The CRS code (dots stripped) is embedded in output filenames (e.g., `EPSG6350`). > **Note:** `TARGET_CRS` must be a **projected CRS** (linear units such as metres). If a geographic CRS (degrees) is set, the pipeline will accept it but print a `[WARN]` and convert the 30 m output resolution to an approximate degree equivalent — which is not equal-area and not recommended for pixel-level VI analysis. --- ## Troubleshooting ### Download fails with a 401 or authentication error Verify your `~/.netrc` credentials, file permissions (`chmod 600 ~/.netrc`), and that you have accepted all required EULAs for HLS on the [NASA Earthdata](https://urs.earthdata.nasa.gov/) website. ### `syntax error` or unexpected token in `hls_pipeline.sh` Verify that `config.env` is present in the repository root and that all required variables are set. The pipeline has been tested on macOS (ZSH) and Linux. ### Out-of-memory errors during Steps 04, 05, 09, or 10 Reduce `NUM_WORKERS` and/or `CHUNK_SIZE` in `config.env`. Fewer simultaneous dask tasks significantly reduce peak RAM usage. ### A step finishes but output files are missing Check the timestamped log file in `LOG_DIR` for error messages. Most steps print tile-level warnings when inputs are missing or when a tile is skipped. ### The mosaic is missing one or more tiles Verify that the missing tiles are included in `HLS_TILES` and that their intermediate reprojected GeoTIFFs exist in `REPROJECTED_DIR` (for Steps 06–07) or `REPROJECTED_DIR_OUTLIERS` (for Steps 07–08). ### NetCDF time coordinate errors or wrong dates HLS granule filenames must not be renamed. Step 03 parses acquisition dates directly from the standard HLS filename convention: `HLS.{L30|S30}.T{TILE}.{YYYYDDD}T{HHMMSS}.v2.0`. ### Step 10 time-series stack has fewer bands than expected If `TIMESLICE_ENABLED` is not set to `"TRUE"`, Step 10 will skip processing. Also verify that at least one observation falls within each configured time window for the tiles in `HLS_TILES`. --- ## Credits & Acknowledgments ### Authors **Stephen Conklin**, Geospatial Analyst — Pipeline architecture, orchestration, and all original code · [github.com/stephenconklin](https://github.com/stephenconklin) **G. Burch Fisher, PhD**, Research Scientist — Conceptual guidance and original code adapted for: - `src/02_hls_vi_calc.py` (VI calculation and Fmask quality masking logic) - `src/03_hls_netcdf_build.py` (NetCDF time-series assembly) **AI Assistance** — This pipeline was developed with the assistance of [Google Gemini](https://gemini.google.com/) and [Anthropic Claude / Claude Code](https://claude.ai/code). These tools assisted with code generation and refinement under the direction and review of the authors. ### Adapted Code **NASA HLS Download Script** `src/01_hls_download_query.sh` is adapted in part from the NASA [`getHLS.sh`](https://github.com/nasa/HLS-Data-Resources/tree/main/bash/hls-bulk-download) script, published by the NASA HLS Data Resources Team under the Apache 2.0 License. ### HLS Data Citation Users of this pipeline who publish results should cite the HLS datasets: > Masek, J., Ju, J., Roger, J.-C., Skakun, S., Vermote, E., Claverie, M., Dungan, J., Yin, Z., Freitag, B., Justice, C. (2021). *HLS Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0* [Data set]. NASA EOSDIS Land Processes DAAC. https://doi.org/10.5067/HLS/HLSL30.002 > Skakun, S., Ju, J., Roger, J.-C., Vermote, E., Masek, J., Justice, C. (2021). *HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0* [Data set]. NASA EOSDIS Land Processes DAAC. https://doi.org/10.5067/HLS/HLSS30.002 ### Python Libraries This pipeline is built on the following open-source libraries: | Library | Purpose | License | |---------|---------|---------| | [rasterio](https://rasterio.readthedocs.io/) | GeoTIFF I/O, reprojection, mosaicing | BSD-3 | | [xarray](https://xarray.pydata.org/) | N-dimensional array and NetCDF processing | Apache 2.0 | | [rioxarray](https://corteva.github.io/rioxarray/) | Spatial extensions for xarray | Apache 2.0 | | [dask](https://dask.org/) | Parallel and chunked computation | BSD-3 | | [geopandas](https://geopandas.org/) | GeoPackage vector I/O | BSD-3 | | [numpy](https://numpy.org/) | Array mathematics | BSD-3 | | [pandas](https://pandas.pydata.org/) | Tabular data handling | BSD-3 | | [netCDF4](https://unidata.github.io/netcdf4-python/) | Low-level NetCDF I/O | MIT | | [shapely](https://shapely.readthedocs.io/) | Point geometry construction | BSD-3 | | [pyproj](https://pyproj4.github.io/pyproj/) | CRS transformations | MIT | --- ## License This project is licensed under the [MIT License](https://github.com/stephenconklin/HLS_VI_Pipeline/blob/main/LICENSE). `src/01_hls_download_query.sh` is adapted in part from NASA's [`getHLS.sh`](https://github.com/nasa/HLS-Data-Resources/tree/main/bash/hls-bulk-download), released by NASA under the [Apache 2.0 License](https://github.com/nasa/HLS-Data-Resources/blob/main/LICENSE).