Changelog

All notable changes to this project are documented here.


2026-03-18 (4)

Fixed

  • Root logger level DEBUGINFOsetup_logging() in hls_utils.py set the root logger to DEBUG, causing rasterio and GDAL internal trace messages to flood the log (thousands of DEBUG [rasterio.env] / DEBUG [rasterio._io] lines per run). Changed to INFO so only pipeline-level messages and library WARNING+ output appear.

  • Steps 04/05: "Skipped" results logged at ERROR — the result-dispatch logic in the main loop of steps 04 and 05 routed any string not starting with "OK" or "WARNING" to logger.error. "Skipped (Exists)" (step 04) and "Skipped (No outliers)" (step 05) both fell into this branch. Added an explicit Skipped prefix check → logger.info before the error fallthrough.

  • Steps 09/10: BLOCKXSIZE without TILED=YES in temp tile writesto_raster() calls for intermediate temp GeoTIFFs in steps 09 and 10 passed blockxsize/blockysize without tiled=True, producing GDAL CPLE_IllegalArg warnings on every tile. Added tiled=True to all three affected to_raster() calls.

  • Step 09: dask chunk mismatch UserWarningxr.open_dataset(nc_path, chunks={'time': 10}) produced a dask performance warning when the on-disk NetCDF chunk layout differed from the requested chunking. Changed to chunks='auto' to align with on-disk layout, consistent with how steps 04, 05, and 10 open datasets.

  • Step 03: CRS remap log message rewording — the southern hemisphere CRS adjustment log line used [CRS fix] and corrected to, implying an error condition. Replaced with [CRS] and southern hemisphere tile, remapped to reflect that this is routine, expected processing for any southern hemisphere tile.


2026-03-18 (3)

Added

  • Structured logging across all Python steps and hls_pipeline.sh — all ten Python pipeline scripts (steps 02–11) now use Python’s logging module via a shared setup_logging(step_name) helper in src/hls_utils.py. Every log line carries a timestamp, level, and bracketed step label: 2026-03-18 20:55:49  INFO      [04_mean_reproject]  message. Previously all diagnostic output used bare print() calls with no timestamps or severity levels. Key design points:

    • Single implementation in hls_utils.py; no logging boilerplate duplicated across scripts.

    • Root logger handler guard (if not root.handlers) makes setup_logging idempotent — calling it in multiprocessing.Pool or ProcessPoolExecutor child processes does not produce duplicate output.

    • Worker functions (steps 02–05, 09–10) are unchanged; they return status strings/dicts to the main process, which performs all logger.*() calls.

    • StreamHandler targets sys.stdout so 2>&1 | tee -a "$LOGFILE" in the shell captures all output.

    • hls_pipeline.sh gains log_info, log_warn, and log_error helper functions that emit the same timestamp + level + [pipeline] format, making combined shell/Python log output visually consistent.


2026-03-18 (2)

Fixed

  • Step 03 — southern hemisphere CRS stored as UTM North — HLS v2.0 GeoTIFFs for tiles south of the equator embed a UTM North zone (EPSG:326xx) with negative northings instead of the standard UTM South convention (EPSG:327xx, false_northing=10,000,000). All southern Africa (BioSCape) and other southern hemisphere tiles were affected. HLSNetCDFAggregator.run() in src/03_hls_netcdf_build.py now detects this case after reading the first GeoTIFF: if pyproj.to_epsg(min_confidence=20) returns a UTM North code (32601–32660) and the pixel-center y mean is negative, the CRS WKT is replaced with the UTM South equivalent (EPSG + 100, e.g. 32634 → 32734) and y-coordinates are shifted by +10,000,000 m. This correction is applied before chunk dicts are built, so both single-chunk and merged tiles are written with the correct EPSG:327xx CRS and positive UTM South northings. Previously rebuilt tiles will need to be regenerated with step 03 to pick up the corrected CRS and coordinates; downstream steps 04–11 that reproject to TARGET_CRS are not affected because they perform a full reprojection from the source CRS.


2026-03-18

Fixed

  • Step 03 — _FillValue lost in merge_chunksprocess_netcdf_chunk correctly creates the VI variable with fill_value=np.nan, but merge_chunks recreated the same variable without a fill_value argument. netCDF4 therefore fell back to its built-in default sentinel (9.969209968386869e+36) for all missing cells in merged files, and the _FillValue attribute was absent from the output. Any tile requiring chunk merging (virtually all multi-year tiles with more acquisitions than CHUNK_SIZE) was affected. Fixed by adding fill_value=np.nan to the createVariable call in merge_chunks (src/03_hls_netcdf_build.py line 231). Newly rebuilt tiles will store missing data as NaN and carry a proper _FillValue = NaN attribute.


2026-03-12

Changed

  • Pipeline scripts moved to src/ — all 11 step scripts (01_hls_download_query.sh11_hls_outlier_gpkg.py) and hls_utils.py relocated from the repository root into src/. hls_pipeline.sh remains at the root. All invocation paths in hls_pipeline.sh, CLAUDE.md, README.md, and docs/ updated accordingly. Python import hls_utils statements are unaffected (Python resolves the import from the script’s own directory).

  • Step 03 — improved CF-1.8 CRS metadata in NetCDF output — the spatial_ref grid-mapping variable now carries both crs_wkt (CF-1.8 standard) and spatial_ref (GDAL / rioxarray compatibility) attributes, plus grid_mapping_name (derived via pyproj) and long_name. The x/y coordinate variables now include standard_name, long_name, and axis attributes; the time variable now includes standard_name, calendar, and axis. A global Conventions = "CF-1.8" attribute is now written. The merge_chunks path mirrors all the same attributes. These changes make da.rio.crs (rioxarray path 1 in detect_crs()) reliably resolve without falling back to the global crs attribute. Existing NetCDF files built with the prior format remain readable via the detect_crs() fallback chain.

  • Step 03 — CRS WKT stored as pyproj WKT2 instead of GDAL WKT1HLSNetCDFAggregator.run() now generates the CRS WKT string via ProjCRS.from_user_input(crs).to_wkt() (pyproj WKT2) instead of rasterio’s crs.to_wkt() (GDAL WKT1). GDAL WKT1 for some HLS tiles lacks a top-level AUTHORITY["EPSG","XXXXX"] node, causing pyproj.CRS.from_wkt(wkt).to_epsg() to return None. Downstream consumers that group tiles by EPSG code (e.g. cross-CRS reprojection checks) would treat same-zone tiles as different CRS groups. The pyproj WKT2 output always includes a resolvable authority node. Existing NetCDF files retain their original WKT; rebuilding with step 03 is recommended for tiles where EPSG grouping matters downstream.


2026-02-28

Added

  • NETCDF_COMPLEVEL — configurable zlib compression level (0–9, default 1) for NetCDF time-series files written by step 03. Threaded through HLSNetCDFAggregator into chunk_info dicts (worker) and merge_chunks.

  • GEOTIFF_COMPRESS — configurable compression codec (default LZW) for all GeoTIFF outputs in steps 02 and 04–10. Accepts any codec supported by the local GDAL build (LZW, DEFLATE, ZSTD, NONE).

  • GEOTIFF_BLOCK_SIZE — configurable internal tile block dimension in pixels (default 512) for all tiled GeoTIFF outputs in steps 04–10. 512 is standard for desktop GIS; 256 is preferred for Cloud-Optimized GeoTIFFs.

  • reproject_resolution() in hls_utils.py — CRS-unit-aware resolution helper replacing all hardcoded resolution=30 calls in steps 04, 05, 09, 10. Returns metres unchanged for projected CRS; converts to approximate degrees for geographic CRS and logs a warning.

Fixed

  • Steps 04, 05, 09, and 10 produced a 1×1 pixel output with no valid data when TARGET_CRS was set to a geographic CRS (e.g. EPSG:4148) because resolution=30 was interpreted as 30 degrees per pixel instead of 30 metres.


2026-02-26

Added

  • Read the Docs configuration and Sphinx documentation scaffold (docs/)

  • docs/overview.md: comprehensive pipeline guide (full user documentation)

Changed

  • README.md restructured as a GitHub landing page (elevator pitch, outputs table, key features, quick start, and link to RTD); full documentation moved to docs/overview.md

  • docs/index.md updated to a hub toctree (overview, configuration, changelog); no longer uses {include} to pull README content

Fixed

  • System requirements table in README: added gdalinfo (called directly by step 01 for GeoTIFF validation; provided by the conda environment via rasterio’s GDAL dependency); clarified that conda is required not just for Python packages but because it supplies native geospatial libraries (GDAL, PROJ, HDF5, GEOS)

  • Per-file download validation with retry logic in step 01

Changed

  • NUM_WORKERS restored to 8 in config.env

Removed

  • Bulk download mode retired; tile-by-tile is now the only download mode, reducing peak disk usage to roughly one tile’s worth of raw data at a time


2026-02-25

Added

  • Step 09 — CountValid mosaic: counts valid (unmasked, in-range) observations per pixel across all download cycles and mosaics the result into a single study-area-wide GeoTIFF. Reads from NetCDF files (step 03); independent of TIMESLICE_WINDOWS and the time-series step.

Changed

  • Steps renumbered to reflect execution order:

    • Former step 09 (time-series) → Step 10

    • Former step 10 (outlier GeoPackage) → Step 11


2026-02-22

Added

  • Initial release of the HLS Vegetation Index Pipeline

  • 11-step end-to-end workflow: download → VI calculation → NetCDF → reprojection → mosaics → time-series → outlier export

  • Support for NDVI, EVI2, and NIRv vegetation indices

  • Bitwise Fmask quality masking with independently configurable flags for cirrus, cloud, adjacent cloud, shadow, snow/ice, water, and aerosol mode

  • Tile-by-tile orchestration for steps 01–03, with optional space-saver flags to remove raw and/or VI intermediate files after each tile’s NetCDF is built

  • Configurable parallel processing via NUM_WORKERS

  • Per-VI valid range outlier detection with configurable bounds (VALID_RANGE_NDVI, VALID_RANGE_EVI2, VALID_RANGE_NIRv)

  • Multi-band seasonal composite stacks via TIMESLICE_WINDOWS (step 10)

  • GeoPackage export of per-pixel outlier observations with WGS84 coordinates (step 11)

  • Pre-flight band validation: the orchestrator checks that all bands required for the selected VIs are configured before any step executes

  • SKIP_APPROVAL flag for automated / non-interactive pipeline runs