Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions scripts/README_update_storage_tier.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Update STAC Storage Tier Metadata

Updates existing STAC items with current S3 storage tier metadata.

## Modes

**Update (default)**: Updates `ovh:storage_tier` for assets with existing `alternate.s3`

**Add Missing (`--add-missing`)**: Creates `alternate.s3` structure for legacy items without it

## Storage Tier Detection

- **Single file**: Returns tier directly (e.g., `"GLACIER"`)
- **Uniform Zarr**: All files same tier (e.g., `"GLACIER"` + distribution)
- **Mixed Zarr**: Different tiers detected (tier: `"MIXED"` + distribution breakdown)

Distribution shows file counts per tier, based on sample of up to 100 files.

### Example: Mixed Storage
```json
{
"ovh:storage_tier": "MIXED",
"ovh:storage_tier_distribution": {
"STANDARD": 450,
"GLACIER": 608
}
}
```

## Notes

- Thumbnail assets automatically skipped
- Failed S3 queries remove existing `ovh:storage_tier` field
- Distribution metadata only for Zarr directories

## Setup

```bash
# Install dependencies
uv sync

# Set environment variables
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export S3_ENDPOINT="https://s3.de.io.cloud.ovh.net"
export STAC_API_URL="https://api.explorer.eopf.copernicus.eu/stac/"
export ITEM_URL="${STAC_API_URL}/collections/sentinel-2-l2a/items/ITEM_ID"
```

## Usage

```bash
# Dry run (preview changes)
uv run python scripts/update_stac_storage_tier.py \
--stac-item-url "$ITEM_URL" \
--stac-api-url "$STAC_API_URL" \
--s3-endpoint "$S3_ENDPOINT" \
--dry-run

# Update existing alternate.s3
uv run python scripts/update_stac_storage_tier.py \
--stac-item-url "$ITEM_URL" \
--stac-api-url "$STAC_API_URL" \
--s3-endpoint "$S3_ENDPOINT"

# Add missing alternate.s3 (legacy items)
uv run python scripts/update_stac_storage_tier.py \
--stac-item-url "$ITEM_URL" \
--stac-api-url "$STAC_API_URL" \
--s3-endpoint "$S3_ENDPOINT" \
--add-missing
```

## Output Examples

**Success:**
```
Processing: S2A_MSIL2A_20251008T100041_N0511_R122_T32TQM_20251008T122613
Assets with alternate.s3: 15
Assets with queryable storage tier: 15
Assets updated: 15
✅ Updated item (HTTP 201)
```

**Mixed storage detected:**
```
Processing: S2A_MSIL2A_20251208T100431_N0511_R122_T32TQQ_20251208T121910
reflectance: Mixed storage detected - {'STANDARD': 450, 'GLACIER': 608}
Assets updated: 1
✅ Updated item (HTTP 201)
```

**S3 query failures:**
```
⚠️ Failed to query storage tier from S3 for 4 asset(s)
Check AWS credentials, S3 permissions, or if objects are Zarr directories
```

## Related Scripts

- `register_v1.py` - Initial STAC registration (includes storage tier)
- `change_storage_tier.py` - Change S3 storage classes
- `storage_tier_utils.py` - Shared utilities for storage tier operations
60 changes: 36 additions & 24 deletions scripts/register_v1.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from pystac import Asset, Item, Link
from pystac.extensions.projection import ProjectionExtension
from pystac_client import Client
from storage_tier_utils import extract_region_from_endpoint, get_s3_storage_class

# Configure logging (set LOG_LEVEL=DEBUG for verbose output)
logging.basicConfig(
Expand Down Expand Up @@ -353,20 +354,8 @@ def add_alternate_s3_assets(item: Item, s3_endpoint: str) -> None:
if ext not in item.stac_extensions:
item.stac_extensions.append(ext)

# Parse endpoint to extract region info
# For OVHcloud endpoints like "s3.de.io.cloud.ovh.net", region is "de"
endpoint_host = urlparse(s3_endpoint).netloc or urlparse(s3_endpoint).path
region = "unknown"
if ".de." in endpoint_host:
region = "de"
elif ".gra." in endpoint_host:
region = "gra"
elif ".sbg." in endpoint_host:
region = "sbg"
elif ".uk." in endpoint_host:
region = "uk"
elif ".ca." in endpoint_host:
region = "ca"
# Extract region from endpoint
region = extract_region_from_endpoint(s3_endpoint)

# Add alternate to each asset with data role that has an HTTPS URL
modified_count = 0
Expand All @@ -383,18 +372,40 @@ def add_alternate_s3_assets(item: Item, s3_endpoint: str) -> None:
if not s3_url:
continue

# Query storage class for this asset
storage_tier = get_s3_storage_class(s3_url, s3_endpoint)

# Add alternate with storage extension fields
if not hasattr(asset, "extra_fields"):
asset.extra_fields = {}

asset.extra_fields["alternate"] = {
"s3": {
"href": s3_url,
"storage:platform": "OVHcloud",
"storage:region": region,
"storage:requester_pays": False,
}
# Preserve existing alternate structure if present
existing_alternate = asset.extra_fields.get("alternate", {})
if not isinstance(existing_alternate, dict):
existing_alternate = {}

# Get existing s3 alternate or create new one
existing_s3 = existing_alternate.get("s3", {})
if not isinstance(existing_s3, dict):
existing_s3 = {}

# Update s3 alternate (preserving any existing fields)
s3_alternate = {
**existing_s3, # Preserve existing fields
"href": s3_url,
"storage:platform": "OVHcloud",
"storage:region": region,
"storage:requester_pays": False,
}

# Add storage tier as a custom field (not part of storage extension spec)
# Using ovh: prefix to indicate vendor-specific extension
if storage_tier:
s3_alternate["ovh:storage_tier"] = storage_tier

# Preserve other alternate formats (e.g., alternate.xarray if it exists)
existing_alternate["s3"] = s3_alternate
asset.extra_fields["alternate"] = existing_alternate
modified_count += 1

if modified_count > 0:
Expand Down Expand Up @@ -648,19 +659,20 @@ def run_registration(
remove_xarray_integration(item)

# 7. Add alternate S3 URLs to assets (alternate-assets + storage extensions)
# This also queries and adds storage:tier to each asset's alternate
add_alternate_s3_assets(item, s3_endpoint)

# 8. Add visualization links (viewer, xyz, tilejson)
add_visualization_links(item, raster_api_url, collection)
logger.info(" 🎨 Added visualization links")

# 9. Add thumbnail asset for STAC browsers
# 10. Add thumbnail asset for STAC browsers
add_thumbnail_asset(item, raster_api_url, collection)

# 10. Add derived_from link to source item
# 11. Add derived_from link to source item
add_derived_from_link(item, source_url)

# 11. Register to STAC API
# 12. Register to STAC API
client = Client.open(stac_api_url)
upsert_item(client, collection, item)

Expand Down
Loading