IEEE International Conference on Big Data (BigData'25) — Regular Paper (to appear)
MariNav is a reinforcement learning (RL) environment built to simulate and optimize tanker navigation across oceanic routes represented by H3 hexagonal grids. It integrates real-world wind data, historical route frequencies, and fuel consumption models, allowing agents to learn efficient and realistic maritime paths under environmental constraints.
Link to the research paper: https://arxiv.org/pdf/2509.01838
This environment is compatible with PPO and MaskablePPO from Stable-Baselines3 and sb3-contrib, making it suitable for training with or without action masking support. It also supports intrinsic exploration techniques such as RND (Random Network Distillation) from the rllte.xplore.reward module.
Below is an image that computes shortest path from a start and goal H3 cell in a map using Dijistra's algorithm. We can use MariNav to develop a better optimal path using RL, given the weather, fuel and time constraints.
git clone https://github.com/Vaishnav2804/MariNav
cd MariNav
pip install -r requirements.txtfrom Env.MariNav import MariNav
from sb3_contrib import MaskablePPO
import networkx as nx
# Load input data
wind_map = load_full_wind_map("august_2018_60min_windmap_v2.csv")
graph = nx.read_gexf("GULF_VISITS_CARGO_TANKER_AUGUST_2018.gexf").to_undirected()
# Define H3 region pool
h3_pool = [
"862b160d7ffffff", "860e4da17ffffff",
"861b9101fffffff", "862b256dfffffff",
"862b33237ffffff"
]
# Create environment
env = MariNav(
h3_pool=h3_pool,
graph=graph,
wind_map=wind_map,
h3_resolution=6,
wind_threshold=22,
render_mode="human"
)
model = MaskablePPO("MlpPolicy", env, verbose=1) # Or use PPO
model.learn(total_timesteps=100_000)from rllte.xplore.reward import RND
from rllte.xplore.wrapper import RLeXploreWithOnPolicyRL
from stable_baselines3 import PPO
irs = RND(env, device="cpu")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=240_000_000, callback=RLeXploreWithOnPolicyRL(irs))
PPO with action masks comparatively performes better than the standard PPO with MLP policy - (For the given environment) -- Play around to figure out more!
MariNav models real-world vessel navigation by:
- Using H3 grids to represent navigable ocean regions
- Integrating timestamped wind data
- Leveraging historical route usage from AIS-derived graphs
- Providing multi-objective rewards for training robust RL agents
- We can use MariNav to compute best path for a given pair of H3 cells in a continuous observation space.
- Hex-Based Ocean Model: Built using Uber H3 (resolution 6)
- Graph Navigation: Based on real ship visits stored in a weighted
NetworkXgraph - Wind-Aware Dynamics: Wind speed and direction influence vessel speed and fuel
- Dynamic Action Space: Multi-discrete choices over neighbor cells + discrete speeds
The reward function combines:
- Progress Reward – distance reduction toward the goal
- Fuel Penalty – penalizes fuel-heavy maneuvers
- Wind Penalty – penalizes adverse wind alignment
- Alignment Penalty – penalizes angular misalignment with wind
- ETA Penalty – encourages timely arrival
- Frequency Reward – rewards travel along historically common routes
- TensorBoard metrics for each reward component
- Route frequency analysis and pair visitation tracking
- Matplotlib + GeoPandas-based rendering (headless safe)
- CSV logging of each episode and step-level transition
The main training script supports:
- Parallel training using
SubprocVecEnv - Reward normalization with
VecNormalize - Early stopping and evaluation checkpoints
- Logging with TensorBoard, CSVs, and SB3-compatible callbacks
python train_marinav.pydef make_env():
def _init():
base_env = MariNav(
h3_pool=H3_POOL,
graph=G_visits,
wind_map=full_wind_map,
h3_resolution=6,
wind_threshold=22,
render_mode="human"
)
base_env.visited_path_counts = global_visited_path_counts
return Monitor(base_env)
return _init8-dimensional vector:
[lat, lon]of current positionspeed_over_groundwind_direction[lat, lon]of start & goal
Multi-discrete:
- Neighbor index (variable per cell)
- Speed index (5 levels from 8–22 knots)
- Goal reached (success)
- Max episode length exceeded (truncated)
- Invalid move (failure)
Modify constants in MariNav.py:
PROGRESS_REWARD_FACTOR = 2
FUEL_PENALTY_SCALE = -0.001
WIND_PENALTY_VALUE = -1.0
ETA_PENALTY = -2.0Run:
tensorboard --logdir ./logs/Access at: http://localhost:6006
episodic_returnprogress_rewardfuel_penaltywind_penaltyalignment_penaltyeta_penaltystep_penaltyfrequency_reward
MariNav tracks how often each (start_h3, goal_h3) pair is selected and successfully completed. These metrics help debug agent behavior and ensure balanced training.
You must provide:
WIND_MAP_PATH: CSV with wind speed & direction per H3 cell per timestampGRAPH_PATH: GEXF file of navigable ocean routesH3_POOL: Valid H3 indices for start/goal sampling
- Fork the repo
- Make a feature branch
- Commit and push changes
- Open a pull request
To cite the inspiration clearly and professionally in your README.md, you can add an Acknowledgment or Citation section like this:
The weighted graph used in MariNav was inspired by the following research:
Learning Spatio-Temporal Vessel Behavior using AIS Trajectory Data and Markovian Models in the Gulf of St. Lawrence Google Scholar: https://scholar.google.ca/citations?view_op=view_citation&hl=en&user=aiL559gAAAAJ&citation_for_view=aiL559gAAAAJ:9yKSN-GCB0IC
This work helped guide the design of the frequency-weighted transition graph for maritime routing in the Gulf of St. Lawrence.


