diff --git a/docs/389ds/design/ansible-replication-monitoring-design.md b/docs/389ds/design/ansible-replication-monitoring-design.md new file mode 100644 index 0000000..5467e77 --- /dev/null +++ b/docs/389ds/design/ansible-replication-monitoring-design.md @@ -0,0 +1,203 @@ +--- +title: "Replication Monitoring With Ansible" +--- + +# Replication Monitoring With Ansible Design + +{% include toc.md %} + +## Document Version + +0.2 + +## Revision History + +| Version | Date | Description of Change | +|---------|------------|-----------------------| +| 0.1 | 03-11-2024 | First MVP version | +| 0.2 | 2026-03-23 | Aligned terminology with Replication Log Analyzer design, added hop lag details, clarified output formats | + +## Related Documents + +- [Replication Log Analyzer Tool](replication-lag-report-design.md): Detailed design for the underlying analysis engine (`ReplicationLogAnalyzer`), CLI interface (`dsconf replication lag-report`), and Cockpit WebUI integration. The Ansible role and the CLI/WebUI tool share the same core analysis concepts (CSN-based lag calculation, global and hop-by-hop metrics) but use different delivery mechanisms. + +## Introduction + +The ds389_repl_monitoring role is designed to monitor replication lag in 389 Directory Server instances. It gathers replication data from access log files, analyzes the data to identify replication lags, and generates visual representations of the replication lag over time. + +The role computes two types of replication lag metrics: +1. **Global Replication Lag**: Time difference between the earliest and latest appearance of a CSN (Change Sequence Number) across all servers — measures end-to-end replication delay +2. **Hop-by-Hop Replication Lag**: Time delays between individual consecutive server pairs in the replication topology — identifies specific bottleneck links + +## Design Considerations + +- The role should be able to handle multiple 389 Directory Server instances. +- It should provide flexibility in specifying the log directory and result directory paths. +- The role should allow filtering the replication data based on various criteria, such as fully replicated changes, not replicated changes, lag time, and execution time. +- It should generate CSV and PNG files for easy analysis and visualization of replication lag data. JSON intermediate data is used internally for cross-host merging. +- The role should be idempotent and handle cases where the replication lag files already exist. + +## System Architecture + +### Role Walkthrough + +The ds389_repl_monitoring role consists of the following main task files: + +1. setup.yml: Performs initial setup tasks such as ensuring connectivity to the hosts, installing necessary packages on the Ansible controller, and creating the log directory. + +2. gather_data.yml: Finds all access log files in the specified directory on each 389 Directory Server instance, parses the logs using the ds389_log_parser module to extract CSN-based replication events, and merges the per-host JSON data from all instances using the ds389_merge_logs module into a single dataset for cross-server lag analysis. + +3. log_replication_lag.yml: Processes the merged JSON data through the ds389_logs_plot module to calculate global and hop-by-hop replication lag metrics, then generates CSV and PNG output files. The CSV contains detailed per-CSN lag data; the PNG contains time-series visualizations of replication lag with optional threshold lines. The files are saved in a directory named with the current date and hour. + +4. cleanup.yml: Removes temporary files created during the monitoring process on both the remote hosts and the Ansible controller. + +### Custom Modules + +The ds389_repl_monitoring role utilizes three custom Ansible modules: + +1. ds389_log_parser: Parses 389 Directory Server access logs and extracts CSN-based replication data. For each CSN found in the logs, the module records timestamps, server names, target DNs, suffixes, and operation execution times. The output is a JSON file containing per-CSN data keyed by server index. + - logfiles: List of paths to 389ds access log files (required). + - anonymous: Replace log file names with generic identifiers (default: false). + - output_file: Path to the output JSON file where the results will be written (required). + +2. ds389_logs_plot: Processes merged JSON log data to calculate global and hop-by-hop replication lag metrics, then generates visualizations. The module computes global lag (latest - earliest CSN appearance across all servers) and hop lag (delay between consecutive server pairs sorted by arrival time) for each CSN. + - input: Path to the input JSON file containing the merged log data (required). + - csv_output_path: Path where the CSV file should be generated (required). + - png_output_path: Path where the plot image should be saved. + - only_fully_replicated: Filter to show only changes replicated on all replicas (default: false). + - only_not_replicated: Filter to show only changes not replicated on all replicas (default: false). + - lag_time_lowest: Filter to show only changes with global lag time greater than or equal to the specified value (in seconds). + - etime_lowest: Filter to show only changes with execution time (etime) greater than or equal to the specified value (in seconds). + - utc_offset: UTC offset in seconds for timezone adjustment. + - repl_lag_threshold: Replication monitoring threshold value (in seconds). A horizontal line will be drawn in the plot to represent this threshold. + +3. ds389_merge_logs: Merges multiple per-host JSON log files into a single file. This step is necessary because each host produces its own JSON output from ds389_log_parser — merging combines CSN data from all hosts so that cross-server lag can be calculated. + - files: A list of paths to the JSON files to be merged (required). + - output: The path to the output file where the merged JSON will be saved (required). + + +### Parameters + +The role accepts the following parameters: + +| Variable | Default | Description | +|----------|---------|-------------| +| ds389_repl_monitoring_lag_threshold | 10 | Threshold for replication lag monitoring (in seconds). A line will be drawn in the plot to indicate the threshold value. | +| ds389_repl_monitoring_result_dir | '/tmp' | Directory to store replication monitoring results. The generated CSV and PNG files will be saved in this directory. | +| ds389_repl_monitoring_only_fully_replicated | false | Filter to show only changes replicated on all replicas. If set to true, only changes that have been replicated to all replicas will be considered. | +| ds389_repl_monitoring_only_not_replicated | false | Filter to show only changes not replicated on all replicas. If set to true, only changes that have not been replicated to all replicas will be considered. | +| ds389_repl_monitoring_lag_time_lowest | 0 | Filter to show only changes with lag time greater than or equal to the specified value (in seconds). Changes with a lag time lower than this value will be excluded from the monitoring results. | +| ds389_repl_monitoring_etime_lowest | 0 | Filter to show only changes with execution time (etime) greater than or equal to the specified value (in seconds). Changes with an execution time lower than this value will be excluded from the monitoring results. | +| ds389_repl_monitoring_utc_offset | 0 | UTC offset in seconds for timezone adjustment. This value will be used to adjust the log timestamps to the desired timezone. | +| ds389_repl_monitoring_tmp_path | "/tmp" | Temporary directory path for storing intermediate files. This directory will be used to store temporary files generated during the monitoring process. | +| ds389_repl_monitoring_tmp_analysis_output_file_path | "{{ ds389_repl_monitoring_tmp_path }}/{{ inventory_hostname }}_analysis_output.json" | Path to the temporary analysis output file for each host. This file will contain the parsed replication data for each individual host. | +| ds389_repl_monitoring_tmp_merged_output_file_path | "{{ ds389_repl_monitoring_tmp_path }}/merged_output.json" | Path to the temporary merged output file. This file will contain the merged replication data from all hosts. | + + +## Inventory Example + +```yaml +all: + children: + production: + vars: + ds389_repl_monitoring_lag_threshold: 20 + ds389_repl_monitoring_result_dir: '/var/log/ds389_repl_monitoring' + hosts: + ds389_instance_1: + ansible_host: 192.168.2.101 + ds389_repl_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier1' + ds389_instance_2: + ansible_host: 192.168.2.102 + ds389_repl_monitoring_log_dir: '/var/log/dirsrv/slapd-supplier2' +``` + +## Playbook Examples + +These examples demonstrate how ds389_repl_monitoring role can be customized using different variable settings to suit specific monitoring requirements. The role can be applied to different host groups, and the variables can be adjusted to filter the monitoring results based on various criteria such as fully replicated changes, minimum lag time, timezone offset, and minimum etime. + +### Example 1: Monitoring with custom lag threshold and result directory + +```yaml +- name: Monitor 389ds Replication with custom settings + hosts: ds389_replicas + roles: + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_lag_threshold: 30 + ds389_repl_monitoring_result_dir: '/var/log/ds389_monitoring' +``` + +In this example, the role is applied to the `ds389_replicas` host group. The `ds389_repl_monitoring_lag_threshold` is set to 30 seconds, meaning that replication lag line will be drawn across the PNG graph. The `ds389_repl_monitoring_result_dir` is set to `/var/log/ds389_monitoring`, specifying the directory where the CSV and PNG files will be stored. + +### Example 2: Monitoring with filters for fully replicated and minimum lag time + +```yaml +- name: Monitor 389ds Replication with filters + hosts: ds389_servers + roles: + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_only_fully_replicated: true + ds389_repl_monitoring_lag_time_lowest: 5 +``` + +This playbook applies the role to the `ds389_servers` host group. The `ds389_repl_monitoring_only_fully_replicated` variable is set to `true`, which means that only changes that have been fully replicated across all replicas will be considered. The `ds389_repl_monitoring_lag_time_lowest` is set to 5 seconds, so only changes with a lag time greater than or equal to 5 seconds will be included in the monitoring results. The results will be put in `/tmp` directory, which is default for `ds389_repl_monitoring_result_dir`. + +### Example 3: Monitoring with timezone offset and minimum etime + +```yaml +- name: Monitor 389ds Replication with timezone and etime filters + hosts: directory_servers + roles: + - role: ds389_repl_monitoring + vars: + ds389_repl_monitoring_utc_offset: -21600 + ds389_repl_monitoring_etime_lowest: 1.5 +``` + +In this example, the role is used to monitor the hosts in the `directory_servers` group. The `ds389_repl_monitoring_utc_offset` is set to -21600 seconds, which adjusts the log timestamps by -6 hours to match the desired timezone. The `ds389_repl_monitoring_etime_lowest` variable is set to 1.5 seconds, meaning that only changes with an etime greater than or equal to 1.5 seconds will be included in the monitoring output. The results will be put in `/tmp` directory, which is default for `ds389_repl_monitoring_result_dir`. + +## Molecule Testing + +The role includes a Molecule configuration for testing with Docker containers simulating 389ds replicas. The test sequence: + +1. Builds multiple containers +2. Copies mock access log files into each container +3. Runs the role against the containers +4. Verifies the role's functionality by: + - Checking CSV and PNG files are generated correctly + - Validating the content of the generated files + - Ensuring proper packages are installed + - Checking permissions on key directories + +## Relationship to CLI/WebUI Tool + +The Ansible role and the `dsconf replication lag-report` CLI tool (with its Cockpit WebUI) solve the same core problem — replication lag analysis — but for different operational contexts: + +| Aspect | Ansible Role | CLI/WebUI Tool | +|--------|-------------|----------------| +| **Use case** | Automated, scheduled monitoring across fleets | Ad-hoc investigation or WebUI-driven analysis | +| **Log collection** | Ansible gathers logs from remote hosts | User provides local log directories | +| **Data merging** | Explicit merge step across hosts | Analyzer processes all directories in one pass | +| **Output formats** | CSV, PNG | CSV, PNG, HTML (Plotly), JSON (PatternFly charts) | +| **Precision controls** | Not applicable (processes all data) | Configurable sampling (`fast`/`balanced`/`full`) | +| **Drill-down** | Not available | CSN detail drill-down in WebUI charts | + +Both approaches calculate the same metrics (global lag, hop-by-hop lag) using the same underlying algorithms. See [Replication Log Analyzer Tool](replication-lag-report-design.md) for the full technical specification. + +## Future Improvements + +- Support for additional log formats and directory server versions +- Support for sending metrics to monitoring systems (Prometheus, Grafana) +- Notifications on critical replication lag events (email, webhook) +- HTML report generation (leveraging Plotly, consistent with the CLI tool's HTML output) +- JSON summary output with aggregate statistics (min/max/avg lag, per-suffix breakdowns) +- Configurable sampling/precision for large-scale deployments +- Integration with the `dsconf replication lag-report` CLI for unified report format + + +Authors +======= + +Simon Pichugin (@droideck) \ No newline at end of file diff --git a/docs/389ds/design/design.md b/docs/389ds/design/design.md index 5ba8c01..a5f0a5a 100644 --- a/docs/389ds/design/design.md +++ b/docs/389ds/design/design.md @@ -36,6 +36,7 @@ If you are adding a new design document, use the [template](design-template.html ## Ansible - [Ansible DS](ansible-ds.html) +- [Replication Monitoring With Ansible](ansible-replication-monitoring-design.html) ## 389 Directory Server 3.1 @@ -47,6 +48,7 @@ If you are adding a new design document, use the [template](design-template.html - [Online certificates refresh](online-certificate-refresh.html) - [CLI Encryption Module Management](cli-encryption-module-design.html) - [Fine grain operation timing](fine-grain-operation-timing.html) +- [Replication Lag Report Design - Tech Preview](replication-lag-report-design.md) ## 389 Directory Server 3.0 diff --git a/docs/389ds/design/replication-lag-report-design.md b/docs/389ds/design/replication-lag-report-design.md new file mode 100644 index 0000000..0878896 --- /dev/null +++ b/docs/389ds/design/replication-lag-report-design.md @@ -0,0 +1,540 @@ +--- +title: "Replication Log Analyzer Tool" +--- + +# Directory Server Replication Lag Analyzer Tool + +## Document Version + +1.1 + +## Revision History + +| Version | Date | Description of Change | +|---------|------------|-----------------------| +| 1.0 | 2025-10-26 | Initial design document | +| 1.1 | 2026-03-23 | Added precision controls, sampling strategy, CSN drill-down, improved output format details, updated WebUI and CLI sections | + +## Executive Summary + +The Directory Server Replication Lag Analyzer Tool is designed to analyze replication performance in 389 Directory Server deployments. It processes access logs from multiple directory servers, calculates replication lag times, and generates comprehensive reports in various formats (Charts, CSV, and only for Fedora - HTML, PNG). The system is available both as a command-line tool and through an web-based interface in the 389 DS Cockpit WebUI. + +The tool focuses on two key metrics: +1. Global Replication Lag: Time difference between the earliest and latest appearance of a CSN across all servers +2. Hop-by-Hop Replication Lag: Time delays between individual server pairs in the replication topology + + +## Architecture Overview + +The system consists of three main components: +1. `DSLogParser`: Parses directory server access logs +2. `ReplicationLogAnalyzer`: Coordinates log analysis and report generation +3. `VisualizationHelper`: Handles data visualization and report formatting + +## Replication Lag Calculation Technical Details + +### Global Replication Lag +- For each CSN (Change Sequence Number): + 1. Track timestamp of first appearance across all servers + 2. Track timestamp of last appearance across all servers + 3. Global lag = latest_timestamp - earliest_timestamp + +### Hop Replication Lag +- For each CSN: + 1. Sort server appearances by timestamp + 2. For consecutive server pairs (supplier → consumer): + - Hop lag = consumer_timestamp - supplier_timestamp + 3. Track individual hop lags to identify bottlenecks + +### Input Parameters +1. Log Directories: + List of paths to server log directories. Each directory represents one server in topology. + +2. Filtering Parameters: + - `suffixes`: List of DN suffixes to analyze + - `time_range`: Optional start/end datetime range + - `lag_time_lowest`: Minimum lag threshold + - `etime_lowest`: Minimum operation execution time + +3. Analysis Options: + - `anonymous`: Hide server names in reports + - `only_fully_replicated`: Show only changes reaching all servers + - `only_not_replicated`: Show only incomplete replication + - `utc_offset`: Timezone handling + +4. Performance Options: + - `analysis_precision`: Controls sampling aggressiveness (`fast`, `balanced`, `full`) + - `max_chart_points`: Maximum data points across all chart series (overrides precision preset) + - `sampling_mode`: `auto` (default) or `none` — controls whether downsampling is applied + +### Output Parameters +1. Reports: + - CSV: Detailed event log with global and hop lags + - HTML: Interactive visualization with Plotly (3 subplots: global lag, duration, per-hop lags) + - PNG: Static visualization with matplotlib (2 subplots: lags and durations) + - JSON: PatternFly chart data with interactive drill-down support + - Summary JSON: Always generated alongside other formats — contains aggregate statistics + +2. Metrics: + - Global lag statistics (min/max/avg) + - Hop lag statistics (min/max/avg) + - Per-suffix update counts + - Total updates processed + - Server participation statistics (active, processed, skipped log dirs) + +## Component Details + +### DSLogParser +- Purpose: Efficient log file parsing +- Key Features: + - Batch processing for memory efficiency + - Timezone-aware timestamp handling + - Regular expression-based log parsing + +### ReplicationLogAnalyzer +- Purpose: Analysis coordination and report generation +- Key Features: + - Multi-server log correlation + - Flexible filtering options + - Multiple report format support + +### VisualizationHelper +- Purpose: Data visualization +- Key Features: + - Interactive Plotly charts + - Static matplotlib exports + - Consistent color schemes + +## Performance & Precision Controls + +The analyzer supports configurable precision modes that balance analysis speed against data fidelity. This is critical for large deployments where access logs can contain millions of entries. + +### Precision Presets + +| Preset | Max Chart Points | Description | +|--------|-----------------|-------------| +| `fast` | 2,000 | Quick preview with aggressive sampling. Suitable for initial investigation of large datasets | +| `balanced` | 6,000 | Default. Good trade-off between speed and detail for most deployments | +| `full` | None (unlimited) | No sampling cap. Processes all data points. Very large datasets may still trigger auto-sampling if they exceed the auto-sampling threshold | + +### Sampling Strategy + +When datasets exceed the configured limits, the analyzer applies uniform sampling to reduce data volume while preserving the statistical shape of the data: + +- **Auto-sampling threshold**: 4,000 CSN points. Below this, sampling is skipped even in `fast`/`balanced` modes +- **Hop series budget**: 25% of total chart points are allocated to hop-lag series, with the remaining 75% for global lag series +- **Minimum points per series**: 2 points are always preserved per series to maintain visual continuity +- **Uniform distribution**: Sampled points are evenly distributed across the original dataset using index-based selection, ensuring no time periods are disproportionately represented + +The `--max-chart-points` CLI parameter allows direct override of the preset limit for fine-tuned control. + +### CSN Details Limiting + +For the JSON output drill-down feature, the analyzer stores detailed per-CSN propagation data. To prevent excessive memory usage: +- Maximum of 10,000 CSN details are retained +- When exceeded, CSNs are ranked by global lag (descending) and the top 10,000 highest-lag CSNs are kept — these are the most diagnostically valuable entries + +### Sampling Metadata + +Reports include sampling metadata so consumers know whether data was reduced: +```json +{ + "applied": false, + "mode": "auto", + "samplingMode": "auto", + "precision": "balanced", + "maxChartPoints": 6000, + "originalTotalPoints": 15000, + "reducedTotalPoints": 6000 +} +``` + +## Data Flow + +1. Log Collection: + ``` + Server Logs → DSLogParser → Parsed Events + ``` + +2. Analysis: + ``` + Parsed Events → ReplicationLogAnalyzer → Lag Calculations + ``` + +3. Sampling (if needed): + ``` + Lag Calculations → Precision Controls → Sampled Data + ``` + +4. Reporting: + ``` + Sampled Data → VisualizationHelper → Reports (JSON/CSV/HTML/PNG) + ``` + +## Output Format Details + +### Summary JSON (`replication_analysis_summary.json`) + +Always generated alongside other formats. Contains aggregate statistics: +```json +{ + "analysis_summary": { + "total_servers": 3, + "configured_log_dirs": ["/var/log/dirsrv/slapd-supplier1", "..."], + "processed_log_dirs": ["/var/log/dirsrv/slapd-supplier1", "..."], + "skipped_log_dirs": [], + "analyzed_logs": 4521, + "total_updates": 12843, + "minimum_lag": 0.001, + "maximum_lag": 45.230, + "average_lag": 2.150, + "minimum_hop_lag": 0.001, + "maximum_hop_lag": 12.450, + "average_hop_lag": 1.030, + "total_hops": 8922, + "updates_by_suffix": {"dc=example,dc=com": 12843}, + "time_range": {"start": "2025-01-01 00:00:00", "end": "2025-01-31 23:59:59"} + } +} +``` + +### PatternFly JSON (`replication_analysis.json`) + +Designed for the Cockpit WebUI's PatternFly chart components. Top-level structure: + +```json +{ + "replicationLags": { + "title": "Global Replication Lag Over Time", + "yAxisLabel": "Lag Time (seconds)", + "xAxisLabel": "Time", + "series": [ + { + "datapoints": [ + { + "name": "supplier1", + "x": "2025-01-15T10:30:00+00:00", + "y": 2.150, + "duration": 0.003, + "hoverInfo": "Timestamp: ...
CSN: ...
...", + "csnId": "5a5b6c7d000000010000" + } + ], + "legendItem": {"name": "supplier1 (dc=example,dc=com)"}, + "color": "#0066cc" + } + ] + }, + "hopLags": { + "title": "Per-Hop Replication Lags", + "yAxisLabel": "Hop Lag Time (seconds)", + "xAxisLabel": "Time", + "series": [ + { + "datapoints": [ + { + "name": "supplier1 → consumer1", + "x": "2025-01-15T10:30:00+00:00", + "y": 1.050, + "hoverInfo": "...", + "csnId": "5a5b6c7d000000010000" + } + ], + "legendItem": {"name": "supplier1 → consumer1"}, + "color": "#ff6600" + } + ] + }, + "csnDetails": { + "5a5b6c7d000000010000": { + "csn": "5a5b6c7d000000010000", + "targetDn": "uid=user1,ou=people,dc=example,dc=com", + "suffix": "dc=example,dc=com", + "globalLag": 2.150, + "originServer": "supplier1", + "originTime": "2025-01-15T10:30:00+00:00", + "arrivals": [ + { + "server": "supplier1", + "timestamp": "2025-01-15T10:30:00+00:00", + "relativeDelay": 0.0, + "duration": 0.003 + }, + { + "server": "consumer1", + "timestamp": "2025-01-15T10:30:01.05+00:00", + "relativeDelay": 1.050, + "duration": 0.002, + "hopFrom": "supplier1", + "hopLag": 1.050 + } + ], + "hops": [ + {"from": "supplier1", "to": "consumer1", "lag": 1.050} + ], + "totalHops": 1, + "serverCount": 2, + "replicatedToAll": true + } + }, + "metadata": { + "totalServers": 3, + "configuredLogDirs": ["..."], + "processedLogDirs": ["..."], + "skippedLogDirs": [], + "analyzedLogs": 4521, + "totalUpdates": 12843, + "timeRange": {"start": "...", "end": "..."}, + "timezone": "UTC", + "sampling": { "...sampling metadata..." } + } +} +``` + +The `csnDetails` map enables click-through drill-down in the WebUI — clicking any chart point reveals the full propagation path for that CSN across all servers. + +### CSV (`replication_analysis.csv`) + +Tabular format with columns: timestamp, CSN, server, lag_time, duration, target_dn, suffix, and hop lag information. Suitable for spreadsheet analysis and external tooling. + +### HTML (`replication_analysis.html`) + +Standalone interactive Plotly visualization with 3 subplots: +1. Global Replication Lag Over Time +2. Operation Duration Over Time +3. Per-Hop Replication Lags + +Supports hover info, range selection, and zoom controls. Requires `python3-lib389-repl-reports` package (Plotly dependency). + +### PNG (`replication_analysis.png`) + +Static matplotlib export with 2 subplots (global lags and operation durations). Requires `python3-lib389-repl-reports` package (matplotlib dependency). No per-hop subplot due to matplotlib's limitations with many series. + +## Challenges and Mitigations + +1. Large Log Files: + - Challenge: Memory consumption + - Mitigation: Batch processing, generators + +2. Time Zone Handling: + - Challenge: Accurate timestamp comparison + - Mitigation: Consistent UTC conversion + +3. Visualization Performance: + - Challenge: Large datasets overwhelming chart rendering + - Mitigation: Configurable precision presets with automatic sampling, client-side downsampling in the WebUI (2,000 points for global lags, 600 for hop lags), CSN details capped at 10,000 entries + +4. Memory Safety: + - Challenge: Very large JSON payloads in the WebUI + - Mitigation: 64 MiB read limits, blob-based PNG handling with cleanup, lazy tab loading + +## Web User Interface (WebUI) + +The Replication Log Analyzer is accessible via **Monitor** → **Log Analyser** in the 389 DS Cockpit WebUI. The interface provides a form-based configuration system with real-time validation and integrated file browsing capabilities. + +### Interface Structure + +The UI is organized into card-based sections with an expandable help section explaining the analysis process. Form validation occurs in real-time with error highlighting and helper text for invalid inputs. + +The tool starts with an expandable "About Replication Log Analysis" section that provides a clear overview of the analysis process. This isn't just documentation - it's an interactive guide that walks you through the five essential steps: selecting server log directories, specifying suffixes, adjusting filters, choosing report formats, and generating the report. + +### Log Directory Selection + +**File Browser Integration**: Modal dialog for directory selection opens to `/var/log/dirsrv` by default. Supports navigation via path input or folder browsing with checkbox-based multi-selection. + +**Directory Management**: Selected directories display in a DataList component with folder icons and remove buttons. The interface validates directory accessibility before allowing selection. + +### Suffix Configuration + +**Input Field**: Text input with real-time DN validation using the `valid_dn()` function. Invalid DNs trigger immediate error display. + +**Chip Display**: Selected suffixes appear as removable PatternFly chips. Interface pre-populates with existing replicated suffixes from server configuration. + +### Configuration Options + +**Display Options**: +- Server anonymization toggle (replaces hostnames with generic identifiers) +- Replication filter: all entries, fully replicated only, or failed replication only + +**Time Range Controls**: +- DatePicker and TimePicker components for start/end times +- UTC offset field with increment/decrement buttons (30-minute intervals) +- Linked controls prevent invalid date ranges + +**Threshold Configuration**: +- NumberInput components for lag time and etime thresholds +- Increment/decrement controls with validation for positive numbers + +### Report Output Configuration + +**Analysis Precision**: Radio button selector controlling backend sampling strategy: +- **Fast (preview, sampled)**: Quick preview using aggressive sampling for large datasets +- **Balanced (default)**: Good balance between speed and detail +- **Full Precision (slower)**: Process all points without sampling. Note: very large datasets may still be auto-sampled for performance; use time range or suffix filters to reduce data volume + +The selected precision value is passed to the backend via the `--precision` CLI parameter. + +**Format Options**: +- JSON: PatternFly chart data with drill-down support (always available) +- CSV: Tabular data export (always available) +- HTML/PNG: Requires `python3-lib389-repl-reports` package + +**Package Detection**: On mount, the interface runs `rpm -q python3-lib389-repl-reports` to check package availability. If missing, HTML and PNG checkboxes are disabled with explanatory tooltips. + +**Output Directory**: Defaults to `/tmp` with optional custom directory selection via the file browser. Each report run creates a subdirectory — either using a custom report name or an auto-generated name with ISO 8601 timestamp and random suffix (e.g., `repl_report_2025-01-15T14-23-45_a3f2b1`). + +**Report Naming**: Optional custom report names; defaults to timestamp-based naming. + +### Report Generation + +**Process Flow**: +1. Form validation before submission +2. Background command execution via Cockpit spawn +3. Loading state with progress indicators +4. JSON response parsing for report file locations + +**Command Construction**: Builds `dsconf replication lag-report` command with all configured parameters, including log directories, suffixes, time ranges, and output formats. + +### Report Viewing Modal + +**Tabbed Interface**: `LagReportModal` dialog with tabs dynamically adapting to available report formats. Each tab loads its data independently and asynchronously, with load tokens preventing race conditions. + +- **Summary Tab** (always present): Displays aggregate statistics in a card-based layout: + - Analysis Overview card: total servers, analyzed log events, total updates + - Replication Lag Statistics card: minimum, maximum, and average lag + - Skipped Log Directories card: warning alert if any directories couldn't be read + - Updates by Suffix card: per-suffix update counts + - Time Range card: analysis start and end times + +- **Charts Tab** (when JSON available): Interactive PatternFly scatter-line charts rendering two chart types: + - **Global Replication Lag Over Time**: Shows lag values for each server/suffix combination + - **Per-Hop Replication Lags**: Shows hop-by-hop delays between server pairs (format: "supplier → consumer") + - **CSN Drill-Down**: Clicking any chart point opens a `CSNDetailModal` showing the full propagation path for that CSN — origin server, arrival timeline with arrows showing hop-by-hop propagation, and detailed timing information for each server + - **Client-Side Sampling**: For very large JSON datasets, the WebUI applies additional client-side downsampling (2,000 points for global lags, 600 for hop lags) with a warning alert when sampling is active + +- **PNG Tab** (when PNG available): Static image display. PNG is read as binary data (up to 64 MiB), converted to a data URL via Blob/FileReader, and rendered as an `` element + +- **CSV Tab** (when CSV available): Shows a text preview of the first 20 lines of CSV data in a `
` code block
+
+- **Report Files Tab** (always present): Lists all generated files (summary JSON, chart JSON, CSV, PNG, HTML) with download buttons. Downloads use Cockpit's channel API with hidden iframe injection for browser-native file saving
+
+### Existing Report Management
+
+**Report Discovery**: "Choose Existing Report" button opens `ChooseLagReportModal` that scans the configured output directory for existing reports. Discovery logic:
+1. Lists subdirectories in the output directory
+2. Checks each for the presence of known report files (`replication_analysis.json`, `_summary.json`, `.html`, `.csv`, `.png`)
+3. Rejects directories containing unexpected files (strict validation)
+4. Retrieves creation time via `stat` or by parsing the directory name timestamp
+
+**Report Table**: Displays report metadata sorted by creation time (newest first) with format availability indicators (checkmarks/X marks) and "View Report" actions that open the same `LagReportModal` used for new reports.
+
+## Command Line Interface
+
+The replication lag analyzer is also available as a CLI tool through dsconf:
+
+```
+dsconf INSTANCE replication lag-report [options]
+```
+
+### Required Parameters
+
+**--log-dirs**: List of log directories to analyze. Each directory represents one server in the replication topology.
+```
+--log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1
+```
+
+**--suffixes**: List of suffixes (naming contexts) to analyze.
+```
+--suffixes "dc=example,dc=com" "dc=test,dc=com"
+```
+
+**--output-dir**: Directory where analysis reports will be written.
+```
+--output-dir /tmp/repl_analysis
+```
+
+### Output Options
+
+**--output-format**: Specify one or more output formats. Options: html, json, png, csv. Default: html.
+```
+--output-format json csv png
+```
+
+**--json**: Output results as JSON for programmatic use or UI integration.
+
+### Filtering Options
+
+**Replication Status Filters** (mutually exclusive):
+- **--only-fully-replicated**: Show only entries that successfully replicated to all servers
+- **--only-not-replicated**: Show only entries that failed to replicate to all servers
+
+**Threshold Filters**:
+- **--lag-time-lowest SECONDS**: Filter entries with lag time above this threshold
+- **--etime-lowest SECONDS**: Filter entries with execution time above this threshold
+
+### Time Range Options
+
+**--start-time**: Start time for analysis in YYYY-MM-DD HH:MM:SS format. Default: 1970-01-01 00:00:00
+
+**--end-time**: End time for analysis in YYYY-MM-DD HH:MM:SS format. Default: 9999-12-31 23:59:59
+
+### Additional Options
+
+**--utc-offset**: UTC offset in ±HHMM format for timezone handling (e.g., -0400, +0530)
+
+**--anonymous**: Anonymize server names in reports (replaces with generic identifiers)
+
+### Performance Options
+
+**--precision**: Analysis precision vs speed. Choices: `fast`, `balanced` (default), `full`. Controls the sampling aggressiveness when generating chart data. See [Performance & Precision Controls](#performance--precision-controls) for details.
+
+**--max-chart-points**: Maximum total data points to include across all chart series. Sampling is applied if exceeded. Default depends on `--precision` setting. Overrides the precision preset when specified explicitly.
+
+### Usage Examples
+
+Basic analysis:
+```
+dsconf supplier1 replication lag-report \
+  --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 \
+  --suffixes "dc=example,dc=com" \
+  --output-dir /tmp/repl_report
+```
+
+Advanced analysis with filtering:
+```
+dsconf supplier1 replication lag-report \
+  --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 \
+  --suffixes "dc=example,dc=com" \
+  --output-dir /tmp/repl_report \
+  --output-format json csv png \
+  --lag-time-lowest 1.0 \
+  --only-fully-replicated \
+  --start-time "2025-01-01 00:00:00" \
+  --end-time "2025-01-31 23:59:59" \
+  --utc-offset "-0500"
+```
+
+Fast preview of a large dataset:
+```
+dsconf supplier1 replication lag-report \
+  --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 /var/log/dirsrv/slapd-consumer2 \
+  --suffixes "dc=example,dc=com" \
+  --output-dir /tmp/repl_report \
+  --output-format json \
+  --precision fast
+```
+
+Full precision with custom chart point limit:
+```
+dsconf supplier1 replication lag-report \
+  --log-dirs /var/log/dirsrv/slapd-supplier1 /var/log/dirsrv/slapd-consumer1 \
+  --suffixes "dc=example,dc=com" \
+  --output-dir /tmp/repl_report \
+  --output-format json csv \
+  --precision full \
+  --max-chart-points 15000
+```
+
+## Authors
+
+Simon Pichugin (@droideck)