Checked for duplicates
Yes - I've already checked
π§βπ¬ User Persona(s)
Data Engineer, Node Operator
πͺ Motivation
...so that I can validate large Table_Character and Table_Delimited products with millions of records without multi-minute wall-clock runtimes, by reducing the per-record and per-field overhead that currently dominates validation time.
π Additional Details
Profiling a 354 MB Table_Character product (1M records, 22 ASCII_Real fields) shows ~15.8s end-to-end validation time. Raw I/O accounts for less than 10% of that; the bottleneck is per-record field validation in FieldValueValidator and TableValidator. Three specific sources of waste identified in the code:
1. value.trim() called 3β6 times per field
In FieldValueValidator.validate(), the same value.trim() expression appears in length checks, empty checks, checkType(), checkSpecialMinMax(), and error message strings. Each call allocates a new String. For 1M records Γ 22 fields = 22M fields, this means 66β132M redundant trim allocations. Fix: call once at the top of the field loop and reuse.
2. Double-read for Table_Delimited
In TableValidator.validateTableDelimitedContent(), each iteration calls both readNextLine() (raw line string) and getRecord(currentRow) (parsed record) for the same row. The raw line is only used for delimiter/EOL checks; the record is used for field validation. These could be unified to avoid the dual object allocation per record.
Relationship to pds4-jparser#197:
The readNextLine() buffering fix in pds4-jparser (4.7Γ raw I/O throughput improvement) has measurable impact for Table_Delimited validation but zero impact for Table_Character validation β TableValidator uses readNextFixedLine() for fixed-length tables, bypassing readNextLine() entirely. Per-record field validation is the dominant cost in both cases.
Relevant files:
src/main/java/gov/nasa/pds/tools/validate/content/table/FieldValueValidator.java
src/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.java
π¦ Related requirements
NASA-PDS/pds4-jparser#197
For Internal Dev Team To Complete
Acceptance Criteria
Given a Table_Character or Table_Delimited product with β₯ 1M records
When I run validation
Then I expect wall-clock time is measurably reduced compared to the current baseline, with no change in validation correctness
βοΈ Engineering Details
π I&T
π€ Generated with Claude Code
Checked for duplicates
Yes - I've already checked
π§βπ¬ User Persona(s)
Data Engineer, Node Operator
πͺ Motivation
...so that I can validate large
Table_CharacterandTable_Delimitedproducts with millions of records without multi-minute wall-clock runtimes, by reducing the per-record and per-field overhead that currently dominates validation time.π Additional Details
Profiling a 354 MB
Table_Characterproduct (1M records, 22ASCII_Realfields) shows ~15.8s end-to-end validation time. Raw I/O accounts for less than 10% of that; the bottleneck is per-record field validation inFieldValueValidatorandTableValidator. Three specific sources of waste identified in the code:1.
value.trim()called 3β6 times per fieldIn
FieldValueValidator.validate(), the samevalue.trim()expression appears in length checks, empty checks,checkType(),checkSpecialMinMax(), and error message strings. Each call allocates a newString. For 1M records Γ 22 fields = 22M fields, this means 66β132M redundant trim allocations. Fix: call once at the top of the field loop and reuse.2. Double-read for
Table_DelimitedIn
TableValidator.validateTableDelimitedContent(), each iteration calls bothreadNextLine()(raw line string) andgetRecord(currentRow)(parsed record) for the same row. The raw line is only used for delimiter/EOL checks; the record is used for field validation. These could be unified to avoid the dual object allocation per record.Relationship to pds4-jparser#197:
The
readNextLine()buffering fix in pds4-jparser (4.7Γ raw I/O throughput improvement) has measurable impact forTable_Delimitedvalidation but zero impact forTable_Charactervalidation βTableValidatorusesreadNextFixedLine()for fixed-length tables, bypassingreadNextLine()entirely. Per-record field validation is the dominant cost in both cases.Relevant files:
src/main/java/gov/nasa/pds/tools/validate/content/table/FieldValueValidator.javasrc/main/java/gov/nasa/pds/tools/validate/rule/pds4/TableValidator.javaπ¦ Related requirements
NASA-PDS/pds4-jparser#197
For Internal Dev Team To Complete
Acceptance Criteria
Given a
Table_CharacterorTable_Delimitedproduct with β₯ 1M recordsWhen I run validation
Then I expect wall-clock time is measurably reduced compared to the current baseline, with no change in validation correctness
βοΈ Engineering Details
π I&T