Skip to content

Feature/fpm 995 alt data window#83

Open
lcdurkan wants to merge 24 commits into
developfrom
feature/FPM-995-alt-data-window
Open

Feature/fpm 995 alt data window#83
lcdurkan wants to merge 24 commits into
developfrom
feature/FPM-995-alt-data-window

Conversation

@lcdurkan
Copy link
Copy Markdown
Contributor

@lcdurkan lcdurkan commented May 7, 2026

FPM-995

Overview

AltDataDynamic infills missing values using an alternative data source and a dynamic correction factor derived from surrounding data. For each contiguous gap in the original dataset, a time window is defined around the gap. A correction factor is computed as the ratio of the sum of the original data to the sum of the alternative data within this window. The alternative data corresponding to the missing interval is scaled by the correction factor to produce the infilled values.

window_size:
AltDataDynamic is instantiated by specifying a window_size, that must either be an iso string, a TimeStream.Period type, or a timedelta. The method _window_duration converts the window_size into a timedelta, and performs validation that the window is at least as long as the periodicity and is of the correct format.

Thresholds:

  • If no min or max threshold are provided, then by default all of the usable data in the window surrounding the missing data interval are used to calculate the correction factor.
  • If a min_threshold is provided, if there is not enough data in the windows around the missing data interval to meet this threshold, that gap is not infilled.
  • If a max_threshold is provided, and is less than the total number of datapoints available across windows surrounding the gap, then the first closest datapoints to the gap are used until the max_threshold of datapoints is met. Unless the optional parameter window_side is set to either of left or right, there is a preference for using the same number of datapoints either side of the gap unless it is not possible to do so.

Raising errors:

  • timedeltas only work for days, hours, seconds or smaller. If a window_size is given in months or years, an error is raised.
  • if a window_size corresponds to a timedelta less than the periodicity of the dataset, an error is raised.
  • if the min_threshold is too large to be met in the window_size specified, an error is raised.
  • if the min_threshold > max_threshold or max_threshold = 0, an error is raised.

Specifying the window side:
By default, 'windows' of data are created either side of the missing data interval. The user can also specify that only data to the left /right of the gap should be used, by initialising AltDataDynamic with window_size = "left" or "right". If window_size = "both", then the default behaviour is used such that a window left and right of the gap is used.

Main changes

Updates to infill.py:

  • Added new AltDataDynamic class, with _fill, _window_duration, _build_windowed_df and _build_correction_factors methods. There are some additional helper functions.

Updates to test_infill.py:

  • Added unit tests to check all functionality of new AltDataDynamic class.

Update to api/infilling.rst and user_guide/infilling.rst:

  • Added documentation

Update to examples_infilling.py:

  • Added examples on how to use AltDataDynamic

Questions on styling:

  • Should we update all column names to use "_column_name" for consistency? Some have this, others don't.
  • Should _build_correction_factors be skipped over if _build_windowed_df returns None? Currently _build_correction_factors handles this case internally.

@lcdurkan lcdurkan marked this pull request as ready for review May 7, 2026 15:01
Copy link
Copy Markdown
Collaborator

@richjam richjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments more on the code itself rather than the logic - I'll leave that to @simonstanley

Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
@richjam
Copy link
Copy Markdown
Collaborator

richjam commented May 8, 2026

Sorry meant to say as well that we will definitely want worked examples in the docs user-guide page

Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread tests/time_stream/test_infill.py Outdated
Comment thread tests/time_stream/test_infill.py Outdated
Comment thread tests/time_stream/test_infill.py Outdated
Comment thread tests/time_stream/test_infill.py Outdated
Comment thread tests/time_stream/test_infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread tests/time_stream/test_infill.py
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
Comment thread src/time_stream/infill.py Outdated
@lcdurkan lcdurkan requested review from richjam and simsta87 May 29, 2026 09:44
Copy link
Copy Markdown
Collaborator

@richjam richjam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this keeps dragging on!!

I think it looks much better with the Polars vectorisation rather than the for loop.

Just a couple of suggestions for tidying up a couple things

Comment thread src/time_stream/infill.py

def _apply_max_threshold(
self,
max_threshold: int,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could get rid of the max_threshold parameter, just use self.max_threshold within this method

(the only thing calling this method sends self.max_threshold)

Comment thread src/time_stream/infill.py
Comment on lines +541 to +545
windowed_df = df.drop(gap_id_column_name).join_where(
gap_bounds,
pl.col(time_column_name) >= pl.col("__GAP_START__") - window_duration,
pl.col(time_column_name) <= pl.col("__GAP_END__") + window_duration,
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm reading the logic right, then this join_where includes all the null gap rows, which are immediately discarded in the _build_windowed_data method.

Can you do the filter nulls before the join_where for a bit of efficiency?

Could probably just get rid of the _filter_nulls method and inline that before the join_where:

windowed_df = df.drop(gap_id_column_name).filter(
      pl.col(infill_column).is_not_null() & pl.col(alt_data_column_name).is_not_null()
).join_where(gap_bounds, ...)

Comment thread src/time_stream/infill.py
Comment on lines +656 to +657
# Filter out all null values from both the original and alternative dataset.
windowed_df = self._filter_nulls(df, infill_column, alt_data_column_name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above comment, I think this can move before the join_where, so remove from this method.

Comment thread src/time_stream/infill.py
Comment on lines +682 to +699
def _filter_nulls(
self,
df: pl.DataFrame,
infill_column: str,
alt_data_column_name: str,
) -> pl.DataFrame:
"""Remove rows where either the infill or alternative data column is null.

Args:
df: Input DataFrame.
infill_column: Name of the column to be infilled.
alt_data_column_name: Name of the alternative data column.

Returns:
DataFrame with rows containing nulls in either value column removed.
"""
return df.filter(pl.col(infill_column).is_not_null() & pl.col(alt_data_column_name).is_not_null())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above comment, this could just be inlined - no need to be a static method really

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants