Feature/fpm 995 alt data window#83
Conversation
…calculate dynamic correction factor
richjam
left a comment
There was a problem hiding this comment.
Some comments more on the code itself rather than the logic - I'll leave that to @simonstanley
|
Sorry meant to say as well that we will definitely want worked examples in the docs user-guide page |
…ta directly as argument in window_duration method
…/time-stream into feature/FPM-995-alt-data-window
richjam
left a comment
There was a problem hiding this comment.
Sorry this keeps dragging on!!
I think it looks much better with the Polars vectorisation rather than the for loop.
Just a couple of suggestions for tidying up a couple things
|
|
||
| def _apply_max_threshold( | ||
| self, | ||
| max_threshold: int, |
There was a problem hiding this comment.
could get rid of the max_threshold parameter, just use self.max_threshold within this method
(the only thing calling this method sends self.max_threshold)
| windowed_df = df.drop(gap_id_column_name).join_where( | ||
| gap_bounds, | ||
| pl.col(time_column_name) >= pl.col("__GAP_START__") - window_duration, | ||
| pl.col(time_column_name) <= pl.col("__GAP_END__") + window_duration, | ||
| ) |
There was a problem hiding this comment.
If I'm reading the logic right, then this join_where includes all the null gap rows, which are immediately discarded in the _build_windowed_data method.
Can you do the filter nulls before the join_where for a bit of efficiency?
Could probably just get rid of the _filter_nulls method and inline that before the join_where:
windowed_df = df.drop(gap_id_column_name).filter(
pl.col(infill_column).is_not_null() & pl.col(alt_data_column_name).is_not_null()
).join_where(gap_bounds, ...)
| # Filter out all null values from both the original and alternative dataset. | ||
| windowed_df = self._filter_nulls(df, infill_column, alt_data_column_name) |
There was a problem hiding this comment.
As above comment, I think this can move before the join_where, so remove from this method.
| def _filter_nulls( | ||
| self, | ||
| df: pl.DataFrame, | ||
| infill_column: str, | ||
| alt_data_column_name: str, | ||
| ) -> pl.DataFrame: | ||
| """Remove rows where either the infill or alternative data column is null. | ||
|
|
||
| Args: | ||
| df: Input DataFrame. | ||
| infill_column: Name of the column to be infilled. | ||
| alt_data_column_name: Name of the alternative data column. | ||
|
|
||
| Returns: | ||
| DataFrame with rows containing nulls in either value column removed. | ||
| """ | ||
| return df.filter(pl.col(infill_column).is_not_null() & pl.col(alt_data_column_name).is_not_null()) | ||
|
|
There was a problem hiding this comment.
As above comment, this could just be inlined - no need to be a static method really
FPM-995
Overview
AltDataDynamicinfills missing values using an alternative data source and a dynamic correction factor derived from surrounding data. For each contiguous gap in the original dataset, a time window is defined around the gap. A correction factor is computed as the ratio of the sum of the original data to the sum of the alternative data within this window. The alternative data corresponding to the missing interval is scaled by the correction factor to produce the infilled values.window_size:
AltDataDynamicis instantiated by specifying awindow_size, that must either be an iso string, aTimeStream.Periodtype, or atimedelta. The method_window_durationconverts thewindow_sizeinto atimedelta, and performs validation that the window is at least as long as the periodicity and is of the correct format.Thresholds:
min_thresholdis provided, if there is not enough data in the windows around the missing data interval to meet this threshold, that gap is not infilled.max_thresholdis provided, and is less than the total number of datapoints available across windows surrounding the gap, then the first closest datapoints to the gap are used until themax_thresholdof datapoints is met. Unless the optional parameterwindow_sideis set to either ofleftorright, there is a preference for using the same number of datapoints either side of the gap unless it is not possible to do so.Raising errors:
timedeltas only work for days, hours, seconds or smaller. If awindow_sizeis given in months or years, an error is raised.window_sizecorresponds to atimedeltaless than the periodicity of the dataset, an error is raised.min_thresholdis too large to be met in thewindow_sizespecified, an error is raised.min_threshold > max_thresholdormax_threshold = 0, an error is raised.Specifying the window side:
By default, 'windows' of data are created either side of the missing data interval. The user can also specify that only data to the left /right of the gap should be used, by initialising
AltDataDynamicwithwindow_size="left"or"right". Ifwindow_size = "both", then the default behaviour is used such that a window left and right of the gap is used.Main changes
Updates to
infill.py:AltDataDynamicclass, with_fill,_window_duration,_build_windowed_dfand_build_correction_factorsmethods. There are some additional helper functions.Updates to
test_infill.py:AltDataDynamicclass.Update to
api/infilling.rstanduser_guide/infilling.rst:Update to
examples_infilling.py:AltDataDynamicQuestions on styling:
_build_correction_factorsbe skipped over if_build_windowed_dfreturnsNone? Currently_build_correction_factorshandles this case internally.