Time Series Performance Statistics¶
TL;DR
Use a effective sample size-corrected paired z-test to compare methods which have been evaluated on moderately-sized time series data.
The neural radar development kit ships with a small, self-contained statistical module (nrdk.tss
) & CLI tool (tss
) for analyzing time series performance metrics.
Time-Correlated Samples
Unlike scraped internet data, collected physical sensor data generally take the form of long time series, which have significant temporal correlations and cannot be viewed as independent samples1.
Finite Sample Size
Due to this temporal correlation, we find that in practice, datasets almost never have a large enough test split to be considered an "infinite sample size," even when their test set consists of thousands or tens of thousands of frames. This necessitates statistical testing to quantify the uncertainty in our evaluation.
Procedure¶
In practice
This methodology has so far been used by:
Assumptions
While our goal is really to estimate the underlying effective sample size (ESS) of the underlying time series, we are not aware of any currently methods which can do so for extremely high-dimensional-spaces with low-dimensional structure2. As such, we apply a univariate analysis on model performance metrics by roughly assuming that metrics change if and only if the data changes. Equivalently, and more verbosely, we assume that:
- Changing metrics imply changing data,
- Constant metrics imply constant data, and
- The degree to which the first assumption is violated (i.e., different metrics result from random noise outside of the underlying data "signal") is roughly cancelled out by the degree to which the second is violated (i.e., the data changes, but the metrics are the same).
-
Run a Paired Test: Paired tests on the difference between two models when applied to the same data control for "constant" sample variability which is unrelated to the performance of the underlying model. This allows for statistical tests on the relative performance of ablations with respect to their baselines.
- Evaluate the baseline method, and each alternative method, on the same samples.
- Take the difference in performance metrics between each alternative and the baseline.
- Perform a statistical test on these differences, where the null hypothesis is that the alternative methods are equivalent in performance to the baseline, and the alternative hypothesis is that a given alternative is different (2-sided test) or better (1-sided test).
-
Calculate the Effective Sample Size: To estimate the effective sample size given these assumptions, we use an autocorrelation-based metric:
whererho_t
is thet
-lag autocorrelation. The sum of autocorrelations is empirically estimated from the performance metrics, where the sum is calculated up toN / 2
or the first negative autocorrelation, whichever is first.For details about how we calculate this, see
effective_sample_size
.Why cut off when
rho_t
is negative?Mathematically, it is possible for aucorrelation
rho_t
to be negative. This is called antithetical sampling: samples which are inversely correlated with previous samples in order to reduce the variance of the overall sampling. However, since we use relative performance as a proxy for data, and assume that this procedure is only used to correct for low-level temporal correlations, such negative autocorrelations are assumed to be spurious. -
Calculate the Standard Error: Using the effective sample size, we can calculate the standard error
and perform a one or two-sided Z-test (assuming thatN_eff
is relatively large).Correct for Multiple Inference
In the case that multiple alternatives are compared against the baseline, it may be necessary to correct for multiple inference (i.e., the increased chances of getting a result with a low p-value if you evaluate many alternatives at once). Since different methods which tackle the same problem are highly correlated, this requires using a Bonferroni correction.
General Usage¶
In addition to using the low level API, we provide a CLI and a high level API which can be used to index results and evaluations, then load the evaluations and calculate statistics.
File Format¶
Time series evaluations are expected to be stored in .npz
files, which contain multiple key-value pairs of metrics and/or timestamps.
- All metrics which might pass through this library must be 1D arrays with a time/sample index axis.
- Timestamps can have additional axes (e.g., for models which operate on a sequence of data); in this case, the last timestamp3 is used.
- The metrics and timestamps within each evaluation are expected to be synchronized.
Warning
We assume that data in different evaluations of the same experiment (e.g., traces with different names - bike/ptbreeze.out.npz
and bike/ptbreeze.back.npz
) are temporally correlated. Evaluations on data traces which are recorded back-to-back must be combined into the same file!
Example
Naming Convention¶
The file path to each evaluation, relative to some base path, should contain information about the experiment name and the sequence/trace name.
- These should be extractable using a regex.
- The regex should have two groups (
experiment
andtrace
), which are used to organize evaluations byexperiment
name and compare evaluations with the sametrace
name.
Example
CLI¶
Usage
Calculate statistics for time series metrics.
- pipe
tss ... > results.csv
to save the results to a file. - use
--config config.yaml
to avoid having to specify all these arguments; any arguments which are explicitly provided will override the values in the config file.
Warning
path
(and --follow_symlinks
, if specified) are required to be
passed via the command line, and cannot be specified via the config.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
directory to find evaluations in. |
required |
pattern
|
str | None
|
regex pattern to match the evaluation directories. |
None
|
key
|
str | None
|
name of the metric to load from the result files. |
None
|
timestamps
|
str | None
|
name of the timestamps to load from the result files. |
None
|
experiments
|
list[str] | None
|
explicit list of experiments to include in the results. |
None
|
baseline
|
str | None
|
baseline experiment for relative statistics. |
None
|
follow_symlinks
|
bool
|
whether to follow symbolic links. May lead to infinite
recursion if |
False
|
cut
|
float | None
|
cut each time series when there is a gap in the timestamps larger than this value if provided. |
None
|
t_max
|
int | None
|
maximum time delay to consider when computing effective sample size; if not specified, do not use any additional constraints. |
None
|
config
|
str | None
|
load all of these values from a yaml configuration file instead. |
None
|
Source code in src/nrdk/tss/_cli.py
High Level API¶
Index evaluations: using index
, provide a base path where the evaluations are stored, and a regex pattern for finding evaluation files and extracting their experiment
and trace
names.
import tss
path = "/shiraz/grt/results" # path to where the evaluations are stored
pattern = r"^(?P<experiment>(.*))/eval/(?P<trace>(.*))\.npz$"
index = tss.index_results(path, pattern)
Tip
This can take quite a long time if you have many evaluation files and are loading from a network drive (~3 seconds for ~3000 evaluations in a directory with 20k total files on a SMB share). You may want to cache the index or save them to disk somewhere!
Compute Statistics: we provide a all-inclusive dataframe_from_index
function which returns a dataframe containing summary statistics for the specified index, given a key of interest and baseline method.
experiments = ["small/p10", "small/p20", "small/p50", "small/base"]
df = tss.dataframe_from_index(
index, "loss", baseline="small/base", experiments=experiments)
df
abs/mean abs/std abs/stderr abs/n abs/ess rel/mean rel/std rel/stderr rel/n rel/ess pct/mean pct/stderr p0.05
name
small/base 0.125371 0.070062 0.002442 162931 823.034877 0.000000 0.000000 NaN 162931 0.000000 0.000000 NaN False
small/p10 0.161236 0.088207 0.002991 162931 869.479694 0.035865 0.039024 0.000769 162931 2577.969172 28.607590 0.613055 True
small/p20 0.152850 0.097548 0.003209 162931 924.222609 0.027480 0.045835 0.000945 162931 2353.289155 21.918760 0.753636 True
small/p50 0.134158 0.076811 0.002594 162931 877.094752 0.008787 0.027099 0.000406 162931 4453.599831 7.009018 0.323892 True
-
Intuitively, sampling the same signal (e.g., radar-lidar-camera tuples) with a greater frequency yields diminishing information: sampling an infinitesimally short video at an infinite frame rate clearly does not yield an infinite sample size. ↩
-
This concept is best explained via the "natural image manifold:" images have a lot of dimensions (
HxWxC
), but take anp.random.random((h, w, c))
image, and you'll almost surely not end up with a "natural" image that you might actually encounter. The space of all such natural images can be thought of as a low-dimensional manifold, embedded in the high-dimensional image space. ↩ -
If more than one extra axis is provided, the last axis when the array is flattened in C-order is used. ↩