Skip to content

Time Series Performance Statistics

TL;DR

Use a effective sample size-corrected paired z-test to compare methods which have been evaluated on moderately-sized time series data.

The neural radar development kit ships with a small, self-contained statistical module (nrdk.tss) & CLI tool (tss) for analyzing time series performance metrics.

Time-Correlated Samples

Unlike scraped internet data, collected physical sensor data generally take the form of long time series, which have significant temporal correlations and cannot be viewed as independent samples1.

Finite Sample Size

Due to this temporal correlation, we find that in practice, datasets almost never have a large enough test split to be considered an "infinite sample size," even when their test set consists of thousands or tens of thousands of frames. This necessitates statistical testing to quantify the uncertainty in our evaluation.

Procedure

In practice

This methodology has so far been used by:

Assumptions

While our goal is really to estimate the underlying effective sample size (ESS) of the underlying time series, we are not aware of any currently methods which can do so for extremely high-dimensional-spaces with low-dimensional structure2. As such, we apply a univariate analysis on model performance metrics by roughly assuming that metrics change if and only if the data changes. Equivalently, and more verbosely, we assume that:

  1. Changing metrics imply changing data,
  2. Constant metrics imply constant data, and
  3. The degree to which the first assumption is violated (i.e., different metrics result from random noise outside of the underlying data "signal") is roughly cancelled out by the degree to which the second is violated (i.e., the data changes, but the metrics are the same).
  1. Run a Paired Test: Paired tests on the difference between two models when applied to the same data control for "constant" sample variability which is unrelated to the performance of the underlying model. This allows for statistical tests on the relative performance of ablations with respect to their baselines.

    • Evaluate the baseline method, and each alternative method, on the same samples.
    • Take the difference in performance metrics between each alternative and the baseline.
    • Perform a statistical test on these differences, where the null hypothesis is that the alternative methods are equivalent in performance to the baseline, and the alternative hypothesis is that a given alternative is different (2-sided test) or better (1-sided test).
  2. Calculate the Effective Sample Size: To estimate the effective sample size given these assumptions, we use an autocorrelation-based metric:

    N_eff = N / (1 + 2 * (rho_1 + rho_2 + ...))
    
    where rho_t is the t-lag autocorrelation. The sum of autocorrelations is empirically estimated from the performance metrics, where the sum is calculated up to N / 2 or the first negative autocorrelation, whichever is first.

    For details about how we calculate this, see effective_sample_size.

    Why cut off when rho_t is negative?

    Mathematically, it is possible for aucorrelation rho_t to be negative. This is called antithetical sampling: samples which are inversely correlated with previous samples in order to reduce the variance of the overall sampling. However, since we use relative performance as a proxy for data, and assume that this procedure is only used to correct for low-level temporal correlations, such negative autocorrelations are assumed to be spurious.

  3. Calculate the Standard Error: Using the effective sample size, we can calculate the standard error

    SE = std / sqrt(N_eff)
    
    and perform a one or two-sided Z-test (assuming that N_eff is relatively large).

    Correct for Multiple Inference

    In the case that multiple alternatives are compared against the baseline, it may be necessary to correct for multiple inference (i.e., the increased chances of getting a result with a low p-value if you evaluate many alternatives at once). Since different methods which tackle the same problem are highly correlated, this requires using a Bonferroni correction.

General Usage

In addition to using the low level API, we provide a CLI and a high level API which can be used to index results and evaluations, then load the evaluations and calculate statistics.

File Format

Time series evaluations are expected to be stored in .npz files, which contain multiple key-value pairs of metrics and/or timestamps.

  • All metrics which might pass through this library must be 1D arrays with a time/sample index axis.
  • Timestamps can have additional axes (e.g., for models which operate on a sequence of data); in this case, the last timestamp3 is used.
  • The metrics and timestamps within each evaluation are expected to be synchronized.

Warning

We assume that data in different evaluations of the same experiment (e.g., traces with different names - bike/ptbreeze.out.npz and bike/ptbreeze.back.npz) are temporally correlated. Evaluations on data traces which are recorded back-to-back must be combined into the same file!

Example
With Timestamps
metrics/loss: float32[439890]
metrics/map/bce: float32[439890]
metrics/map/depth: float32[439890]
timestamps/lidar: float64[439890]
timestamps/radar: float64[439890,8]
Without Timestamps
loss: float32[17606]
map_acc: float32[17606]
map_chamfer: float32[17606]
map_depth: float32[17606]
map_f1: float32[17606]

Naming Convention

The file path to each evaluation, relative to some base path, should contain information about the experiment name and the sequence/trace name.

  • These should be extractable using a regex.
  • The regex should have two groups (experiment and trace), which are used to organize evaluations by experiment name and compare evaluations with the same trace name.
Example
With Experiment and Trace Name
pattern = r'^(?P<experiment>(.*))/eval/(?P<trace>(.*))\.npz$'
relpath = 'small/patch.rdae/eval/bike/ptbreeze.out.npz'
#          └──experiment──┘      └─────trace─────┘ 
With Experiment Name Only
pattern = r'^(?P<experiment>(.*))\.npz$'
relpath = 'reduced2x/t8_p100.npz'
#          └──experiment───┘      (trace=None)

CLI

Usage

uv run tss /path/to/results \
    --pattern "^(?P<experiment>(.*))/eval/bike/(?P<trace>(.*))\.npz$" \
    --experiments medium/p10 medium/p20 \
    --baseline medium/p10 > results.csv \
    --key loss

uv run tss /path/to/results --config config.yaml
config.yaml
pattern: ^(?P<experiment>(.*))/eval/bike/(?P<trace>(.*))\.npz$
experiments:
- medium/p10
- medium/p10
baseline: medium/p10
key: loss

Calculate statistics for time series metrics.

  • pipe tss ... > results.csv to save the results to a file.
  • use --config config.yaml to avoid having to specify all these arguments; any arguments which are explicitly provided will override the values in the config file.

Warning

path (and --follow_symlinks, if specified) are required to be passed via the command line, and cannot be specified via the config.

Parameters:

Name Type Description Default
path str

directory to find evaluations in.

required
pattern str | None

regex pattern to match the evaluation directories.

None
key str | None

name of the metric to load from the result files.

None
timestamps str | None

name of the timestamps to load from the result files.

None
experiments list[str] | None

explicit list of experiments to include in the results.

None
baseline str | None

baseline experiment for relative statistics.

None
follow_symlinks bool

whether to follow symbolic links. May lead to infinite recursion if True and the path contains self-referential links!

False
cut float | None

cut each time series when there is a gap in the timestamps larger than this value if provided.

None
t_max int | None

maximum time delay to consider when computing effective sample size; if not specified, do not use any additional constraints.

None
config str | None

load all of these values from a yaml configuration file instead.

None
Source code in src/nrdk/tss/_cli.py
def _cli(
    path: str, /,
    pattern: str | None = None, key: str | None = None,
    timestamps: str | None = None, experiments: list[str] | None = None,
    baseline: str | None = None, follow_symlinks: bool = False,
    cut: float | None = None, t_max: int | None = None,
    config: str | None = None,
) -> int:
    """Calculate statistics for time series metrics.

    - pipe `tss ... > results.csv` to save the results to a file.
    - use `--config config.yaml` to avoid having to specify all these
        arguments; any arguments which are explicitly provided will override
        the values in the config file.

    !!! warning

        `path` (and `--follow_symlinks`, if specified) are required to be
        passed via the command line, and cannot be specified via the config.

    Args:
        path: directory to find evaluations in.
        pattern: regex pattern to match the evaluation directories.
        key: name of the metric to load from the result files.
        timestamps: name of the timestamps to load from the result files.
        experiments: explicit list of experiments to include in the results.
        baseline: baseline experiment for relative statistics.
        follow_symlinks: whether to follow symbolic links. May lead to infinite
            recursion if `True` and the `path` contains self-referential links!
        cut: cut each time series when there is a gap in the timestamps larger
            than this value if provided.
        t_max: maximum time delay to consider when computing effective sample
            size; if not specified, do not use any additional constraints.
        config: load all of these values from a yaml configuration file
            instead.
    """
    if config is not None:
        with open(config) as f:
            cfg = yaml.safe_load(f)
    else:
        cfg = {}

    def setdefault(value, param, default):
        if value is None:
            value = cfg.get(param, default)
        return value

    pattern = setdefault(pattern, "pattern", r"^(?P<experiment>(.*)).npz$")
    key = setdefault(key, "key", "loss")
    timestamps = setdefault(timestamps, "timestamps", None)
    baseline = setdefault(baseline, "baseline", None)
    cut = setdefault(cut, "cut", None)
    t_max = setdefault(t_max, "t_max", None)

    index = api.index(
        path, pattern=pattern, follow_symlinks=follow_symlinks)  # type: ignore

    if len(index) == 0:
        print("No result files found!")
        print(
            "Hint: if `results` include symlinks (or is a symlink itself), "
            "try passing `--follow_symlinks`.")
        return -1

    df = api.dataframe_from_index(
        index, key=key, baseline=baseline,  # type: ignore
        experiments=experiments, cut=cut, t_max=t_max, timestamps=timestamps)

    buf = StringIO()
    df.to_csv(buf)
    print(buf.getvalue())
    return 0

High Level API

Index evaluations: using index, provide a base path where the evaluations are stored, and a regex pattern for finding evaluation files and extracting their experiment and trace names.

import tss

path = "/shiraz/grt/results"  # path to where the evaluations are stored
pattern = r"^(?P<experiment>(.*))/eval/(?P<trace>(.*))\.npz$"

index = tss.index_results(path, pattern)

Tip

This can take quite a long time if you have many evaluation files and are loading from a network drive (~3 seconds for ~3000 evaluations in a directory with 20k total files on a SMB share). You may want to cache the index or save them to disk somewhere!

Compute Statistics: we provide a all-inclusive dataframe_from_index function which returns a dataframe containing summary statistics for the specified index, given a key of interest and baseline method.

experiments = ["small/p10", "small/p20", "small/p50", "small/base"]
df = tss.dataframe_from_index(
    index, "loss", baseline="small/base", experiments=experiments)
df
output
            abs/mean   abs/std  abs/stderr   abs/n     abs/ess  rel/mean   rel/std  rel/stderr   rel/n      rel/ess   pct/mean  pct/stderr  p0.05
name                                                                                                                                             
small/base  0.125371  0.070062    0.002442  162931  823.034877  0.000000  0.000000         NaN  162931     0.000000   0.000000         NaN  False
small/p10   0.161236  0.088207    0.002991  162931  869.479694  0.035865  0.039024    0.000769  162931  2577.969172  28.607590    0.613055   True
small/p20   0.152850  0.097548    0.003209  162931  924.222609  0.027480  0.045835    0.000945  162931  2353.289155  21.918760    0.753636   True
small/p50   0.134158  0.076811    0.002594  162931  877.094752  0.008787  0.027099    0.000406  162931  4453.599831   7.009018    0.323892   True

  1. Intuitively, sampling the same signal (e.g., radar-lidar-camera tuples) with a greater frequency yields diminishing information: sampling an infinitesimally short video at an infinite frame rate clearly does not yield an infinite sample size. 

  2. This concept is best explained via the "natural image manifold:" images have a lot of dimensions (HxWxC), but take a np.random.random((h, w, c)) image, and you'll almost surely not end up with a "natural" image that you might actually encounter. The space of all such natural images can be thought of as a low-dimensional manifold, embedded in the high-dimensional image space. 

  3. If more than one extra axis is provided, the last axis when the array is flattened in C-order is used.