Time Series Performance Statistics¶

TL;DR

Use a effective sample size-corrected paired z-test to compare methods which have been evaluated on moderately-sized time series data.

The neural radar development kit ships with a small, self-contained statistical module (nrdk.tss) & CLI tool (tss) for analyzing time series performance metrics.

Time-Correlated Samples

Unlike scraped internet data, collected physical sensor data generally take the form of long time series, which have significant temporal correlations and cannot be viewed as independent samples¹.

Finite Sample Size

Due to this temporal correlation, we find that in practice, datasets almost never have a large enough test split to be considered an "infinite sample size," even when their test set consists of thousands or tens of thousands of frames. This necessitates statistical testing to quantify the uncertainty in our evaluation.

Procedure¶

In practice

This methodology has so far been used by:

DART: Implicit Doppler Tomography for Radar Novel View Synthesis, CVPR '24
Towards Foundational Models for Single-Chip Radar, ICCV '25

Assumptions

While our goal is really to estimate the underlying effective sample size (ESS) of the underlying time series, we are not aware of any currently methods which can do so for extremely high-dimensional-spaces with low-dimensional structure². As such, we apply a univariate analysis on model performance metrics by roughly assuming that metrics change if and only if the data changes. Equivalently, and more verbosely, we assume that:

Changing metrics imply changing data,
Constant metrics imply constant data, and
The degree to which the first assumption is violated (i.e., different metrics result from random noise outside of the underlying data "signal") is roughly cancelled out by the degree to which the second is violated (i.e., the data changes, but the metrics are the same).

Run a Paired Test: Paired tests on the difference between two models when applied to the same data control for "constant" sample variability which is unrelated to the performance of the underlying model. This allows for statistical tests on the relative performance of ablations with respect to their baselines.
- Evaluate the baseline method, and each alternative method, on the same samples.
- Take the difference in performance metrics between each alternative and the baseline.
- Perform a statistical test on these differences, where the null hypothesis is that the alternative methods are equivalent in performance to the baseline, and the alternative hypothesis is that a given alternative is different (2-sided test) or better (1-sided test).
Calculate the Effective Sample Size: To estimate the effective sample size given these assumptions, we use an autocorrelation-based metric:
```
N_eff = N / (1 + 2 * (rho_1 + rho_2 + ...))
```
where rho_t is the t-lag autocorrelation. The sum of autocorrelations is empirically estimated from the performance metrics, where the sum is calculated up to N / 2 or the first negative autocorrelation, whichever is first.

For details about how we calculate this, see effective_sample_size.

Why cut off when rho_t is negative?

Mathematically, it is possible for aucorrelation rho_t to be negative. This is called antithetical sampling: samples which are inversely correlated with previous samples in order to reduce the variance of the overall sampling. However, since we use relative performance as a proxy for data, and assume that this procedure is only used to correct for low-level temporal correlations, such negative autocorrelations are assumed to be spurious.
Calculate the Standard Error: Using the effective sample size, we can calculate the standard error
```
SE = std / sqrt(N_eff)
```
and perform a one or two-sided Z-test (assuming that N_eff is relatively large).

Correct for Multiple Inference

In the case that multiple alternatives are compared against the baseline, it may be necessary to correct for multiple inference (i.e., the increased chances of getting a result with a low p-value if you evaluate many alternatives at once). Since different methods which tackle the same problem are highly correlated, this requires using a Bonferroni correction.

General Usage¶

In addition to using the low level API, we provide a CLI and a high level API which can be used to index results and evaluations, then load the evaluations and calculate statistics.

File Format¶

Time series evaluations are expected to be stored in .npz files, which contain multiple key-value pairs of metrics and/or timestamps.

All metrics which might pass through this library must be 1D arrays with a time/sample index axis.
Timestamps can have additional axes (e.g., for models which operate on a sequence of data); in this case, the last timestamp³ is used.
The metrics and timestamps within each evaluation are expected to be synchronized.

Warning

We assume that data in different evaluations of the same experiment (e.g., traces with different names - bike/ptbreeze.out.npz and bike/ptbreeze.back.npz) are temporally correlated. Evaluations on data traces which are recorded back-to-back must be combined into the same file!

Example

With Timestamps

metrics/loss: float32[439890]
metrics/map/bce: float32[439890]
metrics/map/depth: float32[439890]
timestamps/lidar: float64[439890]
timestamps/radar: float64[439890,8]

Without Timestamps

loss: float32[17606]
map_acc: float32[17606]
map_chamfer: float32[17606]
map_depth: float32[17606]
map_f1: float32[17606]

Naming Convention¶

The file path to each evaluation, relative to some base path, should contain information about the experiment name and the sequence/trace name.

These should be extractable using a regex.
The regex should have two groups (experiment and trace), which are used to organize evaluations by experiment name and compare evaluations with the same trace name.

Example

With Experiment and Trace Name

pattern = r'^(?P<experiment>(.*))/eval/(?P<trace>(.*))\.npz$'
relpath = 'small/patch.rdae/eval/bike/ptbreeze.out.npz'
#          └──experiment──┘      └─────trace─────┘

With Experiment Name Only

pattern = r'^(?P<experiment>(.*))\.npz$'
relpath = 'reduced2x/t8_p100.npz'
#          └──experiment───┘      (trace=None)

CLI¶

Usage

Manual ArgumentsUsing a Config File

uv run tss /path/to/results \
    --pattern "^(?P<experiment>(.*))/eval/bike/(?P<trace>(.*))\.npz$" \
    --experiments medium/p10 medium/p20 \
    --baseline medium/p10 > results.csv \
    --key loss

uv run tss /path/to/results --config config.yaml

config.yaml

pattern: ^(?P<experiment>(.*))/eval/bike/(?P<trace>(.*))\.npz$
experiments:
- medium/p10
- medium/p10
baseline: medium/p10
key: loss

Calculate statistics for time series metrics.

pipe tss ... > results.csv to save the results to a file.
use --config config.yaml to avoid having to specify all these arguments; any arguments which are explicitly provided will override the values in the config file.

Warning

path (and --follow_symlinks, if specified) are required to be passed via the command line, and cannot be specified via the config.

Parameters:

Name	Type	Description	Default
`path`	`str`	directory to find evaluations in.	required
`pattern`	`str \| None`	regex pattern to match the evaluation directories.	`None`
`key`	`str \| None`	name of the metric to load from the result files.	`None`
`timestamps`	`str \| None`	name of the timestamps to load from the result files.	`None`
`experiments`	`list[str] \| None`	explicit list of experiments to include in the results.	`None`
`baseline`	`str \| None`	baseline experiment for relative statistics.	`None`
`follow_symlinks`	`bool`	whether to follow symbolic links. May lead to infinite recursion if `True` and the `path` contains self-referential links!	`False`
`cut`	`float \| None`	cut each time series when there is a gap in the timestamps larger than this value if provided.	`None`
`t_max`	`int \| None`	maximum time delay to consider when computing effective sample size; if not specified, do not use any additional constraints.	`None`
`config`	`str \| None`	load all of these values from a yaml configuration file instead.	`None`

Source code in src/nrdk/tss/_cli.py

def _cli(
    path: str, /,
    pattern: str | None = None, key: str | None = None,
    timestamps: str | None = None, experiments: list[str] | None = None,
    baseline: str | None = None, follow_symlinks: bool = False,
    cut: float | None = None, t_max: int | None = None,
    config: str | None = None,
) -> int:
    """Calculate statistics for time series metrics.

    - pipe `tss ... > results.csv` to save the results to a file.
    - use `--config config.yaml` to avoid having to specify all these
        arguments; any arguments which are explicitly provided will override
        the values in the config file.

    !!! warning

        `path` (and `--follow_symlinks`, if specified) are required to be
        passed via the command line, and cannot be specified via the config.

    Args:
        path: directory to find evaluations in.
        pattern: regex pattern to match the evaluation directories.
        key: name of the metric to load from the result files.
        timestamps: name of the timestamps to load from the result files.
        experiments: explicit list of experiments to include in the results.
        baseline: baseline experiment for relative statistics.
        follow_symlinks: whether to follow symbolic links. May lead to infinite
            recursion if `True` and the `path` contains self-referential links!
        cut: cut each time series when there is a gap in the timestamps larger
            than this value if provided.
        t_max: maximum time delay to consider when computing effective sample
            size; if not specified, do not use any additional constraints.
        config: load all of these values from a yaml configuration file
            instead.
    """
    if config is not None:
        with open(config) as f:
            cfg = yaml.safe_load(f)
    else:
        cfg = {}

    def setdefault(value, param, default):
        if value is None:
            value = cfg.get(param, default)
        return value

    pattern = setdefault(pattern, "pattern", r"^(?P<experiment>(.*)).npz$")
    key = setdefault(key, "key", "loss")
    timestamps = setdefault(timestamps, "timestamps", None)
    baseline = setdefault(baseline, "baseline", None)
    cut = setdefault(cut, "cut", None)
    t_max = setdefault(t_max, "t_max", None)

    index = api.index(
        path, pattern=pattern, follow_symlinks=follow_symlinks)  # type: ignore

    if len(index) == 0:
        print("No result files found!")
        print(
            "Hint: if `results` include symlinks (or is a symlink itself), "
            "try passing `--follow_symlinks`.")
        return -1

    df = api.dataframe_from_index(
        index, key=key, baseline=baseline,  # type: ignore
        experiments=experiments, cut=cut, t_max=t_max, timestamps=timestamps)

    buf = StringIO()
    df.to_csv(buf)
    print(buf.getvalue())
    return 0

High Level API¶

Index evaluations: using index, provide a base path where the evaluations are stored, and a regex pattern for finding evaluation files and extracting their experiment and trace names.

import tss

path = "/shiraz/grt/results"  # path to where the evaluations are stored
pattern = r"^(?P<experiment>(.*))/eval/(?P<trace>(.*))\.npz$"

index = tss.index_results(path, pattern)

Tip

This can take quite a long time if you have many evaluation files and are loading from a network drive (~3 seconds for ~3000 evaluations in a directory with 20k total files on a SMB share). You may want to cache the index or save them to disk somewhere!

Compute Statistics: we provide a all-inclusive dataframe_from_index function which returns a dataframe containing summary statistics for the specified index, given a key of interest and baseline method.

experiments = ["small/p10", "small/p20", "small/p50", "small/base"]
df = tss.dataframe_from_index(
    index, "loss", baseline="small/base", experiments=experiments)
df

output

            abs/mean   abs/std  abs/stderr   abs/n     abs/ess  rel/mean   rel/std  rel/stderr   rel/n      rel/ess   pct/mean  pct/stderr  p0.05
name                                                                                                                                             
small/base  0.125371  0.070062    0.002442  162931  823.034877  0.000000  0.000000         NaN  162931     0.000000   0.000000         NaN  False
small/p10   0.161236  0.088207    0.002991  162931  869.479694  0.035865  0.039024    0.000769  162931  2577.969172  28.607590    0.613055   True
small/p20   0.152850  0.097548    0.003209  162931  924.222609  0.027480  0.045835    0.000945  162931  2353.289155  21.918760    0.753636   True
small/p50   0.134158  0.076811    0.002594  162931  877.094752  0.008787  0.027099    0.000406  162931  4453.599831   7.009018    0.323892   True

Intuitively, sampling the same signal (e.g., radar-lidar-camera tuples) with a greater frequency yields diminishing information: sampling an infinitesimally short video at an infinite frame rate clearly does not yield an infinite sample size. ↩
This concept is best explained via the "natural image manifold:" images have a lot of dimensions (HxWxC), but take a np.random.random((h, w, c)) image, and you'll almost surely not end up with a "natural" image that you might actually encounter. The space of all such natural images can be thought of as a low-dimensional manifold, embedded in the high-dimensional image space. ↩
If more than one extra axis is provided, the last axis when the array is flattened in C-order is used. ↩