How to compute statistics on dimensions (mean/var/std)#
Awkward Array provides several functions for statistical analysis that operate on ragged arrays. These are dimensional reducers, like ak.sum()
, ak.min()
, ak.any()
, and ak.all()
in the previous section, but they compute quantities such as mean, variance, standard deviation, and higher moments, as well as functions for linear regression and correlation.
import awkward as ak
import numpy as np
Basic statistical functions#
Mean, variance, and standard deviation#
To compute the mean, variance, and standard deviation of an array, use ak.mean()
, ak.var()
, and ak.std()
. Unlike the NumPy functions with the same names, these functions apply to arrays with variable-length dimensions and missing values (but not heterogeneous dimensionality or records; see the last section of reducing.
array = ak.Array([[0, 1.1, 2.2], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
ak.mean(array, axis=-1)
[1.1, 3.85, 5.5, 8.25] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
ak.var(array, axis=-1)
[0.807, 0.302, 0, 1.51] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
ak.std(array, axis=-1)
[0.898, 0.55, 0, 1.23] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
These functions also have counterparts that ignore nan
values: ak.nanmean()
, ak.nanvar()
, and ak.nanstd()
.
array_with_nan = ak.Array([[0, 1.1, np.nan], [3.3, 4.4], [np.nan], [6.6, np.nan, 8.8, 9.9]])
ak.nanmean(array_with_nan, axis=-1)
[0.55, 3.85, None, 8.43] ------------------ backend: cpu nbytes: 56 B type: 4 * ?float64
ak.nanvar(array_with_nan, axis=-1)
[0.303, 0.302, None, 1.88] ------------------ backend: cpu nbytes: 56 B type: 4 * ?float64
ak.nanstd(array_with_nan, axis=-1)
[0.55, 0.55, None, 1.37] ------------------ backend: cpu nbytes: 56 B type: 4 * ?float64
Note that floating-point nan
is different from missing values (None
). Unlike nan
, integer arrays can have missing values, and whole lists can be missing as well. For both types of functions, missing values are ignored if they are in the dimension being reduced or pass through a function to the output otherwise, just as the nan
-ignoring functions ignore nan
.
array_with_None = ak.Array([[0, 1.1, 2.2], None, [None, 4.4], [5.5], [6.6, np.nan, 8.8, 9.9]])
ak.mean(array_with_None, axis=-1)
[1.1, None, 4.4, 5.5, nan] ------------------ backend: cpu nbytes: 72 B type: 5 * ?float64
ak.nanmean(array_with_None, axis=-1)
[1.1, None, 4.4, 5.5, 8.43] ------------------ backend: cpu nbytes: 72 B type: 5 * ?float64
Moments#
For higher moments, use ak.moment()
. For example, to calculate the third moment (skewness), you would do the following:
ak.moment(array, 3, axis=-1)
[3.99, 60.6, 166, 599] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
Correlation and covariance#
For correlation and covariance between two arrays, use ak.corr()
and ak.covar()
.
array_x = ak.Array([[0, 1.1, 2.2], [3.3, 4.4], [5.5], [6.6, 7.7, 8.8, 9.9]])
array_y = ak.Array([[0, 1, 2], [3, 4], [5], [6, 7, 8, 9]])
ak.corr(array_x, array_y, axis=-1)
[1, 1, nan, 1] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
ak.covar(array_x, array_y, axis=-1)
[0.733, 0.275, 0, 1.38] ----------------- backend: cpu nbytes: 32 B type: 4 * float64
Linear fits#
To perform linear fits, use ak.linear_fit()
. Instead of reducing each list to a number, it reduces each list to a record that has intercept
, slope
, intercept_error
, and slope_error
fields. (These “errors” are uncertainty estimates of the intercept and slope parameters, assuming that the underlying generator of data is truly linear.)
ak.linear_fit(array_x, array_y, axis=-1)
[{intercept: 0, slope: 0.909, intercept_error: 0.913, slope_error: 0.643}, {intercept: 0, slope: 0.909, intercept_error: 5, slope_error: 1.29}, {intercept: nan, slope: nan, intercept_error: inf, slope_error: inf}, {intercept: 0, slope: 0.909, intercept_error: 3.39, slope_error: 0.407}] ------------------------------------------------------------------------------------------------------------------------- backend: cpu nbytes: 128 B type: 4 * LinearFit[ intercept: float64, slope: float64, intercept_error: float64, slope_error: float64 ]
Ordinary least squares linear fits can be computed by a formula, without approximation or iteration, so it can be thought of like computing the mean or other moments, but with greater fidelity to the data because it models a general correlation. For example, some statistical models achieve high granularity by segmenting a dataset in some meaningful way and then summarizing the data in each segment (such as a regression decision tree). Performing linear fits on each segment fine-tunes the model more than performing just taking the average of data in each segment.
Peak to peak#
The peak-to-peak function ak.ptp()
can be used to find the range (maximum - minimum) of data along an axis. It’s more convenient than calling ak.min()
and ak.max()
separately.
ak.ptp(array, axis=-1)
[2.2, 1.1, 0, 3.3] ------------------ backend: cpu nbytes: 64 B type: 4 * ?float64
Softmax#
The softmax function is useful in machine learning, particularly in the context of logistic regression and neural networks. Awkward Array provides ak.softmax()
to compute softmax values of an array.
Note that this function does not reduce a dimension; it computes one output value for each input value, but each output value is normalized by all the other values in the same list.
Also note that only axis=-1
(innermost lists) is supported by ak.softmax()
.
ak.softmax(array, axis=-1)
[[0.0768, 0.231, 0.693], [0.25, 0.75], [1], [0.0249, 0.0748, 0.225, 0.675]] -------------------------------- backend: cpu nbytes: 120 B type: 4 * var * float64
Example uses in data analysis#
Here is an example that normalizes an input array to have an overall mean of 0 and standard deviation of 1:
array = ak.Array([[1.1, 2.2, 3.3], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
(array - ak.mean(array)) / ak.std(array)
[[-1.55, -1.16, -0.775], [-0.387, 0], [0.387, 0.775, 1.16, 1.55]] ---------------------------- backend: cpu nbytes: 104 B type: 3 * var * float64
And here’s another example that normalizes each list within the array to each have a mean of 0 and a standard deviation of 1:
(array - ak.mean(array, axis=-1)) / ak.std(array, axis=-1)
[[-1.22, 4.94e-16, 1.22], [-1, 1], [-1.34, -0.447, 0.447, 1.34]] ------------------------------ backend: cpu nbytes: 104 B type: 3 * var * float64