ak.to_categorical#

Defined in awkward.operations.ak_to_categorical on line 8.

ak.to_categorical(array)#
Parameters

Creates a categorical dataset, which has the following properties:

This is equivalent to R’s “factor”, Pandas’s “categorical”, and Arrow/Parquet’s “dictionary encoding.” It differs from generic uses of ak.contents.IndexedArray and ak.contents.IndexedOptionArray in Awkward Arrays by the guarantee of no duplicate categories and the "categorical" parameter.

>>> array = ak.Array([["one", "two", "three"], [], ["three", "two"]])
>>> categorical = ak.to_categorical(array)
>>> categorical
<Array [['one', 'two', 'three'], ..., [...]] type='3 * var * categorical[ty...'>
>>> categorical.type.show()
3 * var * categorical[type=string]
>>> categorical.to_list() == array.to_list()
True
>>> ak.categories(categorical)
<Array ['one', 'two', 'three'] type='3 * string'>
>>> ak.is_categorical(categorical)
True
>>> ak.from_categorical(categorical)
<Array [['one', 'two', 'three'], ..., ['three', ...]] type='3 * var * string'>

This function descends through nested lists, but not into the fields of records, so records can be categories. To make categorical record fields, split up the record, apply this function to each desired field, and ak.zip the results together.

>>> records = ak.Array([
...     {"x": 1.1, "y": "one"},
...     {"x": 2.2, "y": "two"},
...     {"x": 3.3, "y": "three"},
...     {"x": 2.2, "y": "two"},
...     {"x": 1.1, "y": "one"}
... ])
>>> records
    <Array [{x: 1.1, y: 'one'}, ..., {x: 1.1, ...}] type='5 * {x: float64, y: s...'>
>>> categorical_records = ak.zip({
...     "x": ak.to_categorical(records["x"]),
...     "y": ak.to_categorical(records["y"]),
... })
>>> categorical_records
<Array [{x: 1.1, y: 'one'}, ... y: 'one'}] type='5 * {"x": categorical[type=floa...'>
>>> categorical_records.type.show()
5 * {
    x: categorical[type=float64],
    y: categorical[type=string]
}
>>> categorical_records.to_list() == records.to_list()
True

The check for uniqueness is currently implemented in a Python loop, so conversion to categorical should be regarded as expensive. (This can change, but it would always be an _n log(n)_ operation.)

See also ak.is_categorical, ak.categories, ak.from_categorical.