How to create arrays of strings#

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings#

The ak.Array constructor and ak.from_iter() recognize strings, and strings are returned by ak.to_list().

ak.Array(["one", "two", "three"])
['one',
 'two',
 'three']
----------------
backend: cpu
nbytes: 43 B
type: 3 * string

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'],
 [],
 ['three']]
----------------------
backend: cpu
nbytes: 75 B
type: 3 * var * string

From NumPy arrays#

NumPy strings are also recognized by ak.from_numpy() and ak.to_numpy().

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one',
 'two',
 'three',
 'four']
----------------
backend: cpu
nbytes: 84 B
type: 4 * string

Operations with strings#

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
[3,
 3,
 5,
 4]
---------------
backend: cpu
nbytes: 32 B
type: 4 * int64
awkward_array[:, 1:]
['ne',
 'wo',
 'hree',
 'our']
----------------
backend: cpu
nbytes: 51 B
type: 4 * string

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
[False,
 False,
 True,
 False]
--------------
backend: cpu
nbytes: 4 B
type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False,
 False,
 True,
 True]
--------------
backend: cpu
nbytes: 4 B
type: 4 * bool

Similarly, ak.sort() and ak.argsort() sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
['four',
 'one',
 'three',
 'two']
----------------
backend: cpu
nbytes: 79 B
type: 4 * string

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1619, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1554 """
   1555 Intercepts attempts to pass this Array to a NumPy
   1556 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1616 See also #__array_function__.
   1617 """
   1618 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1619 with ak._errors.OperationErrorContext(name, inputs, kwargs):
   1620     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
     78     self._slate.__dict__.clear()
     79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
     81 else:
     82     # Step out of the way so that another ErrorContext can become primary.
     83     if self.primary() is self:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/highlevel.py:1620, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1618 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
   1619 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1620     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:469, in array_ufunc(ufunc, method, inputs, kwargs)
    461         raise TypeError(
    462             "no {}.{} overloads for custom types: {}".format(
    463                 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
    464             )
    465         )
    467     return None
--> 469 out = ak._broadcasting.broadcast_and_apply(
    470     inputs,
    471     action,
    472     depth_context=depth_context,
    473     lateral_context=lateral_context,
    474     allow_records=False,
    475     function_name=ufunc.__name__,
    476 )
    478 out_named_axis = functools.reduce(
    479     _unify_named_axis, lateral_context[NAMED_AXIS_KEY].named_axis
    480 )
    481 if len(out) == 1:

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1219, in broadcast_and_apply(inputs, action, depth_context, lateral_context, allow_records, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged, function_name, broadcast_parameters_rule)
   1217 backend = backend_of(*inputs, coerce_to_common=False)
   1218 isscalar = []
-> 1219 out = apply_step(
   1220     backend,
   1221     broadcast_pack(inputs, isscalar),
   1222     action,
   1223     0,
   1224     depth_context,
   1225     lateral_context,
   1226     {
   1227         "allow_records": allow_records,
   1228         "left_broadcast": left_broadcast,
   1229         "right_broadcast": right_broadcast,
   1230         "numpy_to_regular": numpy_to_regular,
   1231         "regular_to_jagged": regular_to_jagged,
   1232         "function_name": function_name,
   1233         "broadcast_parameters_rule": broadcast_parameters_rule,
   1234     },
   1235 )
   1236 assert isinstance(out, tuple)
   1237 return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1197, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1195     return result
   1196 elif result is None:
-> 1197     return continuation()
   1198 else:
   1199     raise AssertionError(result)

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1166, in apply_step.<locals>.continuation()
   1164 # Any non-string list-types?
   1165 elif any(x.is_list and not is_string_like(x) for x in contents):
-> 1166     return broadcast_any_list()
   1168 # Any RecordArrays?
   1169 elif any(x.is_record for x in contents):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:670, in apply_step.<locals>.broadcast_any_list()
    667         nextinputs.append(x)
    668         nextparameters.append(NO_PARAMETERS)
--> 670 outcontent = apply_step(
    671     backend,
    672     nextinputs,
    673     action,
    674     depth + 1,
    675     copy.copy(depth_context),
    676     lateral_context,
    677     options,
    678 )
    679 assert isinstance(outcontent, tuple)
    680 parameters = parameters_factory(nextparameters, len(outcontent))

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_broadcasting.py:1179, in apply_step(backend, inputs, action, depth, depth_context, lateral_context, options)
   1172     else:
   1173         raise ValueError(
   1174             "cannot broadcast: {}{}".format(
   1175                 ", ".join(repr(type(x)) for x in inputs), in_function(options)
   1176             )
   1177         )
-> 1179 result = action(
   1180     inputs,
   1181     depth=depth,
   1182     depth_context=depth_context,
   1183     lateral_context=lateral_context,
   1184     continuation=continuation,
   1185     backend=backend,
   1186     options=options,
   1187 )
   1189 if isinstance(result, tuple) and all(isinstance(x, Content) for x in result):
   1190     if any(content.backend is not backend for content in result):

File ~/micromamba/envs/awkward-docs/lib/python3.11/site-packages/awkward/_connect/numpy.py:405, in array_ufunc.<locals>.action(inputs, **ignore)
    400     # Do we have all-strings? If so, we can't proceed
    401     if all(
    402         x.is_list and x.parameter("__array__") in ("string", "bytestring")
    403         for x in contents
    404     ):
--> 405         raise TypeError(
    406             f"{type(ufunc).__module__}.{ufunc.__name__} is not implemented for string types. "
    407             "To register an implementation, add a name to these string(s) and register a behavior overload"
    408         )
    410 if ufunc is numpy.matmul:
    411     raise NotImplementedError(
    412         "matrix multiplication (`@` or `np.matmul`) is not yet implemented for Awkward Arrays"
    413     )

TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload

This error occurred while calling

    numpy.sqrt.__call__(
        <Array ['one', 'two', 'three', 'four'] type='4 * string'>
    )

Categorical strings#

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.str.to_categorical() (requires PyArrow) function makes Awkward Arrays categorical in this sense. ak.to_arrow() and ak.to_parquet() recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------
backend: cpu
nbytes: 100 B
type: 8 * string
categorized = ak.str.to_categorical(uncategorized)
categorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------------------------
backend: cpu
nbytes: 107 B
type: 8 * categorical[type=string]

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three',
 'one',
 'two']
----------------
backend: cpu
nbytes: 43 B
type: 3 * string

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
[False,
 True,
 False,
 False,
 False,
 True,
 True,
 True]
--------------
backend: cpu
nbytes: 8 B
type: 8 * bool

With ArrayBuilder#

ak.ArrayBuilder() is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
['one',
 'two',
 'three']
----------------
backend: cpu
nbytes: 43 B
type: 3 * string