How to create arrays of strings#

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings#

The ak.Array constructor and ak.from_iter recognize strings, and strings are returned by ak.to_list.

ak.Array(["one", "two", "three"])
<Array ['one', 'two', 'three'] type='3 * string'>

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
<Array [['one', 'two'], [], ['three']] type='3 * var * string'>

From NumPy arrays#

NumPy strings are also recognized by ak.from_numpy and ak.to_numpy.

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
<Array ['one', 'two', 'three', 'four'] type='4 * string'>

Operations with strings#

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
<Array [3, 3, 5, 4] type='4 * int64'>
awkward_array[:, 1:]
<Array ['ne', 'wo', 'hree', 'our'] type='4 * string'>

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
<Array [False, False, True, False] type='4 * bool'>
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
<Array [False, False, True, True] type='4 * bool'>

Similarly, ak.sort and ak.argsort sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
<Array ['four', 'one', 'three', 'two'] type='4 * string'>

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [11], line 1
----> 1 np.sqrt(awkward_array)

File ~/python3.8/lib/python3.8/site-packages/awkward/highlevel.py:1411, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1354 """
   1355 Intercepts attempts to pass this Array to a NumPy
   1356 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
   (...)
   1408 See also #__array_function__.
   1409 """
   1410 if not hasattr(self, "_tracers"):
-> 1411     return ak._connect._numpy.array_ufunc(ufunc, method, inputs, kwargs)
   1412 else:
   1413     return ak._connect._jax.jax_utils.array_ufunc(
   1414         self, ufunc, method, inputs, kwargs
   1415     )

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:250, in array_ufunc(ufunc, method, inputs, kwargs)
    240         raise ValueError(
    241             "no overloads for custom types: {}({})".format(
    242                 ufunc.__name__,
   (...)
    245             + ak._util.exception_suffix(__file__)
    246         )
    248     return None
--> 250 out = ak._util.broadcast_and_apply(
    251     inputs, getfunction, behavior, allow_records=False, pass_depth=False
    252 )
    253 assert isinstance(out, tuple) and len(out) == 1
    254 return ak._util.wrap(out[0], behavior)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:1172, in broadcast_and_apply(inputs, getfunction, behavior, allow_records, pass_depth, pass_user, user, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged)
   1170 else:
   1171     isscalar = []
-> 1172     out = apply(broadcast_pack(inputs, isscalar), 0, user)
   1173     assert isinstance(out, tuple)
   1174     return tuple(broadcast_unpack(x, isscalar) for x in out)

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:925, in broadcast_and_apply.<locals>.apply(inputs, depth, user)
    922     else:
    923         nextinputs.append(x)
--> 925 outcontent = apply(nextinputs, depth + 1, user)
    926 assert isinstance(outcontent, tuple)
    928 length = None

File ~/python3.8/lib/python3.8/site-packages/awkward/_util.py:739, in broadcast_and_apply.<locals>.apply(inputs, depth, user)
    736 if pass_user:
    737     args = args + (user,)
--> 739 custom = getfunction(inputs, *args)
    740 if callable(custom):
    741     return custom()

File ~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py:240, in array_ufunc.<locals>.getfunction(inputs)
    238         else:
    239             custom_types.append(type(x).__name__)
--> 240     raise ValueError(
    241         "no overloads for custom types: {}({})".format(
    242             ufunc.__name__,
    243             ", ".join(custom_types),
    244         )
    245         + ak._util.exception_suffix(__file__)
    246     )
    248 return None

ValueError: no overloads for custom types: sqrt(string)

(https://github.com/scikit-hep/awkward-1.0/blob/1.10.1/src/awkward/_connect/_numpy.py#L245)

Categorical strings#

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.to_categorical function makes Awkward Arrays categorical in this sense. ak.to_arrow and ak.to_parquet recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * string'>
categorized = ak.to_categorical(uncategorized)
categorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * categorical[type=string]'>

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index64 i="[0 1 2 2 0 1 1 1]" offset="0" length="8" at="0x000002603f50"/>
ak.Array(categorized.layout.content)
<Array ['three', 'one', 'two'] type='3 * string'>

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
<Array [False, True, False, ... True, True] type='8 * bool'>

With ArrayBuilder#

ak.ArrayBuilder is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
<Array ['one', 'two', 'three'] type='3 * string'>