How to create arrays of strings

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings

The ak.Array constructor and ak.from_iter recognize strings, and strings are returned by ak.to_list.

ak.Array(["one", "two", "three"])
<Array ['one', 'two', 'three'] type='3 * string'>

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
<Array [['one', 'two'], [], ['three']] type='3 * var * string'>

From NumPy arrays

NumPy strings are also recognized by ak.from_numpy and ak.to_numpy.

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
<Array ['one', 'two', 'three', 'four'] type='4 * string'>

Operations with strings

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
<Array [3, 3, 5, 4] type='4 * int64'>
awkward_array[:, 1:]
<Array ['ne', 'wo', 'hree', 'our'] type='4 * string'>

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
<Array [False, False, True, False] type='4 * bool'>
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
<Array [False, False, True, True] type='4 * bool'>

Similarly, ak.sort and ak.argsort sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
<Array ['four', 'one', 'three', 'two'] type='4 * string'>

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_1958/4196337735.py in <module>
----> 1 np.sqrt(awkward_array)

~/python3.8/lib/python3.8/site-packages/awkward/highlevel.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1415         """
   1416         if not hasattr(self, "_tracers"):
-> 1417             return ak._connect._numpy.array_ufunc(ufunc, method, inputs, kwargs)
   1418         else:
   1419             return ak._connect._jax.jax_utils.array_ufunc(

~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py in array_ufunc(ufunc, method, inputs, kwargs)
    260         return None
    261 
--> 262     out = ak._util.broadcast_and_apply(
    263         inputs, getfunction, behavior, allow_records=False, pass_depth=False
    264     )

~/python3.8/lib/python3.8/site-packages/awkward/_util.py in broadcast_and_apply(inputs, getfunction, behavior, allow_records, pass_depth, pass_user, user, left_broadcast, right_broadcast, numpy_to_regular, regular_to_jagged)
   1153     else:
   1154         isscalar = []
-> 1155         out = apply(broadcast_pack(inputs, isscalar), 0, user)
   1156         assert isinstance(out, tuple)
   1157         return tuple(broadcast_unpack(x, isscalar) for x in out)

~/python3.8/lib/python3.8/site-packages/awkward/_util.py in apply(inputs, depth, user)
    915                     [len(x) for x in nextinputs if isinstance(x, ak.layout.Content)]
    916                 )
--> 917                 outcontent = apply(nextinputs, depth + 1, user)
    918                 assert isinstance(outcontent, tuple)
    919 

~/python3.8/lib/python3.8/site-packages/awkward/_util.py in apply(inputs, depth, user)
    726             args = args + (user,)
    727 
--> 728         custom = getfunction(inputs, *args)
    729         if callable(custom):
    730             return custom()

~/python3.8/lib/python3.8/site-packages/awkward/_connect/_numpy.py in getfunction(inputs)
    250                 else:
    251                     custom_types.append(type(x).__name__)
--> 252             raise ValueError(
    253                 "no overloads for custom types: {0}({1})".format(
    254                     ufunc.__name__,

ValueError: no overloads for custom types: sqrt(string)

(https://github.com/scikit-hep/awkward-1.0/blob/1.5.0/src/awkward/_connect/_numpy.py#L257)

Categorical strings

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.to_categorical function makes Awkward Arrays categorical in this sense. ak.to_arrow and ak.to_parquet recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * string'>
categorized = ak.to_categorical(uncategorized)
categorized
<Array ['three', 'one', ... 'one', 'one'] type='8 * categorical[type=string]'>

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index64 i="[0 1 2 2 0 1 1 1]" offset="0" length="8" at="0x0000030ad610"/>
ak.Array(categorized.layout.content)
<Array ['three', 'one', 'two'] type='3 * string'>

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
<Array [False, True, False, ... True, True] type='8 * bool'>

With ArrayBuilder

ak.ArrayBuilder is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
<Array ['one', 'two', 'three'] type='3 * string'>