How to create arrays of strings#

Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8 numbers. As such, the variable-length data are efficiently stored.

NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.

import awkward as ak
import numpy as np

From Python strings#

The ak.Array constructor and ak.from_iter() recognize strings, and strings are returned by ak.to_list().

ak.Array(["one", "two", "three"])
['one',
 'two',
 'three']
----------------
type: 3 * string

They may be nested within anything.

ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'],
 [],
 ['three']]
----------------------
type: 3 * var * string

From NumPy arrays#

NumPy strings are also recognized by ak.from_numpy() and ak.to_numpy().

numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one',
 'two',
 'three',
 'four']
----------------
type: 4 * string

Operations with strings#

Since strings are really just lists, some of the list operations “just work” on strings.

ak.num(awkward_array)
[3,
 3,
 5,
 4]
---------------
type: 4 * int64
awkward_array[:, 1:]
['ne',
 'wo',
 'hree',
 'our']
----------------
type: 4 * string

Others had to be specially overloaded for the string case, such as string-equality. The default meaning for == would be to descend to the lowest level and compare numbers (characters, in this case).

awkward_array == "three"
[False,
 False,
 True,
 False]
--------------
type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False,
 False,
 True,
 True]
--------------
type: 4 * bool

Similarly, ak.sort() and ak.argsort() sort strings lexicographically, not individual characters.

ak.sort(awkward_array)
['four',
 'one',
 'three',
 'two']
----------------
type: 4 * string

Still other operations had to be inhibited, since they wouldn’t make sense for strings.

np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1392, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
   1390 arguments.update(kwargs)
   1391 with ak._errors.OperationErrorContext(name, arguments):
-> 1392     return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:252, in array_ufunc(ufunc, method, inputs, kwargs)
    249             assert isinstance(result, tuple) and len(result) == 1
    250             return result[0]
--> 252     out = ak._do.recursively_apply(
    253         inputs[where],
    254         unary_action,
    255         behavior,
    256         function_name=ufunc.__name__,
    257         allow_records=False,
    258     )
    260 else:
    261     out = ak._broadcasting.broadcast_and_apply(
    262         inputs, action, behavior, allow_records=False, function_name=ufunc.__name__
    263     )

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_do.py:33, in recursively_apply(layout, action, behavior, depth_context, lateral_context, allow_records, keep_parameters, numpy_to_regular, return_simplified, return_array, function_name)
     18 def recursively_apply(
     19     layout: Content | Record,
     20     action: ActionType,
   (...)
     29     function_name: str | None = None,
     30 ) -> Content | Record | None:
     32     if isinstance(layout, Content):
---> 33         return layout._recursively_apply(
     34             action,
     35             behavior,
     36             1,
     37             copy.copy(depth_context),
     38             lateral_context,
     39             {
     40                 "allow_records": allow_records,
     41                 "keep_parameters": keep_parameters,
     42                 "numpy_to_regular": numpy_to_regular,
     43                 "return_simplified": return_simplified,
     44                 "return_array": return_array,
     45                 "function_name": function_name,
     46             },
     47         )
     49     elif isinstance(layout, Record):
     50         out = recursively_apply(
     51             layout._array,
     52             action,
   (...)
     61             function_name,
     62         )

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/listarray.py:1437, in ListArray._recursively_apply(self, action, behavior, depth, depth_context, lateral_context, options)
   1427     def continuation():
   1428         content._recursively_apply(
   1429             action,
   1430             behavior,
   (...)
   1434             options,
   1435         )
-> 1437 result = action(
   1438     self,
   1439     depth=depth,
   1440     depth_context=depth_context,
   1441     lateral_context=lateral_context,
   1442     continuation=continuation,
   1443     behavior=behavior,
   1444     backend=self._backend,
   1445     options=options,
   1446 )
   1448 if isinstance(result, Content):
   1449     return result

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:245, in array_ufunc.<locals>.unary_action(layout, **ignore)
    243 def unary_action(layout, **ignore):
    244     nextinputs[where] = layout
--> 245     result = action(tuple(nextinputs), **ignore)
    246     if result is None:
    247         return None

File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:223, in array_ufunc.<locals>.action(inputs, **ignore)
    221         else:
    222             error_message.append(type(x).__name__)
--> 223     raise ak._errors.wrap_error(
    224         TypeError(
    225             "no {}.{} overloads for custom types: {}".format(
    226                 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
    227             )
    228         )
    229     )
    231 return None

TypeError: while calling

    numpy.sqrt.__call__(
        <Array ['one', 'two', 'three', 'four'] type='4 * string'>
    )

Error details: no numpy.sqrt overloads for custom types: string

Categorical strings#

A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.

The ak.to_categorical() function makes Awkward Arrays categorical in this sense. ak.to_arrow() and ak.to_parquet() recognize categorical data and convert it to the corresponding Arrow and Parquet types.

uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------
type: 8 * string
categorized = ak.to_categorical(uncategorized)
categorized
['three',
 'one',
 'two',
 'two',
 'three',
 'one',
 'one',
 'one']
----------------------------------
type: 8 * categorical[type=string]

Internally, the data now have an index that selects from a set of unique strings.

categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three',
 'one',
 'two']
----------------
type: 3 * string

The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.

categorized == "one"
[False,
 True,
 False,
 False,
 False,
 True,
 True,
 True]
--------------
type: 8 * bool

With ArrayBuilder#

ak.ArrayBuilder() is described in more detail in this tutorial, but you can add strings by calling the string method or simply appending them.

(This is what ak.from_iter() uses internally to accumulate data.)

builder = ak.ArrayBuilder()

builder.string("one")
builder.append("two")
builder.append("three")

array = builder.snapshot()
array
['one',
 'two',
 'three']
----------------
type: 3 * string