How to create arrays of strings#
Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8
numbers. As such, the variable-length data are efficiently stored.
NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.
import awkward as ak
import numpy as np
From Python strings#
The ak.Array
constructor and ak.from_iter()
recognize strings, and strings are returned by ak.to_list()
.
ak.Array(["one", "two", "three"])
['one', 'two', 'three'] ---------------- type: 3 * string
They may be nested within anything.
ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'], [], ['three']] ---------------------- type: 3 * var * string
From NumPy arrays#
NumPy strings are also recognized by ak.from_numpy()
and ak.to_numpy()
.
numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one', 'two', 'three', 'four'] ---------------- type: 4 * string
Operations with strings#
Since strings are really just lists, some of the list operations “just work” on strings.
ak.num(awkward_array)
[3, 3, 5, 4] --------------- type: 4 * int64
awkward_array[:, 1:]
['ne', 'wo', 'hree', 'our'] ---------------- type: 4 * string
Others had to be specially overloaded for the string case, such as string-equality. The default meaning for ==
would be to descend to the lowest level and compare numbers (characters, in this case).
awkward_array == "three"
[False, False, True, False] -------------- type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False, False, True, True] -------------- type: 4 * bool
Similarly, ak.sort()
and ak.argsort()
sort strings lexicographically, not individual characters.
ak.sort(awkward_array)
['four', 'one', 'three', 'two'] ---------------- type: 4 * string
Still other operations had to be inhibited, since they wouldn’t make sense for strings.
np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1376, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1374 arguments.update(kwargs)
1375 with ak._errors.OperationErrorContext(name, arguments):
-> 1376 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:282, in array_ufunc(ufunc, method, inputs, kwargs)
279 assert isinstance(result, tuple) and len(result) == 1
280 return result[0]
--> 282 out = ak._do.recursively_apply(
283 inputs[where],
284 unary_action,
285 behavior,
286 function_name=ufunc.__name__,
287 allow_records=False,
288 )
290 else:
291 out = ak._broadcasting.broadcast_and_apply(
292 inputs, action, behavior, allow_records=False, function_name=ufunc.__name__
293 )
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_do.py:35, in recursively_apply(layout, action, behavior, depth_context, lateral_context, allow_records, keep_parameters, numpy_to_regular, return_simplified, return_array, function_name, regular_to_jagged)
19 def recursively_apply(
20 layout: Content | Record,
21 action: ActionType,
(...)
31 regular_to_jagged=False,
32 ) -> Content | Record | None:
34 if isinstance(layout, Content):
---> 35 return layout._recursively_apply(
36 action,
37 behavior,
38 1,
39 copy.copy(depth_context),
40 lateral_context,
41 {
42 "allow_records": allow_records,
43 "keep_parameters": keep_parameters,
44 "numpy_to_regular": numpy_to_regular,
45 "regular_to_jagged": regular_to_jagged,
46 "return_simplified": return_simplified,
47 "return_array": return_array,
48 "function_name": function_name,
49 },
50 )
52 elif isinstance(layout, Record):
53 out = recursively_apply(
54 layout._array,
55 action,
(...)
64 function_name,
65 )
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/listarray.py:1441, in ListArray._recursively_apply(self, action, behavior, depth, depth_context, lateral_context, options)
1431 def continuation():
1432 content._recursively_apply(
1433 action,
1434 behavior,
(...)
1438 options,
1439 )
-> 1441 result = action(
1442 self,
1443 depth=depth,
1444 depth_context=depth_context,
1445 lateral_context=lateral_context,
1446 continuation=continuation,
1447 behavior=behavior,
1448 backend=self._backend,
1449 options=options,
1450 )
1452 if isinstance(result, Content):
1453 return result
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:275, in array_ufunc.<locals>.unary_action(layout, **ignore)
273 def unary_action(layout, **ignore):
274 nextinputs[where] = layout
--> 275 result = action(tuple(nextinputs), **ignore)
276 if result is None:
277 return None
File ~/micromamba-root/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:253, in array_ufunc.<locals>.action(inputs, **ignore)
251 else:
252 error_message.append(type(x).__name__)
--> 253 raise ak._errors.wrap_error(
254 TypeError(
255 "no {}.{} overloads for custom types: {}".format(
256 type(ufunc).__module__, ufunc.__name__, ", ".join(error_message)
257 )
258 )
259 )
261 return None
TypeError: while calling
numpy.sqrt.__call__(
<Array ['one', 'two', 'three', 'four'] type='4 * string'>
)
Error details: no numpy.sqrt overloads for custom types: string
Categorical strings#
A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.
The ak.to_categorical()
function makes Awkward Arrays categorical in this sense. ak.to_arrow()
and ak.to_parquet()
recognize categorical data and convert it to the corresponding Arrow and Parquet types.
uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------- type: 8 * string
categorized = ak.to_categorical(uncategorized)
categorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------------------------- type: 8 * categorical[type=string]
Internally, the data now have an index that selects from a set of unique strings.
categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three', 'one', 'two'] ---------------- type: 3 * string
The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.
categorized == "one"
[False, True, False, False, False, True, True, True] -------------- type: 8 * bool
With ArrayBuilder#
ak.ArrayBuilder()
is described in more detail in this tutorial, but you can add strings by calling the string
method or simply appending them.
(This is what ak.from_iter()
uses internally to accumulate data.)
builder = ak.ArrayBuilder()
builder.string("one")
builder.append("two")
builder.append("three")
array = builder.snapshot()
array
['one', 'two', 'three'] ---------------- type: 3 * string