How to create arrays of strings#
Awkward Arrays can contain strings, although these strings are just a special view of lists of uint8
numbers. As such, the variable-length data are efficiently stored.
NumPy’s strings are padded to have equal width, and Pandas’s strings are Python objects. Awkward Array doesn’t have nearly as many functions for manipulating arrays of strings as NumPy and Pandas, though.
import awkward as ak
import numpy as np
From Python strings#
The ak.Array
constructor and ak.from_iter()
recognize strings, and strings are returned by ak.to_list()
.
ak.Array(["one", "two", "three"])
['one', 'two', 'three'] ---------------- type: 3 * string
They may be nested within anything.
ak.Array([["one", "two"], [], ["three"]])
[['one', 'two'], [], ['three']] ---------------------- type: 3 * var * string
From NumPy arrays#
NumPy strings are also recognized by ak.from_numpy()
and ak.to_numpy()
.
numpy_array = np.array(["one", "two", "three", "four"])
numpy_array
array(['one', 'two', 'three', 'four'], dtype='<U5')
awkward_array = ak.Array(numpy_array)
awkward_array
['one', 'two', 'three', 'four'] ---------------- type: 4 * string
Operations with strings#
Since strings are really just lists, some of the list operations “just work” on strings.
ak.num(awkward_array)
[3, 3, 5, 4] --------------- type: 4 * int64
awkward_array[:, 1:]
['ne', 'wo', 'hree', 'our'] ---------------- type: 4 * string
Others had to be specially overloaded for the string case, such as string-equality. The default meaning for ==
would be to descend to the lowest level and compare numbers (characters, in this case).
awkward_array == "three"
[False, False, True, False] -------------- type: 4 * bool
awkward_array == ak.Array(["ONE", "TWO", "three", "four"])
[False, False, True, True] -------------- type: 4 * bool
Similarly, ak.sort()
and ak.argsort()
sort strings lexicographically, not individual characters.
ak.sort(awkward_array)
['four', 'one', 'three', 'two'] ---------------- type: 4 * string
Still other operations had to be inhibited, since they wouldn’t make sense for strings.
np.sqrt(awkward_array)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1402, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1401 with ak._errors.OperationErrorContext(name, inputs, kwargs):
-> 1402 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:442, in array_ufunc(ufunc, method, inputs, kwargs)
440 return result[0]
--> 442 out = ak._do.recursively_apply(
443 inputs[where],
444 unary_action,
445 behavior,
446 function_name=ufunc.__name__,
447 allow_records=False,
448 )
450 else:
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_do.py:35, in recursively_apply(layout, action, behavior, depth_context, lateral_context, allow_records, keep_parameters, numpy_to_regular, return_simplified, return_array, function_name, regular_to_jagged)
34 if isinstance(layout, Content):
---> 35 return layout._recursively_apply(
36 action,
37 behavior,
38 1,
39 copy.copy(depth_context),
40 lateral_context,
41 {
42 "allow_records": allow_records,
43 "keep_parameters": keep_parameters,
44 "numpy_to_regular": numpy_to_regular,
45 "regular_to_jagged": regular_to_jagged,
46 "return_simplified": return_simplified,
47 "return_array": return_array,
48 "function_name": function_name,
49 },
50 )
52 elif isinstance(layout, Record):
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/contents/listarray.py:1559, in ListArray._recursively_apply(self, action, behavior, depth, depth_context, lateral_context, options)
1550 content._recursively_apply(
1551 action,
1552 behavior,
(...)
1556 options,
1557 )
-> 1559 result = action(
1560 self,
1561 depth=depth,
1562 depth_context=depth_context,
1563 lateral_context=lateral_context,
1564 continuation=continuation,
1565 behavior=behavior,
1566 backend=self._backend,
1567 options=options,
1568 )
1570 if isinstance(result, Content):
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:435, in array_ufunc.<locals>.unary_action(layout, **ignore)
434 nextinputs[where] = layout
--> 435 result = action(tuple(nextinputs), **ignore)
436 if result is None:
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_connect/numpy.py:351, in array_ufunc.<locals>.action(inputs, **ignore)
347 if all(
348 x.is_list and x.parameter("__array__") in ("string", "bytestring")
349 for x in contents
350 ):
--> 351 raise TypeError(
352 "{}.{} is not implemented for string types. "
353 "To register an implementation, add a name to these string(s) and register a behavior overload".format(
354 type(ufunc).__module__, ufunc.__name__
355 )
356 )
358 if ufunc is numpy.matmul:
TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload
The above exception was the direct cause of the following exception:
TypeError Traceback (most recent call last)
Cell In[11], line 1
----> 1 np.sqrt(awkward_array)
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/highlevel.py:1401, in Array.__array_ufunc__(self, ufunc, method, *inputs, **kwargs)
1336 """
1337 Intercepts attempts to pass this Array to a NumPy
1338 [universal functions](https://docs.scipy.org/doc/numpy/reference/ufuncs.html)
(...)
1398 See also #__array_function__.
1399 """
1400 name = f"{type(ufunc).__module__}.{ufunc.__name__}.{method!s}"
-> 1401 with ak._errors.OperationErrorContext(name, inputs, kwargs):
1402 return ak._connect.numpy.array_ufunc(ufunc, method, inputs, kwargs)
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:67, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
60 try:
61 # Handle caught exception
62 if (
63 exception_type is not None
64 and issubclass(exception_type, Exception)
65 and self.primary() is self
66 ):
---> 67 self.handle_exception(exception_type, exception_value)
68 finally:
69 # `_kwargs` may hold cyclic references, that we really want to avoid
70 # as this can lead to large buffers remaining in memory for longer than absolutely necessary
71 # Let's just clear this, now.
72 self._kwargs.clear()
File ~/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/_errors.py:82, in ErrorContext.handle_exception(self, cls, exception)
80 self.decorate_exception(cls, exception)
81 else:
---> 82 raise self.decorate_exception(cls, exception)
TypeError: numpy.sqrt is not implemented for string types. To register an implementation, add a name to these string(s) and register a behavior overload
This error occurred while calling
numpy.sqrt.__call__(
<Array ['one', 'two', 'three', 'four'] type='4 * string'>
)
Categorical strings#
A large set of strings with few unique values are more efficiently manipulated as integers than as strings. In Pandas, this is categorical data, in R, it’s called a factor, and in Arrow and Parquet, it’s dictionary encoding.
The ak.to_categorical()
function makes Awkward Arrays categorical in this sense. ak.to_arrow()
and ak.to_parquet()
recognize categorical data and convert it to the corresponding Arrow and Parquet types.
uncategorized = ak.Array(["three", "one", "two", "two", "three", "one", "one", "one"])
uncategorized
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------- type: 8 * string
categorized = ak.to_categorical(uncategorized)
categorized
/home/runner/micromamba/envs/awkward-docs/lib/python3.10/site-packages/awkward/operations/ak_to_categorical.py:92: DeprecationWarning: In version 2.5.0, this will be an error.
To raise these warnings as errors (and get stack traces to find out where they're called), run
import warnings
warnings.filterwarnings("error", module="awkward.*")
after the first `import awkward` or use `@pytest.mark.filterwarnings("error:::awkward.*")` in pytest.
Issue: The general purpose `ak.to_categorical` has been replaced by `ak.str.to_categorical`.
return _impl(array, highlevel, behavior)
['three', 'one', 'two', 'two', 'three', 'one', 'one', 'one'] ---------------------------------- type: 8 * categorical[type=string]
Internally, the data now have an index that selects from a set of unique strings.
categorized.layout.index
<Index dtype='int64' len='8'>[0 1 2 2 0 1 1 1]</Index>
ak.Array(categorized.layout.content)
['three', 'one', 'two'] ---------------- type: 3 * string
The main advantage to Awkward categorical data (other than proper conversions to Arrow and Parquet) is that equality is performed using the index integers.
categorized == "one"
[False, True, False, False, False, True, True, True] -------------- type: 8 * bool
With ArrayBuilder#
ak.ArrayBuilder()
is described in more detail in this tutorial, but you can add strings by calling the string
method or simply appending them.
(This is what ak.from_iter()
uses internally to accumulate data.)
builder = ak.ArrayBuilder()
builder.string("one")
builder.append("two")
builder.append("three")
array = builder.snapshot()
array
['one', 'two', 'three'] ---------------- type: 3 * string