ak.from_json#

Defined in awkward.operations.ak_from_json on line 29.

ak.from_json(source, *, line_delimited=False, schema=None, nan_string=None, posinf_string=None, neginf_string=None, complex_record_fields=None, buffersize=65536, initial=1024, resize=8, highlevel=True, behavior=None, attrs=None)#

Parameters:

source (bytes/str, pathlib.Path, or file-like object) – Data source of the JSON-formatted string(s). If bytes/str, the string is parsed. If a pathlib.Path, a file with that name is opened, parsed, and closed. If that path has a URI protocol (like "https://" or "s3://"), this function attempts to open the file with the fsspec library. If a file-like object with a read method, this function reads from the object, but does not close it.
line_delimited (bool) – If False, a single JSON document is read as an entire array or record. If True, this function reads line-delimited JSON into an array (regardless of how many there are). The line delimiter is not actually checked, so it may be "\n", "\r\n" or anything else.
schema (None, JSON str or equivalent lists/dicts) – If None, the data type is discovered while parsing. If a JSONSchema (json-schema.org), that schema is used to parse the JSON more quickly by skipping type-discovery.
nan_string (None or str) – If not None, strings with this value will be interpreted as floating-point NaN values.
posinf_string (None or str) – If not None, strings with this value will be interpreted as floating-point positive infinity values.
neginf_string (None or str) – If not None, strings with this value will be interpreted as floating-point negative infinity values.
complex_record_fields (None or (str, str)) – If not None, defines a pair of field names to interpret 2-field records as complex numbers.
buffersize (int) – Number of bytes in each read from source: larger values use more memory but read less frequently. (Python GIL is released before and after read events.)
initial (int) – Initial size (in bytes) of buffers used by the ak::ArrayBuilder.
resize (float) – Resize multiplier for buffers used by the ak::ArrayBuilder; should be strictly greater than 1.
highlevel (bool) – If True, return an ak.Array; otherwise, return a low-level ak.contents.Content subclass.
behavior (None or dict) – Custom ak.behavior for the output array, if high-level.
attrs (None or dict) – Custom attributes for the output array, if high-level.

Converts a JSON string into an Awkward Array.

There are a few different dichotomies in JSON-reading; all of the combinations are supported:

Reading from in-memory str/bytes, on-disk or over-network file, or an arbitrary Python object with a read(num_bytes) method.
Reading a single JSON document or a sequence of line-delimited documents.
Unknown schema (slow and general) or with a provided JSONSchema (fast, but not all possible cases are supported).
Conversion of strings representing not-a-number, plus and minus infinity into the appropriate floating-point numbers.
Conversion of records with a real and imaginary part into complex numbers.

Non-JSON features not allowed, including literals for not-a-number or infinite numbers; they must be quoted strings for nan_string, posinf_string, and neginf_string to recognize. The document or line-delimited documents must adhere to the strict JSON schema.

Sources#

In-memory strings or bytes are simply passed as the first argument:

>>> ak.from_json("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

File names/paths need to be wrapped in pathlib.Path, and remote files are recognized by URI protocol (like "https://" or "s3://") and handled by fsspec (which must be installed).

>>> import pathlib
>>> with open("tmp.json", "w") as file:
...     file.write("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
...
33
>>> ak.from_json(pathlib.Path("tmp.json"))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

And any object with a read(num_bytes) method can be used as the source.

>>> class HasReadMethod:
...     def __init__(self, data):
...         self.bytes = data.encode()
...         self.pos = 0
...     def read(self, num_bytes):
...         start = self.pos
...         self.pos += num_bytes
...         return self.bytes[start:self.pos]
...
>>> filelike_obj = HasReadMethod("[[1.1, 2.2, 3.3], [], [4.4, 5.5]]")
>>> ak.from_json(filelike_obj)
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

If this function opens a file or network connection (because it is passed as a pathlib.Path), then this function will also close that file or connection.

If this function is provided a file-like object with a read(num_bytes) method, this function will not close it. (It might not even have a close method.)

Data structures#

This function interprets JSON arrays and JSON objects in the same way that ak.from_iter interprets Python lists and Python dicts. It could be used as a synonym for Python’s json.loads followed by ak.from_iter, but the direct JSON-reading is faster (especially with a schema) and uses less memory.

Consider

>>> import json
>>> json_data = "[[1.1, 2.2, 3.3], [], [4.4, 5.5]]"
>>> ak.from_iter(json.loads(json_data))
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> ak.from_json(json_data)
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>

and

>>> json_data = '{"x": 1.1, "y": [1, 2]}'
>>> ak.from_iter(json.loads(json_data))
<Record {x: 1.1, y: [1, 2]} type='{x: float64, y: var * int64}'>
>>> ak.from_json(json_data)
<Record {x: 1.1, y: [1, 2]} type='{x: float64, y: var * int64}'>

As shown above, reading JSON may result in ak.Array or ak.Record, but line-delimited (line_delimited=True) only results in ak.Array:

>>> ak.from_json(
...     '{"x": 1.1, "y": [1]}\n{"x": 2.2, "y": [1, 2]}\n{"x": 3.3, "y": [1, 2, 3]}',
...     line_delimited=True,
... )
<Array [{x: 1.1, y: [1]}, ..., {x: 3.3, ...}] type='3 * {x: float64, y: var...'>

Even arrays of length zero:

>>> ak.from_json("", line_delimited=True)
<Array [] type='0 * unknown'>

Note that JSON interpreted with line_delimited doesn’t actually need delimiters between JSON documents or an absence of delimiters within each document. Parsing with line_delimited=True continues to the end of a JSON document and starts again with the next JSON document. It may be necessary to require actual delimiters between and never within JSON documents to split a large source for parallel-processing, but that consideration is beyond this function.

If a JSONSchema is provided, the schema describes the structure of the JSON document, regardless of whether there’s only one of them (may be an ak.Record) or many of them (must be an ak.Array).

>>> schema = {
...     "type": "object",
...     "properties": {
...         "x": {"type": "number"},
...         "y": {"type": "array", "items": {"type": "integer"}},
...     },
...     "required": ["x", "y"],
... }

>>> ak.from_json(
...     '{"x": 1.1, "y": [1, 2, 3]}',
...     schema=schema,
... )
<Record {x: 1.1, y: [1, ..., 3]} type='{x: float64, y: var * int64}'>

>>> ak.from_json(
...     '{"x": 1.1, "y": [1]}\n{"x": 2.2, "y": [1, 2]}\n{"x": 3.3, "y": [1, 2, 3]}',
...     schema=schema,
...     line_delimited=True,
... )
<Array [{x: 1.1, y: [1]}, ..., {x: 3.3, ...}] type='3 * {x: float64, y: var...'>

All numbers in the final array are signed 64-bit (integers and floating-point).

JSONSchemas#

This function supports a subset of JSONSchema (see the JSONSchema specification). The schemas may be passed as JSON text or as Python lists and dicts representing JSON, but the following conditions apply:

The root of the schema must be "type": "array" or "type": "object".
Every level must have a "type", which can only name one type (as a string or length-1 list) or one type and "null" (as a length-2 list).
"type": "boolean" → 1-byte boolean values.
"type": "integer" → 8-byte integer values. If a part of the schema is declared to have integer type but the JSON numbers are expressed as floating-point, such as 3.14, 3.0, or 3e0, this function raises an error.
"type": "number" → 8-byte floating-point values. If used with this function’s nan_string, posinf_string, and/or neginf_string, the value in the JSON could be a string, as long as it matches one of these three.
"type": "string" → UTF-8 encoded strings. All JSON escape sequences are supported. Remember that the source data are ASCII; Unicode is derived from “\uXXXX” escape sequences. If an "enum" is given, strings are represented as categorical values (ak.contents.IndexedArray or ak.contents.IndexedOptionArray).
"type": "array" → nested lists. The "items" must be specified. If "minItems" and "maxItems" are specified and equal to each other, the list has regular-type (ak.types.RegularType); otherwise, it has variable-length type (ak.types.ListType).
"type": "object" → nested records. The "properties" must be specified, and any properties in the data not described by "properties" will not appear in the output.

Substitutions for non-finite and complex numbers#

JSON doesn’t support not-a-number values, infinite values, or complex number types (as in numbers with a real and imaginary part). Some work-arounds use non-JSON syntax, but this function converts valid JSON into these numbers with user-specified rules.

The nan_string, posinf_string, and neginf_string convert quoted strings into floating-point numbers. You can specify what these strings are.

>>> ak.from_json(
...     '[1, 2, "nan", "inf", "-inf"]',
...     nan_string="nan",
...     posinf_string="inf",
...     neginf_string="-inf",
... )
<Array [1, 2, nan, inf, -inf] type='5 * float64'>

Without these rules, the array would be interpreted as a union of numbers and strings:

>>> ak.from_json(
...     '[1, 2, "nan", "inf", "-inf"]',
... )
<Array [1, 2, 'nan', 'inf', '-inf'] type='5 * union[int64, string]'>

When combined with a JSONSchema, you need to say that these values have type "number", not a union of strings and numbers (i.e. the conversion is performed before schema-validation). Note that they can’t be "integer", since not-a-number and infinite values are only possible for floating-point numbers.

>>> ak.from_json(
...     '[1, 2, "nan", "inf", "-inf"]',
...     nan_string="nan",
...     posinf_string="inf",
...     neginf_string="-inf",
...     schema={"type": "array", "items": {"type": "number"}}
... )
<Array [1, 2, nan, inf, -inf] type='5 * float64'>

The complex_record_fields is a 2-tuple of field names (strings) of objects to identify as the real and imaginary parts of complex numbers. Complex number representations in JSON vary, though most are JSON objects with real and imaginary parts and possibly other fields. Any other fields will be excluded from the output array.

>>> ak.from_json(
...     '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
...     complex_record_fields=("r", "i"),
... )
<Array [1+1.1j, 2+2.2j] type='2 * complex128'>

Without this rule, the array would be interpreted as an array of records:

>>> ak.from_json(
...     '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
... )
<Array [{r: 1, i: 1.1, other: ''}, {...}] type='2 * {r: int64, i: float64, ...'>

When combined with a JSONSchema, you need to specify the object type (i.e. the conversion is performed after schema-validation). Note that even the fields that will be ignored by complex_record_fields need to be specified.

>>> ak.from_json(
...     '[{"r": 1, "i": 1.1, "other": ""}, {"r": 2, "i": 2.2, "other": ""}]',
...     complex_record_fields=("r", "i"),
...     schema={
...         "type": "array",
...         "items": {
...             "type": "object",
...             "properties": {
...                 "r": {"type": "number"},
...                 "i": {"type": "number"},
...                 "other": {"type": "string"},
...             },
...             "required": ["r", "i"],
...         },
...     },
... )
<Array [1+1.1j, 2+2.2j] type='2 * complex128'>