1 files changed, 470 insertions, 0 deletions
diff --git a/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst b/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst
new file mode 100644
index 000000000..a4f2966ab
--- /dev/null
+++ b/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst
@@ -0,0 +1,470 @@
+.. _seed-0105:
+
+===============================================
+0105: Nested Tokens and Tokenized Log Arguments
+===============================================
+
+.. seed::
+   :number: 105
+   :name: Nested Tokens and Tokenized Log Arguments
+   :status: Accepted
+   :proposal_date: 2023-07-10
+   :cl: 154190
+
+-------
+Summary
+-------
+This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_
+and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to
+improve support for nesting tokens and add facilities for tokenizing arguments
+to logs such as strings or and enums. This SEED primarily addresses C/C++
+tokenization and Python/C++ detokenization.
+
+----------
+Motivation
+----------
+Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited
+memory to store long log format strings as hashed 32-bit tokens. When logs are
+moved off-device, host tooling can recover the full logs using token databases
+that were created when building the device image. However, logs may still have
+runtime string arguments that are stored and transferred 1:1 without additional
+encoding. This SEED aims to extend tokenization to these arguments to further
+reduce the weight of logging for embedded applications.
+
+The proposed changes affect both the tokenization module itself and the logging
+facilities built on top of tokenization.
+
+--------
+Proposal
+--------
+Logging enums such as ``pw::Status`` is one common special case where
+tokenization is particularly appropriate: enum values are conceptually
+already tokens mapping to their names, assuming no duplicate values. Logging
+enums frequently entails creating functions and string names that occupy space
+exclusively for logging purposes, which this proposal seeks to mitigate.
+Here, ``pw::Status::NotFound()`` is presented as an illustrative example of
+the several transformations that strings undergo during tokenization and
+detokenization, further complicated in the proposed design by nested tokens.
+
+.. list-table:: Enum Tokenization/Detokenization Phases
+   :widths: 20 45
+
+   * - (1) Source code
+     - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())``
+   * - (2) Token database entries (token, string, domain)
+     - | ``16170adf, "Status: ${pw::Status}#%08x", ""``
+       | ``5       , "PW_STATUS_NOT_FOUND"       , "pw::Status"``
+   * - (3) Wire format
+     - ``df 0a 17 16 0a`` (5 bytes)
+   * - (4) Top-level detokenized and formatted
+     - ``"Status: ${pw::Status}#00000005"``
+   * - (5) Fully detokenized
+     - ``"Status: PW_STATUS_NOT_FOUND"``
+
+Compared to log tokenization without nesting, string literals in token
+database entries may not be identical to what is typed in source code due
+to the use of macros and preprocessor string concatenation. The
+detokenizer also takes an additional step to recursively detokenize any
+nested tokens. In exchange for this added complexity, nested enum tokenization
+allows us to gain the readability of logging value names with zero additional
+runtime space or performance cost compared to logging the integral values
+directly with ``pw_log_tokenized``.
+
+.. note::
+  Without nested enum token support, users can select either readability or
+  reduced binary and transmission size, but not easily both:
+
+  .. list-table::
+    :widths: 15 20 20
+    :header-rows: 1
+
+    * -
+      - Raw integers
+      - String names
+    * - (1) Source code
+      - ``PW_LOG("Status: %x" , status.code())``
+      - ``PW_LOG("Status: %s" , pw_StatusString(status))``
+    * - (2) Token database entries (token, string, domain)
+      - ``03a83461, "Status: %x", ""``
+      - ``069c3ef0, "Status: %s", ""``
+    * - (3) Wire format
+      - ``61 34 a8 03 0a`` (5 bytes)
+      - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes)
+    * - (4) Top-level detokenized and formatted
+      - ``"Status: 5"``
+      - ``"Status: PW_STATUS_NOT_FOUND"``
+    * - (5) Fully detokenized
+      - ``"Status: 5"``
+      - ``"Status: PW_STATUS_NOT_FOUND"``
+
+Tokenization (C/C++)
+====================
+The ``pw_log_tokenized`` module exposes a set of macros for creating and
+formatting nested tokens. Within format strings in the source code, tokens
+are specified using function-like PRI-style macros. These can be used to
+encode static information like the token domain or a numeric base encoding
+and are macro-expanded to string literals that are concatenated with the
+rest of the format string during preprocessing. Since ``pw_log`` generally
+uses printf syntax, only bases 8, 10, and 16 are supported for integer token
+arguments via ``%[odiuxX]``.
+
+The provided macros enforce the token specifier syntax and keep the argument
+types in sync when switching between other ``pw_log`` backends like
+``pw_log_basic``. These macros for basic usage are as follows:
+
+* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args.
+* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg.
+* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be
+  referenced, e.g. as a ``ToString`` function return type.
+
+.. code-block:: cpp
+
+  #include "pw_log/log.h"
+  #include "pw_log/tokenized_args.h"
+
+  // token with default options base-16 and empty domain
+  // token database literal: "The sun will come out $#%08x!"
+  PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow"))
+  // after detokenization: "The sun will come out tomorrow!"
+
+Additional macros are also provided specifically for enum handling. The
+``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum
+value with the specified token domain to prevent token collision between
+multiple tokenized enums. This macro is kept separate from the enum
+definition to allow things like tokenizing a preexisting enum defined in an
+external dependency.
+
+.. code-block:: cpp
+
+  // enums
+  namespace foo {
+
+    enum class Color { kRed, kGreen, kBlue };
+
+    // syntax TBD
+    TOKENIZE_ENUM(
+      foo::Color,
+      kRed,
+      kGreen,
+      kBlue
+    )
+
+  } // namespace foo
+
+  void LogColor(foo::Color color) {
+    // token database literal:
+    // "Color: [${foo::Color}10#%010d]"
+    PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color)
+    // after detokenization:
+    // e.g. "Color: kRed"
+  }
+
+.. admonition:: Nested Base64 tokens
+
+  ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in
+  which case the argument should be a pre-encoded Base64 string argument
+  (e.g. ``QAzF39==``). However, this should be avoided when possible to
+  maximize space savings. Fully-formatted Base64 including the token prefix
+  may also be logged with ``%s`` as before.
+
+Detokenization (Python)
+=======================
+``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++)
+will automatically recursively detokenize tokens of all known formats rather
+than requiring a separate call to ``detokenize_base64`` or similar.
+
+To support detokenizing domain-specific tokens, token databases support multiple
+domains, and ``database.py create`` will build a database with tokens from all
+domains by default. Specifying a domain during database creation will cause
+that domain to be treated as the default.
+
+When detokenization fails, tokens appear as-is in logs. If the detokenizer has
+the ``show_errors`` option set to ``True``, error messages may be printed
+inline following the raw token.
+
+Tokens
+======
+Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so
+users should typically not be manually formatting tokens. However, if
+detokenization fails for any reason, tokens will appear with the following
+format in the final logs and should be easily recognizable.
+
+Nested tokens have the following structure in partially detokenized logs
+(transformation stage 4):
+
+.. code-block::
+
+   $[{DOMAIN}][BASE#]TOKEN
+
+The ``$`` is a common prefix required for all nested tokens. It is possible to
+configure a different common prefix if necessary, but using the default ``$``
+character is strongly recommended.
+
+.. list-table:: Options
+   :widths: 10 30
+
+   * - ``{DOMAIN}``
+     - Specifies the token domain. If this option is omitted, the default
+       (empty) domain is assumed.
+   * - ``BASE#``
+     - Defines the numeric base encoding of the token. Accepted values are 8,
+       10, 16, and 64. If the hash symbol ``#`` is used without specifying a
+       number, the base is assumed to be 16. If the base option is omitted
+       entirely, the base defaults to 64 for backward compatibility. All
+       encodings except Base64 are not case sensitive.
+
+       This option may be expanded to support other bases in the future.
+   * - ``TOKEN`` (required)
+     - The numeric representation of the token in the given base encoding. All
+       encodings except Base64 are left-padded with zeroes to the maximum width
+       of a 32-bit integer in the given base. Base64 data may additionally encode
+       string arguments for the detokenized token, and therefore does not have a
+       maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for
+       supported bases.
+
+When used in conjunction with ``pw_log_tokenized``, the token prefix (including
+any domain and base specifications) is tokenized as part of the log format
+string and therefore incurs zero additional memory or transmission cost over
+that of the original format string. Over the wire, tokens in bases 8, 10, and
+16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64
+tokens continue to be encoded as strings.
+
+.. warning::
+  Tokens do not have a terminating character in general, which is why we
+  require them to be formatted with fixed width. Otherwise, following them
+  immediately with alphanumeric characters valid in their base encoding
+  will cause detokenization errors.
+
+.. admonition:: Recognizing raw nested tokens in strings
+
+  When a string is fully detokenized, there should no longer be any indication
+  of tokenization in the final result, e.g. detokenized logs should read the
+  same as plain string logs. However, if nested tokens cannot be detokenized for
+  any reason, they will appear in their raw form as below:
+
+  .. code-block::
+
+     // Base64 token with no arguments and empty domain
+     $QA19pfEQ
+
+     // Base-10 token
+     $10#0086025943
+
+     // Base-16 token with specified domain
+     ${foo_namespace::MyEnum}#0000001A
+
+     // Base64 token with specified domain
+     ${bar_namespace::MyEnum}QAQQQQ==
+
+
+---------------------
+Problem investigation
+---------------------
+Complex embedded device projects are perpetually seeking more RAM. For longer
+descriptive string arguments, even just a handful can take up hundreds of bytes
+that are frequently exclusively for logging purposes, without any impact on
+function.
+
+One of the most common potential use cases is for logging enum values.
+Inspection of one project revealed that enums accounted for some 90% of the
+string log arguments. We have encountered instances where, to save space,
+developers have avoided logging descriptive names in favor of raw enum values,
+forcing readers of logs look up or memorize the meanings of each number. Like
+with log format strings, we do know the set of possible string values that
+might be emitted in the final logs, so they should be able to be extracted
+into a token database at compile time.
+
+Another major challenge overall is maintaining a user interface
+that is easy to understand and use. The current primary interface through
+``pw_log`` provides printf-style formatting, which is familiar and succinct
+for basic applications.
+
+We also have to contend with the interchangeable backends of ``pw_log``. The
+``pw_log`` facade is intended as an opaque interface layer; adding syntax
+specifically for tokenized logging will break this abstraction barrier. Either
+this additional syntax would be ignored by other backends, or it might simply
+be incompatible (e.g. logging raw integer tokens instead of strings).
+
+Pigweed already supports one form of nested tokens via Base64 encoding. Base64
+tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded
+with one or two trailing ``'='`` symbols. The Python
+``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by
+running a regex replacement on the formatted results of each iteration. Base64
+is not merely a token format, however; it can encode any binary data in a text
+format at the cost of reduced efficiency. Therefore, Base64 tokens may include
+not only a database token that may detokenize to a format string but also
+binary-encoded arguments. Other token types are not expected to include this
+additional argument data.
+
+---------------
+Detailed design
+---------------
+
+Tokenization
+============
+``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary
+functionality to support tokenized arguments. The proposed API is fully
+backward-compatible with non-nested tokenized logging.
+
+Token arguments are indicated in log format strings via PRI-style macros that
+are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT``
+supplies the ``$`` token prefix, brackets around the domain, the base specifier,
+and the printf-style specifier including padding and width, i.e. ``%011o`` for
+base-8, ``%010u`` for base-10, and ``%08X`` for base-16.
+
+For free-standing string arguments such as those where the literals are defined
+in the log statements themselves, tokenization is performed with macros from
+``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros
+simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to
+basic string formatting for other backends. This is achieved by placing an
+empty header file in the local ``public_overrides`` directory of
+``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using
+the ``__has_include`` directive.
+
+For variable string arguments, the API is split across locations. The string
+literals are tokenized wherever they are defined, and the string format macros
+appear in the log format strings corresponding to those string arguments.
+
+When tokens use non-default domains, additional work may be required to create
+the domain name and store associated tokens in the ELF.
+
+Enum Tokenization
+-----------------
+We use existing ``pw_tokenizer`` utilities to record the raw enum values as
+tokens corresponding to their string names in the ELF. There is no change
+required for the backend implementation; we simply skip the token calculation
+step, since we already have a value to use, and specifying a token domain is
+generally required to isolate multiple enums from token collision.
+
+For ease of use, we can also provide a macro that wraps the enum value list
+and encapsulates the recording of each token value-string pair in the ELF.
+
+When actually logging the values, users pass the enum type name as the domain
+to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be
+passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums).
+Since integers are varint-encoded over the wire, this will only require a
+single byte for most enums.
+
+.. admonition:: Logging pw::status
+
+  Note that while this immediately reduces transmission size, the code
+  space occupied by the string names in ``pw::Status::str()`` cannot be
+  recovered unless an entire project is converted to log ``pw::Status``
+  as tokens.
+
+  .. code:: cpp
+
+    #include "pw_log/log.h"
+    #include "pw_log/tokenized_args.h"
+    #include "pw_status/status.h"
+
+    pw::Status status = pw::Status::NotFound();
+
+    // "pw::Status: ${pw::Status}#%08d"
+    PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code)
+    // "pw::Status: NOT_FOUND"
+
+Since the token mapping entries in the ELF are optimized out of the final
+binary, the enum domains are tokenized away as part of the log format strings,
+and we don't need to store separate tokens for each enum value, this addition
+to the API would would provide enum value names in logs with zero additional
+RAM cost. Compared to logging strings with ``ToString``-style functions, we
+save space on the string names as well as the functions themselves.
+
+Token Database
+==============
+Token databases will be expanded to include a column for domains, so that
+multiple domains can be encompassed in a single database rather than requiring
+separate databases for each domain. This is important because domains are being
+used to categorize tokens within a single project, rather than merely keeping
+separate projects distinct from each other. When creating a database
+from an ELF, a domain may be specified as the default domain instead of the
+empty domain. A list of domains or path to a file with a list of domains may
+also separately be specified to define which domains are to be included in
+the database; all domains are now included by default.
+
+When accessing a token database, both a domain and token value may be specified
+to access specific values. If a domain is not specified, the default domain
+will be assumed, retaining the same behavior as before.
+
+Detokenization
+==============
+Detokenization is relatively straightforward. When the detokenizer is called,
+it will first detokenize and format the top-level token and binary argument
+data. The detokenizer will then find and replace nested tokens in the resulting
+formatted string, then rescan the result for more nested tokens up to a fixed
+number of rescans.
+
+For each token type or format, ``pw_tokenizer`` defines a regular expression to
+match the expected formatted output token and a helper function to convert a
+token from a particular format to its mapped value. The regular expressions for
+each token type are combined into a single regex that matches any one of the
+formats. At each recursive step for every match, each detokenization format
+will be attempted, stopping at the first successful token type and then
+recursively replacing all nested tokens in the result. Only full data encoding-
+type tokens like Base64 will also require string/argument formatting as part of
+the recursive step.
+
+For non-Base64 tokens, a token's base encoding as specified by ``BASE#``
+determines its set of permissible alphanumeric characters and the
+maximum token width for regex matching.
+
+If nested detokenization fails for any reason, the formatted token will be
+printed as-is in the output logs. If ``show_errors`` is true for the
+detokenizer, errors will appear in parentheses immediately following the
+token. Supported errors include:
+
+* ``(token collision)``
+* ``(missing database)``
+* ``(token not found)``
+
+------------
+Alternatives
+------------
+
+Protobuf-based Tokenization
+===========================
+Tokenization may be expanded to function on structured data via protobufs.
+This can be used to make logging more flexible, as all manner of compile-time
+metadata can be freely attached to log arguments at effectively no cost.
+This will most likely involve a separate build process to generate and tokenize
+partially-populated protos and will significantly change the user API. It
+will also be a large break from the existing process in implementation, as
+the current system relies only on existing C preprocessor and C++ constexpr
+tricks to function.
+
+In this model, the token domain would likely be a fully-qualified
+namespace for or path to the proto definition.
+
+Implementing this approach also requires a method of passing ordered arguments
+to a partially-filled detokenized protobuf in a manner similar to printf-style
+string formatting, so that argument data can be efficiently encoded and
+transmitted alongside the protobuf's token, and the arguments to a particular
+proto can be disambiguated from arguments to the rest of a log statement.
+
+This approach will also most likely preclude plain string logging as is
+currently supported by ``pw_log``, as the implementations diverge dramatically.
+However, if pursued, this would likely be made the default logging schema
+across all platforms, including host devices.
+
+Custom Detokenization
+=====================
+Theoretically, individual projects could implement their own regex replacement
+schemes on top of Pigweed's detokenizer, allowing them to more flexibly define
+complex relationships between logged tokens via custom log format string
+syntax. However, Pigweed should provide utilities for nested tokenization in
+common cases such as logging enums.
+
+The changes proposed do not preclude additional custom detokenization schemas
+if absolutely necessary, and such practices do not appear to have been popular
+thus far in any case.
+
+--------------
+Open questions
+--------------
+Missing API definitions:
+
+* Updated APIs for creating and accessing token databases with multiple domains
+* Python nested tokenization
+* C++ nested detokenization
+