diff options
Diffstat (limited to 'seed/0105-pw_tokenizer-pw_log-nested-tokens.rst')
-rw-r--r-- | seed/0105-pw_tokenizer-pw_log-nested-tokens.rst | 470 |
1 files changed, 470 insertions, 0 deletions
diff --git a/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst b/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst new file mode 100644 index 000000000..a4f2966ab --- /dev/null +++ b/seed/0105-pw_tokenizer-pw_log-nested-tokens.rst @@ -0,0 +1,470 @@ +.. _seed-0105: + +=============================================== +0105: Nested Tokens and Tokenized Log Arguments +=============================================== + +.. seed:: + :number: 105 + :name: Nested Tokens and Tokenized Log Arguments + :status: Accepted + :proposal_date: 2023-07-10 + :cl: 154190 + +------- +Summary +------- +This SEED describes a number of extensions to the `pw_tokenizer <https://pigweed.dev/pw_tokenizer/>`_ +and `pw_log_tokenized <https://pigweed.dev/pw_log_tokenized>`_ modules to +improve support for nesting tokens and add facilities for tokenizing arguments +to logs such as strings or and enums. This SEED primarily addresses C/C++ +tokenization and Python/C++ detokenization. + +---------- +Motivation +---------- +Currently, ``pw_tokenizer`` and ``pw_log_tokenized`` enable devices with limited +memory to store long log format strings as hashed 32-bit tokens. When logs are +moved off-device, host tooling can recover the full logs using token databases +that were created when building the device image. However, logs may still have +runtime string arguments that are stored and transferred 1:1 without additional +encoding. This SEED aims to extend tokenization to these arguments to further +reduce the weight of logging for embedded applications. + +The proposed changes affect both the tokenization module itself and the logging +facilities built on top of tokenization. + +-------- +Proposal +-------- +Logging enums such as ``pw::Status`` is one common special case where +tokenization is particularly appropriate: enum values are conceptually +already tokens mapping to their names, assuming no duplicate values. Logging +enums frequently entails creating functions and string names that occupy space +exclusively for logging purposes, which this proposal seeks to mitigate. +Here, ``pw::Status::NotFound()`` is presented as an illustrative example of +the several transformations that strings undergo during tokenization and +detokenization, further complicated in the proposed design by nested tokens. + +.. list-table:: Enum Tokenization/Detokenization Phases + :widths: 20 45 + + * - (1) Source code + - ``PW_LOG("Status: " PW_LOG_ENUM_FMT(pw::Status), status.code())`` + * - (2) Token database entries (token, string, domain) + - | ``16170adf, "Status: ${pw::Status}#%08x", ""`` + | ``5 , "PW_STATUS_NOT_FOUND" , "pw::Status"`` + * - (3) Wire format + - ``df 0a 17 16 0a`` (5 bytes) + * - (4) Top-level detokenized and formatted + - ``"Status: ${pw::Status}#00000005"`` + * - (5) Fully detokenized + - ``"Status: PW_STATUS_NOT_FOUND"`` + +Compared to log tokenization without nesting, string literals in token +database entries may not be identical to what is typed in source code due +to the use of macros and preprocessor string concatenation. The +detokenizer also takes an additional step to recursively detokenize any +nested tokens. In exchange for this added complexity, nested enum tokenization +allows us to gain the readability of logging value names with zero additional +runtime space or performance cost compared to logging the integral values +directly with ``pw_log_tokenized``. + +.. note:: + Without nested enum token support, users can select either readability or + reduced binary and transmission size, but not easily both: + + .. list-table:: + :widths: 15 20 20 + :header-rows: 1 + + * - + - Raw integers + - String names + * - (1) Source code + - ``PW_LOG("Status: %x" , status.code())`` + - ``PW_LOG("Status: %s" , pw_StatusString(status))`` + * - (2) Token database entries (token, string, domain) + - ``03a83461, "Status: %x", ""`` + - ``069c3ef0, "Status: %s", ""`` + * - (3) Wire format + - ``61 34 a8 03 0a`` (5 bytes) + - ``f0 3e 9c 06 09 4e 4f 54 5f 46 4f 55 4e 44`` (14 bytes) + * - (4) Top-level detokenized and formatted + - ``"Status: 5"`` + - ``"Status: PW_STATUS_NOT_FOUND"`` + * - (5) Fully detokenized + - ``"Status: 5"`` + - ``"Status: PW_STATUS_NOT_FOUND"`` + +Tokenization (C/C++) +==================== +The ``pw_log_tokenized`` module exposes a set of macros for creating and +formatting nested tokens. Within format strings in the source code, tokens +are specified using function-like PRI-style macros. These can be used to +encode static information like the token domain or a numeric base encoding +and are macro-expanded to string literals that are concatenated with the +rest of the format string during preprocessing. Since ``pw_log`` generally +uses printf syntax, only bases 8, 10, and 16 are supported for integer token +arguments via ``%[odiuxX]``. + +The provided macros enforce the token specifier syntax and keep the argument +types in sync when switching between other ``pw_log`` backends like +``pw_log_basic``. These macros for basic usage are as follows: + +* ``PW_LOG_TOKEN`` and ``PW_LOG_TOKEN_EXPR`` are used to tokenize string args. +* ``PW_LOG_TOKEN_FMT`` is used inside the format string to specify a token arg. +* ``PW_LOG_TOKEN_TYPE`` is used if the type of a tokenized arg needs to be + referenced, e.g. as a ``ToString`` function return type. + +.. code-block:: cpp + + #include "pw_log/log.h" + #include "pw_log/tokenized_args.h" + + // token with default options base-16 and empty domain + // token database literal: "The sun will come out $#%08x!" + PW_LOG("The sun will come out " PW_LOG_TOKEN_FMT() "!", PW_LOG_TOKEN_EXPR("tomorrow")) + // after detokenization: "The sun will come out tomorrow!" + +Additional macros are also provided specifically for enum handling. The +``TOKENIZE_ENUM`` macro creates ELF token database entries for each enum +value with the specified token domain to prevent token collision between +multiple tokenized enums. This macro is kept separate from the enum +definition to allow things like tokenizing a preexisting enum defined in an +external dependency. + +.. code-block:: cpp + + // enums + namespace foo { + + enum class Color { kRed, kGreen, kBlue }; + + // syntax TBD + TOKENIZE_ENUM( + foo::Color, + kRed, + kGreen, + kBlue + ) + + } // namespace foo + + void LogColor(foo::Color color) { + // token database literal: + // "Color: [${foo::Color}10#%010d]" + PW_LOG("Color: [" PW_LOG_ENUM_FMT(foo::Color, 10) "]", color) + // after detokenization: + // e.g. "Color: kRed" + } + +.. admonition:: Nested Base64 tokens + + ``PW_LOG_TOKEN_FMT`` can accept 64 as the base encoding for an argument, in + which case the argument should be a pre-encoded Base64 string argument + (e.g. ``QAzF39==``). However, this should be avoided when possible to + maximize space savings. Fully-formatted Base64 including the token prefix + may also be logged with ``%s`` as before. + +Detokenization (Python) +======================= +``Detokenizer.detokenize`` in Python (``Detokenizer::Detokenize`` in C++) +will automatically recursively detokenize tokens of all known formats rather +than requiring a separate call to ``detokenize_base64`` or similar. + +To support detokenizing domain-specific tokens, token databases support multiple +domains, and ``database.py create`` will build a database with tokens from all +domains by default. Specifying a domain during database creation will cause +that domain to be treated as the default. + +When detokenization fails, tokens appear as-is in logs. If the detokenizer has +the ``show_errors`` option set to ``True``, error messages may be printed +inline following the raw token. + +Tokens +====== +Many details described here are provided via the ``PW_LOG_TOKEN_FMT`` macro, so +users should typically not be manually formatting tokens. However, if +detokenization fails for any reason, tokens will appear with the following +format in the final logs and should be easily recognizable. + +Nested tokens have the following structure in partially detokenized logs +(transformation stage 4): + +.. code-block:: + + $[{DOMAIN}][BASE#]TOKEN + +The ``$`` is a common prefix required for all nested tokens. It is possible to +configure a different common prefix if necessary, but using the default ``$`` +character is strongly recommended. + +.. list-table:: Options + :widths: 10 30 + + * - ``{DOMAIN}`` + - Specifies the token domain. If this option is omitted, the default + (empty) domain is assumed. + * - ``BASE#`` + - Defines the numeric base encoding of the token. Accepted values are 8, + 10, 16, and 64. If the hash symbol ``#`` is used without specifying a + number, the base is assumed to be 16. If the base option is omitted + entirely, the base defaults to 64 for backward compatibility. All + encodings except Base64 are not case sensitive. + + This option may be expanded to support other bases in the future. + * - ``TOKEN`` (required) + - The numeric representation of the token in the given base encoding. All + encodings except Base64 are left-padded with zeroes to the maximum width + of a 32-bit integer in the given base. Base64 data may additionally encode + string arguments for the detokenized token, and therefore does not have a + maximum width. This is automatically handled by ``PW_LOG_TOKEN_FMT`` for + supported bases. + +When used in conjunction with ``pw_log_tokenized``, the token prefix (including +any domain and base specifications) is tokenized as part of the log format +string and therefore incurs zero additional memory or transmission cost over +that of the original format string. Over the wire, tokens in bases 8, 10, and +16 are transmitted as varint-encoded integers up to 5 bytes in size. Base64 +tokens continue to be encoded as strings. + +.. warning:: + Tokens do not have a terminating character in general, which is why we + require them to be formatted with fixed width. Otherwise, following them + immediately with alphanumeric characters valid in their base encoding + will cause detokenization errors. + +.. admonition:: Recognizing raw nested tokens in strings + + When a string is fully detokenized, there should no longer be any indication + of tokenization in the final result, e.g. detokenized logs should read the + same as plain string logs. However, if nested tokens cannot be detokenized for + any reason, they will appear in their raw form as below: + + .. code-block:: + + // Base64 token with no arguments and empty domain + $QA19pfEQ + + // Base-10 token + $10#0086025943 + + // Base-16 token with specified domain + ${foo_namespace::MyEnum}#0000001A + + // Base64 token with specified domain + ${bar_namespace::MyEnum}QAQQQQ== + + +--------------------- +Problem investigation +--------------------- +Complex embedded device projects are perpetually seeking more RAM. For longer +descriptive string arguments, even just a handful can take up hundreds of bytes +that are frequently exclusively for logging purposes, without any impact on +function. + +One of the most common potential use cases is for logging enum values. +Inspection of one project revealed that enums accounted for some 90% of the +string log arguments. We have encountered instances where, to save space, +developers have avoided logging descriptive names in favor of raw enum values, +forcing readers of logs look up or memorize the meanings of each number. Like +with log format strings, we do know the set of possible string values that +might be emitted in the final logs, so they should be able to be extracted +into a token database at compile time. + +Another major challenge overall is maintaining a user interface +that is easy to understand and use. The current primary interface through +``pw_log`` provides printf-style formatting, which is familiar and succinct +for basic applications. + +We also have to contend with the interchangeable backends of ``pw_log``. The +``pw_log`` facade is intended as an opaque interface layer; adding syntax +specifically for tokenized logging will break this abstraction barrier. Either +this additional syntax would be ignored by other backends, or it might simply +be incompatible (e.g. logging raw integer tokens instead of strings). + +Pigweed already supports one form of nested tokens via Base64 encoding. Base64 +tokens begin with ``'$'``, followed by Base64-encoded data, and may be padded +with one or two trailing ``'='`` symbols. The Python +``Detokenizer.detokenize_base64`` method recursively detokenizes Base64 by +running a regex replacement on the formatted results of each iteration. Base64 +is not merely a token format, however; it can encode any binary data in a text +format at the cost of reduced efficiency. Therefore, Base64 tokens may include +not only a database token that may detokenize to a format string but also +binary-encoded arguments. Other token types are not expected to include this +additional argument data. + +--------------- +Detailed design +--------------- + +Tokenization +============ +``pw_tokenizer`` and ``pw_log_tokenized`` already provide much of the necessary +functionality to support tokenized arguments. The proposed API is fully +backward-compatible with non-nested tokenized logging. + +Token arguments are indicated in log format strings via PRI-style macros that +are exposed by a new ``pw_log/tokenized_args.h`` header. ``PW_LOG_TOKEN_FMT`` +supplies the ``$`` token prefix, brackets around the domain, the base specifier, +and the printf-style specifier including padding and width, i.e. ``%011o`` for +base-8, ``%010u`` for base-10, and ``%08X`` for base-16. + +For free-standing string arguments such as those where the literals are defined +in the log statements themselves, tokenization is performed with macros from +``pw_log/tokenized_args.h``. With the tokenized logging backend, these macros +simply alias the corresponding ``PW_TOKENIZE`` macros, but they also revert to +basic string formatting for other backends. This is achieved by placing an +empty header file in the local ``public_overrides`` directory of +``pw_log_tokenized`` and checking for it in ``pw_log/tokenized_args.h`` using +the ``__has_include`` directive. + +For variable string arguments, the API is split across locations. The string +literals are tokenized wherever they are defined, and the string format macros +appear in the log format strings corresponding to those string arguments. + +When tokens use non-default domains, additional work may be required to create +the domain name and store associated tokens in the ELF. + +Enum Tokenization +----------------- +We use existing ``pw_tokenizer`` utilities to record the raw enum values as +tokens corresponding to their string names in the ELF. There is no change +required for the backend implementation; we simply skip the token calculation +step, since we already have a value to use, and specifying a token domain is +generally required to isolate multiple enums from token collision. + +For ease of use, we can also provide a macro that wraps the enum value list +and encapsulates the recording of each token value-string pair in the ELF. + +When actually logging the values, users pass the enum type name as the domain +to format specifier macro ``PW_LOG_TOKEN()``, and the enum values can be +passed as-is to ``PW_LOG`` (casting to integers as necessary for scoped enums). +Since integers are varint-encoded over the wire, this will only require a +single byte for most enums. + +.. admonition:: Logging pw::status + + Note that while this immediately reduces transmission size, the code + space occupied by the string names in ``pw::Status::str()`` cannot be + recovered unless an entire project is converted to log ``pw::Status`` + as tokens. + + .. code:: cpp + + #include "pw_log/log.h" + #include "pw_log/tokenized_args.h" + #include "pw_status/status.h" + + pw::Status status = pw::Status::NotFound(); + + // "pw::Status: ${pw::Status}#%08d" + PW_LOG("pw::Status: " PW_LOG_TOKEN(pw::Status), status.code) + // "pw::Status: NOT_FOUND" + +Since the token mapping entries in the ELF are optimized out of the final +binary, the enum domains are tokenized away as part of the log format strings, +and we don't need to store separate tokens for each enum value, this addition +to the API would would provide enum value names in logs with zero additional +RAM cost. Compared to logging strings with ``ToString``-style functions, we +save space on the string names as well as the functions themselves. + +Token Database +============== +Token databases will be expanded to include a column for domains, so that +multiple domains can be encompassed in a single database rather than requiring +separate databases for each domain. This is important because domains are being +used to categorize tokens within a single project, rather than merely keeping +separate projects distinct from each other. When creating a database +from an ELF, a domain may be specified as the default domain instead of the +empty domain. A list of domains or path to a file with a list of domains may +also separately be specified to define which domains are to be included in +the database; all domains are now included by default. + +When accessing a token database, both a domain and token value may be specified +to access specific values. If a domain is not specified, the default domain +will be assumed, retaining the same behavior as before. + +Detokenization +============== +Detokenization is relatively straightforward. When the detokenizer is called, +it will first detokenize and format the top-level token and binary argument +data. The detokenizer will then find and replace nested tokens in the resulting +formatted string, then rescan the result for more nested tokens up to a fixed +number of rescans. + +For each token type or format, ``pw_tokenizer`` defines a regular expression to +match the expected formatted output token and a helper function to convert a +token from a particular format to its mapped value. The regular expressions for +each token type are combined into a single regex that matches any one of the +formats. At each recursive step for every match, each detokenization format +will be attempted, stopping at the first successful token type and then +recursively replacing all nested tokens in the result. Only full data encoding- +type tokens like Base64 will also require string/argument formatting as part of +the recursive step. + +For non-Base64 tokens, a token's base encoding as specified by ``BASE#`` +determines its set of permissible alphanumeric characters and the +maximum token width for regex matching. + +If nested detokenization fails for any reason, the formatted token will be +printed as-is in the output logs. If ``show_errors`` is true for the +detokenizer, errors will appear in parentheses immediately following the +token. Supported errors include: + +* ``(token collision)`` +* ``(missing database)`` +* ``(token not found)`` + +------------ +Alternatives +------------ + +Protobuf-based Tokenization +=========================== +Tokenization may be expanded to function on structured data via protobufs. +This can be used to make logging more flexible, as all manner of compile-time +metadata can be freely attached to log arguments at effectively no cost. +This will most likely involve a separate build process to generate and tokenize +partially-populated protos and will significantly change the user API. It +will also be a large break from the existing process in implementation, as +the current system relies only on existing C preprocessor and C++ constexpr +tricks to function. + +In this model, the token domain would likely be a fully-qualified +namespace for or path to the proto definition. + +Implementing this approach also requires a method of passing ordered arguments +to a partially-filled detokenized protobuf in a manner similar to printf-style +string formatting, so that argument data can be efficiently encoded and +transmitted alongside the protobuf's token, and the arguments to a particular +proto can be disambiguated from arguments to the rest of a log statement. + +This approach will also most likely preclude plain string logging as is +currently supported by ``pw_log``, as the implementations diverge dramatically. +However, if pursued, this would likely be made the default logging schema +across all platforms, including host devices. + +Custom Detokenization +===================== +Theoretically, individual projects could implement their own regex replacement +schemes on top of Pigweed's detokenizer, allowing them to more flexibly define +complex relationships between logged tokens via custom log format string +syntax. However, Pigweed should provide utilities for nested tokenization in +common cases such as logging enums. + +The changes proposed do not preclude additional custom detokenization schemas +if absolutely necessary, and such practices do not appear to have been popular +thus far in any case. + +-------------- +Open questions +-------------- +Missing API definitions: + +* Updated APIs for creating and accessing token databases with multiple domains +* Python nested tokenization +* C++ nested detokenization + |