diff options
Diffstat (limited to 'pw_tokenizer/tokenization.rst')
-rw-r--r-- | pw_tokenizer/tokenization.rst | 741 |
1 files changed, 741 insertions, 0 deletions
diff --git a/pw_tokenizer/tokenization.rst b/pw_tokenizer/tokenization.rst new file mode 100644 index 000000000..c96349d8c --- /dev/null +++ b/pw_tokenizer/tokenization.rst @@ -0,0 +1,741 @@ +:tocdepth: 3 + +.. _module-pw_tokenizer-tokenization: + +============ +Tokenization +============ +.. pigweed-module-subpage:: + :name: pw_tokenizer + :tagline: Compress strings to shrink logs by +75% + +Tokenization converts a string literal to a token. If it's a printf-style +string, its arguments are encoded along with it. The results of tokenization can +be sent off device or stored in place of a full string. + +-------- +Concepts +-------- +See :ref:`module-pw_tokenizer-get-started-overview` for a high-level +explanation of how ``pw_tokenizer`` works. + +Token generation: fixed length hashing at compile time +====================================================== +String tokens are generated using a modified version of the x65599 hash used by +the SDBM project. All hashing is done at compile time. + +In C code, strings are hashed with a preprocessor macro. For compatibility with +macros, the hash must be limited to a fixed maximum number of characters. This +value is set by ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. Increasing +``PW_TOKENIZER_CFG_C_HASH_LENGTH`` increases the compilation time for C due to +the complexity of the hashing macros. + +C++ macros use a constexpr function instead of a macro. This function works with +any length of string and has lower compilation time impact than the C macros. +For consistency, C++ tokenization uses the same hash algorithm, but the +calculated values will differ between C and C++ for strings longer than +``PW_TOKENIZER_CFG_C_HASH_LENGTH`` characters. + +Token encoding +============== +The token is a 32-bit hash calculated during compilation. The string is encoded +little-endian with the token followed by arguments, if any. For example, the +31-byte string ``You can go about your business.`` hashes to 0xdac9a244. +This is encoded as 4 bytes: ``44 a2 c9 da``. + +Arguments are encoded as follows: + +* **Integers** (1--10 bytes) -- + `ZagZag and varint encoded <https://developers.google.com/protocol-buffers/docs/encoding#signed-integers>`_, + similarly to Protocol Buffers. Smaller values take fewer bytes. +* **Floating point numbers** (4 bytes) -- Single precision floating point. +* **Strings** (1--128 bytes) -- Length byte followed by the string contents. + The top bit of the length whether the string was truncated or not. The + remaining 7 bits encode the string length, with a maximum of 127 bytes. + +.. TODO(hepler): insert diagram here! + +.. tip:: + ``%s`` arguments can quickly fill a tokenization buffer. Keep ``%s`` + arguments short or avoid encoding them as strings (e.g. encode an enum as an + integer instead of a string). See also + :ref:`module-pw_tokenizer-nested-arguments`. + +.. _module-pw_tokenizer-proto: + +Tokenized fields in protocol buffers +==================================== +Text may be represented in a few different ways: + +- Plain ASCII or UTF-8 text (``This is plain text``) +- Base64-encoded tokenized message (``$ibafcA==``) +- Binary-encoded tokenized message (``89 b6 9f 70``) +- Little-endian 32-bit integer token (``0x709fb689``) + +``pw_tokenizer`` provides the ``pw.tokenizer.format`` protobuf field option. +This option may be applied to a protobuf field to indicate that it may contain a +tokenized string. A string that is optionally tokenized is represented with a +single ``bytes`` field annotated with ``(pw.tokenizer.format) = +TOKENIZATION_OPTIONAL``. + +For example, the following protobuf has one field that may contain a tokenized +string. + +.. code-block:: protobuf + + message MessageWithOptionallyTokenizedField { + bytes just_bytes = 1; + bytes maybe_tokenized = 2 [(pw.tokenizer.format) = TOKENIZATION_OPTIONAL]; + string just_text = 3; + } + +----------------------- +Tokenization in C++ / C +----------------------- +To tokenize a string, include ``pw_tokenizer/tokenize.h`` and invoke one of the +``PW_TOKENIZE_*`` macros. + +Tokenize string literals outside of expressions +=============================================== +``pw_tokenizer`` provides macros for tokenizing string literals with no +arguments: + +* :c:macro:`PW_TOKENIZE_STRING` +* :c:macro:`PW_TOKENIZE_STRING_DOMAIN` +* :c:macro:`PW_TOKENIZE_STRING_MASK` + +The tokenization macros above cannot be used inside other expressions. + +.. admonition:: **Yes**: Assign :c:macro:`PW_TOKENIZE_STRING` to a ``constexpr`` variable. + :class: checkmark + + .. code-block:: cpp + + constexpr uint32_t kGlobalToken = PW_TOKENIZE_STRING("Wowee Zowee!"); + + void Function() { + constexpr uint32_t local_token = PW_TOKENIZE_STRING("Wowee Zowee?"); + } + +.. admonition:: **No**: Use :c:macro:`PW_TOKENIZE_STRING` in another expression. + :class: error + + .. code-block:: cpp + + void BadExample() { + ProcessToken(PW_TOKENIZE_STRING("This won't compile!")); + } + + Use :c:macro:`PW_TOKENIZE_STRING_EXPR` instead. + +Tokenize inside expressions +=========================== +An alternate set of macros are provided for use inside expressions. These make +use of lambda functions, so while they can be used inside expressions, they +require C++ and cannot be assigned to constexpr variables or be used with +special function variables like ``__func__``. + +* :c:macro:`PW_TOKENIZE_STRING_EXPR` +* :c:macro:`PW_TOKENIZE_STRING_DOMAIN_EXPR` +* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR` + +.. admonition:: When to use these macros + + Use :c:macro:`PW_TOKENIZE_STRING` and related macros to tokenize string + literals that do not need %-style arguments encoded. + +.. admonition:: **Yes**: Use :c:macro:`PW_TOKENIZE_STRING_EXPR` within other expressions. + :class: checkmark + + .. code-block:: cpp + + void GoodExample() { + ProcessToken(PW_TOKENIZE_STRING_EXPR("This will compile!")); + } + +.. admonition:: **No**: Assign :c:macro:`PW_TOKENIZE_STRING_EXPR` to a ``constexpr`` variable. + :class: error + + .. code-block:: cpp + + constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR("This won't compile!")); + + Instead, use :c:macro:`PW_TOKENIZE_STRING` to assign to a ``constexpr`` variable. + +.. admonition:: **No**: Tokenize ``__func__`` in :c:macro:`PW_TOKENIZE_STRING_EXPR`. + :class: error + + .. code-block:: cpp + + void BadExample() { + // This compiles, but __func__ will not be the outer function's name, and + // there may be compiler warnings. + constexpr uint32_t wont_work = PW_TOKENIZE_STRING_EXPR(__func__); + } + + Instead, use :c:macro:`PW_TOKENIZE_STRING` to tokenize ``__func__`` or similar macros. + +Tokenize a message with arguments to a buffer +============================================= +* :c:macro:`PW_TOKENIZE_TO_BUFFER` +* :c:macro:`PW_TOKENIZE_TO_BUFFER_DOMAIN` +* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK` + +.. admonition:: Why use this macro + + - Encode a tokenized message for consumption within a function. + - Encode a tokenized message into an existing buffer. + + Avoid using ``PW_TOKENIZE_TO_BUFFER`` in widely expanded macros, such as a + logging macro, because it will result in larger code size than passing the + tokenized data to a function. + +.. _module-pw_tokenizer-nested-arguments: + +Tokenize nested arguments +========================= +Encoding ``%s`` string arguments is inefficient, since ``%s`` strings are +encoded 1:1, with no tokenization. Tokens can therefore be used to replace +string arguments to tokenized format strings. + +* :c:macro:`PW_TOKEN_FMT` + +.. admonition:: Logging nested tokens + + Users will typically interact with nested token arguments during logging. + In this case there is a slightly different interface described by + :ref:`module-pw_log-tokenized-args` that does not generally invoke + ``PW_TOKEN_FMT`` directly. + +The format specifier for a token is given by PRI-style macro ``PW_TOKEN_FMT()``, +which is concatenated to the rest of the format string by the C preprocessor. + +.. code-block:: cpp + + PW_TOKENIZE_FORMAT_STRING("margarine_domain", + UINT32_MAX, + "I can't believe it's not " PW_TOKEN_FMT() "!", + PW_TOKENIZE_STRING_EXPR("butter")); + +This feature is currently only supported by the Python detokenizer. + +Nested token format +------------------- +Nested tokens have the following format within strings: + +.. code-block:: + + $[BASE#]TOKEN + +The ``$`` is a common prefix required for all nested tokens. It is possible to +configure a different common prefix if necessary, but using the default ``$`` +character is strongly recommended. + +The optional ``BASE`` defines the numeric base encoding of the token. Accepted +values are 8, 10, 16, and 64. If the hash symbol ``#`` is used without +specifying a number, the base is assumed to be 16. If the base option is +omitted entirely, the base defaults to 64 for backward compatibility. All +encodings except Base64 are not case sensitive. This may be expanded to support +other bases in the future. + +Non-Base64 tokens are encoded strictly as 32-bit integers with padding. +Base64 data may additionally encode string arguments for the detokenized token, +and therefore does not have a maximum width. + +The meaning of ``TOKEN`` depends on the current phase of transformation for the +current tokenized format string. Within the format string's entry in the token +database, when the actual value of the token argument is not known, ``TOKEN`` is +a printf argument specifier (e.g. ``%08x`` for a base-16 token with correct +padding). The actual tokens that will be used as arguments have separate +entries in the token database. + +After the top-level format string has been detokenized and formatted, ``TOKEN`` +should be the value of the token argument in the specified base, with any +necessary padding. This is the final format of a nested token if it cannot be +tokenized. + +.. list-table:: Example tokens + :widths: 10 25 25 + + * - Base + - | Token database + | (within format string entry) + - Partially detokenized + * - 10 + - ``$10#%010d`` + - ``$10#0086025943`` + * - 16 + - ``$#%08x`` + - ``$#0000001A`` + * - 64 + - ``%s`` + - ``$QA19pfEQ`` + +.. _module-pw_tokenizer-custom-macro: + +Tokenize a message with arguments in a custom macro +=================================================== +Projects can leverage the tokenization machinery in whichever way best suits +their needs. The most efficient way to use ``pw_tokenizer`` is to pass tokenized +data to a global handler function. A project's custom tokenization macro can +handle tokenized data in a function of their choosing. The function may accept +any arguments, but its final arguments must be: + +* The 32-bit token (:cpp:type:`pw_tokenizer_Token`) +* The argument types (:cpp:type:`pw_tokenizer_ArgTypes`) +* Variadic arguments, if any + +``pw_tokenizer`` provides two low-level macros to help projects create custom +tokenization macros: + +* :c:macro:`PW_TOKENIZE_FORMAT_STRING` +* :c:macro:`PW_TOKENIZER_REPLACE_FORMAT_STRING` + +.. caution:: + + Note the spelling difference! The first macro begins with ``PW_TOKENIZE_`` + (no ``R``) whereas the second begins with ``PW_TOKENIZER_``. + +Use these macros to invoke an encoding function with the token, argument types, +and variadic arguments. The function can then encode the tokenized message to a +buffer using helpers in ``pw_tokenizer/encode_args.h``: + +.. Note: pw_tokenizer_EncodeArgs is a C function so you would expect to +.. reference it as :c:func:`pw_tokenizer_EncodeArgs`. That doesn't work because +.. it's defined in a header file that mixes C and C++. + +* :cpp:func:`pw::tokenizer::EncodeArgs` +* :cpp:class:`pw::tokenizer::EncodedMessage` +* :cpp:func:`pw_tokenizer_EncodeArgs` + +Example +------- +The following example implements a custom tokenization macro similar to +:ref:`module-pw_log_tokenized`. + +.. code-block:: cpp + + #include "pw_tokenizer/tokenize.h" + + #ifndef __cplusplus + extern "C" { + #endif + + void EncodeTokenizedMessage(uint32_t metadata, + pw_tokenizer_Token token, + pw_tokenizer_ArgTypes types, + ...); + + #ifndef __cplusplus + } // extern "C" + #endif + + #define PW_LOG_TOKENIZED_ENCODE_MESSAGE(metadata, format, ...) \ + do { \ + PW_TOKENIZE_FORMAT_STRING("logs", UINT32_MAX, format, __VA_ARGS__); \ + EncodeTokenizedMessage( \ + metadata, PW_TOKENIZER_REPLACE_FORMAT_STRING(__VA_ARGS__)); \ + } while (0) + +In this example, the ``EncodeTokenizedMessage`` function would handle encoding +and processing the message. Encoding is done by the +:cpp:class:`pw::tokenizer::EncodedMessage` class or +:cpp:func:`pw::tokenizer::EncodeArgs` function from +``pw_tokenizer/encode_args.h``. The encoded message can then be transmitted or +stored as needed. + +.. code-block:: cpp + + #include "pw_log_tokenized/log_tokenized.h" + #include "pw_tokenizer/encode_args.h" + + void HandleTokenizedMessage(pw::log_tokenized::Metadata metadata, + pw::span<std::byte> message); + + extern "C" void EncodeTokenizedMessage(const uint32_t metadata, + const pw_tokenizer_Token token, + const pw_tokenizer_ArgTypes types, + ...) { + va_list args; + va_start(args, types); + pw::tokenizer::EncodedMessage<kLogBufferSize> encoded_message(token, types, args); + va_end(args); + + HandleTokenizedMessage(metadata, encoded_message); + } + +.. admonition:: Why use a custom macro + + - Optimal code size. Invoking a free function with the tokenized data results + in the smallest possible call site. + - Pass additional arguments, such as metadata, with the tokenized message. + - Integrate ``pw_tokenizer`` with other systems. + +Tokenizing function names +========================= +The string literal tokenization functions support tokenizing string literals or +constexpr character arrays (``constexpr const char[]``). In GCC and Clang, the +special ``__func__`` variable and ``__PRETTY_FUNCTION__`` extension are declared +as ``static constexpr char[]`` in C++ instead of the standard ``static const +char[]``. This means that ``__func__`` and ``__PRETTY_FUNCTION__`` can be +tokenized while compiling C++ with GCC or Clang. + +.. code-block:: cpp + + // Tokenize the special function name variables. + constexpr uint32_t function = PW_TOKENIZE_STRING(__func__); + constexpr uint32_t pretty_function = PW_TOKENIZE_STRING(__PRETTY_FUNCTION__); + +Note that ``__func__`` and ``__PRETTY_FUNCTION__`` are not string literals. +They are defined as static character arrays, so they cannot be implicitly +concatentated with string literals. For example, ``printf(__func__ ": %d", +123);`` will not compile. + +Calculate minimum required buffer size +====================================== +See :cpp:func:`pw::tokenizer::MinEncodingBufferSizeBytes`. + +.. _module-pw_tokenizer-base64-format: + +Encoding Base64 +=============== +The tokenizer encodes messages to a compact binary representation. Applications +may desire a textual representation of tokenized strings. This makes it easy to +use tokenized messages alongside plain text messages, but comes at a small +efficiency cost: encoded Base64 messages occupy about 4/3 (133%) as much memory +as binary messages. + +The Base64 format is comprised of a ``$`` character followed by the +Base64-encoded contents of the tokenized message. For example, consider +tokenizing the string ``This is an example: %d!`` with the argument -1. The +string's token is 0x4b016e66. + +.. code-block:: text + + Source code: PW_LOG("This is an example: %d!", -1); + + Plain text: This is an example: -1! [23 bytes] + + Binary: 66 6e 01 4b 01 [ 5 bytes] + + Base64: $Zm4BSwE= [ 9 bytes] + +To encode with the Base64 format, add a call to +``pw::tokenizer::PrefixedBase64Encode`` or ``pw_tokenizer_PrefixedBase64Encode`` +in the tokenizer handler function. For example, + +.. code-block:: cpp + + void TokenizedMessageHandler(const uint8_t encoded_message[], + size_t size_bytes) { + pw::InlineBasicString base64 = pw::tokenizer::PrefixedBase64Encode( + pw::span(encoded_message, size_bytes)); + + TransmitLogMessage(base64.data(), base64.size()); + } + +.. _module-pw_tokenizer-masks: + +Reduce token size with masking +============================== +``pw_tokenizer`` uses 32-bit tokens. On 32-bit or 64-bit architectures, using +fewer than 32 bits does not improve runtime or code size efficiency. However, +when tokens are packed into data structures or stored in arrays, the size of the +token directly affects memory usage. In those cases, every bit counts, and it +may be desireable to use fewer bits for the token. + +``pw_tokenizer`` allows users to provide a mask to apply to the token. This +masked token is used in both the token database and the code. The masked token +is not a masked version of the full 32-bit token, the masked token is the token. +This makes it trivial to decode tokens that use fewer than 32 bits. + +Masking functionality is provided through the ``*_MASK`` versions of the macros: + +* :c:macro:`PW_TOKENIZE_STRING_MASK` +* :c:macro:`PW_TOKENIZE_STRING_MASK_EXPR` +* :c:macro:`PW_TOKENIZE_TO_BUFFER_MASK` + +For example, the following generates 16-bit tokens and packs them into an +existing value. + +.. code-block:: cpp + + constexpr uint32_t token = PW_TOKENIZE_STRING_MASK("domain", 0xFFFF, "Pigweed!"); + uint32_t packed_word = (other_bits << 16) | token; + +Tokens are hashes, so tokens of any size have a collision risk. The fewer bits +used for tokens, the more likely two strings are to hash to the same token. See +:ref:`module-pw_tokenizer-collisions`. + +Masked tokens without arguments may be encoded in fewer bytes. For example, the +16-bit token ``0x1234`` may be encoded as two little-endian bytes (``34 12``) +rather than four (``34 12 00 00``). The detokenizer tools zero-pad data smaller +than four bytes. Tokens with arguments must always be encoded as four bytes. + +.. _module-pw_tokenizer-domains: + +Keep tokens from different sources separate with domains +======================================================== +``pw_tokenizer`` supports having multiple tokenization domains. Domains are a +string label associated with each tokenized string. This allows projects to keep +tokens from different sources separate. Potential use cases include the +following: + +* Keep large sets of tokenized strings separate to avoid collisions. +* Create a separate database for a small number of strings that use truncated + tokens, for example only 10 or 16 bits instead of the full 32 bits. + +If no domain is specified, the domain is empty (``""``). For many projects, this +default domain is sufficient, so no additional configuration is required. + +.. code-block:: cpp + + // Tokenizes this string to the default ("") domain. + PW_TOKENIZE_STRING("Hello, world!"); + + // Tokenizes this string to the "my_custom_domain" domain. + PW_TOKENIZE_STRING_DOMAIN("my_custom_domain", "Hello, world!"); + +The database and detokenization command line tools default to reading from the +default domain. The domain may be specified for ELF files by appending +``#DOMAIN_NAME`` to the file path. Use ``#.*`` to read from all domains. For +example, the following reads strings in ``some_domain`` from ``my_image.elf``. + +.. code-block:: sh + + ./database.py create --database my_db.csv path/to/my_image.elf#some_domain + +See :ref:`module-pw_tokenizer-managing-token-databases` for information about +the ``database.py`` command line tool. + +Limitations, bugs, and future work +================================== + +GCC bug: tokenization in template functions +------------------------------------------- +GCC incorrectly ignores the section attribute for template `functions +<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70435>`_ and `variables +<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88061>`_. For example, the +following won't work when compiling with GCC and tokenized logging: + +.. code-block:: cpp + + template <...> + void DoThings() { + int value = GetValue(); + // This log won't work with tokenized logs due to the templated context. + PW_LOG_INFO("Got value: %d", value); + ... + } + +The bug causes tokenized strings in template functions to be emitted into +``.rodata`` instead of the special tokenized string section. This causes two +problems: + +1. Tokenized strings will not be discovered by the token database tools. +2. Tokenized strings may not be removed from the final binary. + +There are two workarounds. + +#. **Use Clang.** Clang puts the string data in the requested section, as + expected. No extra steps are required. + +#. **Move tokenization calls to a non-templated context.** Creating a separate + non-templated function and invoking it from the template resolves the issue. + This enables tokenizing in most cases encountered in practice with + templates. + + .. code-block:: cpp + + // In .h file: + void LogThings(value); + + template <...> + void DoThings() { + int value = GetValue(); + // This log will work: calls non-templated helper. + LogThings(value); + ... + } + + // In .cc file: + void LogThings(int value) { + // Tokenized logging works as expected in this non-templated context. + PW_LOG_INFO("Got value %d", value); + } + +There is a third option, which isn't implemented yet, which is to compile the +binary twice: once to extract the tokens, and once for the production binary +(without tokens). If this is interesting to you please get in touch. + +64-bit tokenization +------------------- +The Python and C++ detokenizing libraries currently assume that strings were +tokenized on a system with 32-bit ``long``, ``size_t``, ``intptr_t``, and +``ptrdiff_t``. Decoding may not work correctly for these types if a 64-bit +device performed the tokenization. + +Supporting detokenization of strings tokenized on 64-bit targets would be +simple. This could be done by adding an option to switch the 32-bit types to +64-bit. The tokenizer stores the sizes of these types in the +``.pw_tokenizer.info`` ELF section, so the sizes of these types can be verified +by checking the ELF file, if necessary. + +Tokenization in headers +----------------------- +Tokenizing code in header files (inline functions or templates) may trigger +warnings such as ``-Wlto-type-mismatch`` under certain conditions. That +is because tokenization requires declaring a character array for each tokenized +string. If the tokenized string includes macros that change value, the size of +this character array changes, which means the same static variable is defined +with different sizes. It should be safe to suppress these warnings, but, when +possible, code that tokenizes strings with macros that can change value should +be moved to source files rather than headers. + +---------------------- +Tokenization in Python +---------------------- +The Python ``pw_tokenizer.encode`` module has limited support for encoding +tokenized messages with the :func:`pw_tokenizer.encode.encode_token_and_args` +function. This function requires a string's token is already calculated. +Typically these tokens are provided by a database, but they can be manually +created using the tokenizer hash. + +:func:`pw_tokenizer.tokens.pw_tokenizer_65599_hash` is particularly useful +for offline token database generation in cases where tokenized strings in a +binary cannot be embedded as parsable pw_tokenizer entries. + +.. note:: + In C, the hash length of a string has a fixed limit controlled by + ``PW_TOKENIZER_CFG_C_HASH_LENGTH``. To match tokens produced by C (as opposed + to C++) code, ``pw_tokenizer_65599_hash()`` should be called with a matching + hash length limit. When creating an offline database, it's a good idea to + generate tokens for both, and merge the databases. + +.. _module-pw_tokenizer-cli-encoding: + +----------------- +Encoding CLI tool +----------------- +The ``pw_tokenizer.encode`` command line tool can be used to encode +format strings and optional arguments. + +.. code-block:: bash + + python -m pw_tokenizer.encode [-h] FORMAT_STRING [ARG ...] + +Example: + +.. code-block:: text + + $ python -m pw_tokenizer.encode "There's... %d many of %s!" 2 them + Raw input: "There's... %d many of %s!" % (2, 'them') + Formatted input: There's... 2 many of them! + Token: 0xb6ef8b2d + Encoded: b'-\x8b\xef\xb6\x04\x04them' (2d 8b ef b6 04 04 74 68 65 6d) [10 bytes] + Prefixed Base64: $LYvvtgQEdGhlbQ== + +See ``--help`` for full usage details. + +-------- +Appendix +-------- + +Case study +========== +.. note:: This section discusses the implementation, results, and lessons + learned from a real-world deployment of ``pw_tokenizer``. + +The tokenizer module was developed to bring tokenized logging to an +in-development product. The product already had an established text-based +logging system. Deploying tokenization was straightforward and had substantial +benefits. + +Results +------- +* Log contents shrunk by over 50%, even with Base64 encoding. + + * Significant size savings for encoded logs, even using the less-efficient + Base64 encoding required for compatibility with the existing log system. + * Freed valuable communication bandwidth. + * Allowed storing many more logs in crash dumps. + +* Substantial flash savings. + + * Reduced the size firmware images by up to 18%. + +* Simpler logging code. + + * Removed CPU-heavy ``snprintf`` calls. + * Removed complex code for forwarding log arguments to a low-priority task. + +This section describes the tokenizer deployment process and highlights key +insights. + +Firmware deployment +------------------- +* In the project's logging macro, calls to the underlying logging function were + replaced with a tokenized log macro invocation. +* The log level was passed as the payload argument to facilitate runtime log + level control. +* For this project, it was necessary to encode the log messages as text. In + the handler function the log messages were encoded in the $-prefixed + :ref:`module-pw_tokenizer-base64-format`, then dispatched as normal log messages. +* Asserts were tokenized a callback-based API that has been removed (a + :ref:`custom macro <module-pw_tokenizer-custom-macro>` is a better + alternative). + +.. attention:: + Do not encode line numbers in tokenized strings. This results in a huge + number of lines being added to the database, since every time code moves, + new strings are tokenized. If :ref:`module-pw_log_tokenized` is used, line + numbers are encoded in the log metadata. Line numbers may also be included by + by adding ``"%d"`` to the format string and passing ``__LINE__``. + +.. _module-pw_tokenizer-database-management: + +Database management +------------------- +* The token database was stored as a CSV file in the project's Git repo. +* The token database was automatically updated as part of the build, and + developers were expected to check in the database changes alongside their code + changes. +* A presubmit check verified that all strings added by a change were added to + the token database. +* The token database included logs and asserts for all firmware images in the + project. +* No strings were purged from the token database. + +.. tip:: + Merge conflicts may be a frequent occurrence with an in-source CSV database. + Use the :ref:`module-pw_tokenizer-directory-database-format` instead. + +Decoding tooling deployment +--------------------------- +* The Python detokenizer in ``pw_tokenizer`` was deployed to two places: + + * Product-specific Python command line tools, using + ``pw_tokenizer.Detokenizer``. + * Standalone script for decoding prefixed Base64 tokens in files or + live output (e.g. from ``adb``), using ``detokenize.py``'s command line + interface. + +* The C++ detokenizer library was deployed to two Android apps with a Java + Native Interface (JNI) layer. + + * The binary token database was included as a raw resource in the APK. + * In one app, the built-in token database could be overridden by copying a + file to the phone. + +.. tip:: + Make the tokenized logging tools simple to use for your project. + + * Provide simple wrapper shell scripts that fill in arguments for the + project. For example, point ``detokenize.py`` to the project's token + databases. + * Use ``pw_tokenizer.AutoUpdatingDetokenizer`` to decode in + continuously-running tools, so that users don't have to restart the tool + when the token database updates. + * Integrate detokenization everywhere it is needed. Integrating the tools + takes just a few lines of code, and token databases can be embedded in APKs + or binaries. |