diff options
Diffstat (limited to 'pw_tokenizer/detokenization.rst')
-rw-r--r-- | pw_tokenizer/detokenization.rst | 583 |
1 files changed, 583 insertions, 0 deletions
diff --git a/pw_tokenizer/detokenization.rst b/pw_tokenizer/detokenization.rst new file mode 100644 index 000000000..7fbefec88 --- /dev/null +++ b/pw_tokenizer/detokenization.rst @@ -0,0 +1,583 @@ +:tocdepth: 3 + +.. _module-pw_tokenizer-detokenization: + +============== +Detokenization +============== +.. pigweed-module-subpage:: + :name: pw_tokenizer + :tagline: Compress strings to shrink logs by +75% + +Detokenization is the process of expanding a token to the string it represents +and decoding its arguments. ``pw_tokenizer`` provides Python, C++ and +TypeScript detokenization libraries. + +-------------------------------- +Example: decoding tokenized logs +-------------------------------- +A project might tokenize its log messages with the +:ref:`module-pw_tokenizer-base64-format`. Consider the following log file, which +has four tokenized logs and one plain text log: + +.. code-block:: text + + 20200229 14:38:58 INF $HL2VHA== + 20200229 14:39:00 DBG $5IhTKg== + 20200229 14:39:20 DBG Crunching numbers to calculate probability of success + 20200229 14:39:21 INF $EgFj8lVVAUI= + 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk= + +The project's log strings are stored in a database like the following: + +.. code-block:: + + 1c95bd1c, ,"Initiating retrieval process for recovery object" + 2a5388e4, ,"Determining optimal approach and coordinating vectors" + 3743540c, ,"Recovery object retrieval failed with status %s" + f2630112, ,"Calculated acceptable probability of success (%.2f%%)" + +Using the detokenizing tools with the database, the logs can be decoded: + +.. code-block:: text + + 20200229 14:38:58 INF Initiating retrieval process for recovery object + 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors + 20200229 14:39:20 DBG Crunching numbers to calculate probability of success + 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%) + 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY + +.. note:: + + This example uses the :ref:`module-pw_tokenizer-base64-format`, which + occupies about 4/3 (133%) as much space as the default binary format when + encoded. For projects that wish to interleave tokenized with plain text, + using Base64 is a worthwhile tradeoff. + +------------------------ +Detokenization in Python +------------------------ +To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer`` +package, and instantiate it with paths to token databases or ELF files. + +.. code-block:: python + + import pw_tokenizer + + detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf') + + def process_log_message(log_message): + result = detokenizer.detokenize(log_message.payload) + self._log(str(result)) + +The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer`` +class, which can be used in place of the standard ``Detokenizer``. This class +monitors database files for changes and automatically reloads them when they +change. This is helpful for long-running tools that use detokenization. The +class also supports token domains for the given database files in the +``<path>#<domain>`` format. + +For messages that are optionally tokenized and may be encoded as binary, +Base64, or plaintext UTF-8, use +:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to +determine the correct method to detokenize and always provide a printable +string. + +.. _module-pw_tokenizer-base64-decoding: + +Decoding Base64 +=============== +The Python ``Detokenizer`` class supports decoding and detokenizing prefixed +Base64 messages with ``detokenize_base64`` and related methods. + +.. tip:: + The Python detokenization tools support recursive detokenization for prefixed + Base64 text. Tokenized strings found in detokenized text are detokenized, so + prefixed Base64 messages can be passed as ``%s`` arguments. + + For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be + passed as an argument to the printf-style string ``Nested message: %s``, which + encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message + as follows: + + :: + + "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!" + +Base64 decoding is supported in C++ or C with the +``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode`` +functions. + +Investigating undecoded Base64 messages +--------------------------------------- +Tokenized messages cannot be decoded if the token is not recognized. The Python +package includes the ``parse_message`` tool, which parses tokenized Base64 +messages without looking up the token in a database. This tool attempts to guess +the types of the arguments and displays potential ways to decode them. + +This tool can be used to extract argument information from an otherwise unusable +message. It could help identify which statement in the code produced the +message. This tool is not particularly helpful for tokenized messages without +arguments, since all it can do is show the value of the unknown token. + +The tool is executed by passing Base64 tokenized messages, with or without the +``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to +see full usage information. + +Example +^^^^^^^ +.. code-block:: + + $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d + + INF Decoding arguments for '$329JMwA=' + INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes) + INF Token: 0x33496fdf + INF Args: b'\x00' [00] (1 bytes) + INF Decoding with up to 8 %s or %d arguments + INF Attempt 1: [%s] + INF Attempt 2: [%d] 0 + + INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw==' + INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes) + INF Token: 0xe7a58492 + INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes) + INF Decoding with up to 8 %s or %d arguments + INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38 + INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK + + +.. _module-pw_tokenizer-protobuf-tokenization-python: + +Detokenizing protobufs +====================== +The :py:mod:`pw_tokenizer.proto` Python module defines functions that may be +used to detokenize protobuf objects in Python. The function +:py:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields +annotated as tokenized, replacing them with their detokenized version. For +example: + +.. code-block:: python + + my_detokenizer = pw_tokenizer.Detokenizer(some_database) + + my_message = SomeMessage(tokenized_field=b'$YS1EMQ==') + pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message) + + assert my_message.tokenized_field == b'The detokenized string! Cool!' + +Decoding optionally tokenized strings +------------------------------------- +The encoding used for an optionally tokenized field is not recorded in the +protobuf. Despite this, the text can reliably be decoded. This is accomplished +by attempting to decode the field as binary or Base64 tokenized data before +treating it like plain text. + +The following diagram describes the decoding process for optionally tokenized +fields in detail. + +.. mermaid:: + + flowchart TD + start([Received bytes]) --> binary + + binary[Decode as<br>binary tokenized] --> binary_ok + binary_ok{Detokenizes<br>successfully?} -->|no| utf8 + binary_ok -->|yes| done_binary([Display decoded binary]) + + utf8[Decode as UTF-8] --> utf8_ok + utf8_ok{Valid UTF-8?} -->|no| base64_encode + utf8_ok -->|yes| base64 + + base64_encode[Encode as<br>tokenized Base64] --> display + display([Display encoded Base64]) + + base64[Decode as<br>Base64 tokenized] --> base64_ok + + base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text + base64_ok -->|yes| base64_results + + is_plain_text{Text is<br>printable?} -->|no| base64_encode + is_plain_text-->|yes| plain_text + + base64_results([Display decoded Base64]) + plain_text([Display text]) + +Potential decoding problems +--------------------------- +The decoding process for optionally tokenized fields will yield correct results +in almost every situation. In rare circumstances, it is possible for it to fail, +but these can be avoided with a low-overhead mitigation if desired. + +There are two ways in which the decoding process may fail. + +Accidentally interpreting plain text as tokenized binary +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If a plain-text string happens to decode as a binary tokenized message, the +incorrect message could be displayed. This is very unlikely to occur. While many +tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely +that a device will happen to log one of these strings as plain text. The +overwhelming majority of these strings will be nonsense. + +If an implementation wishes to guard against this extremely improbable +situation, it is possible to prevent it. This situation is prevented by +appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data +that happens to be valid UTF-8 (or all binary tokenized messages, if desired). +When decoding, if there is an extra 0xFF byte, it is discarded. + +Displaying undecoded binary as plain text instead of Base64 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +If a message fails to decode as binary tokenized and it is not valid UTF-8, it +is displayed as tokenized Base64. This makes it easily recognizable as a +tokenized message and makes it simple to decode later from the text output (for +example, with an updated token database). + +A binary message for which the token is not known may coincidentally be valid +UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters +When decoding with an out-of-date token database, it is possible that some +binary tokenized messages will be displayed as plain text rather than tokenized +Base64. + +This situation is likely to occur, but should be infrequent. Even if it does +happen, it is not a serious issue. A very small number of strings will be +displayed incorrectly, but these strings cannot be decoded anyway. One nonsense +string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``). +Updating the token database would resolve the issue, though the non-Base64 logs +would be difficult decode later from a log file. + +This situation can be avoided with the same approach described in +`Accidentally interpreting plain text as tokenized binary`_. Appending +an invalid UTF-8 character prevents the undecoded binary message from being +interpreted as plain text. + +--------------------- +Detokenization in C++ +--------------------- +The C++ detokenization libraries can be used in C++ or any language that can +call into C++ with a C-linkage wrapper, such as Java or Rust. A reference +Java Native Interface (JNI) implementation is provided. + +The C++ detokenization library uses binary-format token databases (created with +``database.py create --type binary``). Read a binary format database from a +file or include it in the source code. Pass the database array to +``TokenDatabase::Create``, and construct a detokenizer. + +.. code-block:: cpp + + Detokenizer detokenizer(TokenDatabase::Create(token_database_array)); + + std::string ProcessLog(span<uint8_t> log_data) { + return detokenizer.Detokenize(log_data).BestString(); + } + +The ``TokenDatabase`` class verifies that its data is valid before using it. If +it is invalid, the ``TokenDatabase::Create`` returns an empty database for which +``ok()`` returns false. If the token database is included in the source code, +this check can be done at compile time. + +.. code-block:: cpp + + // This line fails to compile with a static_assert if the database is invalid. + constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>(); + + Detokenizer OpenDatabase(std::string_view path) { + std::vector<uint8_t> data = ReadWholeFile(path); + + TokenDatabase database = TokenDatabase::Create(data); + + // This checks if the file contained a valid database. It is safe to use a + // TokenDatabase that failed to load (it will be empty), but it may be + // desirable to provide a default database or otherwise handle the error. + if (database.ok()) { + return Detokenizer(database); + } + return Detokenizer(kDefaultDatabase); + } + +---------------------------- +Detokenization in TypeScript +---------------------------- +To detokenize in TypeScript, import ``Detokenizer`` from the ``pigweedjs`` +package, and instantiate it with a CSV token database. + +.. code-block:: typescript + + import { pw_tokenizer, pw_hdlc } from 'pigweedjs'; + const { Detokenizer } = pw_tokenizer; + const { Frame } = pw_hdlc; + + const detokenizer = new Detokenizer(String(tokenCsv)); + + function processLog(frame: Frame){ + const result = detokenizer.detokenize(frame); + console.log(result); + } + +For messages that are encoded in Base64, use ``Detokenizer::detokenizeBase64``. +`detokenizeBase64` will also attempt to detokenize nested Base64 tokens. There +is also `detokenizeUint8Array` that works just like `detokenize` but expects +`Uint8Array` instead of a `Frame` argument. + + + +.. _module-pw_tokenizer-cli-detokenizing: + +--------------------- +Detokenizing CLI tool +--------------------- +``pw_tokenizer`` provides two standalone command line utilities for detokenizing +Base64-encoded tokenized strings. + +* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from + stdin. +* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a + connected serial device. + +If the ``pw_tokenizer`` Python package is installed, these tools may be executed +as runnable modules. For example: + +.. code-block:: + + # Detokenize Base64-encoded strings in a file + python -m pw_tokenizer.detokenize -i input_file.txt + + # Detokenize Base64-encoded strings in output from a serial device + python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0 + +See the ``--help`` options for these tools for full usage information. + +-------- +Appendix +-------- + +.. _module-pw_tokenizer-python-detokenization-c99-printf-notes: + +Python detokenization: C99 ``printf`` compatibility notes +========================================================= +This implementation is designed to align with the +`C99 specification, section 7.19.6 +<https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_. +Notably, this specification is slightly different than what is implemented +in most compilers due to each compiler choosing to interpret undefined +behavior in slightly different ways. Treat the following description as the +source of truth. + +This implementation supports: + +- Overall Format: ``%[flags][width][.precision][length][specifier]`` +- Flags (Zero or More) + - ``-``: Left-justify within the given field width; Right justification is + the default (see Width modifier). + - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or + ``-``) even for positive numbers. By default, only negative numbers are + preceded with a ``-`` sign. + - (space): If no sign is going to be written, a blank space is inserted + before the value. + - ``#``: Specifies an alternative print syntax should be used. + - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with + ``0``, ``0x`` or ``0X``, respectively, for values different than zero. + - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it + forces the written output to contain a decimal point even if no more + digits follow. By default, if no digits follow, no decimal point is + written. + - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when + padding is specified (see width sub-specifier). +- Width (Optional) + - ``(number)``: Minimum number of characters to be printed. If the value to + be printed is shorter than this number, the result is padded with blank + spaces or ``0`` if the ``0`` flag is present. The value is not truncated + even if the result is larger. If the value is negative and the ``0`` flag + is present, the ``0``\s are padded after the ``-`` symbol. + - ``*``: The width is not specified in the format string, but as an + additional integer value argument preceding the argument that has to be + formatted. +- Precision (Optional) + - ``.(number)`` + - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum + number of digits to be written. If the value to be written is shorter + than this number, the result is padded with leading zeros. The value is + not truncated even if the result is longer. + + - A precision of ``0`` means that no character is written for the value + ``0``. + + - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number + of digits to be printed after the decimal point. By default, this is + ``6``. + + - For ``g`` and ``G``, specifies the maximum number of significant digits + to be printed. + + - For ``s``, specifies the maximum number of characters to be printed. By + default all characters are printed until the ending null character is + encountered. + + - If the period is specified without an explicit value for precision, + ``0`` is assumed. + - ``.*``: The precision is not specified in the format string, but as an + additional integer value argument preceding the argument that has to be + formatted. +- Length (Optional) + - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``signed char`` or ``unsigned char``. + However, this is largely ignored in the implementation due to it not being + necessary for Python or argument decoding (since the argument is always + encoded at least as a 32-bit integer). + - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``signed short int`` or + ``unsigned short int``. However, this is largely ignored in the + implementation due to it not being necessary for Python or argument + decoding (since the argument is always encoded at least as a 32-bit + integer). + - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``signed long int`` or + ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that + the arguments will be encoded with ``wchar_t`` values (which isn't + different from normal ``char`` values). However, this is largely ignored in + the implementation due to it not being necessary for Python or argument + decoding (since the argument is always encoded at least as a 32-bit + integer). + - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``signed long long int`` or + ``unsigned long long int``. This is required to properly decode the + argument as a 64-bit integer. + - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or + ``G`` conversion specifiers applies to a long double argument. However, + this is ignored in the implementation due to floating point value encoded + that is unaffected by bit width. + - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``intmax_t`` or ``uintmax_t``. + - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``size_t``. This will force the argument + to be decoded as an unsigned integer. + - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers + to convey the argument will be a ``ptrdiff_t``. + - If a length modifier is provided for an incorrect specifier, it is ignored. +- Specifier (Required) + - ``d`` / ``i``: Used for signed decimal integers. + + - ``u``: Used for unsigned decimal integers. + + - ``o``: Used for unsigned decimal integers and specifies formatting should + be as an octal number. + + - ``x``: Used for unsigned decimal integers and specifies formatting should + be as a hexadecimal number using all lowercase letters. + + - ``X``: Used for unsigned decimal integers and specifies formatting should + be as a hexadecimal number using all uppercase letters. + + - ``f``: Used for floating-point values and specifies to use lowercase, + decimal floating point formatting. + + - Default precision is ``6`` decimal places unless explicitly specified. + + - ``F``: Used for floating-point values and specifies to use uppercase, + decimal floating point formatting. + + - Default precision is ``6`` decimal places unless explicitly specified. + + - ``e``: Used for floating-point values and specifies to use lowercase, + exponential (scientific) formatting. + + - Default precision is ``6`` decimal places unless explicitly specified. + + - ``E``: Used for floating-point values and specifies to use uppercase, + exponential (scientific) formatting. + + - Default precision is ``6`` decimal places unless explicitly specified. + + - ``g``: Used for floating-point values and specified to use ``f`` or ``e`` + formatting depending on which would be the shortest representation. + + - Precision specifies the number of significant digits, not just digits + after the decimal place. + + - If the precision is specified as ``0``, it is interpreted to mean ``1``. + + - ``e`` formatting is used if the the exponent would be less than ``-4`` or + is greater than or equal to the precision. + + - Trailing zeros are removed unless the ``#`` flag is set. + + - A decimal point only appears if it is followed by a digit. + + - ``NaN`` or infinities always follow ``f`` formatting. + + - ``G``: Used for floating-point values and specified to use ``f`` or ``e`` + formatting depending on which would be the shortest representation. + + - Precision specifies the number of significant digits, not just digits + after the decimal place. + + - If the precision is specified as ``0``, it is interpreted to mean ``1``. + + - ``E`` formatting is used if the the exponent would be less than ``-4`` or + is greater than or equal to the precision. + + - Trailing zeros are removed unless the ``#`` flag is set. + + - A decimal point only appears if it is followed by a digit. + + - ``NaN`` or infinities always follow ``F`` formatting. + + - ``c``: Used for formatting a ``char`` value. + + - ``s``: Used for formatting a string of ``char`` values. + + - If width is specified, the null terminator character is included as a + character for width count. + + - If precision is specified, no more ``char``\s than that value will be + written from the string (padding is used to fill additional width). + + - ``p``: Used for formatting a pointer address. + + - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags, + width, precision, or length modifiers). + +Underspecified details: + +- If both ``+`` and (space) flags appear, the (space) is ignored. +- The ``+`` and (space) flags will error if used with ``c`` or ``s``. +- The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or + ``p``. +- The ``0`` flag will error if used with ``c``, ``s``, or ``p``. +- Both ``+`` and (space) can work with the unsigned integer specifiers ``u``, + ``o``, ``x``, and ``X``. +- If a length modifier is provided for an incorrect specifier, it is ignored. +- The ``z`` length modifier will decode arugments as signed as long as ``d`` or + ``i`` is used. +- ``p`` is implementation defined. + + - For this implementation, it will print with a ``0x`` prefix and then the + pointer value was printed using ``%08X``. + + - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or + ``0`` flags. + + - None of the length modifiers are usable with ``p``. + + - This implementation will try to adhere to user-specified width (assuming the + width provided is larger than the guaranteed minimum of ``10``). + + - Specifying precision for ``p`` is considered an error. +- Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail + to decode. Some C stdlib implementations support any modifiers being + present between ``%``, but ignore any for the output. +- If a width is specified with the ``0`` flag for a negative value, the padded + ``0``\s will appear after the ``-`` symbol. +- A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means + that no character is written for the value ``0``. +- Precision cannot be specified for ``c``. +- Using ``*`` or fixed precision with the ``s`` specifier still requires the + string argument to be null-terminated. This is due to argument encoding + happening on the C/C++-side while the precision value is not read or + otherwise used until decoding happens in this Python code. + +Non-conformant details: + +- ``n`` specifier: We do not support the ``n`` specifier since it is impossible + for us to retroactively tell the original program how many characters have + been printed since this decoding happens a great deal of time after the + device sent it, usually on a separate processing device entirely. |