aboutsummaryrefslogtreecommitdiff
path: root/pw_tokenizer/detokenization.rst
diff options
context:
space:
mode:
Diffstat (limited to 'pw_tokenizer/detokenization.rst')
-rw-r--r--pw_tokenizer/detokenization.rst583
1 files changed, 583 insertions, 0 deletions
diff --git a/pw_tokenizer/detokenization.rst b/pw_tokenizer/detokenization.rst
new file mode 100644
index 000000000..7fbefec88
--- /dev/null
+++ b/pw_tokenizer/detokenization.rst
@@ -0,0 +1,583 @@
+:tocdepth: 3
+
+.. _module-pw_tokenizer-detokenization:
+
+==============
+Detokenization
+==============
+.. pigweed-module-subpage::
+ :name: pw_tokenizer
+ :tagline: Compress strings to shrink logs by +75%
+
+Detokenization is the process of expanding a token to the string it represents
+and decoding its arguments. ``pw_tokenizer`` provides Python, C++ and
+TypeScript detokenization libraries.
+
+--------------------------------
+Example: decoding tokenized logs
+--------------------------------
+A project might tokenize its log messages with the
+:ref:`module-pw_tokenizer-base64-format`. Consider the following log file, which
+has four tokenized logs and one plain text log:
+
+.. code-block:: text
+
+ 20200229 14:38:58 INF $HL2VHA==
+ 20200229 14:39:00 DBG $5IhTKg==
+ 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
+ 20200229 14:39:21 INF $EgFj8lVVAUI=
+ 20200229 14:39:23 ERR $DFRDNwlOT1RfUkVBRFk=
+
+The project's log strings are stored in a database like the following:
+
+.. code-block::
+
+ 1c95bd1c, ,"Initiating retrieval process for recovery object"
+ 2a5388e4, ,"Determining optimal approach and coordinating vectors"
+ 3743540c, ,"Recovery object retrieval failed with status %s"
+ f2630112, ,"Calculated acceptable probability of success (%.2f%%)"
+
+Using the detokenizing tools with the database, the logs can be decoded:
+
+.. code-block:: text
+
+ 20200229 14:38:58 INF Initiating retrieval process for recovery object
+ 20200229 14:39:00 DBG Determining optimal algorithm and coordinating approach vectors
+ 20200229 14:39:20 DBG Crunching numbers to calculate probability of success
+ 20200229 14:39:21 INF Calculated acceptable probability of success (32.33%)
+ 20200229 14:39:23 ERR Recovery object retrieval failed with status NOT_READY
+
+.. note::
+
+ This example uses the :ref:`module-pw_tokenizer-base64-format`, which
+ occupies about 4/3 (133%) as much space as the default binary format when
+ encoded. For projects that wish to interleave tokenized with plain text,
+ using Base64 is a worthwhile tradeoff.
+
+------------------------
+Detokenization in Python
+------------------------
+To detokenize in Python, import ``Detokenizer`` from the ``pw_tokenizer``
+package, and instantiate it with paths to token databases or ELF files.
+
+.. code-block:: python
+
+ import pw_tokenizer
+
+ detokenizer = pw_tokenizer.Detokenizer('path/to/database.csv', 'other/path.elf')
+
+ def process_log_message(log_message):
+ result = detokenizer.detokenize(log_message.payload)
+ self._log(str(result))
+
+The ``pw_tokenizer`` package also provides the ``AutoUpdatingDetokenizer``
+class, which can be used in place of the standard ``Detokenizer``. This class
+monitors database files for changes and automatically reloads them when they
+change. This is helpful for long-running tools that use detokenization. The
+class also supports token domains for the given database files in the
+``<path>#<domain>`` format.
+
+For messages that are optionally tokenized and may be encoded as binary,
+Base64, or plaintext UTF-8, use
+:func:`pw_tokenizer.proto.decode_optionally_tokenized`. This will attempt to
+determine the correct method to detokenize and always provide a printable
+string.
+
+.. _module-pw_tokenizer-base64-decoding:
+
+Decoding Base64
+===============
+The Python ``Detokenizer`` class supports decoding and detokenizing prefixed
+Base64 messages with ``detokenize_base64`` and related methods.
+
+.. tip::
+ The Python detokenization tools support recursive detokenization for prefixed
+ Base64 text. Tokenized strings found in detokenized text are detokenized, so
+ prefixed Base64 messages can be passed as ``%s`` arguments.
+
+ For example, the tokenized string for "Wow!" is ``$RhYjmQ==``. This could be
+ passed as an argument to the printf-style string ``Nested message: %s``, which
+ encodes to ``$pEVTYQkkUmhZam1RPT0=``. The detokenizer would decode the message
+ as follows:
+
+ ::
+
+ "$pEVTYQkkUmhZam1RPT0=" → "Nested message: $RhYjmQ==" → "Nested message: Wow!"
+
+Base64 decoding is supported in C++ or C with the
+``pw::tokenizer::PrefixedBase64Decode`` or ``pw_tokenizer_PrefixedBase64Decode``
+functions.
+
+Investigating undecoded Base64 messages
+---------------------------------------
+Tokenized messages cannot be decoded if the token is not recognized. The Python
+package includes the ``parse_message`` tool, which parses tokenized Base64
+messages without looking up the token in a database. This tool attempts to guess
+the types of the arguments and displays potential ways to decode them.
+
+This tool can be used to extract argument information from an otherwise unusable
+message. It could help identify which statement in the code produced the
+message. This tool is not particularly helpful for tokenized messages without
+arguments, since all it can do is show the value of the unknown token.
+
+The tool is executed by passing Base64 tokenized messages, with or without the
+``$`` prefix, to ``pw_tokenizer.parse_message``. Pass ``-h`` or ``--help`` to
+see full usage information.
+
+Example
+^^^^^^^
+.. code-block::
+
+ $ python -m pw_tokenizer.parse_message '$329JMwA=' koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw== --specs %s %d
+
+ INF Decoding arguments for '$329JMwA='
+ INF Binary: b'\xdfoI3\x00' [df 6f 49 33 00] (5 bytes)
+ INF Token: 0x33496fdf
+ INF Args: b'\x00' [00] (1 bytes)
+ INF Decoding with up to 8 %s or %d arguments
+ INF Attempt 1: [%s]
+ INF Attempt 2: [%d] 0
+
+ INF Decoding arguments for '$koSl524TRkFJTEVEX1BSRUNPTkRJVElPTgJPSw=='
+ INF Binary: b'\x92\x84\xa5\xe7n\x13FAILED_PRECONDITION\x02OK' [92 84 a5 e7 6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (28 bytes)
+ INF Token: 0xe7a58492
+ INF Args: b'n\x13FAILED_PRECONDITION\x02OK' [6e 13 46 41 49 4c 45 44 5f 50 52 45 43 4f 4e 44 49 54 49 4f 4e 02 4f 4b] (24 bytes)
+ INF Decoding with up to 8 %s or %d arguments
+ INF Attempt 1: [%d %s %d %d %d] 55 FAILED_PRECONDITION 1 -40 -38
+ INF Attempt 2: [%d %s %s] 55 FAILED_PRECONDITION OK
+
+
+.. _module-pw_tokenizer-protobuf-tokenization-python:
+
+Detokenizing protobufs
+======================
+The :py:mod:`pw_tokenizer.proto` Python module defines functions that may be
+used to detokenize protobuf objects in Python. The function
+:py:func:`pw_tokenizer.proto.detokenize_fields` detokenizes all fields
+annotated as tokenized, replacing them with their detokenized version. For
+example:
+
+.. code-block:: python
+
+ my_detokenizer = pw_tokenizer.Detokenizer(some_database)
+
+ my_message = SomeMessage(tokenized_field=b'$YS1EMQ==')
+ pw_tokenizer.proto.detokenize_fields(my_detokenizer, my_message)
+
+ assert my_message.tokenized_field == b'The detokenized string! Cool!'
+
+Decoding optionally tokenized strings
+-------------------------------------
+The encoding used for an optionally tokenized field is not recorded in the
+protobuf. Despite this, the text can reliably be decoded. This is accomplished
+by attempting to decode the field as binary or Base64 tokenized data before
+treating it like plain text.
+
+The following diagram describes the decoding process for optionally tokenized
+fields in detail.
+
+.. mermaid::
+
+ flowchart TD
+ start([Received bytes]) --> binary
+
+ binary[Decode as<br>binary tokenized] --> binary_ok
+ binary_ok{Detokenizes<br>successfully?} -->|no| utf8
+ binary_ok -->|yes| done_binary([Display decoded binary])
+
+ utf8[Decode as UTF-8] --> utf8_ok
+ utf8_ok{Valid UTF-8?} -->|no| base64_encode
+ utf8_ok -->|yes| base64
+
+ base64_encode[Encode as<br>tokenized Base64] --> display
+ display([Display encoded Base64])
+
+ base64[Decode as<br>Base64 tokenized] --> base64_ok
+
+ base64_ok{Fully<br>or partially<br>detokenized?} -->|no| is_plain_text
+ base64_ok -->|yes| base64_results
+
+ is_plain_text{Text is<br>printable?} -->|no| base64_encode
+ is_plain_text-->|yes| plain_text
+
+ base64_results([Display decoded Base64])
+ plain_text([Display text])
+
+Potential decoding problems
+---------------------------
+The decoding process for optionally tokenized fields will yield correct results
+in almost every situation. In rare circumstances, it is possible for it to fail,
+but these can be avoided with a low-overhead mitigation if desired.
+
+There are two ways in which the decoding process may fail.
+
+Accidentally interpreting plain text as tokenized binary
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If a plain-text string happens to decode as a binary tokenized message, the
+incorrect message could be displayed. This is very unlikely to occur. While many
+tokens will incidentally end up being valid UTF-8 strings, it is highly unlikely
+that a device will happen to log one of these strings as plain text. The
+overwhelming majority of these strings will be nonsense.
+
+If an implementation wishes to guard against this extremely improbable
+situation, it is possible to prevent it. This situation is prevented by
+appending 0xFF (or another byte never valid in UTF-8) to binary tokenized data
+that happens to be valid UTF-8 (or all binary tokenized messages, if desired).
+When decoding, if there is an extra 0xFF byte, it is discarded.
+
+Displaying undecoded binary as plain text instead of Base64
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+If a message fails to decode as binary tokenized and it is not valid UTF-8, it
+is displayed as tokenized Base64. This makes it easily recognizable as a
+tokenized message and makes it simple to decode later from the text output (for
+example, with an updated token database).
+
+A binary message for which the token is not known may coincidentally be valid
+UTF-8 or ASCII. 6.25% of 4-byte sequences are composed only of ASCII characters
+When decoding with an out-of-date token database, it is possible that some
+binary tokenized messages will be displayed as plain text rather than tokenized
+Base64.
+
+This situation is likely to occur, but should be infrequent. Even if it does
+happen, it is not a serious issue. A very small number of strings will be
+displayed incorrectly, but these strings cannot be decoded anyway. One nonsense
+string (e.g. ``a-D1``) would be displayed instead of another (``$YS1EMQ==``).
+Updating the token database would resolve the issue, though the non-Base64 logs
+would be difficult decode later from a log file.
+
+This situation can be avoided with the same approach described in
+`Accidentally interpreting plain text as tokenized binary`_. Appending
+an invalid UTF-8 character prevents the undecoded binary message from being
+interpreted as plain text.
+
+---------------------
+Detokenization in C++
+---------------------
+The C++ detokenization libraries can be used in C++ or any language that can
+call into C++ with a C-linkage wrapper, such as Java or Rust. A reference
+Java Native Interface (JNI) implementation is provided.
+
+The C++ detokenization library uses binary-format token databases (created with
+``database.py create --type binary``). Read a binary format database from a
+file or include it in the source code. Pass the database array to
+``TokenDatabase::Create``, and construct a detokenizer.
+
+.. code-block:: cpp
+
+ Detokenizer detokenizer(TokenDatabase::Create(token_database_array));
+
+ std::string ProcessLog(span<uint8_t> log_data) {
+ return detokenizer.Detokenize(log_data).BestString();
+ }
+
+The ``TokenDatabase`` class verifies that its data is valid before using it. If
+it is invalid, the ``TokenDatabase::Create`` returns an empty database for which
+``ok()`` returns false. If the token database is included in the source code,
+this check can be done at compile time.
+
+.. code-block:: cpp
+
+ // This line fails to compile with a static_assert if the database is invalid.
+ constexpr TokenDatabase kDefaultDatabase = TokenDatabase::Create<kData>();
+
+ Detokenizer OpenDatabase(std::string_view path) {
+ std::vector<uint8_t> data = ReadWholeFile(path);
+
+ TokenDatabase database = TokenDatabase::Create(data);
+
+ // This checks if the file contained a valid database. It is safe to use a
+ // TokenDatabase that failed to load (it will be empty), but it may be
+ // desirable to provide a default database or otherwise handle the error.
+ if (database.ok()) {
+ return Detokenizer(database);
+ }
+ return Detokenizer(kDefaultDatabase);
+ }
+
+----------------------------
+Detokenization in TypeScript
+----------------------------
+To detokenize in TypeScript, import ``Detokenizer`` from the ``pigweedjs``
+package, and instantiate it with a CSV token database.
+
+.. code-block:: typescript
+
+ import { pw_tokenizer, pw_hdlc } from 'pigweedjs';
+ const { Detokenizer } = pw_tokenizer;
+ const { Frame } = pw_hdlc;
+
+ const detokenizer = new Detokenizer(String(tokenCsv));
+
+ function processLog(frame: Frame){
+ const result = detokenizer.detokenize(frame);
+ console.log(result);
+ }
+
+For messages that are encoded in Base64, use ``Detokenizer::detokenizeBase64``.
+`detokenizeBase64` will also attempt to detokenize nested Base64 tokens. There
+is also `detokenizeUint8Array` that works just like `detokenize` but expects
+`Uint8Array` instead of a `Frame` argument.
+
+
+
+.. _module-pw_tokenizer-cli-detokenizing:
+
+---------------------
+Detokenizing CLI tool
+---------------------
+``pw_tokenizer`` provides two standalone command line utilities for detokenizing
+Base64-encoded tokenized strings.
+
+* ``detokenize.py`` -- Detokenizes Base64-encoded strings in files or from
+ stdin.
+* ``serial_detokenizer.py`` -- Detokenizes Base64-encoded strings from a
+ connected serial device.
+
+If the ``pw_tokenizer`` Python package is installed, these tools may be executed
+as runnable modules. For example:
+
+.. code-block::
+
+ # Detokenize Base64-encoded strings in a file
+ python -m pw_tokenizer.detokenize -i input_file.txt
+
+ # Detokenize Base64-encoded strings in output from a serial device
+ python -m pw_tokenizer.serial_detokenizer --device /dev/ttyACM0
+
+See the ``--help`` options for these tools for full usage information.
+
+--------
+Appendix
+--------
+
+.. _module-pw_tokenizer-python-detokenization-c99-printf-notes:
+
+Python detokenization: C99 ``printf`` compatibility notes
+=========================================================
+This implementation is designed to align with the
+`C99 specification, section 7.19.6
+<https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf>`_.
+Notably, this specification is slightly different than what is implemented
+in most compilers due to each compiler choosing to interpret undefined
+behavior in slightly different ways. Treat the following description as the
+source of truth.
+
+This implementation supports:
+
+- Overall Format: ``%[flags][width][.precision][length][specifier]``
+- Flags (Zero or More)
+ - ``-``: Left-justify within the given field width; Right justification is
+ the default (see Width modifier).
+ - ``+``: Forces to preceed the result with a plus or minus sign (``+`` or
+ ``-``) even for positive numbers. By default, only negative numbers are
+ preceded with a ``-`` sign.
+ - (space): If no sign is going to be written, a blank space is inserted
+ before the value.
+ - ``#``: Specifies an alternative print syntax should be used.
+ - Used with ``o``, ``x`` or ``X`` specifiers the value is preceeded with
+ ``0``, ``0x`` or ``0X``, respectively, for values different than zero.
+ - Used with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or ``G`` it
+ forces the written output to contain a decimal point even if no more
+ digits follow. By default, if no digits follow, no decimal point is
+ written.
+ - ``0``: Left-pads the number with zeroes (``0``) instead of spaces when
+ padding is specified (see width sub-specifier).
+- Width (Optional)
+ - ``(number)``: Minimum number of characters to be printed. If the value to
+ be printed is shorter than this number, the result is padded with blank
+ spaces or ``0`` if the ``0`` flag is present. The value is not truncated
+ even if the result is larger. If the value is negative and the ``0`` flag
+ is present, the ``0``\s are padded after the ``-`` symbol.
+ - ``*``: The width is not specified in the format string, but as an
+ additional integer value argument preceding the argument that has to be
+ formatted.
+- Precision (Optional)
+ - ``.(number)``
+ - For ``d``, ``i``, ``o``, ``u``, ``x``, ``X``, specifies the minimum
+ number of digits to be written. If the value to be written is shorter
+ than this number, the result is padded with leading zeros. The value is
+ not truncated even if the result is longer.
+
+ - A precision of ``0`` means that no character is written for the value
+ ``0``.
+
+ - For ``a``, ``A``, ``e``, ``E``, ``f``, and ``F``, specifies the number
+ of digits to be printed after the decimal point. By default, this is
+ ``6``.
+
+ - For ``g`` and ``G``, specifies the maximum number of significant digits
+ to be printed.
+
+ - For ``s``, specifies the maximum number of characters to be printed. By
+ default all characters are printed until the ending null character is
+ encountered.
+
+ - If the period is specified without an explicit value for precision,
+ ``0`` is assumed.
+ - ``.*``: The precision is not specified in the format string, but as an
+ additional integer value argument preceding the argument that has to be
+ formatted.
+- Length (Optional)
+ - ``hh``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``signed char`` or ``unsigned char``.
+ However, this is largely ignored in the implementation due to it not being
+ necessary for Python or argument decoding (since the argument is always
+ encoded at least as a 32-bit integer).
+ - ``h``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``signed short int`` or
+ ``unsigned short int``. However, this is largely ignored in the
+ implementation due to it not being necessary for Python or argument
+ decoding (since the argument is always encoded at least as a 32-bit
+ integer).
+ - ``l``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``signed long int`` or
+ ``unsigned long int``. Also is usable with ``c`` and ``s`` to specify that
+ the arguments will be encoded with ``wchar_t`` values (which isn't
+ different from normal ``char`` values). However, this is largely ignored in
+ the implementation due to it not being necessary for Python or argument
+ decoding (since the argument is always encoded at least as a 32-bit
+ integer).
+ - ``ll``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``signed long long int`` or
+ ``unsigned long long int``. This is required to properly decode the
+ argument as a 64-bit integer.
+ - ``L``: Usable with ``a``, ``A``, ``e``, ``E``, ``f``, ``F``, ``g``, or
+ ``G`` conversion specifiers applies to a long double argument. However,
+ this is ignored in the implementation due to floating point value encoded
+ that is unaffected by bit width.
+ - ``j``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``intmax_t`` or ``uintmax_t``.
+ - ``z``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``size_t``. This will force the argument
+ to be decoded as an unsigned integer.
+ - ``t``: Usable with ``d``, ``i``, ``o``, ``u``, ``x``, or ``X`` specifiers
+ to convey the argument will be a ``ptrdiff_t``.
+ - If a length modifier is provided for an incorrect specifier, it is ignored.
+- Specifier (Required)
+ - ``d`` / ``i``: Used for signed decimal integers.
+
+ - ``u``: Used for unsigned decimal integers.
+
+ - ``o``: Used for unsigned decimal integers and specifies formatting should
+ be as an octal number.
+
+ - ``x``: Used for unsigned decimal integers and specifies formatting should
+ be as a hexadecimal number using all lowercase letters.
+
+ - ``X``: Used for unsigned decimal integers and specifies formatting should
+ be as a hexadecimal number using all uppercase letters.
+
+ - ``f``: Used for floating-point values and specifies to use lowercase,
+ decimal floating point formatting.
+
+ - Default precision is ``6`` decimal places unless explicitly specified.
+
+ - ``F``: Used for floating-point values and specifies to use uppercase,
+ decimal floating point formatting.
+
+ - Default precision is ``6`` decimal places unless explicitly specified.
+
+ - ``e``: Used for floating-point values and specifies to use lowercase,
+ exponential (scientific) formatting.
+
+ - Default precision is ``6`` decimal places unless explicitly specified.
+
+ - ``E``: Used for floating-point values and specifies to use uppercase,
+ exponential (scientific) formatting.
+
+ - Default precision is ``6`` decimal places unless explicitly specified.
+
+ - ``g``: Used for floating-point values and specified to use ``f`` or ``e``
+ formatting depending on which would be the shortest representation.
+
+ - Precision specifies the number of significant digits, not just digits
+ after the decimal place.
+
+ - If the precision is specified as ``0``, it is interpreted to mean ``1``.
+
+ - ``e`` formatting is used if the the exponent would be less than ``-4`` or
+ is greater than or equal to the precision.
+
+ - Trailing zeros are removed unless the ``#`` flag is set.
+
+ - A decimal point only appears if it is followed by a digit.
+
+ - ``NaN`` or infinities always follow ``f`` formatting.
+
+ - ``G``: Used for floating-point values and specified to use ``f`` or ``e``
+ formatting depending on which would be the shortest representation.
+
+ - Precision specifies the number of significant digits, not just digits
+ after the decimal place.
+
+ - If the precision is specified as ``0``, it is interpreted to mean ``1``.
+
+ - ``E`` formatting is used if the the exponent would be less than ``-4`` or
+ is greater than or equal to the precision.
+
+ - Trailing zeros are removed unless the ``#`` flag is set.
+
+ - A decimal point only appears if it is followed by a digit.
+
+ - ``NaN`` or infinities always follow ``F`` formatting.
+
+ - ``c``: Used for formatting a ``char`` value.
+
+ - ``s``: Used for formatting a string of ``char`` values.
+
+ - If width is specified, the null terminator character is included as a
+ character for width count.
+
+ - If precision is specified, no more ``char``\s than that value will be
+ written from the string (padding is used to fill additional width).
+
+ - ``p``: Used for formatting a pointer address.
+
+ - ``%``: Prints a single ``%``. Only valid as ``%%`` (supports no flags,
+ width, precision, or length modifiers).
+
+Underspecified details:
+
+- If both ``+`` and (space) flags appear, the (space) is ignored.
+- The ``+`` and (space) flags will error if used with ``c`` or ``s``.
+- The ``#`` flag will error if used with ``d``, ``i``, ``u``, ``c``, ``s``, or
+ ``p``.
+- The ``0`` flag will error if used with ``c``, ``s``, or ``p``.
+- Both ``+`` and (space) can work with the unsigned integer specifiers ``u``,
+ ``o``, ``x``, and ``X``.
+- If a length modifier is provided for an incorrect specifier, it is ignored.
+- The ``z`` length modifier will decode arugments as signed as long as ``d`` or
+ ``i`` is used.
+- ``p`` is implementation defined.
+
+ - For this implementation, it will print with a ``0x`` prefix and then the
+ pointer value was printed using ``%08X``.
+
+ - ``p`` supports the ``+``, ``-``, and (space) flags, but not the ``#`` or
+ ``0`` flags.
+
+ - None of the length modifiers are usable with ``p``.
+
+ - This implementation will try to adhere to user-specified width (assuming the
+ width provided is larger than the guaranteed minimum of ``10``).
+
+ - Specifying precision for ``p`` is considered an error.
+- Only ``%%`` is allowed with no other modifiers. Things like ``%+%`` will fail
+ to decode. Some C stdlib implementations support any modifiers being
+ present between ``%``, but ignore any for the output.
+- If a width is specified with the ``0`` flag for a negative value, the padded
+ ``0``\s will appear after the ``-`` symbol.
+- A precision of ``0`` for ``d``, ``i``, ``u``, ``o``, ``x``, or ``X`` means
+ that no character is written for the value ``0``.
+- Precision cannot be specified for ``c``.
+- Using ``*`` or fixed precision with the ``s`` specifier still requires the
+ string argument to be null-terminated. This is due to argument encoding
+ happening on the C/C++-side while the precision value is not read or
+ otherwise used until decoding happens in this Python code.
+
+Non-conformant details:
+
+- ``n`` specifier: We do not support the ``n`` specifier since it is impossible
+ for us to retroactively tell the original program how many characters have
+ been printed since this decoding happens a great deal of time after the
+ device sent it, usually on a separate processing device entirely.