doFormat counts bytes, not characters

February 09, 2005
Posted by Nick
Permalink
Nick
Permalink
When calculating padding, doFormat counts the number of bytes in the string, not hte number of characters. UTF-8 strings containing multibyte characters are padded wrong.

A quick and dirty fix is to apply the following to format.d:

141c141
<           int padding = field_width - (strlen(prefix) + s.length);
---
>           int padding = field_width - (strlen(prefix) + toUTF32(s).length);

Another better solution to add functions to std.utf for counting the number of characters in a string. This is slightly faster and avoids unnecessary memory allocation.

One possible way to do this (for UTF-8) follows below. It's basically a stripped
down version of decode().

# int countChars(char[] s)
# {
#   size_t len = s.length;
#   size_t chars, i;
#   for( i = 0; i != len;)
#   {
#       char u = s[i];
#
#       if (u & 0x80)
#         {
#           uint n;
#           char u2;
#
#           // Check for valid encodings
#           for (n = 1; ; n++)
#             {
#               if (n > 4)
#                 goto Lerr;          // only do the first 4 of 6 encodings
#
#               if (((u << n) & 0x80) == 0)
#                 {
#                   if (n == 1)
#                     goto Lerr;
#                   break;
#                 }
#             }
#
#           if (i + (n - 1) >= len)
#             goto Lerr;                      // off end of string
#
#           u2 = s[i + 1];
#           if ((u & 0xFE) == 0xC0 ||
#             (u == 0xE0 && (u2 & 0xE0) == 0x80) ||
#             (u == 0xF0 && (u2 & 0xF0) == 0x80) ||
#             (u == 0xF8 && (u2 & 0xF8) == 0x80) ||
#             (u == 0xFC && (u2 & 0xFC) == 0x80))
#               goto Lerr;                      // overlong combination
#
#           for (uint j = 1; j != n; j++)
#           {
#               u = s[i + j];
#               if ((u & 0xC0) != 0x80)
#                 goto Lerr;                  // trailing bytes are 10xxxxxx
#           }
#           i += n;
#        }
#        else i++;
#        chars++;
#   }
#   return chars;
#
# Lerr:
#   throw new UtfError("invalid UTF-8 sequence", i);
# }

Nick
Forums