UTF-16 endianess

Jan 29, 2016

Marek Janukowicz

Jan 29, 2016

Steven Schveighoffer

Jan 29, 2016

Jan 29, 2016

Jan 30, 2016

Jan 30, 2016

Jan 29, 2016

I have trouble understanding how endianess works for UTF-16. For example UTF-16 code for 'ł' character is 0x0142. But this program shows otherwise: import std.stdio; public void main () { ubyte[] properOrder = [0x01, 0x42]; ubyte[] reverseOrder = [0x42, 0x01]; writefln( "proper: %s, reverse: %s", cast(wchar[])properOrder, cast(wchar[])reverseOrder ); } output: proper: 䈁, reverse: ł Is there anything I should know about UTF endianess? -- Marek Janukowicz

On 1/29/16 5:36 PM, Marek Janukowicz wrote: > I have trouble understanding how endianess works for UTF-16. > > For example UTF-16 code for 'ł' character is 0x0142. But this program shows > otherwise: > > import std.stdio; > > public void main () { > ubyte[] properOrder = [0x01, 0x42]; > ubyte[] reverseOrder = [0x42, 0x01]; > writefln( "proper: %s, reverse: %s", > cast(wchar[])properOrder, > cast(wchar[])reverseOrder ); > } > > output: > > proper: 䈁, reverse: ł > > Is there anything I should know about UTF endianess? It's not any different from other endianness. In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. If you are on x86 or x86_64 (very likely), then it should be little endian. If your source of data is big-endian (or opposite from your native endianness), it will have to be converted before treating as a wchar[]. Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. -Steve

On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz wrote: > I have trouble understanding how endianess works for UTF-16. UTF-16 (as well as UTF-32) comes in both little-endian and big-endian variants. A byte-order marker in the file can help you detect which one it is in. See t his t able: http://www.unicode.org/faq/utf_bom.html#gen6

On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote: >> Is there anything I should know about UTF endianess? > > It's not any different from other endianness. > > In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. > > If you are on x86 or x86_64 (very likely), then it should be little endian. > > If your source of data is big-endian (or opposite from your native endianness), To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64). > it will have to be converted before treating as a wchar[]. Is there any clever way to do the conversion? Or do I need to swap the bytes manually? > Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. This solution is of no use to me as I don't want to change the endianess in general. -- Marek Janukowicz

January 29, 2016

Re: UTF-16 endianess

Posted by Steven Schveighoffer
in reply to Marek Janukowicz

Permalink

Steven Schveighoffer

Posted in reply to Marek Janukowicz

Permalink

On 1/29/16 6:03 PM, Marek Janukowicz wrote:
> On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
>>> Is there anything I should know about UTF endianess?
>>
>> It's not any different from other endianness.
>>
>> In other words, a UTF16 code unit is expected to be in the endianness of
>> the platform you are running on.
>>
>> If you are on x86 or x86_64 (very likely), then it should be little endian.
>>
>> If your source of data is big-endian (or opposite from your native
>> endianness),
>
> To be precise - my case is IMAP UTF7 folder name encoding and I finally found
> out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
>
>> it will have to be converted before treating as a wchar[].
>
> Is there any clever way to do the conversion? Or do I need to swap the bytes
> manually?

No clever way, just the straightforward way ;)

Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like:

foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff);

Or you can do it with the bytes directly before casting

>
>> Note the version identifiers BigEndian and LittleEndian can be used to
>> compile the correct code.
>
> This solution is of no use to me as I don't want to change the endianess in
> general.

What I mean is that you can annotate your code with version statements like:

version(LittleEndian)
{
   // perform the byteswap
   ...
}

so your code is portable to BigEndian systems (where you would not want to byte swap).

-Steve

Am Fri, 29 Jan 2016 18:58:17 -0500 schrieb Steven Schveighoffer <schveiguy@yahoo.com>: > On 1/29/16 6:03 PM, Marek Janukowicz wrote: > > On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote: > >>> Is there anything I should know about UTF endianess? > >> > >> It's not any different from other endianness. > >> > >> In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. > >> > >> If you are on x86 or x86_64 (very likely), then it should be > >> little endian. > >> > >> If your source of data is big-endian (or opposite from your native > >> endianness), > > > > To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64). > >> it will have to be converted before treating as a wchar[]. > > > > Is there any clever way to do the conversion? Or do I need to swap the bytes manually? > > No clever way, just the straightforward way ;) > > Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: > > foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> > 8) & 0x00ff); > > Or you can do it with the bytes directly before casting There's also a phobos solution: bigEndianToNative in std.bitmanip.

On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote: >>> Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. >> >> This solution is of no use to me as I don't want to change the endianess in general. > > What I mean is that you can annotate your code with version statements like: > > version(LittleEndian) > { > // perform the byteswap > ... > } > > so your code is portable to BigEndian systems (where you would not want to byte swap). That's a good point, thanks. -- Marek Janukowicz

Forums