Thread overview
UTF-16 endianess
Jan 29, 2016
Marek Janukowicz
Jan 29, 2016
Marek Janukowicz
Jan 30, 2016
Johannes Pfau
Jan 30, 2016
Marek Janukowicz
Jan 29, 2016
Adam D. Ruppe
January 29, 2016
I have trouble understanding how endianess works for UTF-16.

For example UTF-16 code for 'ł' character is 0x0142. But this program shows otherwise:

import std.stdio;

public void main () {
  ubyte[] properOrder = [0x01, 0x42];
	ubyte[] reverseOrder = [0x42, 0x01];
	writefln( "proper: %s, reverse: %s",
		cast(wchar[])properOrder,
		cast(wchar[])reverseOrder );
}

output:

proper: 䈁, reverse: ł

Is there anything I should know about UTF endianess?

-- 
Marek Janukowicz
January 29, 2016
On 1/29/16 5:36 PM, Marek Janukowicz wrote:
> I have trouble understanding how endianess works for UTF-16.
>
> For example UTF-16 code for 'ł' character is 0x0142. But this program shows
> otherwise:
>
> import std.stdio;
>
> public void main () {
>    ubyte[] properOrder = [0x01, 0x42];
> 	ubyte[] reverseOrder = [0x42, 0x01];
> 	writefln( "proper: %s, reverse: %s",
> 		cast(wchar[])properOrder,
> 		cast(wchar[])reverseOrder );
> }
>
> output:
>
> proper: 䈁, reverse: ł
>
> Is there anything I should know about UTF endianess?

It's not any different from other endianness.

In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on.

If you are on x86 or x86_64 (very likely), then it should be little endian.

If your source of data is big-endian (or opposite from your native endianness), it will have to be converted before treating as a wchar[].

Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.

-Steve
January 29, 2016
On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz wrote:
> I have trouble understanding how endianess works for UTF-16.

UTF-16 (as well as UTF-32) comes in both little-endian and big-endian variants. A byte-order marker in the file can help you detect which one it is in.

See t his t able:

http://www.unicode.org/faq/utf_bom.html#gen6

January 29, 2016
On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
>> Is there anything I should know about UTF endianess?
>
> It's not any different from other endianness.
>
> In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on.
>
> If you are on x86 or x86_64 (very likely), then it should be little endian.
>
> If your source of data is big-endian (or opposite from your native endianness),

To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).

> it will have to be converted before treating as a wchar[].

Is there any clever way to do the conversion? Or do I need to swap the bytes manually?

> Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.

This solution is of no use to me as I don't want to change the endianess in general.

-- 
Marek Janukowicz
January 29, 2016
On 1/29/16 6:03 PM, Marek Janukowicz wrote:
> On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
>>> Is there anything I should know about UTF endianess?
>>
>> It's not any different from other endianness.
>>
>> In other words, a UTF16 code unit is expected to be in the endianness of
>> the platform you are running on.
>>
>> If you are on x86 or x86_64 (very likely), then it should be little endian.
>>
>> If your source of data is big-endian (or opposite from your native
>> endianness),
>
> To be precise - my case is IMAP UTF7 folder name encoding and I finally found
> out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
>
>> it will have to be converted before treating as a wchar[].
>
> Is there any clever way to do the conversion? Or do I need to swap the bytes
> manually?

No clever way, just the straightforward way ;)

Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like:

foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff);

Or you can do it with the bytes directly before casting

>
>> Note the version identifiers BigEndian and LittleEndian can be used to
>> compile the correct code.
>
> This solution is of no use to me as I don't want to change the endianess in
> general.

What I mean is that you can annotate your code with version statements like:

version(LittleEndian)
{
   // perform the byteswap
   ...
}

so your code is portable to BigEndian systems (where you would not want to byte swap).

-Steve
January 30, 2016
Am Fri, 29 Jan 2016 18:58:17 -0500
schrieb Steven Schveighoffer <schveiguy@yahoo.com>:

> On 1/29/16 6:03 PM, Marek Janukowicz wrote:
> > On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
> >>> Is there anything I should know about UTF endianess?
> >>
> >> It's not any different from other endianness.
> >>
> >> In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on.
> >>
> >> If you are on x86 or x86_64 (very likely), then it should be
> >> little endian.
> >>
> >> If your source of data is big-endian (or opposite from your native
> >> endianness),
> >
> > To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
> >> it will have to be converted before treating as a wchar[].
> >
> > Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
> 
> No clever way, just the straightforward way ;)
> 
> Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like:
> 
> foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >>
> 8) & 0x00ff);
> 
> Or you can do it with the bytes directly before casting


There's also a phobos solution: bigEndianToNative in std.bitmanip.

January 30, 2016
On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote:
>>> Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.
>>
>> This solution is of no use to me as I don't want to change the endianess in general.
>
> What I mean is that you can annotate your code with version statements like:
>
> version(LittleEndian)
> {
>     // perform the byteswap
>     ...
> }
>
> so your code is portable to BigEndian systems (where you would not want to byte swap).

That's a good point, thanks.

-- 
Marek Janukowicz