Thread overview | |||||||||
---|---|---|---|---|---|---|---|---|---|
|
January 29, 2016 UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
I have trouble understanding how endianess works for UTF-16.
For example UTF-16 code for 'ł' character is 0x0142. But this program shows otherwise:
import std.stdio;
public void main () {
ubyte[] properOrder = [0x01, 0x42];
ubyte[] reverseOrder = [0x42, 0x01];
writefln( "proper: %s, reverse: %s",
cast(wchar[])properOrder,
cast(wchar[])reverseOrder );
}
output:
proper: 䈁, reverse: ł
Is there anything I should know about UTF endianess?
--
Marek Janukowicz
|
January 29, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marek Janukowicz | On 1/29/16 5:36 PM, Marek Janukowicz wrote:
> I have trouble understanding how endianess works for UTF-16.
>
> For example UTF-16 code for 'ł' character is 0x0142. But this program shows
> otherwise:
>
> import std.stdio;
>
> public void main () {
> ubyte[] properOrder = [0x01, 0x42];
> ubyte[] reverseOrder = [0x42, 0x01];
> writefln( "proper: %s, reverse: %s",
> cast(wchar[])properOrder,
> cast(wchar[])reverseOrder );
> }
>
> output:
>
> proper: 䈁, reverse: ł
>
> Is there anything I should know about UTF endianess?
It's not any different from other endianness.
In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on.
If you are on x86 or x86_64 (very likely), then it should be little endian.
If your source of data is big-endian (or opposite from your native endianness), it will have to be converted before treating as a wchar[].
Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code.
-Steve
|
January 29, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marek Janukowicz | On Friday, 29 January 2016 at 22:36:37 UTC, Marek Janukowicz wrote: > I have trouble understanding how endianess works for UTF-16. UTF-16 (as well as UTF-32) comes in both little-endian and big-endian variants. A byte-order marker in the file can help you detect which one it is in. See t his t able: http://www.unicode.org/faq/utf_bom.html#gen6 |
January 29, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote: >> Is there anything I should know about UTF endianess? > > It's not any different from other endianness. > > In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on. > > If you are on x86 or x86_64 (very likely), then it should be little endian. > > If your source of data is big-endian (or opposite from your native endianness), To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64). > it will have to be converted before treating as a wchar[]. Is there any clever way to do the conversion? Or do I need to swap the bytes manually? > Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. This solution is of no use to me as I don't want to change the endianess in general. -- Marek Janukowicz |
January 29, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Marek Janukowicz | On 1/29/16 6:03 PM, Marek Janukowicz wrote: > On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote: >>> Is there anything I should know about UTF endianess? >> >> It's not any different from other endianness. >> >> In other words, a UTF16 code unit is expected to be in the endianness of >> the platform you are running on. >> >> If you are on x86 or x86_64 (very likely), then it should be little endian. >> >> If your source of data is big-endian (or opposite from your native >> endianness), > > To be precise - my case is IMAP UTF7 folder name encoding and I finally found > out it's indeed big endian, which explains my problem (as I'm indeed on x86_64). > >> it will have to be converted before treating as a wchar[]. > > Is there any clever way to do the conversion? Or do I need to swap the bytes > manually? No clever way, just the straightforward way ;) Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like: foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >> 8) & 0x00ff); Or you can do it with the bytes directly before casting > >> Note the version identifiers BigEndian and LittleEndian can be used to >> compile the correct code. > > This solution is of no use to me as I don't want to change the endianess in > general. What I mean is that you can annotate your code with version statements like: version(LittleEndian) { // perform the byteswap ... } so your code is portable to BigEndian systems (where you would not want to byte swap). -Steve |
January 30, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | Am Fri, 29 Jan 2016 18:58:17 -0500
schrieb Steven Schveighoffer <schveiguy@yahoo.com>:
> On 1/29/16 6:03 PM, Marek Janukowicz wrote:
> > On Fri, 29 Jan 2016 17:43:26 -0500, Steven Schveighoffer wrote:
> >>> Is there anything I should know about UTF endianess?
> >>
> >> It's not any different from other endianness.
> >>
> >> In other words, a UTF16 code unit is expected to be in the endianness of the platform you are running on.
> >>
> >> If you are on x86 or x86_64 (very likely), then it should be
> >> little endian.
> >>
> >> If your source of data is big-endian (or opposite from your native
> >> endianness),
> >
> > To be precise - my case is IMAP UTF7 folder name encoding and I finally found out it's indeed big endian, which explains my problem (as I'm indeed on x86_64).
> >> it will have to be converted before treating as a wchar[].
> >
> > Is there any clever way to do the conversion? Or do I need to swap the bytes manually?
>
> No clever way, just the straightforward way ;)
>
> Swapping endianness of 32-bits can be done with core.bitop.bswap. Doing it with 16 bits I believe you have to do bit shifting. Something like:
>
> foreach(ref elem; wcharArr) elem = ((elem << 8) & 0xff00) | ((elem >>
> 8) & 0x00ff);
>
> Or you can do it with the bytes directly before casting
There's also a phobos solution: bigEndianToNative in std.bitmanip.
|
January 30, 2016 Re: UTF-16 endianess | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Fri, 29 Jan 2016 18:58:17 -0500, Steven Schveighoffer wrote: >>> Note the version identifiers BigEndian and LittleEndian can be used to compile the correct code. >> >> This solution is of no use to me as I don't want to change the endianess in general. > > What I mean is that you can annotate your code with version statements like: > > version(LittleEndian) > { > // perform the byteswap > ... > } > > so your code is portable to BigEndian systems (where you would not want to byte swap). That's a good point, thanks. -- Marek Janukowicz |
Copyright © 1999-2021 by the D Language Foundation