Thread overview | ||||||||
---|---|---|---|---|---|---|---|---|
|
December 18, 2006 ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Here's something that came up recently. As some of you may already know, I've been doing some work with forum data recently.
I wanted to move some old forum data, which was stored in ASCII over to UTF8 via D. The problem is that some of the data has characters in the 0x80-0xff range, which causes UTF-BOM detection to fail.
So I rolled the following function to 'transcode' these characters:
char[] ASCII2UTF8(char[] value){
char[] result;
for(uint i=0; i<value.length; i++){
char ch = value[i];
if(ch < 0x80){
result ~= ch;
}
else{
result ~= 0xC0 | (ch >> 6);
result ~= 0x80 | (ch & 0x3F);
debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], result[$-1]);
}
}
return result;
}
So my question is, while this conversion is done against a literal interpretation of the UTF-8 spec: is this the correct way to treat these characters?
Should I be taking user locale into account? Are high-ASCII chars considered to be universal?
--
- EricAnderton at yahoo
|
December 18, 2006 Re: ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Pragma | Pragma wrote: > Here's something that came up recently. As some of you may already know, I've been doing some work with forum data recently. > > I wanted to move some old forum data, which was stored in ASCII over to UTF8 via D. The problem is that some of the data has characters in the 0x80-0xff range, which causes UTF-BOM detection to fail. > > So I rolled the following function to 'transcode' these characters: > > char[] ASCII2UTF8(char[] value){ > char[] result; > for(uint i=0; i<value.length; i++){ > char ch = value[i]; > if(ch < 0x80){ > result ~= ch; > } > else{ > result ~= 0xC0 | (ch >> 6); > result ~= 0x80 | (ch & 0x3F); > debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], result[$-1]); > } > } > return result; > } > > So my question is, while this conversion is done against a literal interpretation of the UTF-8 spec: is this the correct way to treat these characters? First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. The encoding of the upper 128 bytes are locale dependent and can not be called "ASCII". There are numerous different encodings used for the upper 128 code points. The above is correct if the source text is in Latin1 (ISO-8859-1) coding. This is probably the most common single byte encoding for Western Europe and the US. The windows english standard charset 1252 is a superset of latin1 and defines the range 0x80-0x9f differently. > Should I be taking user locale into account? Are high-ASCII chars considered to be universal? Rename the function Latin12UTF8 and you have something that behaves correctly according to spec. :) Best regards, /Oskar |
December 18, 2006 Re: ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Oskar Linde | Oskar Linde wrote:
> The above is correct if the source text is in Latin1 (ISO-8859-1) coding. This is probably the most common single byte encoding for Western Europe and the US. The windows english standard charset 1252 is a superset of latin1 and defines the range 0x80-0x9f differently.
Some European sites/users also use ISO-8859-15. I think it might have the euro (€) sign and some minor other differences too.
|
December 18, 2006 Re: ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Oskar Linde | Oskar Linde wrote: > Pragma wrote: >> Here's something that came up recently. As some of you may already know, I've been doing some work with forum data recently. >> >> I wanted to move some old forum data, which was stored in ASCII over to UTF8 via D. The problem is that some of the data has characters in the 0x80-0xff range, which causes UTF-BOM detection to fail. >> >> So I rolled the following function to 'transcode' these characters: >> >> char[] ASCII2UTF8(char[] value){ >> char[] result; >> for(uint i=0; i<value.length; i++){ >> char ch = value[i]; >> if(ch < 0x80){ >> result ~= ch; >> } >> else{ >> result ~= 0xC0 | (ch >> 6); >> result ~= 0x80 | (ch & 0x3F); >> debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], result[$-1]); >> } >> } >> return result; >> } >> >> So my question is, while this conversion is done against a literal interpretation of the UTF-8 spec: is this the correct way to treat these characters? > > First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. The encoding of the upper 128 bytes are locale dependent and can not be called "ASCII". There are numerous different encodings used for the upper 128 code points. Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was taken for lack of a better title. Admittedly, it's a misnomer. Same goes for my use of "high-ASCII". > > The above is correct if the source text is in Latin1 (ISO-8859-1) coding. This is probably the most common single byte encoding for Western Europe and the US. The windows english standard charset 1252 is a superset of latin1 and defines the range 0x80-0x9f differently. > >> Should I be taking user locale into account? Are high-ASCII chars considered to be universal? > > Rename the function Latin12UTF8 and you have something that behaves correctly according to spec. :) Makes sense to me. If I can't find a way to determine what codepage users are using in the forum for non-Latin1 posts, I'll just try Latin-1 and see what happens. Thanks! -- - EricAnderton at yahoo |
December 18, 2006 Re: ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jari-Matti Mäkelä | Jari-Matti Mäkelä wrote: > Oskar Linde wrote: >> The above is correct if the source text is in Latin1 (ISO-8859-1) >> coding. This is probably the most common single byte encoding for >> Western Europe and the US. The windows english standard charset 1252 is >> a superset of latin1 and defines the range 0x80-0x9f differently. > > Some European sites/users also use ISO-8859-15. I think it might have > the euro (€) sign and some minor other differences too. Ah. Good to know. I'll take that into consideration as well. Thanks! -- - EricAnderton at yahoo |
December 19, 2006 Re: ASCII to UTF8 Conversion - is this right? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Pragma | Pragma wrote:
> Oskar Linde wrote:
>
>> Pragma wrote:
>>
>>> Here's something that came up recently. As some of you may already know, I've been doing some work with forum data recently.
>>>
>>> I wanted to move some old forum data, which was stored in ASCII over to UTF8 via D. The problem is that some of the data has characters in the 0x80-0xff range, which causes UTF-BOM detection to fail.
>>>
>>> So I rolled the following function to 'transcode' these characters:
>>>
>>> char[] ASCII2UTF8(char[] value){
>>> char[] result;
>>> for(uint i=0; i<value.length; i++){
>>> char ch = value[i];
>>> if(ch < 0x80){
>>> result ~= ch;
>>> }
>>> else{
>>> result ~= 0xC0 | (ch >> 6);
>>> result ~= 0x80 | (ch & 0x3F);
>>> debug writefln("converted: %0.2X to %0.2X %0.2X",ch, result[$-2], result[$-1]);
>>> }
>>> }
>>> return result;
>>> }
>>>
>>> So my question is, while this conversion is done against a literal interpretation of the UTF-8 spec: is this the correct way to treat these characters?
>>
>>
>> First, ASCII is a 7 bit encoding that only defines characters <= 0x7f. The encoding of the upper 128 bytes are locale dependent and can not be called "ASCII". There are numerous different encodings used for the upper 128 code points.
>
>
> Precisely the reason why I posted this. :) The 'ASCII2UTF8' name was taken for lack of a better title. Admittedly, it's a misnomer. Same goes for my use of "high-ASCII".
>
>>
>> The above is correct if the source text is in Latin1 (ISO-8859-1) coding. This is probably the most common single byte encoding for Western Europe and the US. The windows english standard charset 1252 is a superset of latin1 and defines the range 0x80-0x9f differently.
>>
>>> Should I be taking user locale into account? Are high-ASCII chars considered to be universal?
>>
>>
>> Rename the function Latin12UTF8 and you have something that behaves correctly according to spec. :)
>
>
> Makes sense to me. If I can't find a way to determine what codepage users are using in the forum for non-Latin1 posts, I'll just try Latin-1 and see what happens.
You might also want to look at the message headers:
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Especially the Content-Type header often tells you directly what the coding is.
|
Copyright © 1999-2021 by the D Language Foundation