View mode: basic / threaded / horizontal-split · Log in · Help
August 09, 2005
UTF8 Encoding: Again
Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["août"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("août");

mon will have "08", however if it comes from another text file, say an ASCII
file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "août";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile
and run, and it will write, supposely, the content of txt.  However, when you
open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however,
nothing has worked.  Any ideas how I can fix this?

thanks,

josé
August 09, 2005
Re: UTF8 Encoding: Again
> ...  However, when you
> open the test.log file, the content of it is,
>
>août

Seems correct to me: 'août' is 61 6f c3 bb 74 in binary which will be
interpreted as 'août' in ASCII mode. So I think the editor that you
use to view the file is the problem. A lot of editors (e.g. Notepad)
need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
ef bb bf. Delete these with an hex editor and it'll give you the same
output. Anyways, you have to make sure that you read/interpret your 
UTF8 file as UTF8, not ASCII.

HTH,
Stefan



In article <ddb1bt$bcp$1@digitaldaemon.com>, jicman says...
>
>
>Greetings!
>
>Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
>you may be able to help me jump it or break through it.
>
>I have this code,
>char[] GetMonthDigit(char[] mon)
>{
>char [][char[]] sMon;
>sMon["août"] = "08";
>return sMon[mon];
>}
>
>if I call this from within the program
>
>char[] mon = GetMonthDigit("août");
>
>mon will have "08", however if it comes from another text file, say an ASCII
>file, it fails.  So, if you take a look at this small piece of code,
>
>import std.stream;
>void main()
>{
>char[] fn = "test.log";
>char[] txt = "août";
>File log = new File(fn,FileMode.Append);
>log.writeLine(txt);
>log.close();
>}
>
>this will create a file, if it does not exists, called test.log after compile
>and run, and it will write, supposely, the content of txt.  However, when you
>open the test.log file, the content of it is,
>
>août
>
>Hmmmm... I tried setting the editor settings to UTF8, and others, however,
>nothing has worked.  Any ideas how I can fix this?
>
>thanks,
>
>josé
>
>
August 09, 2005
Re: UTF8 Encoding: Again (correction)
In article <ddb5u6$fmn$1@digitaldaemon.com>, Stefan says...
>
>> ...  However, when you
>> open the test.log file, the content of it is,
>>
>>août
>
> ... A lot of editors (e.g. Notepad)
>need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
>recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
>ef bb bf. Delete these with an hex editor and it'll give you the same
>output.

Just realized that Notepad on XP works even without the BOM now.
You can reproduce it with WordPad, however. Any decent XML editor
should be able to display correctly without BOMs.

Best regards,
Stefan
August 09, 2005
Re: UTF8 Encoding: Again (correction)
Stefan says...
>
>In article <ddb5u6$fmn$1@digitaldaemon.com>, Stefan says...
>>
>>> ...  However, when you
>>> open the test.log file, the content of it is,
>>>
>>>août
>>
>> ... A lot of editors (e.g. Notepad)
>>need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
>>recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is 
>>ef bb bf. Delete these with an hex editor and it'll give you the same
>>output.
>
>Just realized that Notepad on XP works even without the BOM now.
>You can reproduce it with WordPad, however. Any decent XML editor
>should be able to display correctly without BOMs.
>
>Best regards,
>Stefan

Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
string from a file which contains certain words, say "août" which I need to test
on.  The problem is that even though I have a test on the program, which
displays correctly, the test never matches because the input data, "août", and
the test data inside the program, "août", never match.

I guess I still don't understand UTF. :-o

I will have to rewrite this whole check check function because of this.

Thanks for the help.

josé
August 09, 2005
Re: UTF8 Encoding: Again (correction)
In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>
>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
>string from a file which contains certain words, say "août" which I need to test
>on.  The problem is that even though I have a test on the program, which
>displays correctly, the test never matches because the input data, "août", and
>the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is
UTF8 encoded)? Nevertheless, good luck! If there's something to learn from
your investigations let us other D newbies know ;-)

Best regards,
Stefan


>I guess I still don't understand UTF. :-o
>
>I will have to rewrite this whole check check function because of this.
>
>Thanks for the help.
>
>josé
August 10, 2005
Re: UTF8 Encoding: Again
jicman escribió:
> Greetings!
> 
> Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
> you may be able to help me jump it or break through it.
> 
> I have this code,
> char[] GetMonthDigit(char[] mon)
> {
> char [][char[]] sMon;
> sMon["août"] = "08";
> return sMon[mon];
> }
> 
> if I call this from within the program
> 
> char[] mon = GetMonthDigit("août");
> 
> mon will have "08", however if it comes from another text file, say an ASCII
> file, it fails.  So, if you take a look at this small piece of code,
> 
> import std.stream;
> void main()
> {
> char[] fn = "test.log";
> char[] txt = "août";
> File log = new File(fn,FileMode.Append);
> log.writeLine(txt);
> log.close();
> }
> 
> this will create a file, if it does not exists, called test.log after compile
> and run, and it will write, supposely, the content of txt.  However, when you
> open the test.log file, the content of it is,
> 
> août
> 
> Hmmmm... I tried setting the editor settings to UTF8, and others, however,
> nothing has worked.  Any ideas how I can fix this?
> 
> thanks,
> 
> josé
> 
> 

It must be something with the editor or with the console, because I just 
tried with gdc-0.13 on Mac and it worked ok.

-- 
Carlos Santander Bernal
August 10, 2005
Re: UTF8 Encoding: Again (correction)
Stefan Zobel says...
>
>In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>
>>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
>>string from a file which contains certain words, say "août" which I need to test
>>on.  The problem is that even though I have a test on the program, which
>>displays correctly, the test never matches because the input data, "août", and
>>the test data inside the program, "août", never match.
>
>Are you sure you're reading/interpreting the file as UTF8 (and it actually is
>UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with?  For
example, I have this 200+ files with text data which contain ASCII data with
many different accented characters.  Do I need to change this input data to UTF8
to be able to work with it?  I know that I have to save the source code files as
UTF8, but do I also have to change the other text files that I work with to
UTF8?  That is my problem.  I am saving the source files ok, but the input that
I read from text files are not matching the the source code.  Again, do I need
to change that input data to UTF8?


> Nevertheless, good luck! If there's something to learn from
>your investigations let us other D newbies know ;-)

There is nothing to learn. ;-)  All I am going to do is to change any character
higher than 127 to +. :-)  That's how I have been able to work with this UTF
stuff. :-)

thanks,

josé
August 10, 2005
Re: UTF8 Encoding: Again (correction)
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman wrote:

> Stefan Zobel says...
>>
>>In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>>
>>>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input
>>>string from a file which contains certain words, say "août" which I need to test
>>>on.  The problem is that even though I have a test on the program, which
>>>displays correctly, the test never matches because the input data, "août", and
>>>the test data inside the program, "août", never match.
>>
>>Are you sure you're reading/interpreting the file as UTF8 (and it actually is
>>UTF8 encoded)?
> 
> Well, here is a question: do I have to change the data that I work with?  For
> example, I have this 200+ files with text data which contain ASCII data with
> many different accented characters.  Do I need to change this input data to UTF8
> to be able to work with it?  I know that I have to save the source code files as
> UTF8, but do I also have to change the other text files that I work with to
> UTF8?  That is my problem.  I am saving the source files ok, but the input that
> I read from text files are not matching the the source code.  Again, do I need
> to change that input data to UTF8?

Technically, if it contains accented characters it is *not* ASCII. It is
some other form of character encoding. For example, my Windows XP has Code
Page 850 set for the DOS console. 

( http://en.wikipedia.org/wiki/Code_page_850 )

You would need to find out which character encoding standard was used in
your file, then read the file in as a stream of *bytes* not chars, and
convert each of the byte values into the equivalent Unicode character. You
could then use UTF8 "char[]", UTF16 "wchar[]", or UTF32 "dchar[]" as your
preferred coding in your program. 

Also have a look at 

 http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues 

for further help.
-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
10/08/2005 1:03:43 PM
August 10, 2005
Re: UTF8 Encoding: Again (correction)
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman  
<jicman_member@pathlink.com> wrote:
> Stefan Zobel says...
>>
>> In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>>
>>> Thanks for the help, Stefan.  The problem is much deeper than that.  I  
>>> get input
>>> string from a file which contains certain words, say "août" which I  
>>> need to test
>>> on.  The problem is that even though I have a test on the program,  
>>> which
>>> displays correctly, the test never matches because the input data,  
>>> "août", and
>>> the test data inside the program, "août", never match.
>>
>> Are you sure you're reading/interpreting the file as UTF8 (and it  
>> actually is
>> UTF8 encoded)?
>
> Well, here is a question: do I have to change the data that I work  
> with?  For example, I have this 200+ files with text data which contain  
> ASCII data with many different accented characters.  Do I need to change  
> this input data to UTF8  to be able to work with it?

Yes and No.

As Derek said, if it has characters above 127 it's not ascii, see:
  http://www.columbia.edu/kermit/csettables.html

I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1  
Latin Alphabet 1" which are very similar.

To figure it out open the text file in a binary editor and check the value  
of an accented character and compare it to the tables in the links above.

You can read and write these non UTF characters into a char[] etc,  
provided you don't use writef or any other routine that actually checks  
whether the characters are valid UTF, i.e. writeString works but writef  
will give an exception.

If you want to compare the data to static string you'll need to convert  
the data to UTF. I have a small module which will convert windows code  
page 1252 into UTF8, 16, and 32 and back again (tho the back again is  
totally untested). I needed it for much the same thing as you do. This  
code is public domain.

Regan
August 11, 2005
Re: UTF8 Encoding: Again
Carlos Santander escribió:
> 
> It must be something with the editor or with the console, because I just 
> tried with gdc-0.13 on Mac and it worked ok.
> 

That should've been 0.15

-- 
Carlos Santander Bernal
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home