Jump to page: 1 2
Thread overview
UTF8 Encoding: Again
Aug 09, 2005
jicman
Aug 09, 2005
Stefan
Re: UTF8 Encoding: Again (correction)
Aug 09, 2005
Stefan
Aug 09, 2005
jicman
Aug 09, 2005
Stefan Zobel
Aug 10, 2005
jicman
Aug 10, 2005
Derek Parnell
Aug 10, 2005
Regan Heath
Aug 11, 2005
jicman
Aug 10, 2005
Carlos Santander
Aug 11, 2005
Carlos Santander
August 09, 2005
Greetings!

Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of you may be able to help me jump it or break through it.

I have this code,
char[] GetMonthDigit(char[] mon)
{
char [][char[]] sMon;
sMon["août"] = "08";
return sMon[mon];
}

if I call this from within the program

char[] mon = GetMonthDigit("août");

mon will have "08", however if it comes from another text file, say an ASCII file, it fails.  So, if you take a look at this small piece of code,

import std.stream;
void main()
{
char[] fn = "test.log";
char[] txt = "août";
File log = new File(fn,FileMode.Append);
log.writeLine(txt);
log.close();
}

this will create a file, if it does not exists, called test.log after compile and run, and it will write, supposely, the content of txt.  However, when you open the test.log file, the content of it is,

août

Hmmmm... I tried setting the editor settings to UTF8, and others, however, nothing has worked.  Any ideas how I can fix this?

thanks,

josé


August 09, 2005
> ...  However, when you
> open the test.log file, the content of it is,
>
>août

Seems correct to me: 'août' is 61 6f c3 bb 74 in binary which will be
interpreted as 'août' in ASCII mode. So I think the editor that you
use to view the file is the problem. A lot of editors (e.g. Notepad)
need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is
ef bb bf. Delete these with an hex editor and it'll give you the same
output. Anyways, you have to make sure that you read/interpret your
UTF8 file as UTF8, not ASCII.

HTH,
Stefan



In article <ddb1bt$bcp$1@digitaldaemon.com>, jicman says...
>
>
>Greetings!
>
>Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of you may be able to help me jump it or break through it.
>
>I have this code,
>char[] GetMonthDigit(char[] mon)
>{
>char [][char[]] sMon;
>sMon["août"] = "08";
>return sMon[mon];
>}
>
>if I call this from within the program
>
>char[] mon = GetMonthDigit("août");
>
>mon will have "08", however if it comes from another text file, say an ASCII file, it fails.  So, if you take a look at this small piece of code,
>
>import std.stream;
>void main()
>{
>char[] fn = "test.log";
>char[] txt = "août";
>File log = new File(fn,FileMode.Append);
>log.writeLine(txt);
>log.close();
>}
>
>this will create a file, if it does not exists, called test.log after compile and run, and it will write, supposely, the content of txt.  However, when you open the test.log file, the content of it is,
>
>août
>
>Hmmmm... I tried setting the editor settings to UTF8, and others, however, nothing has worked.  Any ideas how I can fix this?
>
>thanks,
>
>josé
>
>


August 09, 2005
In article <ddb5u6$fmn$1@digitaldaemon.com>, Stefan says...
>
>> ...  However, when you
>> open the test.log file, the content of it is,
>>
>>août
>
> ... A lot of editors (e.g. Notepad)
>need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
>recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is
>ef bb bf. Delete these with an hex editor and it'll give you the same
>output.

Just realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs.

Best regards,
Stefan


August 09, 2005
Stefan says...
>
>In article <ddb5u6$fmn$1@digitaldaemon.com>, Stefan says...
>>
>>> ...  However, when you
>>> open the test.log file, the content of it is,
>>>
>>>août
>>
>> ... A lot of editors (e.g. Notepad)
>>need a 'magic' byte order mark (BOM) in the beginning of the file to be able to
>>recognize an UTF8 file. For example, the (Notepad) BOM for UTF8 is
>>ef bb bf. Delete these with an hex editor and it'll give you the same
>>output.
>
>Just realized that Notepad on XP works even without the BOM now. You can reproduce it with WordPad, however. Any decent XML editor should be able to display correctly without BOMs.
>
>Best regards,
>Stefan

Thanks for the help, Stefan.  The problem is much deeper than that.  I get input string from a file which contains certain words, say "août" which I need to test on.  The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.

I guess I still don't understand UTF. :-o

I will have to rewrite this whole check check function because of this.

Thanks for the help.

josé


August 09, 2005
In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>
>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input string from a file which contains certain words, say "août" which I need to test on.  The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.

Are you sure you're reading/interpreting the file as UTF8 (and it actually is UTF8 encoded)? Nevertheless, good luck! If there's something to learn from your investigations let us other D newbies know ;-)

Best regards,
Stefan


>I guess I still don't understand UTF. :-o
>
>I will have to rewrite this whole check check function because of this.
>
>Thanks for the help.
>
>josé


August 10, 2005
jicman escribió:
> Greetings!
> 
> Sorry about this, but I have found a wall with UTF8, again.  Perhaps, some of
> you may be able to help me jump it or break through it.
> 
> I have this code,
> char[] GetMonthDigit(char[] mon)
> {
> char [][char[]] sMon;
> sMon["août"] = "08";
> return sMon[mon];
> }
> 
> if I call this from within the program
> 
> char[] mon = GetMonthDigit("août");
> 
> mon will have "08", however if it comes from another text file, say an ASCII
> file, it fails.  So, if you take a look at this small piece of code,
> 
> import std.stream;
> void main()
> {
> char[] fn = "test.log";
> char[] txt = "août";
> File log = new File(fn,FileMode.Append);
> log.writeLine(txt);
> log.close();
> }
> 
> this will create a file, if it does not exists, called test.log after compile
> and run, and it will write, supposely, the content of txt.  However, when you
> open the test.log file, the content of it is,
> 
> août
> 
> Hmmmm... I tried setting the editor settings to UTF8, and others, however,
> nothing has worked.  Any ideas how I can fix this?
> 
> thanks,
> 
> josé
> 
> 

It must be something with the editor or with the console, because I just tried with gdc-0.13 on Mac and it worked ok.

-- 
Carlos Santander Bernal
August 10, 2005
Stefan Zobel says...
>
>In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>
>>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input string from a file which contains certain words, say "août" which I need to test on.  The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.
>
>Are you sure you're reading/interpreting the file as UTF8 (and it actually is
>UTF8 encoded)?

Well, here is a question: do I have to change the data that I work with?  For example, I have this 200+ files with text data which contain ASCII data with many different accented characters.  Do I need to change this input data to UTF8 to be able to work with it?  I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8?  That is my problem.  I am saving the source files ok, but the input that I read from text files are not matching the the source code.  Again, do I need to change that input data to UTF8?


> Nevertheless, good luck! If there's something to learn from
>your investigations let us other D newbies know ;-)

There is nothing to learn. ;-)  All I am going to do is to change any character higher than 127 to +. :-)  That's how I have been able to work with this UTF stuff. :-)

thanks,

josé


August 10, 2005
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman wrote:

> Stefan Zobel says...
>>
>>In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>>
>>>Thanks for the help, Stefan.  The problem is much deeper than that.  I get input string from a file which contains certain words, say "août" which I need to test on.  The problem is that even though I have a test on the program, which displays correctly, the test never matches because the input data, "août", and the test data inside the program, "août", never match.
>>
>>Are you sure you're reading/interpreting the file as UTF8 (and it actually is
>>UTF8 encoded)?
> 
> Well, here is a question: do I have to change the data that I work with?  For example, I have this 200+ files with text data which contain ASCII data with many different accented characters.  Do I need to change this input data to UTF8 to be able to work with it?  I know that I have to save the source code files as UTF8, but do I also have to change the other text files that I work with to UTF8?  That is my problem.  I am saving the source files ok, but the input that I read from text files are not matching the the source code.  Again, do I need to change that input data to UTF8?

Technically, if it contains accented characters it is *not* ASCII. It is some other form of character encoding. For example, my Windows XP has Code Page 850 set for the DOS console.

( http://en.wikipedia.org/wiki/Code_page_850 )

You would need to find out which character encoding standard was used in
your file, then read the file in as a stream of *bytes* not chars, and
convert each of the byte values into the equivalent Unicode character. You
could then use UTF8 "char[]", UTF16 "wchar[]", or UTF32 "dchar[]" as your
preferred coding in your program.

Also have a look at

  http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

for further help.
-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
10/08/2005 1:03:43 PM
August 10, 2005
On Wed, 10 Aug 2005 02:57:12 +0000 (UTC), jicman <jicman_member@pathlink.com> wrote:
> Stefan Zobel says...
>>
>> In article <ddb989$jkm$1@digitaldaemon.com>, jicman says...
>>>
>>> Thanks for the help, Stefan.  The problem is much deeper than that.  I
>>> get input
>>> string from a file which contains certain words, say "août" which I
>>> need to test
>>> on.  The problem is that even though I have a test on the program,
>>> which
>>> displays correctly, the test never matches because the input data,
>>> "août", and
>>> the test data inside the program, "août", never match.
>>
>> Are you sure you're reading/interpreting the file as UTF8 (and it
>> actually is
>> UTF8 encoded)?
>
> Well, here is a question: do I have to change the data that I work with?  For example, I have this 200+ files with text data which contain ASCII data with many different accented characters.  Do I need to change this input data to UTF8  to be able to work with it?

Yes and No.

As Derek said, if it has characters above 127 it's not ascii, see:
   http://www.columbia.edu/kermit/csettables.html

I suspect your data is "Microsoft Windows Code Page 1252" or "ISO 8859-1 Latin Alphabet 1" which are very similar.

To figure it out open the text file in a binary editor and check the value of an accented character and compare it to the tables in the links above.

You can read and write these non UTF characters into a char[] etc, provided you don't use writef or any other routine that actually checks whether the characters are valid UTF, i.e. writeString works but writef will give an exception.

If you want to compare the data to static string you'll need to convert the data to UTF. I have a small module which will convert windows code page 1252 into UTF8, 16, and 32 and back again (tho the back again is totally untested). I needed it for much the same thing as you do. This code is public domain.

Regan

August 11, 2005
Carlos Santander escribió:
> 
> It must be something with the editor or with the console, because I just tried with gdc-0.13 on Mac and it worked ok.
> 

That should've been 0.15

-- 
Carlos Santander Bernal
« First   ‹ Prev
1 2