reading an unicode file - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » reading an unicode file

Thread overview

reading an unicode file
May 11, 2007 jicman
May 11, 2007 Bill Baxter
May 11, 2007 jicman
May 11, 2007 Bill Baxter
May 11, 2007 Chris Nicholson-Sauls
May 11, 2007 Dejan Lekic

May 11, 2007

reading an unicode file

Posted by jicman

jicman

Greetings!

I am reading this file into a char[][] array and all the data is broken down by a space.  So, if a line of data read has,

hi there folks!

the string contains,

h i  t h e r e  f o l k s !

I know this has to do with UTF8 and unicode, but how do I fix that?

Any help would be greatly appreciated.

thanks,

josé

May 11, 2007

Re: reading an unicode file

Posted by Bill Baxter
in reply to jicman

Bill Baxter

Posted in reply to jicman

jicman wrote:
> Greetings!
> 
> I am reading this file into a char[][] array and all the data is broken down
> by a space.  So, if a line of data read has,
> 
> hi there folks!
> 
> the string contains,
> 
> h i  t h e r e  f o l k s !
> 
> I know this has to do with UTF8 and unicode, but how do I fix that?

Yeh, the file is probably UCS2 (UTF16) rather than UTF8.  Meaning every char is 2 bytes (with a few exceptions).  The things between the characters are probably not spaces, but rather null characters (a 0-byte).

> Any help would be greatly appreciated.

Try to read it as binary and use std.utf functions to convert?
Or maybe read as wchar's with the funcs in std.stream (then convert to utf8 if neceesary with std.utf funcs).

Never done this stuff myself, but that's where I'd look.

--bb

May 11, 2007

Re: reading an unicode file

Posted by jicman
in reply to Bill Baxter

jicman

Posted in reply to Bill Baxter

Thanks BB.

I should stop using char and do more wchars.  But that is a whole new world for me. :-)

Interesting enough, I did this command to the string,

char[] n = std.string.replace(s,"\000","");

and now strings show correctly.  The problem is that I work with accented characters, which will probably break something. I am going to have to look into this, but for now, it's working for this task.

Thanks for the help.

May 11, 2007

Re: reading an unicode file

Posted by Bill Baxter
in reply to jicman

Bill Baxter

Posted in reply to jicman

jicman wrote:
> Thanks BB.
> 
> I should stop using char and do more wchars.  But that is a whole new world for
> me. :-)
> 
> Interesting enough, I did this command to the string,
> 
> char[] n = std.string.replace(s,"\000","");
> 
> and now strings show correctly.  The problem is that I work with accented
> characters, which will probably break something. I am going to have to look into
> this, but for now, it's working for this task.

Yep. That is probably going to break in horrible ways when you start to encounter more than just plain 7-bit ASCII.

--bb

May 11, 2007

Re: reading an unicode file

Posted by Dejan Lekic
in reply to jicman

Dejan Lekic

Posted in reply to jicman

By reading a BOM of the file you should be able to detect which text format to use. More about BOM: http://unicode.org/faq/utf_bom.html#BOM .
So, I would first chech which BOM is it, than use appropriate readLine() or readLineW() InputStream methods to read the file line-by-line, or if you prefer just to read until the eof, than appropriate read() method.

May 11, 2007

Re: reading an unicode file

Posted by Chris Nicholson-Sauls
in reply to jicman

Chris Nicholson-Sauls

Posted in reply to jicman

jicman wrote:
> Thanks BB.
> 
> I should stop using char and do more wchars.  But that is a whole new world for
> me. :-)
> 
> Interesting enough, I did this command to the string,
> 
> char[] n = std.string.replace(s,"\000","");
> 
> and now strings show correctly.  The problem is that I work with accented
> characters, which will probably break something. I am going to have to look into
> this, but for now, it's working for this task.
> 
> Thanks for the help.

Even though it increases sizes, I find using dchar provides vast convenience in cases where you /know/ you want|need to support various sorts of character outside ASCII.  Of course you ought to experiment to see if wchar is fine for your use case.

-- Chris Nicholson-Sauls

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation