Reading unicode chars..

How do I read unicode chars that has code points \u1FFF and higher from a file? file.getcw() reads only part of the char, and D identifies this character as an array of three or four characters. Importing std.uni does not change the behavior. Thank you.

On Tuesday, 2 September 2014 at 14:06:04 UTC, seany wrote: > How do I read unicode chars that has code points \u1FFF and higher from a file? > > file.getcw() reads only part of the char, and D identifies this character as an array of three or four characters. > > Importing std.uni does not change the behavior. > > Thank you. Maybe someone else here will recognize this, but for me you'd need to supply more information. Std.file doesn't have getcw, I see one in std.stream which has an "outdated" warning and that getcw is documented as implementation-specific. So what platform are you on? Better yet, can you make a small code sample that shows what you're seeing?

September 02, 2014

Re: Reading unicode chars..

Posted by Ali Çehreli
in reply to seany

Permalink

Ali Çehreli

Posted in reply to seany

Permalink

On 09/02/2014 07:06 AM, seany wrote:
> How do I read unicode chars that has code points \u1FFF and higher from
> a file?
>
> file.getcw() reads only part of the char, and D identifies this
> character as an array of three or four characters.
>
> Importing std.uni does not change the behavior.
>
> Thank you.

One way is to use std.stdio.File just like you would use stdin and stdout:

import std.stdio;

void main()
{
    string fileName = "unicode_test_file";
    doWrite(fileName);
    doRead(fileName);
}

void doWrite(string fileName)
{
    auto file = File(fileName, "w");
    file.writeln("abcçdef");
}

void doRead(string fileName)
{
    auto file = File(fileName, "r");

    foreach (line; file.byLine) {        // (1)
        foreach (dchar c; line) {        // (2)
            writeln(c);
        }

        import std.range;
        foreach (c; line.stride(1)) {    // (3)
            writeln(c);
        }
    }
}

Notes:

1) To avoid a common gotcha, note that 'line' is reused at every iteration here. You must make copies of portions of it if you need to.

2) dchar is important there

3) Any algorithms that turns a string to a range does expose decoded dchars. Here, I used stride.

Ali

Hi Ali, i know this example from your book. But try to capture „ the low quotation mark, appearing in the All-purpose punctuations plane of unicode, with \u201e - I worte I am having problems with \u1FFF and up. This particular symbol, is seen as a dchar array "\x1e\x20" - so two dchars, using wchar returns the same result, when I directly profide the symbol to the code. SO I was thinking of using two dchars, and printing the dstring, the problem then is that I do not know beforehand if a particular character read out of the file is a pair of dchars, or a single dchar. And yes, it was stream.getcw, sorry, not file.getcw(). Indeed. Reading this character from a file (using a while loop until EOF) produces an â and an unknown charcter given by a question mark in a white polygon.

On Tuesday, 2 September 2014 at 17:10:57 UTC, Ali Çehreli wrote: > 1) To avoid a common gotcha, note that 'line' is reused at every iteration here. You must make copies of portions of it if you need to. > > Ali I don't know if you are aware, but "byLineCopy" was recently introduced. It will be available in 2.067. Just spreading info.

On 09/02/2014 11:11 AM, seany wrote: > But try to capture „ the low quotation mark, appearing in the > All-purpose punctuations plane of unicode, with \u201e - I worte I am > having problems with \u1FFF and up. You are doing it differently. Can you show us a minimal example? Otherwise, there is nothing special about „. Continuing with my example, just change one line and it still works: file.writeln("abcçd„ef"); > This particular symbol, is seen as a dchar array "\x1e\x20" - so two > dchars That would happen when you you treat the chars on the input and individual dchars. Those two chars must be decoded as a single dchar. My example has shown two different ways of doing it. :) > using wchar returns the same result Same issue: You are treating individual chars as two individual wchars. Ali

On Tuesday, 2 September 2014 at 18:22:54 UTC, Ali Çehreli wrote: > That would happen when you you treat the chars on the input and individual dchars. That is precisely where the problem is. If you use the character in a file, and then open it as a stream, then use File.getc() or file.getcw() until EOF is reached, then you get this prblem. I want to read the file char by char, and problem is i dont know where this char will appear, meaning I dont know where i have to treat multiple dchars, read by getc() or getcw() as a single char.

On Tuesday, 2 September 2014 at 18:30:55 UTC, seany wrote: > Your example reads the file by lines, i need to get them by chars. If you are intent on reading the stream character (or wcharacter) 1 by 1, then you will have to decode them manually, as there is no "getcd". Unfortunately, the "newer" std.stdio module does not really provide facilities for such unitary reads. I'd suggest you create a range out of your std.stream.File, which reads it byte by byte. Then, you pass it to the "byDchar()" range, which will auto decode those characters. If you really want to do it "character by character". What's wrong with reading line by line, but processing the characters in said lines 1 by 1? That works "out of the box".

Forums