Read a unicode character from the terminal - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Learn » Read a unicode character from the terminal

Thread overview

Read a unicode character from the terminal

Mar 31, 2012

Mar 31, 2012

Mar 31, 2012

Mar 31, 2012

Mar 31, 2012

Mar 31, 2012

Apr 01, 2012

Apr 01, 2012

Apr 01, 2012

Apr 01, 2012

Mar 31, 2012

Apr 01, 2012

Apr 04, 2012

Apr 04, 2012

Apr 04, 2012

Apr 05, 2012

SUL for Posix (was: Read a unicode character from the terminal)

Apr 05, 2012

Re: SUL for Posix
Apr 05, 2012 Jacob Carlborg
Apr 07, 2012 Stewart Gordon
Apr 07, 2012 Jacob Carlborg
Apr 07, 2012 Stewart Gordon
Apr 07, 2012 Jacob Carlborg
Apr 07, 2012 Stewart Gordon

Apr 04, 2012

March 31, 2012

Read a unicode character from the terminal

Posted by Jacob Carlborg

Jacob Carlborg

How would I read a unicode character from the terminal? I've tried using "std.cstream.din.getc" but it seems to only work for ascii characters. If I try to read and print something that isn't ascii, it just prints a question mark.

-- 
/Jacob Carlborg

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Ali Çehreli
in reply to Jacob Carlborg

Ali Çehreli

Posted in reply to Jacob Carlborg

On 03/31/2012 08:56 AM, Jacob Carlborg wrote:
> How would I read a unicode character from the terminal? I've tried using
> "std.cstream.din.getc"

I recommend using stdin. The destiny of std.cstream is uncertain and stdin is sufficient. (I know that it lacks support for BOM but I don't need them.)

> but it seems to only work for ascii characters.
> If I try to read and print something that isn't ascii, it just prints a
> question mark.

The word 'character' used to mean characters of the Latin-based alphabets but with Unicode support that's not the case anymore. In D, 'character' means UTF code unit, nothing else. Unfortunately, although 'Unidode character' is just the correct term to use, it conflicts with D's characters which are not Unicode characters.

'Unicode code point' is the non-conflicting term that matches what we mean with 'Unicode character.' Only dchar can hold code points.

That's the part about characters.

The other side is what is being fed into the program through its standard input. On my Linux consoles, the text comes as a stream of chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is capable of supporting Unicode through its settings. On Windows terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

Then, it is the program that must know what the data represents. If you are expecting a Unicode code point, then you may think that is should be as simple as reading into a dchar:

import std.stdio;

void main()
{
    dchar letter;
    readf("%s", &letter);    // <-- does not work!
    writeln(letter);
}

The output:

$ ./deneme
ç
Ã  <-- will be different on different consoles

The problem is, char can implicitly be converted to dchar. Since the letter ç consists of two chars (two UTF-8 code units), dchar gets the first one converted as a dchar.

To see this, read and write two chars in a loop without a newline in between:

import std.stdio;

void main()
{
    foreach (i; 0 .. 2) {
        char code;
        readf("%s", &code);
        write(code);
    }

    writeln();
}

This time two code units are read and then outputted to form a Unicode character on the console:

$ ./deneme
ç
ç   <-- result of two write(code) expressions

The solution is to use ranges when pulling Unicode characters out of strings. std.stdin does not provide this yet, but it will eventually happen (so I've heard :)).

For now, this is a way of getting Unicode characters from the input:

import std.stdio;

void main()
{
    string line = readln();

    foreach (dchar c; line) {
        writeln(c);
    }
}

Once you have the input as a string, std.utf.decode can also be used.

Ali

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Jordi Sayol
in reply to Ali Çehreli

Jordi Sayol

Posted in reply to Ali Çehreli

Many thanks to be so educational.

Best regards,
-- 
Jordi Sayol

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Jordi Sayol
in reply to Ali Çehreli

Jordi Sayol

Posted in reply to Ali Çehreli

BTW, for those who do not know, Ali Çehreli is writing a book to learn "D" from scratch. It's very educational.
There are two formats: HTML (on-line) and PDF.
http://ddili.org/ders/d.en/index.html

Best regards,
-- 
Jordi Sayol

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Ali Çehreli
in reply to Jordi Sayol

Ali Çehreli

Posted in reply to Jordi Sayol

On 03/31/2012 02:31 PM, Jordi Sayol wrote:
> BTW, for those who do not know, Ali Çehreli is writing a book to learn "D" from scratch. It's very educational.
> There are two formats: HTML (on-line) and PDF.
> http://ddili.org/ders/d.en/index.html
>
> Best regards,

Thank you very much for the free plug! :)

I have translated eleven more chapters since the last announcement. I am on the assert chapter as we speak. It is taking longer than I had expected because I constantly make improvements to the original: corrections, consistency improvements, additions, adapting code samples to the current state of D, etc.

Ali

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Stewart Gordon
in reply to Jacob Carlborg

Stewart Gordon

Posted in reply to Jacob Carlborg

On 31/03/2012 16:56, Jacob Carlborg wrote:
> How would I read a unicode character from the terminal? I've tried using
> "std.cstream.din.getc" but it seems to only work for ascii characters. If I try to read
> and print something that isn't ascii, it just prints a question mark.

What OS are you using?

And what codepage is the console set to?

You might want to try the console module in my utility library:

http://pr.stewartsplace.org.uk/d/sutil/

(For D1 at the moment, but a D2 version will be available any day now!)

Stewart.

March 31, 2012

Re: Read a unicode character from the terminal

Posted by Ali Çehreli
in reply to Ali Çehreli

Ali Çehreli

Posted in reply to Ali Çehreli

On 03/31/2012 11:53 AM, Ali Çehreli wrote:

> The solution is to use ranges when pulling Unicode characters out of
> strings. std.stdin does not provide this yet, but it will eventually
> happen (so I've heard :)).

Here is a Unicode character range, which is unfortunately pretty inefficient because it relies on an exception that is thrown from isValidDchar! :p

import std.stdio;
import std.utf;
import std.array;

struct UnicodeRange
{
    File file;
    char[4] codes;
    bool ready;

    this(File file)
    {
        this.file = file;
        this.ready = false;
    }

    bool empty() const @property
    {
        return file.eof();
    }

    dchar front() const @property
    {
        if (!ready) {
            // Sorry, no 'mutable' in D! :p
            UnicodeRange * mutable_this = cast(UnicodeRange*)&this;
            mutable_this.readNext();
        }
        return codes.front;
    }

    void popFront()
    {
        codes = codes.init;
        ready = false;
    }

    void readNext()
    {
        foreach (ref code; codes) {
            file.readf("%s", &code);

            if (file.eof()) {
                codes[] = '\0';
                ready = false;
                break;
            }

            // Expensive way of determining "ready"!
            try {
                if (isValidDchar(codes.front)) {
                    ready = true;
                    break;
                }

            } catch (Exception) {
                // not ready
            }
        }
    }
}

UnicodeRange byUnicode(File file = stdin)
{
    return UnicodeRange(file);
}

void main()
{
    foreach(c; byUnicode()) {
        writeln(c);
    }
}

Ali

April 01, 2012

Re: Read a unicode character from the terminal

Posted by Jacob Carlborg
in reply to Ali Çehreli

Jacob Carlborg

Posted in reply to Ali Çehreli

On 2012-03-31 20:53, Ali Çehreli wrote:
> I recommend using stdin. The destiny of std.cstream is uncertain and
> stdin is sufficient. (I know that it lacks support for BOM but I don't
> need them.)

I thought std.cstream was a stream wrapper around stdin.

> The word 'character' used to mean characters of the Latin-based
> alphabets but with Unicode support that's not the case anymore. In D,
> 'character' means UTF code unit, nothing else. Unfortunately, although
> 'Unidode character' is just the correct term to use, it conflicts with
> D's characters which are not Unicode characters.
>
> 'Unicode code point' is the non-conflicting term that matches what we
> mean with 'Unicode character.' Only dchar can hold code points.
>
> That's the part about characters.

Yeah, exactly. When I think about it, I don't know why I thought "getc" would work since it only returns a "char" and not a "dchar".

> The other side is what is being fed into the program through its
> standard input. On my Linux consoles, the text comes as a stream of
> chars, i.e. a UTF-8 encoded text. You must ensure that your terminal is
> capable of supporting Unicode through its settings. On Windows
> terminals, one must enter 'chcp 65001' to set the terminal to UTF-8.

I'm on Mac OS X, the terminal is capable of handling Unicode.

> Then, it is the program that must know what the data represents. If you
> are expecting a Unicode code point, then you may think that is should be
> as simple as reading into a dchar:
>
> import std.stdio;
>
> void main()
> {
> dchar letter;
> readf("%s", &letter); // <-- does not work!
> writeln(letter);
> }
>
> The output:
>
> $ ./deneme
> ç
> Ã <-- will be different on different consoles

I tried that as well.

> The problem is, char can implicitly be converted to dchar. Since the
> letter ç consists of two chars (two UTF-8 code units), dchar gets the
> first one converted as a dchar.
>
> To see this, read and write two chars in a loop without a newline in
> between:
>
> import std.stdio;
>
> void main()
> {
> foreach (i; 0 .. 2) {
> char code;
> readf("%s", &code);
> write(code);
> }
>
> writeln();
> }
>
> This time two code units are read and then outputted to form a Unicode
> character on the console:
>
> $ ./deneme
> ç
> ç <-- result of two write(code) expressions
>
> The solution is to use ranges when pulling Unicode characters out of
> strings. std.stdin does not provide this yet, but it will eventually
> happen (so I've heard :)).
>
> For now, this is a way of getting Unicode characters from the input:
>
> import std.stdio;
>
> void main()
> {
> string line = readln();
>
> foreach (dchar c; line) {
> writeln(c);
> }
> }
>
> Once you have the input as a string, std.utf.decode can also be used.
>
> Ali
>

I'll give that a try, thanks.

-- 
/Jacob Carlborg

April 01, 2012

Re: Read a unicode character from the terminal

Posted by Jacob Carlborg
in reply to Ali Çehreli

Jacob Carlborg

Posted in reply to Ali Çehreli

On 2012-04-01 01:17, Ali Çehreli wrote:
> On 03/31/2012 11:53 AM, Ali Çehreli wrote:
>
>  > The solution is to use ranges when pulling Unicode characters out of
>  > strings. std.stdin does not provide this yet, but it will eventually
>  > happen (so I've heard :)).
>
> Here is a Unicode character range, which is unfortunately pretty
> inefficient because it relies on an exception that is thrown from
> isValidDchar! :p

Ok, what's the differences compared to the example in your first post:

void main()
{
    string line = readln();

    foreach (dchar c; line) {
        writeln(c);
    }
}

-- 
/Jacob Carlborg

April 01, 2012

Re: Read a unicode character from the terminal

Posted by Jacob Carlborg
in reply to Stewart Gordon

Jacob Carlborg

Posted in reply to Stewart Gordon

On 2012-04-01 00:14, Stewart Gordon wrote:
> On 31/03/2012 16:56, Jacob Carlborg wrote:
>> How would I read a unicode character from the terminal? I've tried using
>> "std.cstream.din.getc" but it seems to only work for ascii characters.
>> If I try to read
>> and print something that isn't ascii, it just prints a question mark.
>
> What OS are you using?
>
> And what codepage is the console set to?

I'm using Mac OS X and the terminal is set to handle UTF-8.

> You might want to try the console module in my utility library:
>
> http://pr.stewartsplace.org.uk/d/sutil/
>
> (For D1 at the moment, but a D2 version will be available any day now!)
>
> Stewart.

I'll have a look, thanks.

-- 
/Jacob Carlborg

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation