Jump to page: 1 2
Thread overview
how to localize console and GUI apps in Windows
Dec 28
Andrei
Dec 29
Andrei
Dec 29
zabruk70
Jan 03
Andrei
Jan 03
thedeemon
Jan 03
thedeemon
Jan 03
Andrei
Jan 04
Andrei
Jan 04
Andrei
Dec 28
zabruk70
December 28
There is one everlasting problem writing Cyrillic programs in Windows: Microsoft consequently invented two much different code pages for Russia and other Cyrillic-alphabet countries: first was MSDOS-866 (and alike), second Windows-1251. Nowadays MS Windows uses first code page for console programs, second for GUI applications, and there always are many workarounds to get proper translation between them. Mostly a programmer should write program sources either in one code page for console and other for GUI, or use .NET, which basically uses UTF8 in sources and makes seamless translation depending on back end.

In D language which uses only UTF8 for string encoding I cannot write neither MS866 code page program texts, nor Windows-1251 - both cases end in a compiler error like "Invalid trailing code unit" or "Outside Unicode code space". And writing Cyrillic strings in UTF8 format is fatal for both console and GUI Windows targets.

My question is: is there any standard means to translate Cyrillic or any other localized UTF8 strings for console and GUI output in D libraries. If so - where I can get more information and good example. Google would not help.

Thanks.

December 28
On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn wrote:
> There is one everlasting problem writing Cyrillic programs in Windows: Microsoft consequently invented two much different code pages for Russia and other Cyrillic-alphabet countries: first was MSDOS-866 (and alike), second Windows-1251. Nowadays MS Windows uses first code page for console programs, second for GUI applications, and there always are many workarounds to get proper translation between them. Mostly a programmer should write program sources either in one code page for console and other for GUI, or use .NET, which basically uses UTF8 in sources and makes seamless translation depending on back end.
> 
> In D language which uses only UTF8 for string encoding I cannot write neither MS866 code page program texts, nor Windows-1251 - both cases end in a compiler error like "Invalid trailing code unit" or "Outside Unicode code space". And writing Cyrillic strings in UTF8 format is fatal for both console and GUI Windows targets.
> 
> My question is: is there any standard means to translate Cyrillic or any other localized UTF8 strings for console and GUI output in D libraries. If so - where I can get more information and good example. Google would not help.
[...]

The string / wstring / dstring types in D are intended to be Unicode strings.  If you need to use other encodings, you really should be using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of string.

One approach is to use UTF-8 in your code, and only translate to one of the code pages when you need to produce output.  I wrote a small module for translating to/from KOI8-R when dealing with Russian text; you might find it helpful:

-------------------------------------------------------------------------------
/**
 * Module to convert between UTF and KOI8-R
 */
module koi8r;

import std.string;
import std.range;

static immutable ubyte[0x450 - 0x410] utf2koi8r = [
    225, 226, 247, 231, 228, 229, 246, 250, // АБВГДЕЖЗ
    233, 234, 235, 236, 237, 238, 239, 240, // ИЙКЛМНОП
    242, 243, 244, 245, 230, 232, 227, 254, // РСТУФХЦЧ
    251, 253, 255, 249, 248, 252, 224, 241, // ШЩЪЫЬЭЮЯ
    193, 194, 215, 199, 196, 197, 214, 218, // абвгдежз
    201, 202, 203, 204, 205, 206, 207, 208, // ийклмноп
    210, 211, 212, 213, 198, 200, 195, 222, // рстуфхцч
    219, 221, 223, 217, 216, 220, 192, 209  // шщъыьэюя
];

/**
 * Translates a range of UTF characters into KOI8-R characters.
 * Returns: Range of KOI8-R characters (as ubyte).
 */
auto toKOI8r(R)(R range)
    if (isInputRange!R && is(ElementType!R : dchar))
{
    static struct Result
    {
        R _range;

        @property bool empty() { return _range.empty; }

        @property ubyte front()
        {
            dchar ch = _range.front;

            // ASCII
            if (ch < 128)
                return cast(ubyte)ch;

            // Primary alphabetic range
            if (ch >= 0x410 && ch < 0x450)
                return utf2koi8r[ch - 0x410];

            // Special case: Ё and ё are outside the usual range.
            if (ch == 0x401) return 179;
            if (ch == 0x451) return 163;

            throw new Exception(
                "Encoding error: unable to convert '%c' to KOI8-R".format(ch));
        }

        void popFront() { _range.popFront(); }

        static if (isForwardRange!R)
        {
            @property Result save()
            {
                Result copy;
                copy._range = _range.save;
                return copy;
            }
        }
    }
    return Result(range);
}

unittest
{
    import std.string;
    import std.algorithm : equal;

    assert("юабцдефгхийклмнопярстужвьызшэщчъ".toKOI8r.equal(iota(192, 224)));
    assert("ЮАБЦДЕФГХИЙКЛМНОПЯРСТУЖВЬЫЗШЭЩЧЪ".toKOI8r.equal(iota(224, 256)));
}

unittest
{
    auto r = "abc абв".toKOI8r;
    static assert(isForwardRange!(typeof(r)));
    import std.algorithm.comparison : equal;
    assert(r.equal(['a', 'b', 'c', ' ', 193, 194, 215]));
}

static dchar[0x100 - 0xC0] koi8r2utf = [
    'ю', 'а', 'б', 'ц', 'д', 'е', 'ф', 'г', // 192-199
    'х', 'и', 'й', 'к', 'л', 'м', 'н', 'о', // 200-207
    'п', 'я', 'р', 'с', 'т', 'у', 'ж', 'в', // 208-215
    'ь', 'ы', 'з', 'ш', 'э', 'щ', 'ч', 'ъ', // 216-223
    'Ю', 'А', 'Б', 'Ц', 'Д', 'Е', 'Ф', 'Г', // 224-231
    'Х', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О', // 232-239
    'П', 'Я', 'Р', 'С', 'Т', 'У', 'Ж', 'В', // 240-247
    'Ь', 'Ы', 'З', 'Ш', 'Э', 'Щ', 'Ч', 'Ъ'  // 248-255
];

/**
 * Translates a range of KOI8-R characters to UTF.
 * Returns: Range of UTF characters (as dchar).
 */
auto fromKOI8r(R)(R range)
    if (isInputRange!R && is(ElementType!R : ubyte))
{
    static struct Result
    {
        R _range;
        @property bool empty() { return _range.empty; }
        @property dchar front()
        {
            ubyte b = _range.front;
            if (b < 128) return b;
            if (b >= 192)
                return koi8r2utf[b - 192];

            switch (b)
            {
                case 128: return '─';
                case 152: return '≤';
                case 153: return '≥';
                case 163: return 'ё';
                case 179: return 'Ё';
                default:
                    import std.string : format;
                    throw new Exception(
                        "KOI8-R character %d not implemented yet".format(b));
            }
        }
        void popFront() { _range.popFront(); }
        static if (isForwardRange!R)
        {
            @property Result save()
            {
                Result copy;
                copy._range = _range.save;
                return copy;
            }
        }
    }
    return Result(range);
}

unittest
{
    import std.algorithm.comparison : equal;
    ubyte[] lower = [
        193, 194, 215, 199, 196, 197, 163, 214,
        218, 201, 202, 203, 204, 205, 206, 207,
        208, 210, 211, 212, 213, 198, 200, 195,
        222, 219, 221, 223, 217, 216, 220, 192,
        209
    ];
    assert(lower.fromKOI8r.equal("абвгдеёжзийклмнопрстуфхцчшщъыьэюя"));

    ubyte[] upper = [
        225, 226, 247, 231, 228, 229, 179, 246,
        250, 233, 234, 235, 236, 237, 238, 239,
        240, 242, 243, 244, 245, 230, 232, 227,
        254, 251, 253, 255, 249, 248, 252, 224,
        241
    ];
    assert(upper.fromKOI8r.equal("АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"));
}
-------------------------------------------------------------------------------

As the unittests show, you just call toKOI8r or fromKOI8r to translate between encodings.  All non-Unicode strings are traded as ubyte[], so that you won't accidentally mix up a Unicode string with a KOI8-R string.

And the code should be straightforward enough to be adapted for other encodings as well.

Hope this helps.


T

-- 
For every argument for something, there is always an equal and opposite argument against it. Debates don't give answers, only wounded or inflated egos.
December 28
you can just set console CP to UTF-8:

https://github.com/CyberShadow/ae/blob/master/sys/console.d

December 29
On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
> On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn wrote:
> ...
> The string / wstring / dstring types in D are intended to be Unicode strings.  If you need to use other encodings, you really should be using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of string.

Thank you Teoh for advise and good example! I was looking towards writing something like that if no decision exists. Still this way of deliberate translations seems to be not the best. It supposes explicit workaround for every ahchoo in Russian and steady converting ubyte[] to string and back around. No formatting gems, no simple and elegant I/O statements or string/char comparisons. This may be endurable if you write an application where Russian is only one of rare options, and what if your whole environment is totally Russian?

Or some other nonASCII locale... Many other cultures have same mix of DOS/Window/Unix code pages. The decision to use only Unicode for strings in D language seems excellent just because of this, but the realization turns out to be delusive. Folks in such countries won’t appreciate a language which is elegant only for English-spoken intercommunications.

This problem is common for most programming languages and runtimes I know of. The only system which has decided the whole case is .NET I think.

The way proposed by zabruk70 below seems more appropriate though more particular too - I feel it suits only console type of applications. Alas, this type of application proved to be buggy too.

On Thursday, 28 December 2017 at 22:49:30 UTC, zabruk70 wrote:
> you can just set console CP to UTF-8:
>
> https://github.com/CyberShadow/ae/blob/master/sys/console.d

Yes! This seems to be the required, thank you very much! Though it is not suitable for GUI type of a Windows application.

Still some testing showed that this way conforms only console output. Simple read/write/compare script listed below works very well until the user enters something Russian. It then prints **empty** response and falls into indefinite loop printing the prompt and then immediately empty response without actually reading it.

But I think this is subject for ”Issues” part of this forum.

p.s. I’ve found that I may set “Consolas” font for a console window and then you can output properly localized UTF8 strings without any special code in D script managing code pages. Still this does not decide localized input problem: any localized input throws an exception “std.utf.UTFException... Invalid UTF-8 sequence”.

The script:

import core.sys.windows.windows;
import std.stdio;
import std.string;

int main(string[] args)
{
    const UTF8CP = 65001;
    UINT oldCP, oldOutputCP;
    oldCP = GetConsoleCP();
    oldOutputCP = GetConsoleOutputCP();

    SetConsoleCP(UTF8CP);
    SetConsoleOutputCP(UTF8CP);

    writeln("hello world, привет всем!");

    bool quit = false;
    string response;
    while (!quit)
    {
        write("responde something: ");
        response=readln().strip();
        writefln("your response is \"%s\"", response);
        if (response == "quit")
        {
            writeln("good buy then!");
            quit = true;
        }
    }

    SetConsoleCP(oldCP);
    SetConsoleOutputCP(oldOutputCP);

    return 0;
}

December 29
On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
> Though it is not suitable for GUI type of a Windows application.

AFAIK, Windows GUI have no ANSI/OEM problem.
You can use Unicode.

For Windows ANSI/OEM problem you can use also
https://dlang.org/phobos/std_windows_charset.html

December 29
On Fri, Dec 29, 2017 at 10:35:53AM +0000, Andrei via Digitalmars-d-learn wrote:
> On Thursday, 28 December 2017 at 18:45:39 UTC, H. S. Teoh wrote:
> > On Thu, Dec 28, 2017 at 05:56:32PM +0000, Andrei via Digitalmars-d-learn
> > wrote:
> > ...
> > The string / wstring / dstring types in D are intended to be Unicode
> > strings.  If you need to use other encodings, you really should be
> > using ubyte[] or const(ubyte)[] or immutable(ubyte)[], instead of
> > string.
> 
> Thank you Teoh for advise and good example! I was looking towards writing something like that if no decision exists. Still this way of deliberate translations seems to be not the best. It supposes explicit workaround for every ahchoo in Russian and steady converting ubyte[] to string and back around. No formatting gems, no simple and elegant I/O statements or string/char comparisons. This may be endurable if you write an application where Russian is only one of rare options, and what if your whole environment is totally Russian?

You mean if your environment uses a non-UTF encoding?  If your environment uses UTF, there is no problem.  I have code with strings in Russian (and other languages) embedded, and it's no problem because everything is in Unicode, all input and all output.

But I understand that in Windows you may not have this luxury. So you have to deal with codepages and what-not.

Converting back and forth is not a big problem, and it actually also solves the problem of string comparisons, because std.uni provides utilities for collating strings, etc.. But it only works for Unicode, so you have to convert to Unicode internally anyway.  Also, for static strings, it's not hard to make the codepage mapping functions CTFE-able, so you can actually write string literals in a codepage and have the compiler automatically convert it to UTF-8.

The other approach, if you don't like the idea of converting codepages all the time, is to explicitly work in ubyte[] for all strings. Or, preferably, create your own string type with ubyte[] representation underneath, and implement your own comparison functions, etc., then use this type for all strings. Better yet, contribute this to code.dlang.org so that others who have the same problem can reuse your code instead of needing to write their own.

[...]
> p.s. I’ve found that I may set “Consolas” font for a console window
> and then you can output properly localized UTF8 strings without any
> special code in D script managing code pages. Still this does not
> decide localized input problem: any localized input throws an
> exception “std.utf.UTFException...  Invalid UTF-8 sequence”.

Is the exception thrown in readln() or in writeln()? If it's in
writeln(), it shouldn't be a big deal, you just have to pass the data
returned by readln() to fromKOI8 (or whatever other codepage you're
using).

If the problem is in readln(), then you probably need to read the input in binary (i.e., as ubyte[]) and convert it manually. Unfortunately, there's no other way around this if you're forced to use codepages. The ideal situation is if you can just use Unicode throughout your environment. But of course, sometimes you have no choice.


T

-- 
Heuristics are bug-ridden by definition. If they didn't have bugs, they'd be algorithms.
January 03
On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
> On Friday, 29 December 2017 at 10:35:53 UTC, Andrei wrote:
>> Though it is not suitable for GUI type of a Windows application.
>
> AFAIK, Windows GUI have no ANSI/OEM problem.
> You can use Unicode.

Partly, yes. Just for a test I tried to "russify" the example Windows GUI program that comes with D installation pack (samples\d\winsamp.d). Window captions, button captions, message box texts written in UTF8 all shows fine. But direct text output functions CreateFont()/TextOut() render all Cyrillic from UTF8 strings into garbage.

> For Windows ANSI/OEM problem you can use also
> https://dlang.org/phobos/std_windows_charset.html

Thank you very much, toMBSz() makes requisite translation for  TextOut() function with some workarounds.



January 03
On Wednesday, 3 January 2018 at 06:42:42 UTC, Andrei wrote:
>> AFAIK, Windows GUI have no ANSI/OEM problem.
>> You can use Unicode.
>
> Partly, yes. Just for a test I tried to "russify" the example Windows GUI program that comes with D installation pack (samples\d\winsamp.d). Window captions, button captions, message box texts written in UTF8 all shows fine. But direct text output functions CreateFont()/TextOut() render all Cyrillic from UTF8 strings into garbage.

Windows API contains two sets of functions: those whose names end with A (meaning ANSI), the other where names end with W (wide characters, meaning Unicode). The sample uses TextOutA, this function that expects 8-bit encoding. Properly, you need to use TextOutW that accepts 16-bit Unicode, so just convert your UTF-8 D strings to 16-bit Unicode wstrings, there are appropriate conversion functions in Phobos.

January 03
On Wednesday, 3 January 2018 at 09:11:32 UTC, thedeemon wrote:
> you need to use TextOutW that accepts 16-bit Unicode, so just convert your UTF-8 D strings to 16-bit Unicode wstrings, there are appropriate conversion functions in Phobos.

Some details:
import std.utf : toUTF16z;
...
string s = "привет";
TextOutW(s.toUTF16z);
January 03
On Friday, 29 December 2017 at 11:14:39 UTC, zabruk70 wrote:
>
> AFAIK, Windows GUI have no ANSI/OEM problem.
> You can use Unicode.

Be advised there are some problems with console UTF-8 input/output in Windows. The most usable is Win10 new console window but I recommend to use Windows API (WriteConsole) instead. It works correctly regardless of codepage setting, os version and C library.

« First   ‹ Prev
1 2