Russian and other national languages support

Feb 03, 2009

zorran

Feb 03, 2009

Feb 03, 2009

Feb 03, 2009

Feb 03, 2009

Feb 04, 2009

Feb 04, 2009

Feb 04, 2009

Feb 03, 2009

Feb 05, 2009

Feb 07, 2009

Feb 21, 2009

Russian language not working in comments and strings by default with ANSI coding (code page) Compiler write error - "invalid UTF-8 sequence" ============== void main() { string s = "Что-то по русски"; // some text in russian printf("hello, world!"); // Здравствуй, мир! } ============== (D version 1.039) in Delphi, C#, and many C++ compilers - All OK! Why? it can reduce popularity D! Russian text not needs two-byte code-page! its not Chinese!

On Tue, 3 Feb 2009 17:13:38 +0000 (UTC), zorran <zorran@tut.by> wrote: >Russian language not working >in comments and strings by default >with ANSI coding (code page) > >Compiler write error - "invalid UTF-8 sequence" > >============== >void main() >{ > string s = "Что-то по русски"; // some text in russian > printf("hello, world!"); // Здравствуй, мир! >} >============== > >(D version 1.039) > >in Delphi, C#, and many C++ compilers - All OK! >Why? >it can reduce popularity D! >Russian text not needs two-byte code-page! its not Chinese! D strings are supposed to be UTF-8. Source files can be ASCII or UTF. To escape a Unicode code point, use \u0000 or \U00000000, where 0 is a hexadecimal digit. Be aware that dmd/phobos still have some minor problems with Unicode support. For example, messages produced by static asserts are not output correctly.

February 03, 2009

Re: Russian and other national languages support

Posted by BCS
in reply to zorran

Permalink

BCS

Posted in reply to zorran

Permalink

Reply to Zorran,

> Russian language not working
> in comments and strings by default
> with ANSI coding (code page)
> Compiler write error - "invalid UTF-8 sequence"
> 
> ==============
> void main()
> {
> string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;
> &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
> printf("hello, world!"); //
> &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;
> , &#1084;&#1080;&#1088;!
> }
> ==============
> 
> (D version 1.039)
> 
> in Delphi, C#, and many C++ compilers - All OK!
> Why?
> it can reduce popularity D!
> Russian text not needs two-byte code-page! its not Chinese!

IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages have all kinds of nasty side effects. For instance, the above code is a garbled mess of number codes in my NG reader. Also this kind of thing: http://www.viprasys.com/vb/f44/hole-notepad-12276/

Way back (2-3 years) I remember a long thread about the use of UTF in D and the up shot was that it's not grate but it's a lot better than anything else anyone has come up with.

On Tue, 03 Feb 2009 20:13:38 +0300, zorran <zorran@tut.by> wrote: > Russian language not working > in comments and strings by default > with ANSI coding (code page) > > Compiler write error - "invalid UTF-8 sequence" > > ============== > void main() > { > string s = "Что-то по русски"; // some text in russian > printf("hello, world!"); // Здравствуй, мир! > } > ============== > > (D version 1.039) > > in Delphi, C#, and many C++ compilers - All OK! > Why? > it can reduce popularity D! > Russian text not needs two-byte code-page! its not Chinese! Just save your file as UTF-8 and you are done.

BCS wrote: <snip> > IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages have all kinds of nasty side effects. For instance, the above code is a garbled mess of number codes in my NG reader. Also this kind of thing: http://www.viprasys.com/vb/f44/hole-notepad-12276/ <snip> Seems to be a bug in the web newsgroup interface. Indeed: http://validator.w3.org/check?uri=http://www.digitalmars.com/webnews/newsgroups.php Knowing PHP, it should be trivial to insert a meta tag to fix this. Though really, www.digitalmars.com should be configured to declare all text/* content as UTF-8 in the HTTP headers. Meanwhile, best bet is to stop using the web interface and get oneself a newsreader. Stewart.

Reply to Stewart, > > Meanwhile, best bet is to stop using the web interface and get oneself > a newsreader. > If the web interface is the problem than it's the posting bit as /I'm/ not using the web interface.

BCS wrote: > Reply to Stewart, > >> Meanwhile, best bet is to stop using the web interface and get oneself >> a newsreader. > > If the web interface is the problem than it's the posting bit It can't be just the posting bit. If it doesn't declare a sensible encoding, it can't properly display UTF-8 encoded posts either. JTAI it doesn't just need to declare an encoding for the HTML output - it also needs to declare a suitable encoding when posting and handle encoding properly when displaying messages. But how easy or not is this in PHP? > as /I'm/ not using the web interface. My comment wasn't aimed at you particularly - I just needed somewhere to put it. Sorry if it seemed otherwise. Stewart.

Hello Stewart, > BCS wrote: > >> Reply to Stewart, >> >>> Meanwhile, best bet is to stop using the web interface and get >>> oneself a newsreader. >>> >> If the web interface is the problem than it's the posting bit >> > It can't be just the posting bit. If it doesn't declare a sensible > encoding, it can't properly display UTF-8 encoded posts either. > it could be converting client side to ASCII :)

BCS wrote: > Hello Stewart, > >> BCS wrote: <snip> >>> If the web interface is the problem than it's the posting bit >>> >> It can't be just the posting bit. If it doesn't declare a sensible >> encoding, it can't properly display UTF-8 encoded posts either. > > it could be converting client side to ASCII :) AIUI form posts are transmitted in the encoding of the HTML page containing the form. If the user supplies a character that can't be represented in this encoding, it gets converted on the client side to an HTML entity reference. Look at http://d.puremagic.com/issues/show_bug.cgi?id=111 When this issue was filed, Bugzilla was configured to serve pages in ISO-8859-1; hence the bug report was mangled, with # t ~= "我"; having become # t ~= "我"; Now our Bugzilla is on UTF-8, but this instance remains because it is what went through to the server at the time and is therefore stored in the database. Stewart.

zorran Wrote: > in Delphi, C#, and many C++ compilers - All OK! > Why? > it can reduce popularity D! > Russian text not needs two-byte code-page! its not Chinese! In C# all strings are two-byte encoded (UTF-16), in C++ L"..." strings are (usually) two-byte encoded, Delphi is a legacy technology, but some people enabled it with some WideStrings and TNT which are unicode too. Modern projects usually use modern technologies like unicode. If you really want to work with ANSI strings, you can do it, but then you should not use D libraries, which expect strings to be unicode.

Forums