Jump to page: 1 2
Thread overview
Russian and other national languages support
Feb 03, 2009
zorran
Feb 03, 2009
Max Samukha
Feb 03, 2009
BCS
Feb 03, 2009
Stewart Gordon
Feb 03, 2009
BCS
Feb 04, 2009
Stewart Gordon
Feb 04, 2009
BCS
Feb 04, 2009
Stewart Gordon
Feb 03, 2009
Denis Koroskin
Feb 05, 2009
Kagamin
Feb 07, 2009
zorran
Feb 21, 2009
Walter Bright
February 03, 2009
Russian language not working
in comments and strings by default
with ANSI coding (code page)

Compiler write error - "invalid UTF-8 sequence"

==============
void main()
{
	string s = "Что-то по русски"; // some text in russian
	printf("hello, world!"); // Здравствуй, мир!
}
==============

(D version 1.039)

in Delphi, C#, and many C++ compilers - All OK!
Why?
it can reduce popularity D!
Russian text not needs two-byte code-page! its not Chinese!
February 03, 2009
On Tue, 3 Feb 2009 17:13:38 +0000 (UTC), zorran <zorran@tut.by> wrote:

>Russian language not working
>in comments and strings by default
>with ANSI coding (code page)
>
>Compiler write error - "invalid UTF-8 sequence"
>
>==============
>void main()
>{
>	string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086; &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
>	printf("hello, world!"); // &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;, &#1084;&#1080;&#1088;!
>}
>==============
>
>(D version 1.039)
>
>in Delphi, C#, and many C++ compilers - All OK!
>Why?
>it can reduce popularity D!
>Russian text not needs two-byte code-page! its not Chinese!

D strings are supposed to be UTF-8. Source files can be ASCII or UTF. To escape a Unicode code point, use \u0000 or \U00000000, where 0 is a hexadecimal digit. Be aware that dmd/phobos still have some minor problems with Unicode support. For example, messages produced by static asserts are not output correctly.
February 03, 2009
Reply to Zorran,

> Russian language not working
> in comments and strings by default
> with ANSI coding (code page)
> Compiler write error - "invalid UTF-8 sequence"
> 
> ==============
> void main()
> {
> string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086;
> &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
> printf("hello, world!"); //
> &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;
> , &#1084;&#1080;&#1088;!
> }
> ==============
> 
> (D version 1.039)
> 
> in Delphi, C#, and many C++ compilers - All OK!
> Why?
> it can reduce popularity D!
> Russian text not needs two-byte code-page! its not Chinese!

IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages have all kinds of nasty side effects. For instance, the above code is a garbled mess of number codes in my NG reader. Also this kind of thing: http://www.viprasys.com/vb/f44/hole-notepad-12276/

Way back (2-3 years) I remember a long thread about the use of UTF in D and the up shot was that it's not grate but it's a lot better than anything else anyone has come up with.


February 03, 2009
On Tue, 03 Feb 2009 20:13:38 +0300, zorran <zorran@tut.by> wrote:

> Russian language not working
> in comments and strings by default
> with ANSI coding (code page)
>
> Compiler write error - "invalid UTF-8 sequence"
>
> ==============
> void main()
> {
> 	string s = "&#1063;&#1090;&#1086;-&#1090;&#1086; &#1087;&#1086; &#1088;&#1091;&#1089;&#1089;&#1082;&#1080;"; // some text in russian
> 	printf("hello, world!"); // &#1047;&#1076;&#1088;&#1072;&#1074;&#1089;&#1090;&#1074;&#1091;&#1081;, &#1084;&#1080;&#1088;!
> }
> ==============
>
> (D version 1.039)
>
> in Delphi, C#, and many C++ compilers - All OK!
> Why?
> it can reduce popularity D!
> Russian text not needs two-byte code-page! its not Chinese!

Just save your file as UTF-8 and you are done.
February 03, 2009
BCS wrote:
<snip>
> IIRC D doesn't use codepages at all, it is pure UTF-8/16/32. Code pages have all kinds of nasty side effects. For instance, the above code is a garbled mess of number codes in my NG reader. Also this kind of thing: http://www.viprasys.com/vb/f44/hole-notepad-12276/
<snip>

Seems to be a bug in the web newsgroup interface.  Indeed:
http://validator.w3.org/check?uri=http://www.digitalmars.com/webnews/newsgroups.php

Knowing PHP, it should be trivial to insert a meta tag to fix this. Though really, www.digitalmars.com should be configured to declare all text/* content as UTF-8 in the HTTP headers.

Meanwhile, best bet is to stop using the web interface and get oneself a newsreader.

Stewart.
February 03, 2009
Reply to Stewart,

> 
> Meanwhile, best bet is to stop using the web interface and get oneself
> a newsreader.
> 

If the web interface is the problem than it's the posting bit as /I'm/ not using the web interface.


February 04, 2009
BCS wrote:
> Reply to Stewart,
> 
>> Meanwhile, best bet is to stop using the web interface and get oneself
>> a newsreader.
> 
> If the web interface is the problem than it's the posting bit

It can't be just the posting bit.  If it doesn't declare a sensible encoding, it can't properly display UTF-8 encoded posts either.

JTAI it doesn't just need to declare an encoding for the HTML output - it also needs to declare a suitable encoding when posting and handle encoding properly when displaying messages.  But how easy or not is this in PHP?

> as /I'm/ not using the web interface.

My comment wasn't aimed at you particularly - I just needed somewhere to put it.  Sorry if it seemed otherwise.

Stewart.
February 04, 2009
Hello Stewart,

> BCS wrote:
> 
>> Reply to Stewart,
>> 
>>> Meanwhile, best bet is to stop using the web interface and get
>>> oneself a newsreader.
>>> 
>> If the web interface is the problem than it's the posting bit
>> 
> It can't be just the posting bit.  If it doesn't declare a sensible
> encoding, it can't properly display UTF-8 encoded posts either.
> 

it could be converting client side to ASCII :)


February 04, 2009
BCS wrote:
> Hello Stewart,
> 
>> BCS wrote:
<snip>
>>> If the web interface is the problem than it's the posting bit
>>>
>> It can't be just the posting bit.  If it doesn't declare a sensible
>> encoding, it can't properly display UTF-8 encoded posts either.
> 
> it could be converting client side to ASCII :)

AIUI form posts are transmitted in the encoding of the HTML page containing the form.  If the user supplies a character that can't be represented in this encoding, it gets converted on the client side to an HTML entity reference.  Look at
http://d.puremagic.com/issues/show_bug.cgi?id=111

When this issue was filed, Bugzilla was configured to serve pages in ISO-8859-1; hence the bug report was mangled, with

#  t ~= "我";

having become

#  t ~= "&#25105;";

Now our Bugzilla is on UTF-8, but this instance remains because it is what went through to the server at the time and is therefore stored in the database.

Stewart.
February 05, 2009
zorran Wrote:

> in Delphi, C#, and many C++ compilers - All OK!
> Why?
> it can reduce popularity D!
> Russian text not needs two-byte code-page! its not Chinese!

In C# all strings are two-byte encoded (UTF-16), in C++ L"..." strings are (usually) two-byte encoded, Delphi is a legacy technology, but some people enabled it with some WideStrings and TNT which are unicode too. Modern projects usually use modern technologies like unicode. If you really want to work with ANSI strings, you can do it, but then you should not use D libraries, which expect strings to be unicode.
« First   ‹ Prev
1 2