Thread overview | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 28, 2005 writef crashes on international string output | ||||
---|---|---|---|---|
| ||||
Writef crashes on international (russian) string output not UTF but generic. |
January 28, 2005 Re: writef crashes on international string output | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.Dizel | Dr.Dizel schrieb in news:ctea06$k6q$1@digitaldaemon.com... > Writef crashes on international (russian) string output not UTF but generic. plattform? OS? compiler version? sample string? what shell? Thomas |
January 29, 2005 Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Thomas Kuehne | In article <cteamj$ku0$1@digitaldaemon.com>, Thomas Kuehne says... >Dr.Dizel schrieb in news:ctea06$k6q$1@digitaldaemon.com... >> Writef crashes on international (russian) string output not UTF but generic. Ouch! It is a dmd parsing bug. I cannot write source files on my national language not identifiers but for example just simple strings for output. If I do so dmd cannot parse they in any encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange codepage conversions. However, I need to write and print my strings on Russian! Examples with DOS codepage (866): ------------------------------------ import std.stdio; int main(char[][] args) { char[] hello_on_russian = "Ïðèâåò, ìèð!"; return 0; } C:\dmd\bin>dmd helloworld.d helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence helloworld.d(6): invalid UTF-8 sequence -------------------------------------------------- import std.stdio; int main(char[][] args) { char[] hello_on_russian = `Ïðèâåò, ìèð!`; // backquotes here writef(hello_on_russian); return 0; } C:\dmd\bin>dmd helloworld.d C:\dmd\bin\..\..\dm\bin\link.exe helloworld,,,user32+kernel32/noi; C:\dmd\bin>helloworld Error: invalid UTF-8 sequence ------------------------------------ import std.stdio; int main(char[][] args) { char[] hello_on_russian = `Ïðèâåò, ìèð!`; // backquotes here printf(hello_on_russian); return 0; } C:\dmd\bin>helloworld Ïðèâåò, ìèð! Old printf way is good. I think other parts of dmd library have some bugs in national language strings parsing. P.S. I use Windows XP and dmd version is 0.111. |
January 29, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.Dizel | Dr.Dizel wrote: > Ouch! It is a dmd parsing bug. It's not a dmd bug, but a limitation by design... > I cannot write source files on my national language not identifiers but for > example just simple strings for output. If I do so dmd cannot parse they in any > encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange > codepage conversions. However, I need to write and print my strings on Russian! D *only* supports Unicode (UTF-8, UTF-16, UTF-32) This means: 1) Your source code must be in UTF-8 2) Your console input must be UTF-8 3) Your console output will be UTF-8 Otherwise you *will* get errors such as "invalid UTF-8 sequence" or wrong output. However, Unicode does have full support for Russian / Kyrillic - and so does D. This means that if you want to run D programs on an unsupported console, you need to cast and change encoding on the char[] before input/output. The input you get will be in ubyte[], in the local encoding, and can be converted to wchar[] with a lookup table... Similarly, you can convert your char[] to an ubyte[] for output by using the reverse of that table. The lookup table, "wchar[256] mapping", is different for each encoding. I can post some sample code, if wanted ? You can also use routines from the Windows API, to convert to and from the current console code page. They should be somewhere in D, as well. --anders PS. Lookup from codepage 866 (ubyte) to unicode (wchar) can be found at: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP866.TXT |
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | In article <ctgl26$4jh$1@digitaldaemon.com>, =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says... > >>Dr.Dizel wrote: >> Ouch! It is a dmd parsing bug. > >It's not a dmd bug, but a limitation by design... > Then backquotes in my example destroy this design. Why I can use only English strings but cannot others? Is it tyranny of US? :-) >> I cannot write source files on my national language not identifiers but for example just simple strings for output. If I do so dmd cannot parse they in any encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange codepage conversions. However, I need to write and print my strings on Russian! > >D *only* supports Unicode (UTF-8, UTF-16, UTF-32) However backquotes ... >This means: >1) Your source code must be in UTF-8 >2) Your console input must be UTF-8 >3) Your console output will be UTF-8 Where did you see such console? Which programs can use it? Is it sferic horse in vacuum? :-) If module std.stdio has no any input, how can I do it? Is it codepage safe? How can I input from and output to none UTF console? Is it a big problem or difficult thing to use dmd for programs, which use multilanguage envieroment? >Otherwise you *will* get errors such as >"invalid UTF-8 sequence" or wrong output. > >However, Unicode does have full support >for Russian / Kyrillic - and so does D. > > >This means that if you want to run D programs on an unsupported console, you need to cast and change encoding on the char[] before input/output. How can I do so: char[] can hold only UTF-8 chars and writef cannot output other codepages (see my example)? >The input you get will be in ubyte[], in the local encoding, and can be converted to wchar[] with a lookup table... Similarly, you can convert your char[] to an ubyte[] for output by using the reverse of that table. The lookup table, "wchar[256] mapping", is different for each encoding. How can I output ubyte[] with writef? >I can post some sample code, if wanted ? Yes. In addition, developers must rename char to utf8 because it is not real char and wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF. |
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.Dizel | Dr.Dizel schrieb: > In article <ctgl26$4jh$1@digitaldaemon.com>, > =?ISO-8859-1?Q?Anders_F_Bj=F6rklund?= says... > >>>Dr.Dizel wrote: >>>Ouch! It is a dmd parsing bug. >> >>It's not a dmd bug, but a limitation by design... >> > > > Then backquotes in my example destroy this design. > Why I can use only English strings but cannot others? Is it tyranny of US? :-) Funny one gets accused as a tyrant when using the most liberal and general encoding available... ;) > > >>>I cannot write source files on my national language not identifiers but for >>>example just simple strings for output. If I do so dmd cannot parse they in any >>>encoding: ANSI, OEM, KOI8R ... except UTF-16. If I use UTF-16 dmd do strange >>>codepage conversions. However, I need to write and print my strings on Russian! >> >>D *only* supports Unicode (UTF-8, UTF-16, UTF-32) > > > However backquotes ... You oughta make sure your text editor saves the source code correctly. If you wish to use UTF-16 or UTF-32, be sure that there is a Byte Order Mark at the start of the file. I use jEdit and save files in UTF-8, which works fine. > > >>This means: >>1) Your source code must be in UTF-8 >>2) Your console input must be UTF-8 >>3) Your console output will be UTF-8 > > > Where did you see such console? Which programs can use it? Is it sferic horse in > vacuum? :-) I guess your best bet currently would be to not use the console, sad as that is. Alternatively, you might use something like iconv, but I have no idea if it's available for D. How does Russian console input work, anyway? I'd be interested in that ^^ > > In addition, developers must rename char to utf8 because it is not real char and > wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF. This has been up for discussion a lot of times, actually. IMHO, it doesn't really matter what you call them; the docs state clearly enough what they *are*. -Sebastian |
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Dr.Dizel | Dr.Dizel wrote: > Why I can use only English strings but cannot others? Is it tyranny of US? :-) On the contrary, you can now use a lot more than just Western languages. >>This means: >>1) Your source code must be in UTF-8 This implies that your text editor must also be able to handle UTF-8. >>2) Your console input must be UTF-8 >>3) Your console output will be UTF-8 > > Where did you see such console? Which programs can use it? Linux has one. Mac OS X has one. I hope Windows XP can get one... > If module std.stdio has no any input, how can I do it? Is it codepage safe? > How can I input from and output to none UTF console? > Is it a big problem or difficult thing to use dmd for programs, > which use multilanguage envieroment? Non-UTF consoles are unsupported, but it can still be done. > How can I do so: char[] can hold only UTF-8 chars and writef cannot output other > codepages (see my example)? Yes. > How can I output ubyte[] with writef? That I am not 100% sure of, since I used printf instead. writef works just fine for Unicode, but not for 8-bit... >>I can post some sample code, if wanted ? > > Yes. See http://www.algonet.se/~afb/d/mapping.zip Haven't added CP866, but CP437 is there for reference ? Note: There are better version of this, for Windows only. (maybe some one else can post a version using Win32 API ?) > In addition, developers must rename char to utf8 because it is not real char and > wchar to utf16 and dchar to utf32. Char must store any char from 0x00 to 0xFF. The "char" type in D is, by definition, a UTF-8 type. Holding 0x00-0x7F, and all different types of Unicode characters by using up to char[4]... To store any so called character, from 0x00-0xFF, you *need* ubyte. Note: The "real char", if we are talking C/C++, is called "byte" in D. --anders |
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Anders F Björklund wrote: > Linux has one. Mac OS X has one. I hope Windows XP can get one... Michael Walter has demonstrated that the WinXP console is indeed capable of UTF-8: <http://ilfirin.org/unicode.png> |
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Benjamin Herr | Benjamin Herr schrieb:
> <http://ilfirin.org/unicode.png>
OMG, don't open the homepage!
|
January 30, 2005 Re: Ouch! It is a dmd parsing bug too. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sebastian Beschke | Sebastian Beschke schrieb:
> Benjamin Herr schrieb:
>
>> <http://ilfirin.org/unicode.png>
>
>
> OMG, don't open the homepage!
Sorry if I offended you :(
|
Copyright © 1999-2021 by the D Language Foundation