What's left for 1.0? - string class (page 3) - D Programming Language Discussion Forum

November 17, 2006

Re: What's left for 1.0? - string class

Posted by Samuel MV
in reply to Bill Baxter

Permalink

Samuel MV

Posted in reply to Bill Baxter

Permalink

I don't think this is a library matter, because of it's the way char[] works. If 'hög' or 'aún' aren't 3 chars, it's broken ... :(

Best regards,

              Samuel.

Bill Baxter escribió:
> Aarti_pl wrote:
>> I can not believe no one is using utf-8 characters in his program and is not concerned about issues with current D char[] implementation, so I repost my previous post. Sorry about reposting - if no one will comment I will get a lesson and thing that maybe this issue is not so much important.
>>
>> But preferably I will get some even negative comments about importance of having string class built in...
>>
>> For me string class is something what could significantly improve quality of libraries for D.
> 
>  From previous discussions it seemed to me like there was a fair amount of support for a string class.   I think the lack of response could be just that not so many folks feel like it is a "must-have" for 1.0.
> 
> I think anything that can be done in a library can wait till post 1.0. C++ had very little in the way of a standard library at "1.0" (and really it still has very little).  But for 1.0, the language itself better be in a state that it is *possible* to write every library on the wish list.  If there is anything in the language itself that would prevent creating a string class like the one you speak of, then I think that needs to be addressed.  But the string class itself can come later.
> 
> Of course if dstring really is good enough as-is, then it might as well be in the std library for 1.0.
> 
> --bb

Samuel MV wrote: > This is *very* serious for i18n: > > >> char[] foo = "hög"; > >> assert(foo.length == 3); // Sorry UTF-8, this is == 4 > >> assert(foo[1] == 'ö'); // Not a chance! > > char[] should be a real char[], not a sort of byte[] for text. It needs to be fix for non-english. That's what wchar and dchar are for. If all you want is to make sure your chars are chars, then use dchar everywhere and be happy. Just be aware that dchars are 32bits a piece. Not a big deal for most apps, but could be for a few. Is there any problem with dchar other than just the size of it being massive overkill for western languages? --bb

Yep, memory is cheap, but then libraries has to support well char/wchar/dchar (quite unusual) ... I think that there should be only one kind of char, that internally works as UTF8, UTF16 or UTF32 (automatically or on demand), but you don't care about it except when you need to interface with non-D libraries, files, etc. (solved with a couple of functions) Best Regards, Samuel. Bill Baxter escribió: > Samuel MV wrote: >> This is *very* serious for i18n: >> >> >> char[] foo = "hög"; >> >> assert(foo.length == 3); // Sorry UTF-8, this is == 4 >> >> assert(foo[1] == 'ö'); // Not a chance! >> >> char[] should be a real char[], not a sort of byte[] for text. It needs to be fix for non-english. > > That's what wchar and dchar are for. If all you want is to make sure your chars are chars, then use dchar everywhere and be happy. Just be aware that dchars are 32bits a piece. Not a big deal for most apps, but could be for a few. > > Is there any problem with dchar other than just the size of it being massive overkill for western languages? > > --bb

Bill Baxter napisał(a): > Samuel MV wrote: >> This is *very* serious for i18n: >> >> >> char[] foo = "hög"; >> >> assert(foo.length == 3); // Sorry UTF-8, this is == 4 >> >> assert(foo[1] == 'ö'); // Not a chance! >> >> char[] should be a real char[], not a sort of byte[] for text. It needs to be fix for non-english. > > That's what wchar and dchar are for. If all you want is to make sure your chars are chars, then use dchar everywhere and be happy. Just be aware that dchars are 32bits a piece. Not a big deal for most apps, but could be for a few. > > Is there any problem with dchar other than just the size of it being massive overkill for western languages? > > --bb from my point of view currently char is just an "alias" for ubyte, and could/should be removed because it is superfluous. You can not make even char letter="ą"; // polish character a + , and in current state it is confusing... Maybe only dchar should be left and dchar should be renamed to char?... But ok. I can live with char... But I think good string class is really necessary in all cases... Regards Marcin Kuszczak

Bill Baxter wrote: >> For me string class is something what could significantly improve quality of libraries for D. > > From previous discussions it seemed to me like there was a fair amount of support for a string class. I think the lack of response could be just that not so many folks feel like it is a "must-have" for 1.0. Since D is a hybrid language, it needs string types AND a String class. And since Phobos isn't a pure OOP library, the lack of a pure OOP string class isn't all that surprising. Especially after bashing std::string... But it would be nice to have one "official" String class, instead of everyone inventing their own which seems to be inevitable otherwise ? --anders

Aarti_pl wrote: > from my point of view currently char is just an "alias" for ubyte, and could/should be removed because it is superfluous. char and wchar are nothing special, but char[] and wchar[] are magic. If those used ubyte[] and ushort[], code point looping wouldn't work. --anders

Very good. I never tried it but for some reason always thought it could not be done. Will Linux be getting this capability say, within a year or so? Also, is there a utility in phobos to load a DLL at run-time? -Craig "Walter Bright" <newshound@digitalmars.com> wrote in message news:ejis3v$499$1@digitaldaemon.com... > Craig Black wrote: >> Can you load a DLL that implements an abstract class at run-time? > > Sure. Just like in C++. (Note that shared library support isn't there in the linux DMD yet, but this should work under Windows.)

Anders F Björklund napisał(a): > Aarti_pl wrote: > >> from my point of view currently char is just an "alias" for ubyte, and could/should be removed because it is superfluous. > > char and wchar are nothing special, but char[] and wchar[] are magic. > If those used ubyte[] and ushort[], code point looping wouldn't work. > > --anders so it works now? on dmd 0.172 it exits with: Error: invalid UTF8 sequence ------------ import std.stdio; void main() { char[] text="Łóżko"; foreach(c; text) { writefln(c); } } -------------- With string class (e.g. dstring) it could work and some of magic could be removed from compiler :-). But now running this program is just disaster :-< It doesn't seem that char is in any way different from ubyte... BR Marcin Kuszczak

Aarti_pl wrote: >> char and wchar are nothing special, but char[] and wchar[] are magic. >> If those used ubyte[] and ushort[], code point looping wouldn't work. > > so it works now? Sure: import std.stdio; void main() { char[] text="Łóżko"; foreach(wchar c; text) { writefln(c); } } --anders

Anders F Björklund napisał(a): > Aarti_pl wrote: > >>> char and wchar are nothing special, but char[] and wchar[] are magic. >>> If those used ubyte[] and ushort[], code point looping wouldn't work. >> >> so it works now? > > Sure: > > import std.stdio; > > void main() { > char[] text="Łóżko"; > foreach(wchar c; text) { > writefln(c); > } > } > > --anders True... You are right... But anyway I don't think that we need magic in compiler... I would say that we need straightforward solutions which could be easily used. Making size of char equal to 4 bytes, and having string class, which can optimize different encodings would allow to get rid of all magic... :-) Below is just a dream: # void main() { # string text="Łóżko"; // string class which uses ubyte/ushort/char for different internal representation; string could optimize texts for speed or memory consumption # foreach(char c; text) { // char is always unicode 4 bytes # writefln(c); # } # } Regards Marcin Kuszczak

Forums