November 25, 2005
On Fri, 25 Nov 2005 00:04:06 -0800, John Reimer <terminal.node@gmail.com> wrote:
> I have a proposal... okay it's not about strings.  After trying to follow all these posts, I now can say I'm thoroughly confused about everything UTF-like, unicodish, pointless, valueable, or characteristically encoded in 8, 16, and 32 discrete portions.
>
> I propose a name change to this thread.
>
> String Theory by Relentless Debate
>
> Regan you're a great guy, but you sure are insatiably persistant!

Fair comment.

> Kris, is it worth it? I don't think it's getting through to him yet. :)

I sure wish I knew what "it" was :)

Regan
November 25, 2005
On Fri, 25 Nov 2005 00:47:25 -0800, kris <fu@bar.org> wrote:
> John Reimer wrote:
>> I propose a name change to this thread.
>>  String Theory by Relentless Debate
>
> :-D
>
>>  Regan you're a great guy, but you sure are insatiably persistant!
>> Kris, is it worth it? I don't think it's getting through to him yet. :)
>
> We're learning how to be nice to each other ;-)

Yeah, seems to be working.

Regan
November 25, 2005
Kris wrote:
> It seems clear that any unified string notion would be better off as a library suite; not built into the compiler. It's difficult enough to evolve the code within Phobos, let alone something hard-coded into the compiler.

D has associative arrays as a hard-coded feature too.

> b) a String class to support Unicode is hardly a trivial undertaking. You really have to consider very hard what the goals are before putting something in stone (as in getting it added to Phobos). I say that from experience with the ICU project ~ there's code in there to handle the kinds of things that would frighten many people. Unicode ain't trivial and, frankly, I think AJ would have a hard time coming up with a "suitable" set of compromises. The latter is important: there will be many compromises one way or another.

I don't see it as a major compromise if one wants to have an abstract UTF-representation independent string type. If we could create a basic string type that does all it's major operations in O(1) or O(n) time, these 'advanced' operations would be fast enough (even if they're not, you can always handle the string as a raw stream of bytes)

Even the current implementation is a compromise. The language doesn't want to take care of any Unicode operations, all the 'hard' work (including char[] symbol-based indexing) is left for the programmer.
November 25, 2005
John Reimer wrote:
> I have a proposal... okay it's not about strings.  After trying to follow all these posts, I now can say I'm thoroughly confused about everything UTF-like, unicodish, pointless, valueable, or characteristically encoded in 8, 16, and 32 discrete portions.

Yes, and I bet a bunch of other folks who don't write are even more confused. And that gets us conveniently to the exact point of all this: we who carry on this debate, do it precisely so that future D users could gain a few things:

  1. Not get drowned in the current utf maze of glass walls and mirrors.

  2. Real Soon Now, be able to do their coding without being forced to know a single thing about utf.

  3. Not have downright disinformation stuffed down their throats by D documentation, specs, or the existing choice of data types in D.

  4. Get rid of all the gotchas (especially the unobvious) hidden in the current framework of what appears to be character types and handling.

At least I personally expect this whole "utf" issue to be over and done with, in a couple of weeks. (Heh, knowing SW projects, that probably means before year's end.) Once this is fixed, we have a _much_ smoother API, both factually, but especially in concept. And then -- we'll not hear a word about utf during the entire next year. 8-|

And you know what: I actually think we don't have to do very much coding to get that done. Most of the issues here are with the language spec, removing and renaming existing datatypes. The major part of needed code actually exists already within Phobos, so this is technically trivial (the others in this ng. may not agree, but I think so).

> I propose a name change to this thread.
> 
> String Theory by Relentless Debate

Nice quip! :-)

And hey, I guess most of the already bewildered have skipped this thread for ages ago. It's not only the precisely wrong thing to read if one needs to learn about International Character Set Issues, this thread is downright counter productive for that.

> Regan you're a great guy, but you sure are insatiably persistant! 

While I may be a thinker, visionary and a loudmouth, at least Regan is one who gets things done! And he's persistent, I agree!

> Kris, is it worth it? I don't think it's getting through to him yet.

Well, IMHO, Kris and Regan have been talking about apples and oranges, without either noticing.

Regan is talking about this utf thing in terms of what we here have been discussing, while Kris means the entire ICU issue. At least I believe this is so, and that they've not necessarily believed the other one understands the ("same") issue. (Kris, Regan, correct me if I'm wrong here.)

And John, I assume you think, taking on the whole ICU issue (I'm using a wrong term here, I know, but you know what I mean) is a little too big job for us, right? Which I wholeheartedly agree with.

More specifically, the ICU thing is not something I believe D should even tackle. For the next couple of years, I think *those* application programmers who care about such, should use a library (like ICU, or whatever). What D provides will be adequate UTF handling -- as far as slicin' n' streamin' are concerned, nothing fancier. After a couple of years, we can always check the issue again. Maybe by that time a bunch of now broken issues have been settled, maybe there actually is some need for such functionality, maybe by that time they have stopped fighting with the available compilers (bugs), maybe... Then we can check it out.

Upon our solid (but not all encompassing) basement, everybody can build a Galaxy Wide Character set. But we do need the basement first, and it has to be solid.

November 25, 2005
kris wrote:
> Georg Wrede wrote:
> 
>> kris wrote:
>> 
>>> Regan Heath wrote:
>>> 
>>> Designing it with respect to performance and immutability are
>>> also not so tough (though D badly needs read-only arrays).
>> 
>> (OT) never thought about that! Please elaborate.
> 
> On Read-Only arrays? Sure.
> 
> One can easily design a class such that it cannot be mutated when
> passed from one function to another. However, when it comes to
> arrays, access to content by the callee is wide open to abuse. That
> is, if funcA wants to give funcB read-only access to a large quantity
> of data, one should clone the thing /just in case/ funcB mutates it.
> This then pervades throughout structs and classes without respect to
> attribute visibility.
> 
> The D notion is that CoW will be somehow be adhered to by the callee
> ~ it will be a "good" function, and clone the array before touching
> it. Yet this is not enforced by the compiler, to any degree. Thus the
> caller ends up doing the work, just to be sure.
> 
> This, I'm sure you'll agree is a bit daft. It's also a significant performance problem for server-code, or anywhere where immutability
> is a high priority. Anyone who regularly uses multiple threads will
> attest that enforced immutability is a welcoming lifeboat within a
> cold sea of unrest and uncertainly.

Ah, right. Interesting that there's no convenient hardware support for such. Well, can't have everything, do we. :-)

Should we take this up, like after the holiday season? (Not that I'm expecting a sudden panacea invented, but who knows, maybe we could make some small steps.)

>> *******************************************
>> 
>> I seriously suggest, or actually ask all here:
>> 
>> Looking at the riot we had before _any_ understanding of utf or unicode things percolated, we just HAVE TO decide that BEFORE D 2.0
>> we _will_not_ touch any of the above issues!!
> 
> Talk about making things compatible with certain libraries has to
> take some general requirements into consideration. That requires
> research.

Agreed.

Right now, we've come a long way since what folks understood a month ago. We even have a kind of concensus on that the current "utf" state of affairs is, ehh, not perhaps very good.

So, I feel, right now would be a bad spot to drop everything and go out in with the dragons and lizards, looking for Widsom. Instead, I feel it is absolutely vital that we fix the few issues we're at -- and get that over and done with, before we poison the minds of the next thousand D newcomers.

Compared to that, I must confess, compatibilty with ICU-like things is not a great priority. Next summer, or sometime, but not right now. I'd love if you'd agree with me on this?
November 25, 2005
kris wrote:
> Georg Wrede wrote:
>> Regan Heath wrote:
>> 
>>> This doesn't proove anything but it suggests that using a dchar
>>> sized variable for characters will have little or no real effect
>>> on performance.. maybe, a conclusive test should really be made.
>> 
>> Well, the neat thing here is that since i/o is inherently very
>> slow, at that particular point one can afford to do just about
>> anything -- for free, so to say!
> 
> Forgive me, Georg; but that sounds like codswallop. 

No panic.

> You're making an assumption there's just one task taking place, which
> may be partly true for your machine at home, but it ain't true for
> real-time systems or servers of any variety.
>
> Asynchronous I/O exists for a reason ~ so that one can do as much as
> possible /whilst/ waiting. Alternatively, one uses multiples threads
> to keep the CPU occupied whilst others are I/O-bound. For example, I
> seriously doubt this NG server sits down and twiddles its thumbs
> whilst waiting for socket transfers <g>

I'm thinking of the average speed of i/o. Every once in a while a swoosh of data comes into the in-buffer, then it takes some time before we get the next swoosh, even if we'd "use" the data in zero time.

If the time before the next buffer fill is used to decode-and-read, then we get the decode "for free". (See below, before answering. :-) )

> Even in this day and age there's little excuse for slothfulness
> (though it appears less egregious at the high level). Besides; the
> wprintf thing is a total red-herring, since the goal there is
> convenience; it's pretty obvious performance was not a priority.

Slothfullness... I'll tell Bill you're picking on me!

Of course, on a multiuser system it is part of table manners not to waste clock cycles, there's no disputing that. So you are right.

Even more (as I think you also mean), there's no place anywhere, where a code can be slothful without using up from the total of clock cycles available, so other processes of course get less. Be it doing "for free" i/o conversion, or whatever else.

But I'm trying to maintain a balance here. Right now we are pressured for time (or actually Walter is -- I can't imagine that his wife hasn't left him already, especially considering how quickly .140 came out with all those things), and we should get this utf thing out of the way, so other things can be tackled.

I'd say we can right now implement stuff less-than well -- _as_long_ as the setup is drawn right. In other words, so that later we (without changing the API) can rewrite and optimize the individual routines.

Kind of "Why start rocking the boat, when we've just got off the underwater rock, with such effort, too." (...whatever "underwater rock" is in proper English...)

PS, I've already suggested doing such tests. ;-)

So, fixing a print function that does superfluous intermediate conversions, should, IMHO, not be on our agenda at all before spring. DMD 1.0 or not.
November 25, 2005
Jari-Matti Mäkelä wrote:
> If we could create a basic string type that does all it's major
> operations in O(1) or O(n) time, these 'advanced' operations would be
> fast enough (even if they're not, you can always handle the string as
> a raw stream of bytes)

????

> Even the current implementation is a compromise. The language doesn't
>  want to take care of any Unicode operations, all the 'hard' work (including char[] symbol-based indexing) is left for the programmer.

That change is just behind the corner.
November 25, 2005
Georg Wrede wrote:
> kris wrote:
> 
>> Georg Wrede wrote:
>>
>>> kris wrote:
>>>
>>>> Regan Heath wrote:
>>>>
>>>> Designing it with respect to performance and immutability are
>>>> also not so tough (though D badly needs read-only arrays).
>>>
>>>
>>> (OT) never thought about that! Please elaborate.
>>
>>
>> On Read-Only arrays? Sure.
>>
>> One can easily design a class such that it cannot be mutated when
>> passed from one function to another. However, when it comes to
>> arrays, access to content by the callee is wide open to abuse. That
>> is, if funcA wants to give funcB read-only access to a large quantity
>> of data, one should clone the thing /just in case/ funcB mutates it.
>> This then pervades throughout structs and classes without respect to
>> attribute visibility.
>>
>> The D notion is that CoW will be somehow be adhered to by the callee
>> ~ it will be a "good" function, and clone the array before touching
>> it. Yet this is not enforced by the compiler, to any degree. Thus the
>> caller ends up doing the work, just to be sure.
>>
>> This, I'm sure you'll agree is a bit daft. It's also a significant performance problem for server-code, or anywhere where immutability
>> is a high priority. Anyone who regularly uses multiple threads will
>> attest that enforced immutability is a welcoming lifeboat within a
>> cold sea of unrest and uncertainly.
> 
> 
> Ah, right. Interesting that there's no convenient hardware support for such. Well, can't have everything, do we. :-)

Hardware support is not needed for such things. Instead the language needs a means to decorate a return-type as being read only (or something akin), and enforce subsequent usage as an rValue only. At compile-time. Support is already there for arrays-as-arguments (the 'in' modifier), though I wonder if that is robust enough? I mean, it's the caller who's concerned about the immutability; not the callee (whose sig could easily change).

Yes, there's another case whereby it's the callee who's concerned about the caller changing the content on the fly. But that one is purely the responsibility of the caller, and can thus be managed. The fundamental issue is ensuring an unknown callee can be trusted with the family jewels.
November 25, 2005
Georg Wrede wrote:

> 
> Slothfullness... I'll tell Bill you're picking on me!

Sorry. That wasn't intended to be a personal attribution <g>

> But I'm trying to maintain a balance here. Right now we are pressured for time (or actually Walter is -- I can't imagine that his wife hasn't left him already, especially considering how quickly .140 came out with all those things), and we should get this utf thing out of the way, so other things can be tackled.

Which UTF thing? There's so many threads going on its hard to keep track.

There's the one that says "default all argument string-literals to char[]". That would increase consistency on a number of fronts, so that would be great. Bring it on!

There's the one that says "add some array properties as a convenience for transcoding", such as adding .utf8 .utf16 and .utf32 properties as appropriate. That would be nice!

There's a call for a "unified" string, which is a String class by any other name. Yet there's precious little evidence of a well considered class at this time. I hope you're no referring to the latter?


> I'd say we can right now implement stuff less-than well -- _as_long_ as the setup is drawn right. In other words, so that later we (without changing the API) can rewrite and optimize the individual routines.

I'm really missing something here. You're talking about an API for what? It must be a String class, yes? Getting it "right" so that it doesn't change, is not something that can be done on a whim. I know you know that, so what's the huge rush all of a sudden? Don't you think it would be better to build something and let it mature with use for a period of time?

Why not pick up one of the String classes that's been around for a year or more? There's at least three of them that old. As you might guess, this is not a new topic at all :-)


> So, fixing a print function that does superfluous intermediate conversions, should, IMHO, not be on our agenda at all before spring. DMD 1.0 or not.

I believe you're badly miscontruing something here, Georg. Who ever said anything about fixing writef? I certainly countered an argument that was effectively stating "if writef can do it like that, then that's probably good enough for everything else". Merely pointing out that printf exists for convenience, not performance, should hardly be interpreted in this manner :-D

Can you please tell me about this Spring date? Is something important happening then?
November 25, 2005
Georg Wrede wrote:
> John Reimer wrote:
> 
>> I have a proposal... okay it's not about strings.  After trying to follow all these posts, I now can say I'm thoroughly confused about everything UTF-like, unicodish, pointless, valueable, or characteristically encoded in 8, 16, and 32 discrete portions.
> 
> 
> Yes, and I bet a bunch of other folks who don't write are even more confused. And that gets us conveniently to the exact point of all this: we who carry on this debate, do it precisely so that future D users could gain a few things:
> 
>   1. Not get drowned in the current utf maze of glass walls and mirrors.
> 
>   2. Real Soon Now, be able to do their coding without being forced to know a single thing about utf.
> 
>   3. Not have downright disinformation stuffed down their throats by D documentation, specs, or the existing choice of data types in D.
> 
>   4. Get rid of all the gotchas (especially the unobvious) hidden in the current framework of what appears to be character types and handling.


I don't doubt in the least the importance of this debate.  Despite being unable to wholly understand even half the material presented, I respect the necessity of the wrangling... although in Kris' and Regan's case, I don't think it was getting anywhere.


> At least I personally expect this whole "utf" issue to be over and done with, in a couple of weeks. (Heh, knowing SW projects, that probably means before year's end.) Once this is fixed, we have a _much_ smoother API, both factually, but especially in concept. And then -- we'll not hear a word about utf during the entire next year. 8-|


Really? Such optimism! :)  Who says the smoother API will be agreed upon or even adopted?  I hope it does, whatever that API is or wherever that API currently exists.  If that API resides in ICU, then there's a stopgap solution until people make up there minds (specifically people like Walter).  If otherwise, then a solution will be long in coming, I think, and will be debated until the end of our days. Although, I'm still curious to know why people think they can change D without Walter's input in the matter.  I expect you people are planning on making a submission to phobos or something and hoping Walter will agree?


> And you know what: I actually think we don't have to do very much coding to get that done. Most of the issues here are with the language spec, removing and renaming existing datatypes. The major part of needed code actually exists already within Phobos, so this is technically trivial (the others in this ng. may not agree, but I think so).

Could be.  I don't know much to agree or not... but I'm always hopeful. :D

>> I propose a name change to this thread.
>>
>> String Theory by Relentless Debate
> 
> 
> Nice quip! :-)
> 
> And hey, I guess most of the already bewildered have skipped this thread for ages ago. It's not only the precisely wrong thing to read if one needs to learn about International Character Set Issues, this thread is downright counter productive for that.

Ho.. yeah.  I don't think it helped me.  I tried reading some of this but it certainly just confused the issue for me.  But that's okay. The purpose of the thread wasn't for education of UTF novices... I can accept that. :)


>> Regan you're a great guy, but you sure are insatiably persistant! 
> 
> 
> While I may be a thinker, visionary and a loudmouth, at least Regan is one who gets things done! And he's persistent, I agree!


What can I say? this board wouldn't be the same without gents like you two.  This topic is an obvious necessity. Discourse is important.


>> Kris, is it worth it? I don't think it's getting through to him yet.
> 
> 
> Well, IMHO, Kris and Regan have been talking about apples and oranges, without either noticing.
> 
> Regan is talking about this utf thing in terms of what we here have been discussing, while Kris means the entire ICU issue. At least I believe this is so, and that they've not necessarily believed the other one understands the ("same") issue. (Kris, Regan, correct me if I'm wrong here.)


I'm not so sure they're on different wavelengths.  I think they merely see a different solution to the same problem.  One sees a broad solution that is feasible in the current D universe.  The other sees a narrower one that requires an intimate influence over the development of D. I'll let you decide which is which. Yet the debate is only as productive as far as the language can be influenced; so, of necessity, only one will be more successul than the other, no matter which solution is best.

Nonetheless, it is good to discuss alternative solutions, on the off chance that language change may happen.  But it appears that any sort of change will create a tremondously large commitment to solving a extremely complicated problem.  Have we got the language designer behind us on this? Because if we don't, it'll be the toughest ride of our lives.


> 
> And John, I assume you think, taking on the whole ICU issue (I'm using a wrong term here, I know, but you know what I mean) is a little too big job for us, right? Which I wholeheartedly agree with.


Well, it is.  But, I certainly respect the necessity of the discussion.    I get a little listless, though, wondering whether it's going anywhere.


> More specifically, the ICU thing is not something I believe D should even tackle. For the next couple of years, I think *those* application programmers who care about such, should use a library (like ICU, or whatever). What D provides will be adequate UTF handling -- as far as slicin' n' streamin' are concerned, nothing fancier. After a couple of years, we can always check the issue again. Maybe by that time a bunch of now broken issues have been settled, maybe there actually is some need for such functionality, maybe by that time they have stopped fighting with the available compilers (bugs), maybe... Then we can check it out.


That's a fair assessment of the situation.  That makes much more sense.


> Upon our solid (but not all encompassing) basement, everybody can build a Galaxy Wide Character set. But we do need the basement first, and it has to be solid.
> 

Yes.  I can appreciate that.

Thanks for your remarks, Georg.  It's always a pleasure.

-John