November 25, 2005
Regan Heath wrote:
> 
> This doesn't proove anything but it suggests that using a dchar sized
>  variable for characters will have little or no real effect on performance.. maybe, a conclusive test should really be made.

Well, the neat thing here is that since i/o is inherently very slow, at that particular point one can afford to do just about anything -- for free, so to say!

I/o goes to/from the display, to a file, to the net, to the printer. They're all so slow that I'd say one can do the transformations ten times over(!), and nobody could see the difference.

Probably, Walter's so familiar with this idea that he hardly noticed. So he instinctively was liberal with clock cycles in the right place. (No use optimising to death where it doesn't count. Pun intended.)

What I left out from the i/o list above was pipes. But on a Windows machine I guess that's slow anyhow. So we're left with Unix command line chaining, which I guess is about the only place where one would see a difference. (And even there the data ultimately comes from the disk (or the others) and goes somewhere. At least in real life.)

---

This actually gives an idea: to compare the efficiency of the different UTF widths in some specific job, it might be a good idea to first have the input data collected in memory, then time whatever operations one wants to test, and then "stop the clock" before either the output or the discard of the resultant data. (Oh yes, and large datasets should absolutely be tested on a quiet machine, or they will get swapped out inbetween. So a single-user mode unix might be pretty close to what's needed.)

Similarly, when one talks about the real-life efficiency of utf-this or utf-that, it is _imperative_ to include the i/o (as from+to disk or whatever) in the comparisons.

November 25, 2005
On Thu, 24 Nov 2005 23:12:45 -0800, kris <fu@bar.org> wrote:
> Regan Heath wrote:
>> On Thu, 24 Nov 2005 22:19:33 -0800, kris <fu@bar.org> wrote:
>>  <snip good advice>
>>
>>> I stongly suspect, based on experience, that you'd end up with a  class-based interface anyway. And why not? What on earth is wrong with  classes? Especially when they're native to the language?
>>   To answer that question you have to ask "what is the difference between a  class and the built in array types?".    Regan
>
> You don't know? :-)
>
> If I get your drift, the question should perhaps be thus: at what point of complexity does it become generally acceptable to leave native types behind.

Yes, or rather to what degree should the built in types go in order to support feature X. X in this case being string handling and/or unicode string handling.

What I'd like is for the built in types to go as far as providing support for indexing characters in strings regardless of the encoding(*). The reason I think that is the degree to which it should go is that once it does that anyone can write a function in D which will correctly handle any string in any encoding(*) without having to think about UTF code fragments and the problems associated with that.

(*) The 3 UTF encodings are all it needs to support. Other encodings should be handled by libraries i.e. ICU.

> The key to powerful, easy-to-use, practical, and extensible Unicode handling is, IMO, far away on the other side of that divide. I suspect/hope you'd ultimately agree.

A complete solution is certainly, as you say, something a library should handle. But I don't want a complete solution, just a small step really.

Regan
November 25, 2005
kris wrote:
> Regan Heath wrote:
> 
> Designing it with respect to performance and immutability are also
> not so tough (though D badly needs read-only arrays).

(OT) never thought about that! Please elaborate.

> What's really hard is getting the initial set of compromises worked
> out, as I keep repeating. Then comes the hard work of dealing with
> the edge-conditions, special cases, unexpected gotcha's and, in some
> cases, just plain old grey-matter and hard work.

I take it you refer here to character classification, collation and other cans of (Unicode-related) _real_ boas and anacondas? I agree.

> You mentioned before that this built-in notion would somehow interface with ICU? Well, that would be a consideration. But first you need to review how ICU, and other packages like it, operate before assuming some binding to a native type (other than a class) could make it an attractive marriage.

*******************************************

I seriously suggest, or actually ask all here:

Looking at the riot we had before _any_ understanding of utf or unicode things percolated, we just HAVE TO decide that BEFORE D 2.0 we _will_not_ touch any of the above issues!!

Promise, everybody?

Let's only do the character widths, i/o, and polishing of the utf API (and the language spec) -- and do that well.

After D 1.0 we'll have all the time in the world to do the rest of the world.
November 25, 2005
Georg Wrede wrote:
> Regan Heath wrote:
> 
>>
>> This doesn't proove anything but it suggests that using a dchar sized
>>  variable for characters will have little or no real effect on performance.. maybe, a conclusive test should really be made.
> 
> 
> Well, the neat thing here is that since i/o is inherently very slow, at that particular point one can afford to do just about anything -- for free, so to say!

Forgive me, Georg; but that sounds like codswallop. You're making an assumption there's just one task taking place, which may be partly true for your machine at home, but it ain't true for real-time systems or servers of any variety.

Asynchronous I/O exists for a reason ~ so that one can do as much as possible /whilst/ waiting. Alternatively, one uses multiples threads to keep the CPU occupied whilst others are I/O-bound. For example, I seriously doubt this NG server sits down and twiddles its thumbs whilst waiting for socket transfers <g>

Even in this day and age there's little excuse for slothfulness (though it appears less egregious at the high level). Besides; the wprintf thing is a total red-herring, since the goal there is convenience; it's pretty obvious performance was not a priority.
November 25, 2005
I have a proposal... okay it's not about strings.  After trying to follow all these posts, I now can say I'm thoroughly confused about everything UTF-like, unicodish, pointless, valueable, or characteristically encoded in 8, 16, and 32 discrete portions.

I propose a name change to this thread.

String Theory by Relentless Debate

Regan you're a great guy, but you sure are insatiably persistant!
Kris, is it worth it? I don't think it's getting through to him yet. :)

Cheers to both of you!

-JJR
November 25, 2005
kris wrote:

...

I asked you about where you work, etc. and never answered your post.

I consider that bad manners from myself, especially when such might even be considered a bit personal to ask on a public ng. Sorry!

I saw the post, and decided to look at the two links to your own projects, before I'd comment. (As you've probably seen) I've spent quite some time writing about this utf thing, so now I've lost your post. Shame on me!

---

But I do remember that you worked at PARC!

I think that's about as cool as working next door to Linus Torvalds or (for the other half of mankind) next door to Bill Gates.

That's one place where I'll do a Tourist Pilgrimage one day!
November 25, 2005
kris wrote:

> Since this thread is called "String theory by example", I'll
> encourage those interested to take a critical look at the ICU project
> here:

Aaaaarrrghhh, nooooo!

I've been to that place, now THAT was scary.

Let's avoid that all till DMD v 1.0!
November 25, 2005
Georg Wrede wrote:
> kris wrote:
> 
>> Regan Heath wrote:
>>
>> Designing it with respect to performance and immutability are also
>> not so tough (though D badly needs read-only arrays).
> 
> 
> (OT) never thought about that! Please elaborate.

On Read-Only arrays? Sure.

One can easily design a class such that it cannot be mutated when passed from one function to another. However, when it comes to arrays, access to content by the callee is wide open to abuse. That is, if funcA wants to give funcB read-only access to a large quantity of data, one should clone the thing /just in case/ funcB mutates it. This then pervades throughout structs and classes without respect to attribute visibility.

The D notion is that CoW will be somehow be adhered to by the callee ~ it will be a "good" function, and clone the array before touching it. Yet this is not enforced by the compiler, to any degree. Thus the caller ends up doing the work, just to be sure.

This, I'm sure you'll agree is a bit daft. It's also a significant performance problem for server-code, or anywhere where immutability is a high priority. Anyone who regularly uses multiple threads will attest that enforced immutability is a welcoming lifeboat within a cold sea of unrest and uncertainly.


> *******************************************
> 
> I seriously suggest, or actually ask all here:
> 
> Looking at the riot we had before _any_ understanding of utf or unicode things percolated, we just HAVE TO decide that BEFORE D 2.0 we _will_not_ touch any of the above issues!!

Talk about making things compatible with certain libraries has to take some general requirements into consideration. That requires research.
November 25, 2005
Georg Wrede wrote:
> I asked you about where you work, etc. and never answered your post.
> 
> I consider that bad manners from myself, especially when such might even be considered a bit personal to ask on a public ng. Sorry!
> 
> I saw the post, and decided to look at the two links to your own projects, before I'd comment. (As you've probably seen) I've spent quite some time writing about this utf thing, so now I've lost your post. Shame on me!

Yes ~ terribly poor form :-p

No problem. Appreciate the thought.
November 25, 2005
John Reimer wrote:
> I propose a name change to this thread.
> 
> String Theory by Relentless Debate

:-D

> 
> Regan you're a great guy, but you sure are insatiably persistant!
> Kris, is it worth it? I don't think it's getting through to him yet. :)

We're learning how to be nice to each other ;-)


> Cheers to both of you!

To you too ~ you don't frequent here as much as you once did. That's a shame.