October 29, 2004
"Walter" <newshound@digitalmars.com> wrote in message news:clsfce$1dlm$1@digitaldaemon.com...

> True, Win32 process strings in UTF-16 and Linux in UTF-8. But I'll argue
> that the string conversion costs are insignificant, because very rarely does
> one write code that crosses from the app to the OS in a tight loop. In fact,
> one actively tries to avoid doing that because crossing the process boundary
> layer is expensive anyway.
>
> If profiling indicates that the conversion cost is significant, then use an
> alias, sure. But I'll wager that's very unlikely.

Wait a minute. Aren't these pretty close to the same arguments I made for why the difference between the performance of a consistent default string class and a byte array wouldn't generally matter? "X is usually insignificant, and if it is ever significant use a profiler and do something non-default, but in general keep it simple...."

What is the performance difference between sending 1000 wchar[] strings into a filter library function that wants char[] strings, so it converts them all into char[] on the way in, finds the ones that qualify, and converts them back to wchar[] on the way out, versus sending a thousand default string objects, by reference of course, into a library written for default string objects for filtering and returns the qualifiers as default string objects? (I'm actually asking. It's not rhetorical.)

Having a consistent, default string that's used (almost) everywhere and never suffers any conversion costs inside the app may have a big benefit in reducing complexity, with certainly no need for aliases anywhere, and may not even have any performance penalty over code that repeatedly gets converted back and forth within the app. And if ever it did perform more slowly, you use your profiler as you suggest and tweak it with a byte array. 

October 29, 2004
James McComb wrote:

> But you need to use aliases for the following scenario:
> 
> Suppose that:
>   1. I want to write code for both Windows and Unix.
>   2. I don't want to pay any string conversion costs at all.
> 
> I assume the way to do this in D is:
>   1. Use wchar[] on Windows and make UTF-16 API calls.
>   2. Use char[] on Linux and make UTF-8 API calls.
>   3. Use an alias to toggle between wchar[] and char[].
>   4. Use a string library that defines all functions in both wchar[] and char[] versions.
> 
> If I just used char[], I would be forced to pay string conversion costs, as Windows ultimately processes all strings in UTF-16.

Couldn't a new "tchar" alias be introduced for OS / platform strings ?

(mapping to either char or wchar)

Similar to how pointer aliases work with both 32- and 64-bit pointers ?

(that is: size_t and ptrdiff_t)


It would be similar to using the macro (TCHAR *) in Windows C or C++.
(with _tcs macro versions of all the functions like: strlen and wcslen)

With overloading and templates in D it is easier to maintain, though...
(compared to the preprocessor tricks one has to resort to, back in C)


Or just use the standard type "char[]" and cast(), like Walter said ?
(which seems to be a little biased towards ASCII or UNIX, but anyway)

But using the same name (tchar) as Windows / Linux does, would be good
if there indeed is such a platform-character alias eventually added...

--anders


PS. I think that it's only Windows NT (2K,XP) that uses Unicode,
    while Windows 95 (98,ME) uses ASCII... But I could be wrong?
October 29, 2004
"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsglk19qr5a2sq9@digitalmars.com...
> On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
> > So defining multiple aliases for strings or any other type is
> > a pretty harmless thing to do. It should only effect the readability and
> > maintainability of the code.
>
> I'd argue that it's not harmless for the very reasons you just mentioned. Readability and maintainability are important when working on any large-ish project.

But introducing more names doesn't always make something more readable or maintainable. One has to factor in the size of the group and time-scale of life of the code. A wrapper or alias might seem obvious to the couple of people who started the project but years down the road with a group orders of magnitude larger a little helper wrapper can add up to be more overhead than it is worth. Also notions of "this code is readable" and "maintainable" are much more subjective than "this code doesn't compile" or "this code uses the wrong type". My personal preference is that keeping things simple is the best way to make something readable and maintainable.


October 30, 2004
Glen Perkins wrote:
> I'd heard a bit about D, but this is the first time I've taken a bit of time to look it over. I'm glad I did, because I love the design.
> 
> I am wondering about something, though, and that's the apparent decision to have three different standard string types, each with its encoding exposed to the developer. I've had some experience designing text models--I worked with Sun upgrading Java's string model from UCS-2 to UTF-16 and for Macromedia upgrading the string types within Flash and ColdFusion, for example--but every case has its unique constraints.
> 
> I don't know enough about D to be sure of the issues and constraints in this case, but I'm wondering if it wouldn't make sense to have a single standard "String" class for the majority of text handling plus something like char/wchar/dchar/ubyte arrays reserved for special cases.

(I've read some of the posts in this thread. Sorry if I'm repeating what  someone else has already written.)

It seems to me that D would support a string class such as the one you seem to be proposing. Since Walter is busy getting the bugs out of the compiler, so he's not likely to write an official string class anytime soon. But someone else could write it. And if that string was good and lots of people liked it, I'd be surprised if Walter didn't add it to the standard library, Phobos.

If you're not up to writing it yourself, maybe you could persuade someone else to do the work by proposing a design.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
October 31, 2004
On Fri, 29 Oct 2004 10:39:08 -0400, Ben Hinkle <bhinkle@mathworks.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsglk19qr5a2sq9@digitalmars.com...
>> On Thu, 28 Oct 2004 08:35:24 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
>> > So defining multiple aliases for strings or any other type is
>> > a pretty harmless thing to do. It should only effect the readability 
>> and
>> > maintainability of the code.
>>
>> I'd argue that it's not harmless for the very reasons you just mentioned.
>> Readability and maintainability are important when working on any
>> large-ish project.
>
> But introducing more names doesn't always make something more readable or
> maintainable. One has to factor in the size of the group and time-scale of
> life of the code. A wrapper or alias might seem obvious to the couple of
> people who started the project but years down the road with a group orders
> of magnitude larger a little helper wrapper can add up to be more overhead
> than it is worth. Also notions of "this code is readable" and "maintainable"
> are much more subjective than "this code doesn't compile" or "this code uses
> the wrong type". My personal preference is that keeping things simple is the
> best way to make something readable and maintainable.

That's what *I* implied/said, wasn't it?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
November 03, 2004
>>> I'd argue that it's not harmless for the very reasons you just
>>> mentioned.
>>> Readability and maintainability are important when working on any
>>> large-ish project.
..
>>>My personal preference is that keeping things simple is
>> the
>> best way to make something readable and maintainable.
>
>That's what *I* implied/said, wasn't it?

As an old man, I cannot avoid thinking that these (obviously both) talented young men cannot find a place between their hormones and the writing on the wall. Had I been in that age, I'd have participated vigorously in this.

I hope we get Walter with us in introducing a new name for the Canonical String. Be it an alias, a type, a class, or whatever. -- The main point is that we do need A Type that "everyone" uses.

Sure, we can claim that it's the wchar, uchar, dchar, or whatever, but hey, please, do remember the very purpose of a programming language:

"We may create a programming language from the point of the computer.
We may create a programming language from the point of the programmer.
We may create a ... ... sw-developing company.
We ... ... education.
W... ... maintainability.
" ... -- the story has other leaves.

Psychology, practice, history, just about everything "non-reality" related tells us 10-to-1 that we should create a name and tell everyone to use that.

Technically we do not need this, but this, i'm sorry, is not the issue here.


November 05, 2004
Just posting to let you know I also think "string" should be standardized. Be it char[] or whatever, but standardized.

Maybe in the future "string" could get some other members/operators that have no equivalent with int[]. (The fact that char[] is begin treated as UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

> Virtually all traditional tokenization and parsing tasks
> can be done with 8 bit types, because they require searching for
> delimiters that
> are themselves 8 bit chars.  I've not seen "U+umlaut" delimited fields ;)

Indeed :-)

Lio.


November 05, 2004
Lionello Lunesu wrote:

> Maybe in the future "string" could get some other members/operators that have no equivalent with int[]. (The fact that char[] is begin treated as UTF8 when converting to wchar[] proves that it's not simply an int8[] array)

The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8.

This means that a "char" only holds an ASCII character.
You need a wchar to hold e.g. a Latin-1 character, and
a full (32-bit) dchar to hold all Unicode possibilities...

--anders
November 05, 2004
Yes, I've noticed that. I was referring to the how the array is treated.

char[] array;
wchar[] warray = array;

This is doing some magic that has nothing to do with simply copying members, extending them as necessary. OK, I guess they're both arrays of UTF characters and the prefix only shows the memory representation, so it's still a member-by-member copy...

Can I do a similar assignment from byte[] to uint[] ? (I know I could simply test, but I've never written a D program). If not, then there is something special about char[] that might perhaps be more obvious if it was a built-in string type (the [] is confusing.)

Lio.

"Anders F Björklund" <afb@algonet.se> wrote in message news:cmfjbd$22d5$1@digitaldaemon.com...
> Lionello Lunesu wrote:
>
>> Maybe in the future "string" could get some other members/operators that have no equivalent with int[]. (The fact that char[] is begin treated as UTF8 when converting to wchar[] proves that it's not simply an int8[] array)
>
> The 8-bit integer type in D is "byte". D's "char" is *defined* as UTF-8.
>
> This means that a "char" only holds an ASCII character.
> You need a wchar to hold e.g. a Latin-1 character, and
> a full (32-bit) dchar to hold all Unicode possibilities...
>
> --anders


November 05, 2004
Lionello Lunesu wrote:

> Yes, I've noticed that. I was referring to the how the array is treated.
> 
> char[] array;
> wchar[] warray = array;

That D code just gives an error, when you actually try to compile it:
"cannot implicitly convert expression array of type char[] to wchar[]"

If you insert an explicit cast, result is probably NOT what you want...
(you CAN cast string *constants*)

> This is doing some magic that has nothing to do with simply copying members, extending them as necessary. OK, I guess they're both arrays of UTF characters and the prefix only shows the memory representation, so it's still a member-by-member copy...

The compiler needs some code regarding converting different UTF arrays.
Each dchar (UTF-32), converts 1-4 chars (UTF-8) to 1-2 wchars (UTF-16)

It's not a simple memory copy, as you can see in the std/utf.d code:
wchar[] toUTF16(char[] s);

> Can I do a similar assignment from byte[] to uint[] ?

Nope: "cannot implicitly convert expression a of type byte[] to uint[]"

You would have to do something like:
    byte[] a;
    uint[] b;
    foreach (byte c; a) b ~= c;
Again, a cast() just does a "memcpy"

> If not, then there is something special about char[] that might
> perhaps be more obvious if it was a built-in string type (the [] is
> confusing.)

Type char[] has a few "stringish" properties,
and bit has some magic "boolean" properties.

This is somehow better than built-in types...
(and a frequent source of D discussions/wars)


We'll just have to live with the type aliases;
"string" and "bool", as types aren't changing ?

alias char[] string;
alias bit bool;

--anders