August 28, 2004
What you fail to understand, Jill, is that such arguments are but pinpricks upon the World's foremost authority on everything from language-design to server-software to ease-of-use.

Better to just build a wchar-based String class (and all the supporting goodies), and those who care about such things will naturally migrate to it; they'll curse D for the short-sighted approach to Object.toString, leaving the door further open for a D successor

V

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp845$2ea2$1@digitaldaemon.com...
> In article <cgobse$237t$2@digitaldaemon.com>, Walter says...
>
> >There are also twice as many bytes to scan for the gc, and half the data until your machine starts thrashing the swap disk. The latter is a very
real
> >issue for server apps, since it means that you reach the point of having
to
> >double the hardware in half the time.
>
> There you go again, assuming that wchar[] strings are double the length of char[] strings. THIS IS NOT TRUE IN GENERAL. In Chinese, wchar[] strings
are
> shorter than char[] strings. In Japanese, wchar[] strings are shorter than char[] strings. In Mongolian, wchar[] strings are shorter than char[]
strings.
> In Tibetan, wchar[] strings are shorter than char[] strings. I assume I
don't
> need to go on...?
>
> <sarcasm>But I guess server apps never have to deliver text in those languages.</sarcasm>
>
> Walter, servers are one the places where internationalization matters
most. XML
> and HTML documents, for example, could be (a) stored and (b) requested in
any
> encodings whatsoever. A server would have to push them through a
transcoding
> function. For this, wchar[]s are more sensible.
>
> I don't understand the basis of your determination. It seems ill-founded. Jill
>
>


August 30, 2004
Walter wrote:

> 
> But D does not have a default. Programmers can use the encoding which is
> optimal for the data they expect to see. Even if UTF-8 were the default,
> UTF-8 still supports full internationalization and Unicode. I am certainly
> not talking about supporting only ASCII or having ASCII as the default.
> 

Umm, what about the toString() function?  Doesn't that assume char[]?
Hense, it is the default by example.

I'll be honest, I don't get why optimization is so important when there
hasn't been determined a need yet.  I am sure there can be quicker ways
of dealing with allocation and de-allocation--this would make the system
faster for all objects, not just strings.  If that can be done, why not
concentrate on that?

More advanced memory utilization can mean better overall performance,
and reduce the cost of one type of string over another.  Heck, if a
page of memory is being allocated for string storage (multiple strings
mind you), what about a really fast bit blit for the whole page?  That
would make the strings default to initialization state and speed things
up.  Ideally, the difference between a char[] and a dchar[] would be
how much of that page is allocated.
August 30, 2004
Walter wrote:

> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
> news:cgp845$2ea2$1@digitaldaemon.com...
> 
> Are you sure? Even european languages are mostly ascii.

But not completely.  There is the euro symbol (I dare say would
be quite common).  In Spanish the enye (n with ~ on top, can't
really do that well in Windows) is fairly common, and important.
There is a big difference between an anus and a year, but the
only difference in Spanish is n vs. enye.  Not to mention all
those words that use an accent to mark an abnormally stressed
sylable.

Then we get to French, which uses the circumflex, accents, and
accent grave.  Oh, then there's German which uses those two little
dots alot.  And I haven't even touched on Greek or Russian, both
European countries.

You can only make that assumption about English speaking countries.
Yes almost everyone is exposed to English in some way, and it is
the current "lingua de franca" (language of business, like French
used to be--hense the term).  The bottom line is that there are
sufficient exceptions to your "rule" that it would be a shame to
assume the world was America and Great Britain.
August 30, 2004
"Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:cgv9m0$2g71$1@digitaldaemon.com...
> Walter wrote:
>
> > "Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp845$2ea2$1@digitaldaemon.com...
> >
> > Are you sure? Even european languages are mostly ascii.
>
> But not completely.  There is the euro symbol (I dare say would be quite common).  In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye.  Not to mention all those words that use an accent to mark an abnormally stressed sylable.
>
> Then we get to French, which uses the circumflex, accents, and accent grave.  Oh, then there's German which uses those two little dots alot.  And I haven't even touched on Greek or Russian, both European countries.
>
> You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term).  The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.

Even Britain has a non-ASCII used quite extensively: Pound. £
Norway/Denmark/Sweden has three non ASCII characters (used all the
time). The Sami peoples has their own characters (they live in
Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
++ all have their own characters in addition to ASCII. Russia has
its own alphabet! All latin family languages (French/Spanish/Italian/
Portuguese) have all sorts of special characters (accents forwards/
backwards ++)... And now I have not even gone through HALF of Europe.
In Asia there are wildly different systems, and several systems in use,
_in_ each_ _country_.

As I have stated before: I agree with Walter's concern for performance.
But where I think there is some disagreement in these discussions is
where to put the effort to "adapt" the environment, on those who
only needs ASCII (most of the time), or on all those who would prefer
the language to default to the more general need of application and
server programmers all over the world. My view is that speed freaks
are used to tune the tools for best speed, and the general case should
reflect newbies and the 5 billion+ potential non English using markets.
Everything else is selling D short, in a shortsighted quest for best
speed as default as one of the language features.

I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of:

1. Remove wchar and dchar from the language.
2. Make char mean 8-bit unsigned byte, containing
   US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
   Null termination is expected. AFAIK all the sets mentioned are compatible
   with each other. Char *may* contain characters from any
   8-bit based encoding, given that either existing conv. table or
application
   can convert to/from one of the types below. This type makes for a clean,
   minimum effort port, from C and C++, and interaction with current crop of
   OS and libraries. It also takes care of US/Western Europe speed freaks.
3. New types, utf8, utf16 and utf32 as suggested by others.
4. String based on utf16 as default storage. With overidden storage type
like:
   new String(200, utf8)   // 200 bytes
   new String(200, utf16)  // 400 bytes
   new String(200)         // 400 bytes
   new String(200, utf32)  // 800 bytes
   Anyone can use string with the optimal performance for them.
5. String literals in source, default assumed to be utf16 encoded.
   Can be changed by app programmer like:
   c"text"    // char[] 4 bytes
   u"text"    // String() 4 bytes
   w"text"    // String() 8 bytes
   "text"     // String() 8 bytes
   d"text"    // String() 16 bytes

I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea.

Roald


August 30, 2004
I couldn't agree more about Walter's ASCII argument. It's way out there and alienates all of us with non-english first languages (maybe I should start writing my messages using runes, just like my forefathers...).
If the toString is only really useful for debugging anyway, it could as well return dchars. I'd rather remove altoghether, though.

Lars Ivar Igesund

Roald Ribe wrote:
> "Berin Loritsch" <bloritsch@d-haven.org> wrote in message
> news:cgv9m0$2g71$1@digitaldaemon.com...
> 
>>Walter wrote:
>>
>>
>>>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message
>>>news:cgp845$2ea2$1@digitaldaemon.com...
>>>
>>>Are you sure? Even european languages are mostly ascii.
>>
>>But not completely.  There is the euro symbol (I dare say would
>>be quite common).  In Spanish the enye (n with ~ on top, can't
>>really do that well in Windows) is fairly common, and important.
>>There is a big difference between an anus and a year, but the
>>only difference in Spanish is n vs. enye.  Not to mention all
>>those words that use an accent to mark an abnormally stressed
>>sylable.
>>
>>Then we get to French, which uses the circumflex, accents, and
>>accent grave.  Oh, then there's German which uses those two little
>>dots alot.  And I haven't even touched on Greek or Russian, both
>>European countries.
>>
>>You can only make that assumption about English speaking countries.
>>Yes almost everyone is exposed to English in some way, and it is
>>the current "lingua de franca" (language of business, like French
>>used to be--hense the term).  The bottom line is that there are
>>sufficient exceptions to your "rule" that it would be a shame to
>>assume the world was America and Great Britain.
> 
> 
> Even Britain has a non-ASCII used quite extensively: Pound. £
> Norway/Denmark/Sweden has three non ASCII characters (used all the
> time). The Sami peoples has their own characters (they live in
> Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
> ++ all have their own characters in addition to ASCII. Russia has
> its own alphabet! All latin family languages (French/Spanish/Italian/
> Portuguese) have all sorts of special characters (accents forwards/
> backwards ++)... And now I have not even gone through HALF of Europe.
> In Asia there are wildly different systems, and several systems in use,
> _in_ each_ _country_.
> 
> As I have stated before: I agree with Walter's concern for performance.
> But where I think there is some disagreement in these discussions is
> where to put the effort to "adapt" the environment, on those who
> only needs ASCII (most of the time), or on all those who would prefer
> the language to default to the more general need of application and
> server programmers all over the world. My view is that speed freaks
> are used to tune the tools for best speed, and the general case should
> reflect newbies and the 5 billion+ potential non English using markets.
> Everything else is selling D short, in a shortsighted quest for best
> speed as default as one of the language features.
> 
> I have a rather radical suggestion, that may make sense, or it may
> happen that someone will shoot it down right away because of something
> I have not thought of:
> 
> 1. Remove wchar and dchar from the language.
> 2. Make char mean 8-bit unsigned byte, containing
>    US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
>    Null termination is expected. AFAIK all the sets mentioned are compatible
>    with each other. Char *may* contain characters from any
>    8-bit based encoding, given that either existing conv. table or
> application
>    can convert to/from one of the types below. This type makes for a clean,
>    minimum effort port, from C and C++, and interaction with current crop of
>    OS and libraries. It also takes care of US/Western Europe speed freaks.
> 3. New types, utf8, utf16 and utf32 as suggested by others.
> 4. String based on utf16 as default storage. With overidden storage type
> like:
>    new String(200, utf8)   // 200 bytes
>    new String(200, utf16)  // 400 bytes
>    new String(200)         // 400 bytes
>    new String(200, utf32)  // 800 bytes
>    Anyone can use string with the optimal performance for them.
> 5. String literals in source, default assumed to be utf16 encoded.
>    Can be changed by app programmer like:
>    c"text"    // char[] 4 bytes
>    u"text"    // String() 4 bytes
>    w"text"    // String() 8 bytes
>    "text"     // String() 8 bytes
>    d"text"    // String() 16 bytes
> 
> I am open to the fact that I am not at all experienced in language
> design, but I hope this may bring the discussion along. I think making
> char the same as in C/C++ (but slightly better defined default char set)
> and go with entirely different type for the rest is a sound idea.
> 
> Roald
> 
> 
August 30, 2004
Walter did use the word "most". Does anyone know of any studies on the fequency of non-ASCII chars for different document content and languages? There must be solid numbers about these things given all the zillions of electronic documents out there. A quick google for French just dug up a posting where someone scanned 86 millions characters from swiss-french newsagency reports and got 22M non-accented vowels (aeiou) and 1.8M accented chars. That's a factor of roughly 10. That seems significant. But I don't want to read too much into one posting found in a minute of googling - I'm just curious what the data says.

"Lars Ivar Igesund" <larsivar@igesund.net> wrote in message news:cgvoid$2nt9$1@digitaldaemon.com...
> I couldn't agree more about Walter's ASCII argument. It's way out there
> and alienates all of us with non-english first languages (maybe I should
> start writing my messages using runes, just like my forefathers...).
> If the toString is only really useful for debugging anyway, it could as
> well return dchars. I'd rather remove altoghether, though.
>
> Lars Ivar Igesund
>
> Roald Ribe wrote:
> > "Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:cgv9m0$2g71$1@digitaldaemon.com...
> >
> >>Walter wrote:
> >>
> >>
> >>>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cgp845$2ea2$1@digitaldaemon.com...
> >>>
> >>>Are you sure? Even european languages are mostly ascii.
> >>
> >>But not completely.  There is the euro symbol (I dare say would be quite common).  In Spanish the enye (n with ~ on top, can't really do that well in Windows) is fairly common, and important. There is a big difference between an anus and a year, but the only difference in Spanish is n vs. enye.  Not to mention all those words that use an accent to mark an abnormally stressed sylable.
> >>
> >>Then we get to French, which uses the circumflex, accents, and accent grave.  Oh, then there's German which uses those two little dots alot.  And I haven't even touched on Greek or Russian, both European countries.
> >>
> >>You can only make that assumption about English speaking countries. Yes almost everyone is exposed to English in some way, and it is the current "lingua de franca" (language of business, like French used to be--hense the term).  The bottom line is that there are sufficient exceptions to your "rule" that it would be a shame to assume the world was America and Great Britain.
> >
> >
> > Even Britain has a non-ASCII used quite extensively: Pound. £
> > Norway/Denmark/Sweden has three non ASCII characters (used all the
> > time). The Sami peoples has their own characters (they live in
> > Norway, Sweden, Russia). Finland, Estonia, Lituania, Poland,
> > ++ all have their own characters in addition to ASCII. Russia has
> > its own alphabet! All latin family languages (French/Spanish/Italian/
> > Portuguese) have all sorts of special characters (accents forwards/
> > backwards ++)... And now I have not even gone through HALF of Europe.
> > In Asia there are wildly different systems, and several systems in use,
> > _in_ each_ _country_.
> >
> > As I have stated before: I agree with Walter's concern for performance.
> > But where I think there is some disagreement in these discussions is
> > where to put the effort to "adapt" the environment, on those who
> > only needs ASCII (most of the time), or on all those who would prefer
> > the language to default to the more general need of application and
> > server programmers all over the world. My view is that speed freaks
> > are used to tune the tools for best speed, and the general case should
> > reflect newbies and the 5 billion+ potential non English using markets.
> > Everything else is selling D short, in a shortsighted quest for best
> > speed as default as one of the language features.
> >
> > I have a rather radical suggestion, that may make sense, or it may happen that someone will shoot it down right away because of something I have not thought of:
> >
> > 1. Remove wchar and dchar from the language.
> > 2. Make char mean 8-bit unsigned byte, containing
> >    US-ASCII/Latin1/ISO-8859/cp1252, with one character in each byte.
> >    Null termination is expected. AFAIK all the sets mentioned are
compatible
> >    with each other. Char *may* contain characters from any
> >    8-bit based encoding, given that either existing conv. table or
> > application
> >    can convert to/from one of the types below. This type makes for a
clean,
> >    minimum effort port, from C and C++, and interaction with current
crop of
> >    OS and libraries. It also takes care of US/Western Europe speed
freaks.
> > 3. New types, utf8, utf16 and utf32 as suggested by others.
> > 4. String based on utf16 as default storage. With overidden storage type
> > like:
> >    new String(200, utf8)   // 200 bytes
> >    new String(200, utf16)  // 400 bytes
> >    new String(200)         // 400 bytes
> >    new String(200, utf32)  // 800 bytes
> >    Anyone can use string with the optimal performance for them.
> > 5. String literals in source, default assumed to be utf16 encoded.
> >    Can be changed by app programmer like:
> >    c"text"    // char[] 4 bytes
> >    u"text"    // String() 4 bytes
> >    w"text"    // String() 8 bytes
> >    "text"     // String() 8 bytes
> >    d"text"    // String() 16 bytes
> >
> > I am open to the fact that I am not at all experienced in language design, but I hope this may bring the discussion along. I think making char the same as in C/C++ (but slightly better defined default char set) and go with entirely different type for the rest is a sound idea.
> >
> > Roald
> >
> >


August 30, 2004
"Berin Loritsch" <bloritsch@d-haven.org> wrote in message news:cgv92r$2fvv$1@digitaldaemon.com...
> Umm, what about the toString() function?  Doesn't that assume char[]? Hense, it is the default by example.

Yes, but it isn't char(!)acteristic of D.

> I'll be honest, I don't get why optimization is so important when there hasn't been determined a need yet.

Efficiency, or at least potential efficiency, has always been a strong attraction that programmers have to C/C++. Since D is targetted at that market, efficiency will be a major consideration. If D acquires an early reputation for being "slow", like Java did, that reputation can be very, very hard to shake.

> I am sure there can be quicker ways
> of dealing with allocation and de-allocation--this would make the system
> faster for all objects, not just strings.  If that can be done, why not
> concentrate on that?

There's no way to just wipe away the costs of using double the storage.

> More advanced memory utilization can mean better overall performance, and reduce the cost of one type of string over another.  Heck, if a page of memory is being allocated for string storage (multiple strings mind you), what about a really fast bit blit for the whole page?  That would make the strings default to initialization state and speed things up.  Ideally, the difference between a char[] and a dchar[] would be how much of that page is allocated.


August 30, 2004
Berin Loritsch schrieb:

> Walter wrote:
> 
>> "Arcane Jill" <Arcane_member@pathlink.com> wrote in message
>> news:cgp845$2ea2$1@digitaldaemon.com...
>>
>> Are you sure? Even european languages are mostly ascii.
> 
> 
> But not completely.  There is the euro symbol (I dare say would
> be quite common).  In Spanish the enye (n with ~ on top, can't
> really do that well in Windows) is fairly common, and important.
> There is a big difference between an anus and a year, but the
> only difference in Spanish is n vs. enye.  Not to mention all
> those words that use an accent to mark an abnormally stressed
> sylable.
>
> Then we get to French, which uses the circumflex, accents, and
> accent grave.  Oh, then there's German which uses those two little
> dots alot.  And I haven't even touched on Greek or Russian, both
> European countries.

When serving HTML, extended european characters are usually not served as Latin or Unicode. Instead, the &sym; escape encoding is preferred. There are ASCII escapes for all Latin-1 characters, as far as i know.

But what bothers me with all Unicode, is that cyrillic languages cannot be handled with 8 bits as well. What would be nice, if we found an encoding which would work on 2 buffers - the primary one containing the ASCII and data in some codepage. The secondary one would contain packed codepage changes, so that russian, english, hebrew and other test can be mixed and would still need about one byte per character on average. For asian languages, the encoding should use in average per character one symbol on primary string, and one symbol on the secondary. The length of the primary stream must be exactly the length of the string, all of the overhang must be placed in the secondary one. I have a feeling that this could be great for most uses and most efficient in total.

We should also not forget that the world is mostly chinese, and soon the computer users will also be. The european race will loose its importance.

-eye
August 31, 2004
In article <ch07mc$30ai$1@digitaldaemon.com>, Ilya Minkov says...

>But what bothers me with all Unicode, is that cyrillic languages cannot be handled with 8 bits as well.

One option would be the encoding WINDOWS-1251. Quote...

"The Cyrillic text used in the data sets are encoded using the CP1251 Cyrillic system.  Users will require CP1251 fonts to read or print such text correctly. CP1251 is the Cyrillic encoding used in Windows products as developed by Microsoft. The system replaces the underused upper 128 characters of the typical Latin character set with Cyrillic characters, leaving the full set of Latin type in the lower 128 characters. Thus the user may mix Cyrillic and Latin text without changing fonts."

(-- source: http://polyglot.lss.wisc.edu/creeca/kaiser/cp1251.html)

But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?



>We should also not forget that the world is mostly chinese, and soon the computer users will also be.

Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese alphabet). SHIFT-JIS is seriously horrible to work with, compared with the elegant simplicity of UTF-16.

Arcane Jill



September 01, 2004
Arcane Jill schrieb:

> One option would be the encoding WINDOWS-1251. Quote...

Oh come on. Do you rally think i don't know 1251 and all the other Windows codepages???? Oh, how would a person who natively speaks russian ever know that? They are all into typewriters and handwriting, aren't they?

> But that's just a transcoding issue, surely? Internally, we'd use Unicode, no?

You have apparently ignored what i tried to say. What is used externally is determined by external conditions, and is not the subject of this part of the post. I have suggested to investigate and possibly develop another *internal* representation which would provide optimal performance. It should consist of 2 storages, the 8-bit primary storage and the variable length "overhang" storage, and should be able to represent *all* unicode characters. We are back at the question of an efficient String class or struct.

The idea is, that characters are not self-contained, but instead context-dependant. For example, the most commonly used escape in the overhang string would be "select a new unicode subrange to work on". Unicode documents are not just random data! They are words or sentences written in a combination of a few languages, with a change of the language happening perhaps every few words. But you don't have every symbol be in the new language. So why does every symbol need to carry the complete information, if most of it is more effciently stored as a relatively rare state change?

>>We should also not forget that the world is mostly chinese, and soon the computer users will also be.
> 
> Well, Chinese /certainly/ can't be handled with 8 bits. Traditionally, Chinese
> users have made use of the encoding SHIFT-JIS, which is (shock! horror!) a
> /multi-byte-encoding/ (there being vastly more than 256 "letters" in the Chinese
> alphabet). SHIFT-JIS is seriously horrible to work with, compared with the
> elegant simplicity of UTF-16.

Again, you have chosen to ignore my post. As you are much more familiar with Unicode than myself, could you possibly debelop an encoding which takes amortized

1 byte per character for usual codepages (not including the fixed-length subrange select command in the beginning)

2 bytes per character for all multibyte encodings which fit into UTF-16 (not including the fixed-length subrange select command in the beginning)

the rest of the Unicode characters should be representable as well. Besides, i would like that only the first byte from the character encoding is stored in a primary string, and the rest on the "overhang". I have my reasons to suggest that, and *if* you care to pay attention i would also like to explain in detail.

-eye