String theory in D (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » String theory in D (page 2)

October 27, 2004

Re: String theory in D

Posted by Anders F Björklund
in reply to Glen Perkins

Anders F Björklund

Posted in reply to Glen Perkins

Glen Perkins wrote:

>> What did you think about the "string" (char[]) and "ustring" (wchar[]) ?
> 
> I don't think you were asking me, but my concern applies to any "let a hundred flowers bloom" design approach for strings. If you have multiple string types with no dominant leader, plus an "alias" feature, plus strong support for OOP but no standard string class, [...]

Walter has earlier ruled out a built-in "native" string type in D,
and a String class brings us back to the earlier "boxing" discussion.

Currently the D language treats strings as arrays of Unicode code units,
and one can still use char[] as ASCII strings, just like one could in C.

There is a lot of things discussed regarding Unicode and strings at:
http://www.prowiki.org/wiki4d/wiki.cgi?UnicodeIssues

A "transcoding" string type with a built-in hash code would have been
welcome, but it is *not* in the current D language specification...

I just wanted a reasonable alias while the theological debate rages on ?
(and reason for submitting it was so that we could all use the same one)

--anders

October 27, 2004

Re: String theory in D

Posted by Ben Hinkle
in reply to Glen Perkins

Ben Hinkle

Posted in reply to Glen Perkins

Glen Perkins wrote:

> 
> "Ben Hinkle" <bhinkle4@juno.com> wrote in message news:clk269$haj$1@digitaldaemon.com...
>> Glen Perkins wrote:
>>
>> welcome.
> 
> Thanks.
> 
>> There is a port of IBM's ICU unicode library underway and that will
>> help
>> fill in various unicode shortcomings of phobos. What else do you see
>> a
>> class doing that isn't in phobos?
> 
> I don't know enough to comment at this point. I don't even know how modularity works for compiled executables in D, and I don't want to propose something that would violate D's priorities by, for example, creating a heavyweight string full of ICU features, that would end up being statically linked into every little "hello, world" written in D, ruining the goal of tiny executables if, for example, that is a high priority in D.
> 
> If there's no chance of a standard string class for general string operations in D, then there's no point in designing one. If there is a chance, then the design would have to start with the priorities and constraints of this particular language.
> 
> My sense is that a string class similar to that in C#, but noncommittal regarding its internal encoding, would be nice for a language like D.
> 
>>> ...I've become increasingly convinced that programmers don't
>>> need to know, much less be forced to decide, how most of their text
>>> is
>>> encoded. They should be thinking in terms of text semantically most
>>> of
>>> the time, without concerning themselves with its byte
>>> representation.
>>
>> are you referring to indexing and slicing being character lookup and
>> not
>> byte lookup?
> 
> Yes, that's a specific example of what I'm referring to, which is the general notion of just thinking about algorithms for working with the text in terms of text itself without regard to how the computer might be representing that text inside (except in the minority of cases where you MUST work explicitly with the representation.)
> 
> And though it's probably too radical for D (so nobody freak out), we may well evolve to the point where the most reasonable default for walking through the "characters" in general text data is something like 'foreach char ch in mystring do {}', where the built-in "char" datatype in the language is a variable length entity designed to hold a complete grapheme. Only where optimization was required would you drop down to the level of "foreach codepoint cp in mytext do {}', where mytext was defined as 'codepoint[] mytext', or even more radically to 'foreach byte b in mytext do {}', where mytext was defined as 'byte[] mytext'.

One can foreach over dchars from either a char[] or wchar[]:

int main() {
  char[] t = "hello 中国 world";
  foreach(dchar x;t) printf("%x ",x);
  return 0;
}

prints

68 65 6c 6c 6f 20 4e2d 56fd 20 77 6f 72 6c 64

Similarly structs and classes can have overloaded opApply implementations to customize what it means to foreach in different situations.

> Once again, I'm not proposing that for D, I'm just promoting the general notion of keeping the developer's mind on the text and off of the representation details to the extent that it is *reasonable*.
> 
> 
>>> Since the internal encoding of the standard String would not be
>>> exposed to the programmer, it could be optimized differently on
>>> every
>>> platform. I would probably implement my String class in UTF-16 on
>>> Windows and UTF-8 on Linux to make interactions with the OS and
>>> neighboring processes as lightweight as possible.
>>
>> Aliases can introduce a symbol that can mean different things on
>> different
>> platforms:
>>
>> // "Operating System" character
>> version (Win32) {
>> alias wchar oschar;
>> } else {
>> alias char oschar;
>> }
>> oschar[] a_string_in_the_OS_preferred_format;
> 
> Thanks for pointing out this feature. I like it. It provides a mechanism for manual optimization at the cost of greater complexity for those special cases where optimization is called for. You could have different string representations for different zones in your app, labeled by zone name: oschar for internal and OS API calls, xmlchar for an XML I/O boundary etc., so you could change the OS format from OS to OS while leaving the XML format unchanged.
> 
> I can't help thinking, though, that it would be best reserved for optimization cases, with a simple works-everywhere, called "string" everywhere, string class for the general case. Otherwise, your language tutorials would be teaching you that a string is "char[]" but real production code would almost always be based on locally-invented names for string types. Libraries, which are also trying hard to be real production quality code, would use the above alias approach and invent their own names. Not just at points you needed to manually optimize but literally everywhere you did anything with a string internally, you'd have to choose among the three standard names, char, wchar, and dchar, plus your own custom oschar and xmlchar, plus your GUI library's gchar or kchar, and your ICU library's unichar, plus a database orachar designed to match the database encoding, etc.
> 
> You could easily end up with so many conversions going on between types locally optimized for each zone in your app that you are globally unoptimized.

That's possible, but so far it doesn't seem so bad to have three core string types. Storing the encoding in the instance instead of the type would turn today's compile-time decisions into run-time decisions, though. That would most likely slow things down since it can't inline as completely.

>> One disadvantage of a String class is that the methods of the class
>> are
>> fixed. With arrays and functions anyone can add a string "method". A
>> class
>> will actually reduce flexibility in the eyes of the user IMO.
>> Another
>> disadvantage is that classes in D are by reference (like Java) and
>> so
>> slicing will have to allocate memory - today a slice is a length and
>> pointer to shared data so no allocation is needed. A String struct
>> would be
>> an option if a class isn't used, though.
> 
> It's true what you're saying about the relative lack of flexibility of built-in methods vs. external functions. You can always apply functions to strings, though, and the conservative approach would be to have a few clearly important methods in the string, implement other operations as functions that take string arguments, and over time consider migrating those operations into the string itself.
> 
> Another possibility might be to have this "oschar" approach above actually built-in, with everybody (starting from the first "hello, world" tutorial) encouraged to use that one by default. That's tricky, though, because when you asked for mystring[3] from your oschar-based string, what would you get? People would expect the third text character, but as you know it would depend on the platform, and would not have any useful meaning in general, which seems pretty awkward for a standard string. It doesn't seem very useful to present something in an array format without the individual elements of the array being very useful. You could make them useful by making dchar[] the default, but everybody would probably fuss about the wasted memory, and production code would end up using char or wchar. So that brings us back to a string class where operator overloading could make the [] array-type access yield consistent, complete codepoints on every platform.
> 
> I'm sympathetic to performance arguments. That would be one of the big attractions of D. I still can't help thinking that sticking to a single string class shared by almost all of your tutorials, your own code, your downloaded snippets, and all of your libraries might not only be the easiest for programmers to work with but could result in apps that tended to be at least as performant as the existing approach.

Yeah - any design will have trade-offs. dchar[] takes up too much space. On-the-fly character lookup is too slow to make the default. char[] is too fat for asian languages. Judgements like "too much space" and "too slow" are subjective and Walter made his choices. I'm sure he's open to more information that would sway those choices but the best chance of influencing things is to add some solid data that is missing. With your experience in string handling in different languages I'm guessing your opinions are based on accumulated knowledge about what is fast or slow etc so trying to articulate that accumulated knowledge would be very useful.

-Ben

October 27, 2004

Re: String theory in D

Posted by ac
in reply to Anders F Björklund

ac

Posted in reply to Anders F Björklund

>> Walter has earlier ruled out a built-in "native" string type in D, and a String class brings us back to the earlier "boxing" discussion.

a) Built-in or library? (Standard library or 3rd party?)

b) 0, 1, 3 or 3+1 "approved" string kinds?

c) Unicode (which?), native (which?), other?

These 3 questions are orthogonal to each other.

To (a) I have no strong opinion. Maybe just building facilities in the language itself that are geared towards making it easy to implement an efficient string library would be adequate?

I have no problem with 3+1 in (b). Why not let the 3 existing strings live on. But I would really like to have an additional string which would be advertised as what you should use.

(c) I leave to smarter people.


If we don't have exactly _one_ type that everyone _should_ use, then programmers in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese origin, probably would use a 32-bit type -- even if they were coding in the US. And even if they would be working on a project that is to manipulate ASCII strings, because they'd expect the application to sooner or later get exposed to non-USASCII characters anyway.

Actually, rednecks would be happy with 7 bits.

What if all these guys happen to work for the same global company?

<joke-mode>
I can hear a crowd all over the D-community shouting to their screens: "Well,
that company would have their global coding policy on strings. NO problem."

Right. But what when (not if) that company gets merged into another? Would they have happended to choose the very same string coding policy? Maybe they began with different operating systems, maybe the other one originally came from another continent?

I don't even want to guess what the crowd says to this. </joke-mode>

This ought to be a no-brainer!

October 27, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Ben Hinkle

Regan Heath

Posted in reply to Ben Hinkle

On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
> That's possible, but so far it doesn't seem so bad to have three core string types. Storing the encoding in the instance instead of the type would turn today's compile-time decisions into run-time decisions, though. That would most likely slow things down since it can't inline as completely.

Ben, can you give me/us an example where this would be the case.
How much slower do you think it would make it?

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 27, 2004

Re: String theory in D

Posted by Ben Hinkle
in reply to Regan Heath

Ben Hinkle

Posted in reply to Regan Heath

"Regan Heath" <regan@netwin.co.nz> wrote in message news:opsgjm7mx55a2sq9@digitalmars.com...
> On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
> > That's possible, but so far it doesn't seem so bad to have three core string types. Storing the encoding in the instance instead of the type would turn today's compile-time decisions into run-time decisions, though. That would most likely slow things down since it can't inline as completely.
>
> Ben, can you give me/us an example where this would be the case. How much slower do you think it would make it?

I don't know about impact on typical string usage but it certainly makes a difference with a super-cheezy made-up example like:

import std.c.windows.windows;
enum Encoding { UTF8, UTF16, UTF32 };
struct my_string {
  Encoding encoding;
  int length;
  void* data;
}

char index(char[] s, int n) { return s[n]; }
wchar index(wchar[] s, int n) { return s[n]; }
dchar index(dchar[] s, int n) { return s[n]; }
dchar index(my_string s, int n) {
  switch (s.encoding) {
    case Encoding.UTF8:
      return (cast(char*)s.data)[n];
    case Encoding.UTF16:
      return (cast(wchar*)s.data)[n];
    case Encoding.UTF32:
      return (cast(dchar*)s.data)[n];
  }
}

int main() {

  char[] s = "hello";
  int t1 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s,3);
  }
  int t2 = GetTickCount();

  my_string s2;
  s2.data = s;
  s2.encoding = Encoding.UTF8;
  s2.length = s.length;
  int t3 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s2,3);
  }
  int t4 = GetTickCount();

  printf("compile time %d\n",t2-t1);
  printf("run time %d\n",t4-t3);
  return 0;
}

compiling with "dmd main.d -O -inline" and running gives
compile time 110
run time 531

Any particular example doesn't mean much, though. My statement was meant as a general statement about compile-time vs run-time decisions.

October 27, 2004

Re: String theory in D

Posted by Anders F Björklund
in reply to ac

Anders F Björklund

Posted in reply to ac

ac wrote:

>>>Walter has earlier ruled out a built-in "native" string type in D,
>>>and a String class brings us back to the earlier "boxing" discussion.
> 
> a) Built-in or library? (Standard library or 3rd party?)

There is no built-in D type, and does not look like a standard class.
(as in: there will probably be no Integer, Character, String classes?)

> b) 0, 1, 3 or 3+1 "approved" string kinds?

There are *two* approved string types: char[] and wchar[]
(there is also a dchar[] type, but hardly any use for it?)

> c) Unicode (which?), native (which?), other?

"Unicode is the future", so there is no Latin-1 support...
(I assume you meant something like ISO-8859-1 by "native"?)

> These 3 questions are orthogonal to each other. 

I thought they were a bit strange, but I tried anyway ?

> If we don't have exactly _one_ type that everyone _should_ use, then programmers
> in, say, the Mid West would all use an 8-bit kind. People from, say, Chinese
> origin, probably would use a 32-bit type -- even if they were coding in the US.
> And even if they would be working on a project that is to manipulate ASCII
> strings, because they'd expect the application to sooner or later get exposed to
> non-USASCII characters anyway.

Western people that earlier had Latin-1 tend to use "char[]",
the only trick is to dimension as [length * 2] since some
characters occupy two bytes when encoded. To be i18n-savvy,
they should use [length * 4] which allows for all of Unicode.

"char" is only useful for ASCII characters, as one has to
use at least wchar to fit a Latin-1 character for instance.

Other people tend to use "wchar[]", which is also the string
(and character) encoding that Java choose. Nowadays one has
to be prepared to handle "surrogates", since Unicode does
not fit in 16 bits anymore - but spilled over to 21 bits...

"wchar" *usually* works for Unicode characters, but to
be able to handle all characters then dchar must be used.

Nobody in their right mind uses "dchar[]" to store strings,
but the "dchar" type is useful for storing one code point.

A big disadvantage of UTF-16 (over UTF-8) is that it is
platform-dependant, and that it is not ASCII-compatible.
At least not with C and UNIX, since it will have a "BOM"
and since every other byte in a ASCII string will be NUL.

(more details at http://www.unicode.org/faq/utf_bom.html)

And imagine that all I wanted was two simpler aliases. :-)
(thought "string" and "ustring" were easier to "pronounce"
than "char[]" and "wchar[]", and that was about it really.
Just a simple: alias char[] string; alias wchar[] ustring;)

Not any new types or classes or other magic incantations...
--anders

October 27, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Ben Hinkle

Regan Heath

Posted in reply to Ben Hinkle

On Wed, 27 Oct 2004 16:19:14 -0400, Ben Hinkle <bhinkle@mathworks.com> wrote:
> "Regan Heath" <regan@netwin.co.nz> wrote in message
> news:opsgjm7mx55a2sq9@digitalmars.com...
>> On Wed, 27 Oct 2004 08:26:52 -0400, Ben Hinkle <bhinkle4@juno.com> wrote:
>> > That's possible, but so far it doesn't seem so bad to have three core
>> > string types. Storing the encoding in the instance instead of the type
>> > would turn today's compile-time decisions into run-time decisions,
>> > though. That would most likely slow things down since it can't inline 
>> as
>> > completely.
>>
>> Ben, can you give me/us an example where this would be the case.
>> How much slower do you think it would make it?
>
> I don't know about impact on typical string usage but it certainly makes a
> difference with a super-cheezy made-up example like:
>
> import std.c.windows.windows;
> enum Encoding { UTF8, UTF16, UTF32 };
> struct my_string {
>   Encoding encoding;
>   int length;
>   void* data;
> }
>
> char index(char[] s, int n) { return s[n]; }
> wchar index(wchar[] s, int n) { return s[n]; }
> dchar index(dchar[] s, int n) { return s[n]; }
> dchar index(my_string s, int n) {
>   switch (s.encoding) {
>     case Encoding.UTF8:
>       return (cast(char*)s.data)[n];
>     case Encoding.UTF16:
>       return (cast(wchar*)s.data)[n];
>     case Encoding.UTF32:
>       return (cast(dchar*)s.data)[n];
>   }
> }
>
> int main() {
>
>   char[] s = "hello";
>   int t1 = GetTickCount();
>   for(int k=0;k<100_000_000; k++) {
>     index(s,3);
>   }
>   int t2 = GetTickCount();
>
>   my_string s2;
>   s2.data = s;
>   s2.encoding = Encoding.UTF8;
>   s2.length = s.length;
>   int t3 = GetTickCount();
>   for(int k=0;k<100_000_000; k++) {
>     index(s2,3);
>   }
>   int t4 = GetTickCount();
>
>   printf("compile time %d\n",t2-t1);
>   printf("run time %d\n",t4-t3);
>   return 0;
> }
>
> compiling with "dmd main.d -O -inline" and running gives
> compile time 110
> run time 531
>
> Any particular example doesn't mean much, though. My statement was meant as
> a general statement about compile-time vs run-time decisions.

Thanks. I was hacking round with your example, basically inventing a string type which did not have runtime decisions, it is giving me some very strange results, I wonder if you can spot where it's going awry.

D:\D\src\temp>dmd string.d -O -release -inline
d:\d\dmd\bin\..\..\dm\bin\link.exe string,,,user32+kernel32/noi;

D:\D\src\temp>string
compile time 156
run time 1000

(string.d is your example, unmodified, as a comparrison to what I get below)

D:\D\src\temp>dmd string2.d -O -release -inline
d:\d\dmd\bin\..\..\dm\bin\link.exe string2,,,user32+kernel32/noi;

D:\D\src\temp>string2
compile time 219
run time 1156
template 157

I ran both several times, the results above are typical for my system.

Notice:
1- the compile time string2.d is slower than string.d
2- the template one is faster than the compile time one

I don't understand how either of the above can be true.

--[string2.d]--
import std.c.windows.windows;
enum Encoding { UTF8, UTF16, UTF32 };
struct my_string {
  Encoding encoding;
  void opCall(char[] s)  { encoding = Encoding.UTF8;  cs = s.dup; }
  void opCall(wchar[] s) { encoding = Encoding.UTF16; ws = s.dup; }
  void opCall(dchar[] s) { encoding = Encoding.UTF32; ds = s.dup; }
  union {
  	char[] cs;
  	wchar[] ws;
  	dchar[] ds;
  }
}
struct my_string2(Type) {
	Type[] data;
	void opCall(char[] s)  { data = cast(Type[])s.dup; }
	void opCall(wchar[] s) { data = cast(Type[])s.dup; }
	void opCall(dchar[] s) { data = cast(Type[])s.dup; }
	Type opIndex(int i) { return data[i]; }
}

char index(char[] s, int n) { return s[n]; }
wchar index(wchar[] s, int n) { return s[n]; }
dchar index(dchar[] s, int n) { return s[n]; }
dchar index(my_string s, int n) {
  switch (s.encoding) {
    case Encoding.UTF8:
      return s.cs[n];
    case Encoding.UTF16:
      return s.ws[n];
    case Encoding.UTF32:
      return s.ds[n];
  }
}
char index(my_string2!(char) s, int n) {
	return s.data[n];
}
wchar index(my_string2!(wchar) s, int n) {
	return s.data[n];
}
dchar index(my_string2!(dchar) s, int n) {
	return s.data[n];
}

int main() {

  char[] s = "hello";
  int t1 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s,3);
  }
  int t2 = GetTickCount();

  my_string s2;
  s2(s);
  int t3 = GetTickCount();
  for(int k=0;k<100_000_000; k++) {
    index(s2,3);
  }
  int t4 = GetTickCount();

  my_string2!(char) s3;
  s3(s);
  int t5 = GetTickCount();
    for(int k=0;k<100_000_000; k++) {
      index(s3,3);
    }
  int t6 = GetTickCount();

  printf("compile time %d\n",t2-t1);
  printf("run time %d\n",t4-t3);
  printf("template %d\n",t6-t5);
  return 0;
}

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 28, 2004

Re: String theory in D

Posted by Glen Perkins
in reply to Ben Hinkle

Glen Perkins

Posted in reply to Ben Hinkle

"Ben Hinkle" <bhinkle4@juno.com> wrote in message news:clo463$21js$1@digitaldaemon.com...
>>
>> You could easily end up with so many conversions going on between
>> types locally optimized for each zone in your app that you are
>> globally unoptimized.
>
> That's possible, but so far it doesn't seem so bad to have three core string
> types.

I think you'll end up with many more than that. Since people will be required to make what is essentially an optimization decision every time they do anything with text, the choice will typically be different on different platforms. Rather than letting the implementation deal with that so that source code can be ported and still remain close to optimal, this design requires the programmer to either 1) live with suboptimal performance when porting, 2) manually rewrite most of his code and live with separate versions that are harder to keep in sync, or 3) use the "alias" feature to invent a local name for a "standard" string.

Of the three, I think #3 is the most attractive. If lots of people agree with me, then when we end up reusing each other's code, we'll end up with the standard three string types, plus our own type, plus those invented by others. And there's no guarantee that our various alias types will all make the same decisions for when to be what.

So now I have a whole bunch of string types to deal with, some of which are the same on some platforms but different on others, so when I try to optimize my code so that I don't have lots of unnecessary back and forth and back and forth encoding conversions, I have to further de-sync my different platform versions or use more aliases to manage the aliases or, once again, live with the lack of optimization, attempting to repair it only where necessary.

If it's going to end up being #3, and it probably should because of what we know about optimization, where the majority of your operations of all types could execute instantly without a noticeable improvement in overall app performance, then you could probably get about the same performance without the design nightmare by using a single, standard string type (which is optimized by the implementation for the platform) for almost everything.

> Storing the encoding in the instance instead of the type would turn
> today's compile-time decisions into run-time decisions, though. That would
> most likely slow things down since it can't inline as completely.

I'm not suggesting a string type that would have a field to hold its encoding, so that two instances of the same string class on the same platform could have two different internal encodings and functions would have to decide at runtime what code to run for each instance. I'm talking about a situation similar to the alias idea where every instance of a standard string on a given platform, whether in your own code or the libraries, would be in the same encoding, an encoding known at compile time.

The information to early-bind the methods would be available at compile time, and a smart compiler might be able to use that fact for compile time optimization, but I can't completely disagree with you. There may be other reasons why the compiler might not be able to do the binding at compile time, perhaps due to the general implementation of OO support.

Even if this is the case, you don't have to dismiss an idea because it doesn't optimize performance for each instance in which it is used. GC itself doesn't optimize performance for each instance, but it's still the way to go (in my opinion) because the performance of most parts is irrelevant to the performance of the whole, as long as those parts are reasonable, and you have a manual option for special cases. I think the same argument implies having a single default string type and letting the compiler optimize it.

>> I'm sympathetic to performance arguments. That would be one of the big
>> attractions of D. I still can't help thinking that sticking to a
>> single string class shared by almost all of your tutorials, your own
>> code, your downloaded snippets, and all of your libraries might not
>> only be the easiest for programmers to work with but could result in
>> apps that tended to be at least as performant as the existing
>> approach.
>
> Yeah - any design will have trade-offs. dchar[] takes up too much space.
> On-the-fly character lookup is too slow to make the default.

I'm not sure I understand this. I realize that you're just quoting things that "people say", but if this means it's better to have byte fetching from UTF-8 be the default instead of character fetching, it sounds as though it's claiming that it's a better default to do something useless than useful if the useless operation is faster. For the majority of text work, byte fetching is useless. What you care about is the text, not its representation. Only in a minority of cases would byte fetching matter. Those special cases are definitely important--the general cases will be built on top of byte fetching so fast byte fetching is mandatory--but defaults should be based on the typical need, not the exceptional need. If the typical need requires more work, well, it's still the typical need and the default, almost by definition, should be designed for the typical need.

Of course, I may have misunderstood you completely. ;-) Even more likely, this particular point doesn't matter, but it has been a source of some frustration how often people with a "C mindset" (and I'm not talking about you but am thinking of countless design meetings over the years) end up optimizing the insignificant at the expense of the significant because the insignificant is always in their face.

I think you can have the best of both worlds with a design based roughly on the idea of programmer productivity for defaults plus fine-grained manual optimization features (that integrate easily with the default features) for bottlenecks (defined as anyplace where a local optimization will produce a global optimization). D is quite close to such a design, but it seems to me that the string approach doesn't quite match.

> char[] is too
> fat for asian languages. Judgements like "too much space" and "too slow"
> are subjective and Walter made his choices. I'm sure he's open to more
> information that would sway those choices but the best chance of
> influencing things is to add some solid data that is missing.

I'm not sure who "Walter" is, but it sounds like he's the guy to thank for such a nice language design. (If I didn't mean that, I wouldn't waste my time writing any of this.) For the specific issue of strings, the information that I think is most relevant (and, as I said before, I still can't be *sure* that it's relevant in D's case), is not "data" per se, but a reminder that C++ is about the worst case scenario among major languages when it comes to programmer productivity in text handling, in large part because you ALWAYS end up getting stuck with multiple string types in any significant app. The problem is NOT that nobody ever managed to create a useful string type for C++, it's that EVERYBODY did so because Stroustrup wouldn't.

The "data", I suppose, is what happened in the case of C++ and didn't happen to any language with a built-in standard string class, but of course you can argue about the relevance of the comparisons.


> With your
> experience in string handling in different languages I'm guessing your
> opinions are based on accumulated knowledge about what is fast or slow etc
> so trying to articulate that accumulated knowledge would be very useful.

My accumulated knowledge tells me that what's fast or slow for a string design should NOT be your primary consideration, even when performance of the app as a whole IS (and it usually isn't.) I DO care about performance. Java's design prohibits the kind of performance I'm looking for, which is one reason I'm curious about D. But I care about global performance not local performance, and I also care about other significant global issues such as programmer productivity, lack of bugs, source portability, maintenance costs, etc. that almost always matter more than the microscale performance of your strings.

A factory doing manual labor can double its output by doubling the people at every station, or can pull people off of some stations, reducing the local "performance" at those stations, and reallocate them to double the staff at the bottleneck only. One approach improves global performance by improving local performance everywhere. The other either doesn't improve or actually loses performance everywhere except at the bottleneck. Both produce the same doubling of total factory output.

Which approach is better? Well gaining performance everywhere is obviously better--you never know when it might come in handy, right?--until you factor in the cost.

I don't want to get too tangled in the details of the analogy, but the cost of having three standard strings for fine-grained performance tuning everywhere, plus homemade and 3rd party aliases, plus multiple 3rd party string classes that will fill the void in the standard, is the complexity that it will add to designs with all of the implications that has for debugging, code reuse, architectural decisions, portability, maintenance, and general programmer productivity. All of those factors have costs, and some of them may even negatively impact global performance, which was the reason for the extra complexity to begin with.

I just have a hard time imagining that MAKING people micromanage their string implementations in all cases will produce superior global performance to simply ALLOWING them to do so where it impacted global performance. Doubling the staff at every factory station results in no more total production than simply doubling the staff at the bottleneck. I have an even harder time imagining that the benefits of the unavoidable additional complexity (which you can never avoid if you ever use other people's code) will be worth the performance benefit that may not even exist.

I could still be wrong about any of this. Am I overlooking something?

October 28, 2004

Re: String theory in D

Posted by Regan Heath
in reply to Glen Perkins

Regan Heath

Posted in reply to Glen Perkins

On Wed, 27 Oct 2004 17:34:14 -0700, Glen Perkins <please.dont@email.com> wrote:
> "Ben Hinkle" <bhinkle4@juno.com> wrote in message news:clo463$21js$1@digitaldaemon.com...
>> Storing the encoding in the instance instead of the type would turn
>> today's compile-time decisions into run-time decisions, though. That would
>> most likely slow things down since it can't inline as completely.
>
> I'm not suggesting a string type that would have a field to hold its encoding, so that two instances of the same string class on the same platform could have two different internal encodings and functions would have to decide at runtime what code to run for each instance.

No, but I did :)

I am starting to think it's un-necessary however. Given that converting from one encoding to another necessitates a copy of the data anyway.

So, instead, a single string type that could be encoded internally as any of the available encodings, couldn't change encoding itself, but, could be cast/converted to another encoding (creating a new string).

Plus, it needs all the functionality of our current arrays i.e. indexing, slicing, being able to write methods for it i.e.

void foo(char[] a, int b);
char[] aa;
aa.foo(5);  <-- calls 'foo' above.

I'm pretty sure the above idea is not possible without some sort of compiler magic.

Regan

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/

October 28, 2004

Re: String theory in D

Posted by Walter
in reply to Glen Perkins

Walter

Posted in reply to Glen Perkins

"Glen Perkins" <please.dont@email.com> wrote in message news:clpeud$lql$1@digitaldaemon.com...
> I'm not sure I understand this. I realize that you're just quoting things that "people say", but if this means it's better to have byte fetching from UTF-8 be the default instead of character fetching, it sounds as though it's claiming that it's a better default to do something useless than useful if the useless operation is faster. For the majority of text work, byte fetching is useless. What you care about is the text, not its representation. Only in a minority of cases would byte fetching matter. Those special cases are definitely important--the general cases will be built on top of byte fetching so fast byte fetching is mandatory--but defaults should be based on the typical need, not the exceptional need. If the typical need requires more work, well, it's still the typical need and the default, almost by definition, should be designed for the typical need.

I'm not so sure this is correct. For a number of common string operations, such as copying and searching, byte indexing of UTF-8 is faster than codepoint indexing. For sequential codepoint access, the foreach() statement does the job. For random access of codepoints, one has to always start from the beginning and count forward anyway, and the foreach() does that.

As for a single string type, there is no answer for that. Each has significant tradeoffs. For a speed oriented language, the choice needs to be under the control of the application programmer, not the language. The three types are readilly convertible into each other. I don't really see the need for application programmers to layer on more string types.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation