Unicode Character and String Intrinsics (page 2)

Settings

Help

Index » D » Unicode Character and String Intrinsics (page 2)

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Bill Cox
in reply to Mark Evans

Permalink

Bill Cox

Posted in reply to Mark Evans

Permalink

In article <b6al79$1ahd$1@digitaldaemon.com>, Mark Evans says...
>
>Hi again Bill
>
>After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...

A maximalist wants many built-in features, from functional programming support, to multimethods, to support of every character format known to man.  Not in libraries, where we could all contribute, but built-in, where Walter has to write it.

As a minimalist, I'd settle for features that allow me to add the features I need to the language in libraries.  The meta-programming stuff I'd mentioned leads in that direction.

Bill

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Bill Cox

Permalink

Matthew Wilson

Posted in reply to Bill Cox

Permalink

And a pragmatist wants as much as is possible in libraries, but what he/she feels must be in the compiler because of the likelihood of stuff-ups if left to the full spectrum of the developer community (such as meaningful ==, string types and my auto-stringise thingo with char null *)

"Bill Cox" <Bill_member@pathlink.com> wrote in message news:b6b05r$1hsv$1@digitaldaemon.com...
> In article <b6al79$1ahd$1@digitaldaemon.com>, Mark Evans says...
> >
> >Hi again Bill
> >
> >After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...
>
> A maximalist wants many built-in features, from functional programming
support,
> to multimethods, to support of every character format known to man.  Not
in
> libraries, where we could all contribute, but built-in, where Walter has
to
> write it.
>
> As a minimalist, I'd settle for features that allow me to add the features
I
> need to the language in libraries.  The meta-programming stuff I'd
mentioned
> leads in that direction.
>
> Bill
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Bill Cox

Permalink

Mark Evans

Posted in reply to Bill Cox

Permalink

Bill the point is that trying to paint me this or that color, instead of focusing on something specific, is ad hominem.  I find it patronizing. Especially since on this point you've already agreed with me explicitly.

We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
maybe even 3 (UTF8 + char + wchar).

I have much to say about those bizarre meta programming concepts.  I have worked in EDA and know that domain - you can't blow smoke in my face, even if others are impressed.  All I would say here is that by your own admission, you're trying to write code for 'average' or 'dumb' programmers, so please focus on doing just that.

Mark

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Peter Hercek
in reply to Matthew Wilson

Permalink

Peter Hercek

Posted in reply to Matthew Wilson

Permalink

Well, I went through character and code page problems too about a year
 ago. Very bad experience in C/C++ ... (I'm from place where 7 bits
 is not enough). I have two points about this:
1) D should support characters and not bytes (8bits) or words (16bits);
 when I'm indexing string I do so by characters and not by a byte multiply;
 if I would want to index by eg bytes I would ask for string byte length and
 cast to a byte array
2) Support for 3 character types (UTF8, UTF16, UTF32) is handy, but
 not critical (can be solved by conversion functions); actually for one
 character only, UTF32 has the shortest representation; it may be also
 interesting not to be able to specify the the exact encoding for a string
 (as oposed to an encoding for a character) - let's compiler to decide
 what is the best representation (may be some optimization can be
 achieved based on this later; eg compiler can decide to store strings
 in partially balanced trees like STLPort does for ropes, but with
 posibly different encodings for different nodes ... whatever just
 writting down my thoughts)


"Matthew Wilson" <dmd@synesis.com.au> wrote in message news:b6aq84$1dn4$1@digitaldaemon.com...
> I'm sold. Where can I sign up?
>
> I presume you'll be working on the libraries ... ;)
>
> To suck up: I've been faffing around with this issue for years, and have been (unjustifiably, in my opinion) called on numerous times to expertly opine on it for clients. (My expertise is limited to the C/C++ char/wchar_t/horrid-TCHAR type stuff, which I'm well aware is not the full picture.) Your discussion here is the first time I even get a hint that I'm listening to someone that know's what they're talking about. It's nasty, nasty stuff, and I hope that your promise can bear fruit for D. If it can, then it'll earn massive brownie points for D over its peer languages. There's a big market out there of peoples whose character sets don't fall into 7-bits ...
>
>
>
> "Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6al79$1ahd$1@digitaldaemon.com...
> > Hi again Bill
> >
> > After your 'meta-programming' talk I shudder to think what your idea of a maximalist is...maybe a computer that writes a compiler that generates source code for a computer motherboard design program to construct another computer that...
> >
> > Under my scheme we gain 3 character types and drop 2: net gain 1. We gain 3 string types and drop 1: net gain 2. Total net gain, 3 types. What does that buy us? Complete internationalization of D, complete freedom from ugly C string idioms, data portability across platforms, ease of interfacing with Win32 APIs and other software languages.
> >
> > The idea of "just one" Unicode type holds little water. Why don't you make the same argument about numeric types, of which we have some twenty-odd? Or how about if D offered just one data type, the bit, and let you construct everything else from that? If D does Unicode then D should do it right. It's a poor, asymmetric design to have some Unicode built-in and the rest tacked on as library routines.
> >
> > Mark
> >
> >
> > > This is a rare occasion when I agree with Mark. The fact that a minimalist like me, and a maximalist like Mark, and a pragmatist like yourself seem to agree is something Walter should consider. I would want to hold built-in string support to just UTF-8. D could offer some support for the other formats through conversion routines in a standard library. Having a single string format would surely be simpler than supporting them all. Bill
> >
> >
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Mark Evans

Permalink

Mark Evans

Posted in reply to Mark Evans

Permalink

Walter -

On a positive and constructive note, an implementation concept might hold some interest. I'm just bringing it to attention, not advocating yet <g>.

There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings. Other ways to build Unicode strings exist. The one offered here would do little or no damage to the current compiler. Really it's just a set of small additions.

Consider a Unicode string made of two data structures: a C-style array, and a lookup table. The C-style array holds the first code word for each character. The table holds all second, third, and additional code words. (A 'code word' meaning 8/16/32 bits for UTF 8/16/32 respectively.) The keys to the table are the indices of the string. So if character #100 has extra code words, they are accessed via some function like table_access(100).

This setup unifies C array indices with Unicode character indices. So D can employ straight pointer arithmetic to find any character in the string. Character index = array index. String length (in chars) = implementation array size (in elements). These features may address your hesitation over implementation issues that are complex in the serial case.

Having found the character, D need only check the high bit(s) which flag additional code words. Unicode requires such a test in any case; it's unavoidable. If flagged, D performs a table lookup. This table lookup is the only serious runtime cost. The table could take whatever form is most efficient.

* UTF-32 has no extended codes, so UTF-32 strings don't need tables.
* UTF-16 characters involve only a few percent with extended codes.
Ergo - the table is small, and the runtime cost is, say, 2-3%.
* UTF-8 needs the biggest and most table entries, but manageably so.

A downside might be file and network serialization - but we might skate by. D could supply streams on demand, without an intermediate serialized format. If I tell D "write(myFile, myString)" no intermediate format is required. D can just empty the internal array and table to disk in proper byte sequence. The disk or network won't care how D get the bytes from memory.

The only hard serialization requirement would be actual user conversion to byte arrays. (If the user is doing that, let him suffer!)

This scheme supports 7-bit ASCII. An optimization could yield raw C speed. Put an extra boolean flag inside each string structure. This flag is the logical OR of all contained Unicode bit flags. If the string has no extended chars, the flag is FALSE, and D can use alternate string code on that basis. (No bit tests, no table lookups.) That works for UTF-32, 7-bit ASCII, and the majority of UTF-16 strings.

The idea can be nitpicked to death, but it's a concept. Unicode strings and characters will never enjoy the simplicity or speed of 7-bit ASCII. That's a fact of life, meaning that implementation concepts cannot be faulted on such a basis.

What would be nice is to make Unicode maximally simple and maximally efficient for D users.

Thanks again Walter,

Best-
Mark

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Bill Cox

Permalink

Mark Evans

Posted in reply to Bill Cox

Permalink

Bill Cox wrote,
>Not in libraries, where we could all contribute,
>but built-in, where Walter has to write it.

The compiler is open-source.  Contributions are welcome.  (Wasn't it you who said recently, 'I had a few days off and rewrote the D compiler' or words to that effect?  Forgive me if memory fails, I think it was you.)

Whatever reasons you accept for UTF-8 as a native type hold equally well for UTF-16 and UTF-32.  The only rationale advanced otherwise was a vague impression of unease (coupled with slurs on my design sense).

Dividing type families is a war crime.  It's more complex having one member in the compiler and the rest stranded in a library.  Think about slicing Unicode strings.  Suppose the compiler includes code for slicing UTF-8 strings.  Why do we want to duplicate that in a library for UTF-16?  We have to write identical logic, in C for the compiler and in D for the library?  Yuk!  And what about the conversions between Unicode formats?  They are easier with the strings all living in the same place.  Either these strings belong in the language together, or they belong in a library together.  I see no objective reason to divide them up.

Just think about what you're saying in terms of numeric types and the fallacy will jump out at you.  C has trained people too well about what strings really are.  Suppose for example that we put all floats in the compiler and all doubles in the library.  Silly! <g>

Maybe it will mend fences to say in public that UTF-32 could be dropped.  I have objective reasons for saying so, not vague unease: UTF-32 is rarely used and truly fixed-width (so it can be 'faked' as Walter suggests).  Nonetheless intrinsic UTF-32 is just as reasonable to support as, say, the equally rarely used, and equally fake-able 'ifloat' type.

Mark

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Mark Evans

Permalink

Matthew Wilson

Posted in reply to Mark Evans

Permalink

Qualifying this again with the stipulation that I am far from an expert on this issue (aside from having a fair amount of experience in a negative sense):

This sounds like a nice idea - array of 1st-byte plus lookups. I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity?

If I had time when it was introduced I'd be keen to participate in the serialisation stuff, on which I have more firmer footing.

It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with in the fashion that you've outlined.

Finally, I'm troubled by your comments "on a positive and constructive note" and "maybe it will mend fences to " (other post). Have I missed some animus that everyone else has perceived? If so, I don't know which side to be on. Seriously, though, I don't think anyone's getting shirty, so chill, baby. :)

Keep those great comments coming. I'm learning heaps.


"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6bb6i$1ont$1@digitaldaemon.com...
> Walter -
>
> On a positive and constructive note, an implementation concept might hold
some
> interest.  I'm just bringing it to attention, not advocating yet <g>.
>
> There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings.  Other ways to build Unicode strings exist.
The one
> offered here would do little or no damage to the current compiler.  Really
it's
> just a set of small additions.
>
> Consider a Unicode string made of two data structures:  a C-style array,
and a
> lookup table.  The C-style array holds the first code word for each
character.
> The table holds all second, third, and additional code words.  (A 'code
word'
> meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table
are
> the indices of the string.  So if character #100 has extra code words,
they are
> accessed via some function like table_access(100).
>
> This setup unifies C array indices with Unicode character indices.  So D
can
> employ straight pointer arithmetic to find any character in the string. Character index = array index.  String length (in chars) = implementation
array
> size (in elements).  These features may address your hesitation over implementation issues that are complex in the serial case.
>
> Having found the character, D need only check the high bit(s) which flag additional code words.  Unicode requires such a test in any case; it's unavoidable.  If flagged, D performs a table lookup.  This table lookup is
the
> only serious runtime cost.  The table could take whatever form is most efficient.
>
> * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
> * UTF-16 characters involve only a few percent with extended codes.
> Ergo - the table is small, and the runtime cost is, say, 2-3%.
> * UTF-8 needs the biggest and most table entries, but manageably so.
>
> A downside might be file and network serialization - but we might skate
by.  D
> could supply streams on demand, without an intermediate serialized format.
If I
> tell D "write(myFile, myString)" no intermediate format is required.  D
can just
> empty the internal array and table to disk in proper byte sequence.  The
disk or
> network won't care how D get the bytes from memory.
>
> The only hard serialization requirement would be actual user conversion to
byte
> arrays.  (If the user is doing that, let him suffer!)
>
> This scheme supports 7-bit ASCII.  An optimization could yield raw C
speed.  Put
> an extra boolean flag inside each string structure.  This flag is the
logical OR
> of all contained Unicode bit flags.  If the string has no extended chars,
the
> flag is FALSE, and D can use alternate string code on that basis.  (No bit tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the
majority
> of UTF-16 strings.
>
> The idea can be nitpicked to death, but it's a concept.  Unicode strings
and
> characters will never enjoy the simplicity or speed of 7-bit ASCII.
That's a
> fact of life, meaning that implementation concepts cannot be faulted on
such a
> basis.
>
> What would be nice is to make Unicode maximally simple and maximally
efficient
> for D users.
>
> Thanks again Walter,
>
> Best-
> Mark
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Sean L. Palmer
in reply to Walter

Permalink

Sean L. Palmer

Posted in reply to Walter

Permalink

"Walter" <walter@digitalmars.com> wrote in message news:b6aoge$1cnp$1@digitaldaemon.com...
> > >> if someone could point out why 'ubyte' and 'ubyte[]' do not suffice.
> > >Function overloading.
> > This comment is a logical contradiction with prior remarks. If the distinction between ubyte and char matters for this reason, then the same reason makes a difference between uint and UTF-32. But in the latter case you say to just use uint. You can't have it both ways.
>
> I think uint[] will well serve for UTF-32 because there is no need to be concerned about multiword encoding.

But there is still concern for there to be a separate type, for function overloading.

Otherwise, how shall we print a Unicode character higher than position 0xFFFF?

Perhaps the basic char type would actually be 32 bits and capable of holding any Unicode character?  And when used in array form, char[] would transmogrify into UTF-8?  Would we then even need wchar?

Obviously this Unicode thing is a whole can of worms.  Too bad we can't get everyone to forget about enough characters that they all fit in 16 bits!  ;)

Sean

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Sean L. Palmer
in reply to Mark Evans

Permalink

Sean L. Palmer

Posted in reply to Mark Evans

Permalink

That's so crazy it just might work!  ;)

I think it's a fine concept.

One point I'd like to add is that when straight iterating over the string, the library function can iterate over both the main array and the secondary map at the same time, in sync, with no map lookups, only iteration.

This would be an interesting bit to actually implement.  But no harder than the many other possible solutions, and easier and more efficient than most, especially for random-access indexing, which seems to be what D is leaning toward in general.

I'd prefer iteration to be the normal way of using D arrays, rather than explicit loops and indexing.  Those are, for obvious reasons, difficult to optimize.  But Walter has not decided on a good foreach construct, and newsgroup discussion on the topic has died down.  Anyone have any good proposals?  I haven't used any language that has good iterators, except if you count C++ STL.

Sean

"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6bb6i$1ont$1@digitaldaemon.com...
> Walter -
>
> On a positive and constructive note, an implementation concept might hold
some
> interest.  I'm just bringing it to attention, not advocating yet <g>.
>
> There's no hard requirement for serial bytewise storage of the proposed intrinsic Unicode strings.  Other ways to build Unicode strings exist.
The one
> offered here would do little or no damage to the current compiler.  Really
it's
> just a set of small additions.
>
> Consider a Unicode string made of two data structures:  a C-style array,
and a
> lookup table.  The C-style array holds the first code word for each
character.
> The table holds all second, third, and additional code words.  (A 'code
word'
> meaning 8/16/32 bits for UTF 8/16/32 respectively.)  The keys to the table
are
> the indices of the string.  So if character #100 has extra code words,
they are
> accessed via some function like table_access(100).
>
> This setup unifies C array indices with Unicode character indices.  So D
can
> employ straight pointer arithmetic to find any character in the string. Character index = array index.  String length (in chars) = implementation
array
> size (in elements).  These features may address your hesitation over implementation issues that are complex in the serial case.
>
> Having found the character, D need only check the high bit(s) which flag additional code words.  Unicode requires such a test in any case; it's unavoidable.  If flagged, D performs a table lookup.  This table lookup is
the
> only serious runtime cost.  The table could take whatever form is most efficient.
>
> * UTF-32 has no extended codes, so UTF-32 strings don't need tables.
> * UTF-16 characters involve only a few percent with extended codes.
> Ergo - the table is small, and the runtime cost is, say, 2-3%.
> * UTF-8 needs the biggest and most table entries, but manageably so.
>
> A downside might be file and network serialization - but we might skate
by.  D
> could supply streams on demand, without an intermediate serialized format.
If I
> tell D "write(myFile, myString)" no intermediate format is required.  D
can just
> empty the internal array and table to disk in proper byte sequence.  The
disk or
> network won't care how D get the bytes from memory.
>
> The only hard serialization requirement would be actual user conversion to
byte
> arrays.  (If the user is doing that, let him suffer!)
>
> This scheme supports 7-bit ASCII.  An optimization could yield raw C
speed.  Put
> an extra boolean flag inside each string structure.  This flag is the
logical OR
> of all contained Unicode bit flags.  If the string has no extended chars,
the
> flag is FALSE, and D can use alternate string code on that basis.  (No bit tests, no table lookups.)  That works for UTF-32, 7-bit ASCII, and the
majority
> of UTF-16 strings.
>
> The idea can be nitpicked to death, but it's a concept.  Unicode strings
and
> characters will never enjoy the simplicity or speed of 7-bit ASCII.
That's a
> fact of life, meaning that implementation concepts cannot be faulted on
such a
> basis.
>
> What would be nice is to make Unicode maximally simple and maximally
efficient
> for D users.
>
> Thanks again Walter,
>
> Best-
> Mark
>
>

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Sean L. Palmer
in reply to Matthew Wilson

Permalink

Sean L. Palmer

Posted in reply to Matthew Wilson

Permalink

"Matthew Wilson" <dmd@synesis.com.au> wrote in message news:b6bgt5$1sai$1@digitaldaemon.com...
> This sounds like a nice idea - array of 1st-byte plus lookups. I'm
intrigued
> as to the nature of the lookup table. Is this a constant, process-wide, entity?

No, because the map is indexed by the same index used to index into the flat array.  Unless I'm misunderstanding something.

Perhaps these could be grouped into separate maps by the total size of the char, which I think is determinable from the first char?  May speed lookups a tad, or slow them down, not sure.

Sean

Top | Forum index | About this forum

Forums