January 18, 2003
"Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:b0bvoh$1hm5$1@digitaldaemon.com...
> I've gotten a little confused reading this thread. Here are some questions
> swimming in my head:
> 1) What does it mean to make UTF-8 the native type?

From a compiler standpoint, all it really means is that string literals are encoded as UTF-8. The real support for it will be in the runtime library, such as UTF-8 support in printf().

> 2) What is char.size?

It'll be 1.

> 3) Does char[] differ from byte[] or is it a typedef?

It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - it can do conversions between UTF-8 and UTF-16, for example.

> 4) How does one get a UTF-16 encoding of a char[],

At the moment, I'm thinking:
    wchar[] w;
    char[] c;
    w = cast(wchar[])c;
to do a UTF-8 to UTF-16 conversion.

> or get the length,

To get the length in bytes:
    c.length
to get the length in USC-4 characters, perhaps:
    c.nchars ??

> or get
> the 5th character, or set the 5th character to a given unicode character
> (expressed in UTF-16, say)?

Probably a library function.


January 18, 2003
The best way to handle Unicode is, as a previous poster suggested, to make UTF-16 the default and tack on ASCII conversions in the runtime library.  Not the other way around.  Legacy stuff should be runtime lib, modern stuff built-in.  Otherwise we are building a language on outdated standards.

I don't like typecasting hacks or half-measures.  Besides, typecasting by definition should not change the size of its argument.

Mark


January 18, 2003
Walter wrote:
> "Ben Hinkle" <bhinkle@mathworks.com> wrote in message
> news:b0bvoh$1hm5$1@digitaldaemon.com...
> 
>>4) How does one get a UTF-16 encoding of a char[],
> 
> 
> At the moment, I'm thinking:
>     wchar[] w;
>     char[] c;
>     w = cast(wchar[])c;
> to do a UTF-8 to UTF-16 conversion.

This is less complex than "w = toWideStringz(c);" somehow?  I can't speak for anyone else, but this won't help my work with dig at all - I already have to preprocess any strings sent to the API with toStringz, while the public interface will still use char[].  So constant casting is the name of the game by necessity, and if I want to be conservative I have to cache the conversion and delete it anyway.  Calling these APIs directly, when this casting becomes a win, just doesn't happen to me.

January 18, 2003
Walter wrote:
> "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message
> news:b0bpq9$1d3d$1@digitaldaemon.com...
> 
>>Current D uses char[] as the string type. If we declare each char to be
>>UTF-8 we'll have all the problems with what does "myString[13] = someChar;"
>>means. I think a opaque string datatype may be better in this case. We could
>>have a glyph datatype that represents one unicode glyph in UTF-8 encoding,
>>and use it together with a string class.
> 
> I'm thinking that myString[13] should simply set the byte at myString[13].
> Trying to fiddle with the multibyte stuff with simple array access semantics
> just looks to be too confusing and error prone. To access the unicode
> characters from it would be via a function or property.

I disagree.  Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules.  Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8.

I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip.  It should be as transparent as possible!  If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers.  I don't like either option.

[snip]

January 18, 2003
"Walter" <walter@digitalmars.com> wrote in message news:b0c66n$1mq6$2@digitaldaemon.com...
>
> "Ben Hinkle" <bhinkle@mathworks.com> wrote in message news:b0bvoh$1hm5$1@digitaldaemon.com...
> > I've gotten a little confused reading this thread. Here are some
questions
> > swimming in my head:
> > 1) What does it mean to make UTF-8 the native type?
>
> From a compiler standpoint, all it really means is that string literals
are
> encoded as UTF-8. The real support for it will be in the runtime library, such as UTF-8 support in printf().
>
> > 2) What is char.size?
>
> It'll be 1.

D'oh! char.size=8 is a tad big ;)

> > 3) Does char[] differ from byte[] or is it a typedef?
>
> It differs in that it can be overloaded differently, and the compiler recognizes char[] as special when doing casts to other array types - it
can
> do conversions between UTF-8 and UTF-16, for example.

The semantics of casting (across all of D) needs to be nice and predictable. I'd hate to track down a bug because a cast that I thought was trivial turned out to allocate new memory and copy data around...

> > 4) How does one get a UTF-16 encoding of a char[],
>
> At the moment, I'm thinking:
>     wchar[] w;
>     char[] c;
>     w = cast(wchar[])c;
> to do a UTF-8 to UTF-16 conversion.
>
> > or get the length,
>
> To get the length in bytes:
>     c.length
> to get the length in USC-4 characters, perhaps:
>     c.nchars ??

Could arrays (or some types that want to have array-like behavior) have some semantics that distinguish between the memory layout and the array indexing and length? Another example of this comes up in sparse matrices, where you want to have an array-like thing that has a non-trivial memory layout. Perhaps not full-blown operator overloading for [] and .length, etc - but some kind of special syntax to differentiate between running around in the memory layout or running around in the "high-level interface".

> > or get
> > the 5th character, or set the 5th character to a given unicode character
> > (expressed in UTF-16, say)?
>
> Probably a library function.
>
>


January 18, 2003
You're probably right, the typecasting hack is inconsistent enough with the way the rest of the language works that it's probably a bad idea.

As for why UTF-16 instead of UTF-8, why do you find it preferable?

"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b0ccek$1qnh$1@digitaldaemon.com...
> The best way to handle Unicode is, as a previous poster suggested, to make UTF-16 the default and tack on ASCII conversions in the runtime library.
Not
> the other way around.  Legacy stuff should be runtime lib, modern stuff built-in.  Otherwise we are building a language on outdated standards.
>
> I don't like typecasting hacks or half-measures.  Besides, typecasting by definition should not change the size of its argument.
>
> Mark
>
>


January 18, 2003
"Walter" <walter@digitalmars.com> escreveu na mensagem news:b0c66n$1mq6$1@digitaldaemon.com...
>
> "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0bpq9$1d3d$1@digitaldaemon.com...
> >     Current D uses char[] as the string type. If we declare each char to
> be
> > UTF-8 we'll have all the problems with what does "myString[13] =
> someChar;"
> > means. I think a opaque string datatype may be better in this case. We
> could
> > have a glyph datatype that represents one unicode glyph in UTF-8
encoding,
> > and use it together with a string class.
>
> I'm thinking that myString[13] should simply set the byte at myString[13]. Trying to fiddle with the multibyte stuff with simple array access
semantics
> just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.

That's why I think it should be a opaque, immutable, data-type.

>
> > Also I don't think a mutable string
> > type is a good idea. Python and Java use immutable strings, and this
leads
> > to better programs (you don't need to worry about copying your strings
> when
> > you get or give them). Some nice tricks, like caching hashCode results
for
> > strings are possible, because the values won't change. We could also
> provide
> > a mutable string class.
>
> I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe
adverse
> performance results (think of a toupper() function, copying the string
again
> each time a character is converted). Using it instead as a coding style, which is currently how it's done in Phobos, seems to work well. My javascript implementation (DMDScript) does cache the hash for each string, and that works well for the semantics of javascript. But I don't think it
is
> appropriate for lower level language like D to do as much for strings.
>
>
> >     If this is the way to go we need lots of test cases, specially from
> > people with experience writing unicode libraries. The Unicode spec has
> lots
> > of particularities, like correct regular expression support, that may
lead
> > to subtle bugs.
>
> Regular expression implementations naturally lend themselves to subtle
bugs
> :-(. Having a good test suite is a lifesaver.
>

Not if you write a "correct" regular expression implementation. If you implement right from scratch using simple NFAs you probably won't have any headaches. I've implemented a toy regex machine in Java based on Mark Jason Dominus excelent article "How Regexes work" at http://perl.plover.com/Regex/ It's very simple and quite fast as it's a dumb implementation without any kind of optimizations (4 times slower than a fast bytecode regex interpreter in Java, http://jakarta.apache.org/regexp/index.html). Also the sourcecode is lot's of times cleaner. BTW I've written a unit test suite based on Jakarta Regexp set of tests. I can port it to D if you like and use it with your regex implementation.


---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.443 / Virus Database: 248 - Release Date: 10/1/2003


January 19, 2003
"Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0cond$222q$1@digitaldaemon.com...
> BTW I've written a unit test suite based on
> Jakarta Regexp set of tests. I can port it to D if you like and use it
with
> your regex implementation.

At the moment I'm using Spencer's regex test suite augmented with a bunch of new test vectors. More testing is better, so yes I'm interested in better & more comprehensive tests.


January 19, 2003
"Burton Radons" <loth@users.sourceforge.net> wrote in message news:b0cgdd$1t4o$1@digitaldaemon.com...
> Walter wrote:
> > "Daniel Yokomiso" <daniel_yokomiso@yahoo.com.br> wrote in message news:b0bpq9$1d3d$1@digitaldaemon.com...
> >
> >>Current D uses char[] as the string type. If we declare each char to be UTF-8 we'll have all the problems with what does "myString[13] =
someChar;"
> >>means. I think a opaque string datatype may be better in this case. We
could
> >>have a glyph datatype that represents one unicode glyph in UTF-8
encoding,
> >>and use it together with a string class.
> >
> > I'm thinking that myString[13] should simply set the byte at
myString[13].
> > Trying to fiddle with the multibyte stuff with simple array access
semantics
> > just looks to be too confusing and error prone. To access the unicode characters from it would be via a function or property.
>
> I disagree.  Returning the character makes indexing expensive, but it has the expectant result and for the most part hides the fact that compaction is going on automatically; the only rule change is that indexed assignment can invalidate any slices and copies, which isn't any worse than D's current rules.  Then char.size will be 4 and char.max will be 0x10FFFF or 0x7FFFFFFF, depending upon whether we use UNICODE or ISO-10646 for our UTF-8.
>
> I also think that incrementing a char pointer should read the data to determine how many bytes it needs to skip.  It should be as transparent as possible!  If it can't be transparent, then it should use a class or be limited: no indexing, no char pointers.  I don't like either option.

Obviously, this needs more thought by me.


January 19, 2003
"Walter" <walter@digitalmars.com> wrote in message news:b0c66n$1mq6$1@digitaldaemon.com...
> I think the copy-on-write approach to strings is the right idea. Unfortunately, if done by the language semantics, it can have severe
adverse
> performance results (think of a toupper() function, copying the string
again
> each time a character is converted). Using it instead as a coding style,

Copy-on-write usually doesn't copy unless there's more than one live reference to the string.  If you're actively modifying it, it'll only make one copy until you distribute the new reference.  Of course that means reference counting.  Perhaps the GC could store info about string use.