Unified String Theory.. (page 5) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Unified String Theory.. (page 5)

November 25, 2005

Re: Unified String Theory..

Posted by Georg Wrede
in reply to Derek Parnell

Georg Wrede

Posted in reply to Derek Parnell

Derek Parnell wrote:
> On Fri, 25 Nov 2005 05:16:15 +0200, Georg Wrede wrote:
>>Derek Parnell wrote:
>>>On Fri, 25 Nov 2005 02:21:52 +0200, Georg Wrede wrote:
>>
>>>>>I like "uchar". I agree "char" should go back to being C's char
>>>>>type. I don't think we need a char[] all the C functions expect a
>>>>>null terminated char*.
>>>>
>>>>That would be nice! What if we even decided that char[] is null-terminated? That'd massively reduce all kinds of bugs when
>>>>(under pressure) converting code from C(++)!
>>>
>>>I think that would interfere with the slice concept.
>>>  char[] a = "some text";
>>>  char[] b = a[4 .. 7];
>>>  // Making 'b' a reference into 'a'.
>>
>>Slicing C's char[] implies byte-wide, and non-UTF.
> 
> 
> Exactly, and that why I'm worried by the suggestion that char[] be
> automatically zero-terminated, because slices are usually not
> zero-terminated. 

With what we're doing with the utf, it would be a small additional job to have the "C char" arrays take care of the null byte at the end.

So the programmer would not have to think about it.

(I admit this takes some further thinking first! So you are right in your concerns!)

November 25, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Bruno Medeiros
in reply to Derek Parnell

Bruno Medeiros

Posted in reply to Derek Parnell

Derek Parnell wrote:
> On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:
> 
> 
>>Derek, I must have done a terrible job explaining this, because you've  completely missunderstood me, in fact your counter proposal is essentially  what my proposal was intended to be.
> 
> 
> You seemed to be wanting to have data types that could only hold
> characters-fragments. I can't see the point of that. 
> 
> If strings must be arrays, then let there be an atomic data type that
> represents a character and then strings can be arrays of characters. The
> UTF encoding of the string is just an implementation detail then. All
> indexing would be done on a character basis regardless of the underlying
> encoding. 
> 
> In other words, if 'uchar' is the data type that holds a character then 
> 
>   alias uchar[] string;
> 
> could be used. The main 'new thinking' is that uchar could actually have
> variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
> the character it holds and the encoding used at the time.
> 
> However I still prefer my earlier suggestion.
>  
> 
Whoa, did you ever stop to think on the implications of having a
primitive type with *variable size* ? It's plain nuts to implement, no
wait, it's actually downright impossible.
If you have a uchar variable (not an array), how much space do you
allocate for it, if it has variable-size?
The only way to implement this would be with a fixed-size equal to the
max possible size (4 bytes). That would be a dchar then...


-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."

November 25, 2005

Re: Unified String Theory..

Posted by Oskar Linde
in reply to Regan Heath

Oskar Linde

Posted in reply to Regan Heath

In article <ops0rhghyl23k2f5@nrage.netwin.co.nz>, Regan Heath says...
>
>On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde <oskar.lindeREM@OVEgmail.com> wrote:
>>>> As I see it, there are only two views you need on a unicode string:
>>>> a) The code units
>>>> b) The unicode characters
>>>  (a) is seldom required. (b) is the common and thus goal view IMO.
>>
>> Actually, I think it is the other way around. (b) is seldom required. You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.
>
>If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.

By split, I meant this:
char[][] words = "abc def ghi åäö jkl".split(" ");

>> This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters.
>
>But people don't care about code units, they care about characters. [...]

Most of the time people care about string contents, neither code units nor
characters.
Naturally, I'm biased by my own experience. I have written a few applications in
D dealing with UTF-8 data, including parsing grammar definition files and
communicating with web servers, but not once have I needed character based
indexing. One reason may be that all delimeters used are ASCII, and therefore
only occupy a single code unit, but I would assume that this is typical for most
data.

When dealing with UTF-8 streams, you want searching and parsing to work on indices (positions) within this stream, not on the character count up to this position. A code unit index gives you the direct byte position of the stream, whereas a character index would require iterating the entire stream up to the indexed position. The performance difference is hardly negligible.

>[...] When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.

How often do you need to change the 4th character of a string? I think that
scenario is just as unlikely.
Of course, there are cases where you need character based access, and that is
what dchar[] is ideal for*. If you instead want to sacrifice performance for
better memory footprint, use a wrapper class. What I don't agree with is making
this sacrifice in performance the default, when its gains are so seldom needed.

*) In many cases, such as word processors and similar, you need more efficient data structures than flat arrays. A basic character based string would not be of much help.

>> When would you actually need character based indexing?
>> I believe the answer is less often than you think.
>
>Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.

I very seldom, if ever, care what UTF-8 framents are used to represent the data as long as I know that ASCII characters (those whose character literals are assignable to a char) are represented by a single code unit.

You say that users of char[] need to know some things about UTF-8, and I can't argue with that. Maybe the docs should recommend dchar[] for users that want to remain UTF ignorant. :)

Regards,

Oskar

November 25, 2005

Re: Unified String Theory..

Posted by Kris
in reply to Oskar Linde

Kris

Posted in reply to Oskar Linde

"Oskar Linde" <oskar.lindeREM@OVEgmail.com> wrote
> Most of the time people care about string contents, neither code units nor
> characters.
> Naturally, I'm biased by my own experience. I have written a few
> applications in
> D dealing with UTF-8 data, including parsing grammar definition files and
> communicating with web servers, but not once have I needed character based
> indexing. One reason may be that all delimeters used are ASCII, and
> therefore
> only occupy a single code unit, but I would assume that this is typical
> for most
> data.

Absolutely right. This is why, for example, URI classes will remain char[] based. IRI extensions are applied simply by assuming the content is utf8.


>>[...] When
>>do you want to inspect or modify a single code unit? I would say, just
>>about never. On the other hand you might want to change the 4th character
>>of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[]
>>array. Ick.
>
> How often do you need to change the 4th character of a string? I think
> that
> scenario is just as unlikely.
> Of course, there are cases where you need character based access, and that
> is
> what dchar[] is ideal for*. If you instead want to sacrifice performance
> for
> better memory footprint, use a wrapper class. What I don't agree with is
> making
> this sacrifice in performance the default, when its gains are so seldom
> needed.
>
> *) In many cases, such as word processors and similar, you need more
> efficient
> data structures than flat arrays. A basic character based string would not
> be of
> much help.

Right on. Such things are very much application specific. That, IMO, is where much of the general confusion stems from.

November 25, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Derek Parnell
in reply to Bruno Medeiros

Derek Parnell

Posted in reply to Bruno Medeiros

On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote:

[snip]

>> could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time.
>> 
>> However I still prefer my earlier suggestion.
>> 
>> 
> Whoa, did you ever stop to think on the implications of having a
> primitive type with *variable size* ? It's plain nuts to implement, no
> wait, it's actually downright impossible.
> If you have a uchar variable (not an array), how much space do you
> allocate for it, if it has variable-size?
> The only way to implement this would be with a fixed-size equal to the
> max possible size (4 bytes). That would be a dchar then...

Well not *actually* impossible but certainly something you'd only do if you didn't care about performance. However, I was really talking on a conceptual level rather than an implementation level. As you and others have said, it would most likely be implemented as a 32-bit unsigned integer however certain bits are redundant and are thus (conceptually) not significant.

And as I have said earlier, I already work in such a world. The Euphoria programming language only has 32-bit characters.

-- 
Derek Parnell
Melbourne, Australia
26/11/2005 8:40:39 AM

November 26, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Bruno Medeiros
in reply to Derek Parnell

Bruno Medeiros

Posted in reply to Derek Parnell

Derek Parnell wrote:
> On Fri, 25 Nov 2005 14:06:34 +0000, Bruno Medeiros wrote:
> 
> [snip]
> 
> 
>>>could be used. The main 'new thinking' is that uchar could actually have
>>>variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
>>>the character it holds and the encoding used at the time.
>>>
>>>However I still prefer my earlier suggestion.
>>> 
>>>
>>
>>Whoa, did you ever stop to think on the implications of having a
>>primitive type with *variable size* ? It's plain nuts to implement, no
>>wait, it's actually downright impossible.
>>If you have a uchar variable (not an array), how much space do you
>>allocate for it, if it has variable-size?
>>The only way to implement this would be with a fixed-size equal to the
>>max possible size (4 bytes). That would be a dchar then...
> 
> 
> Well not *actually* impossible but certainly something you'd only do if you
> didn't care about performance. 
Another alternative would be to use a reference type, but that would use
even more space. I honestly don't see how it would be possible, while
using less than 4 bytes and maintaining all other D features/properties
(performance not considerated).

> However, I was really talking on a
> conceptual level rather than an implementation level. As you and others
> have said, it would most likely be implemented as a 32-bit unsigned integer
> however certain bits are redundant and are thus (conceptually) not
> significant.
> 
Thus one would have the dchar type, and this new "Unified String" would
be simply be a dchar[] . I've only skimmed trough this discussion but
people (Regan & others?) wanted a string type that was space-eficient,
allowing itself to be enconded in UTF-8, UTF-16, etc, thus
dchar[]/uchar[] would not be acceptable. Unless you wanted this uchar[]
to be a basic type by itself, and not an array of basic uchars (which
would work, but would be a horrible design)

--

In fact, and I'm gonna go a bit on rant mode here (not directed at you
in particular, Derek), but I've skimmed through this whole series of
threads about Unicode and strings, and I'm getting a bit pissed with all
of those meaningless posts based on wrong assumptions, wrong
terminology, crazy or unfeasable language changes, and all of this for a
problem I've yet failed to grasp why it cannot be fully solved with a
dchar array or with a custom made String class (custom-made, that is,
*user-coded*, not part of the language). I admit I have no Unicode
coding experience, so indeed *I may be* missing something, but on every
new thread made all I see is progressively more crazy, ridiculous ideas
about a problem I do not see.

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to
be... unnatural."

November 26, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Derek Parnell
in reply to Bruno Medeiros

Derek Parnell

Posted in reply to Bruno Medeiros

On Sat, 26 Nov 2005 13:30:28 +0000, Bruno Medeiros wrote:

[snip]
> Thus one would have the dchar type, and this new "Unified String" would be simply be a dchar[] .
[snip]
> I've skimmed through this whole series of
> threads about Unicode and strings, and
[snip]
> I've yet failed to grasp why it cannot be fully solved with a
> dchar array or with a custom made String class (custom-made, that is,
> *user-coded*, not part of the language). I admit I have no Unicode
> coding experience, so indeed *I may be* missing something, but on every
> new thread made all I see is progressively more crazy, ridiculous ideas
> about a problem I do not see.

I agree that it is much better to identify the problem *before* one tries to fix it. I'm sure Water has been having a nice little chuckle at our meandering ways.

I see 'the problem' as ...

** We have choice about the representation of strings in D, and thus at times we introduce a degree of ambiguity in our code that the compiler has trouble resolving.

** The 'char' data type is performing multiple roles. In one aspect, it is a fragment of a character in an utf-8 encoded string, and in other aspects it is a byte-sized character for ASCII and C/C++ compatibility purposes. This can be confusing to coders not used to thinking internationally.

** Indexing string that are based on 'char' and 'wchar' can cause bugs because it is possible to access character fragments rather than complete characters.

There are some other issues which are not language related, and have to deal with string manipulation that assume ASCII strings only - such as the 'strip()' function which doesn't recognize all the Unicode white-space characters just the ASCII ones.

-- 
Derek Parnell
Melbourne, Australia
27/11/2005 7:40:57 AM

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation