Jump to page: 1 2
Thread overview
String Class (was other things)
Jul 28, 2004
Arcane Jill
Jul 28, 2004
parabolis
Jul 29, 2004
Arcane Jill
Jul 29, 2004
parabolis
Jul 28, 2004
Daniel Horn
Jul 28, 2004
Walter
Jul 28, 2004
parabolis
Jul 28, 2004
Arcane Jill
Jul 28, 2004
parabolis
Jul 28, 2004
Arcane Jill
Jul 28, 2004
parabolis
Jul 28, 2004
Arcane Jill
Jul 28, 2004
parabolis
Jul 29, 2004
Arcane Jill
Jul 29, 2004
parabolis
Jul 29, 2004
Arcane Jill
Jul 29, 2004
parabolis
Jul 28, 2004
Sean Kelly
Jul 28, 2004
Berin Loritsch
Jul 28, 2004
parabolis
July 28, 2004
Just to tidy up a few loose ends...

The Unicode Consortium /claim/ that the definition of a character is "The smallest component of written language that has semantic value", but in fact, a more accurate definition would be "a character is anything that the Unicode Consortium say is a character". This is because when Unicode was first invented, it wanted to be a superset of all other character standards, so it imported "characters" from all of those other standards - no matter that those legacy standards conflicted with each other. This is just "historical baggage", and we now have to live with it.

A Unicode string can be interpretted on many levels. As you have pointed out, char[] arrays, etc., work at the level of code units. You say that it would be more appropriate to work at the character level. That's a reasonable point of view, but the truth is we can already do this. A simple dchar[] array will do the job. Or, if you want to stick with char[] arrays, there are functions in std.utf which can cope with this level of abstration quite happily.

But I don't think that even the /character/ level is particularly useful, myself. I'd say you have to go up by at least one further level of abstraction - to the /grapheme/ level at least, before you start dealing with what we /really/ think of as a string. Let me give you an example. Consider the following two UTF-8 sequences:

(a) 0x63 0x61 0x66 0xC3 0x89
(b) 0x63 0x61 0x66 0x65 0xCC 0x81

Clearly these are different sequences of code units. They have different lengths, for a start. If stored in char[] arrays then a.length would be 5 and b.length would be 6. So now let's look at them, not at the code unit level, but at the character level. Now we have:

(a) "caf\u00E9"
(b) "cafe\u0301"

Again, these are different sequences of characters. The first one is four characters long, and contains the character U+00E9 (lowercase e with acute accent). The second one is five characters long, and contains the character U+0301 (combining acute accent). These strings would not compare equal either as char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now it gets interesting, because here we get:

(a) "café"
(b) "café"

This is what you /see/ when the string is rendered in a plain text editor such as we programmers use. The strings look alike, and, conceptually, they /are/ identical at this level of abstraction. (They are "canonically equivalent", to be pedantic). They both contain exactly four graphemes, and the final grapheme is "é" in both cases. So - are you /sure/ that the number of "characters" is the RIGHT THING to be focusing on?

But it doesn't even stop there. There's another level of abstraction even beyond graphemes. These objects are called "glyphs", and they are what you would see in a non-plain text renderer, such a web browser or Micro$oft Word. You can make a glyph by joining one or more graphemes together by using join-controls. For example, the character/grapheme/glyph 'æ' (U+00E6) is a ligature formed by ligating 'a' and 'e'. It is the same glyph as the one formed from 'a' followed by U+200D (ZERO WIDTH JOINER) followed by 'e', but you'll need a good text rendering engine before they'll look identical.

Different levels are important for different people. Right now, D gives you the code unit level in char[], wchar[] and dchar[] arrays, and the character level in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you functions enabling you to work at the grapheme level. All of these different conceptual viewpoints could, in theory, be plumbed into the workings of a string class, but only after the Unicode algorithms are actually implemented in etc.unicode.

I'm not trying to throw a spanner into the works - I just wanted to challenge the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow. I don't believe it does, and, even if that low level is not one that you would use, there certainly will be programmers who need to work at that level.

Arcane Jill


July 28, 2004
Arcane Jill wrote:
> Just to tidy up a few loose ends...
> 
> The Unicode Consortium /claim/ that the definition of a character is "The
> smallest component of written language that has semantic value", but in fact, a
> more accurate definition would be "a character is anything that the Unicode
> Consortium say is a character". This is because when Unicode was first invented,
> it wanted to be a superset of all other character standards, so it imported
> "characters" from all of those other standards - no matter that those legacy
> standards conflicted with each other. This is just "historical baggage", and we
> now have to live with it.
> 

>
> [lots of stuff]
>
I cut the large part of your post which explains the details of
the relationships:

    Character has a UTFImplementation
    Grapheme is a Character
    Glyph is a Gphapmeme

Which suggestes an OO implementation of:
================================
    Interface UTFImplementation { /* ??? */ }
    Character : UTFImplementation { /* ??? */ }
    Grapheme : Character { /* ??? */ }
    Glyph : Gphapmeme { /* ??? */ }
================================

As I have said any String class implementatoin that satisfies the most basic string operations is a step in the right direction. It sounds like you would like to see the following implementation
================================
    Interface String { /* ??? */ }
    CharacterString : String { /* ??? */ }
    GraphemeString : String { /* ??? */ }
    GlyphString : String { /* ??? */ }

    or perhaps

    Interface String { /* ??? */ }
    CharacterString : String { /* ??? */ }
    GraphemeString : CharacterString { /* ??? */ }
    GlyphString : GphapmemeString { /* ??? */ }
================================

> 
> Different levels are important for different people. Right now, D gives you the
> code unit level in char[], wchar[] and dchar[] arrays, and the character level
> in dchar[] arrays. Phase 2 of etc.unicode, when it arrives, will give you
> functions enabling you to work at the grapheme level. All of these different
> conceptual viewpoints could, in theory, be plumbed into the workings of a string
> class, but only after the Unicode algorithms are actually implemented in
> etc.unicode.

You are trying (unsuccessfully) to make use of the fact that UTF-32 code units (dchar) alawys correspond to a Character but are missing the fact they are still defined as code units in the D Docs Types section:
================================================================
        char  unsigned 8 bit UTF-8  0xFF
        wchar  unsigned 16 bit UTF-16  0xFFFF
        dchar  unsigned 32 bit UTF-32
================================================================
If what you were suggesting were really the case then anybody who has a reason to use UTF-8 or UTF-16 would have to count the Characters by converting to UTF-32 and then checking the length of the resulting dchar array. This implementation is a crude hack that works by virtue of the fact that the number of code units in UTF32 are guaranteed to be 1:1 with Characters defined by the UC.

> I'm not trying to throw a spanner into the works - I just wanted to challenge
> the viewpoint that defining char as a UTF-8 fragment makes D deficient somehow.

The reason D uses the precise name 'char' is a direct result of D's ability to compile C directly and has nothing to do with the unfortunate and misleading association char sometimes has with Character.



July 28, 2004
Seems like more spam filters need to start tokenizing at the grapheme level... or perhaps one level higher where char sequences that render similarly are compared similarly-- maybe more like text recognition
Arcane Jill wrote:

> think of as a string. Let me give you an example. Consider the following two
> UTF-8 sequences:
> 
> (a) 0x63 0x61 0x66 0xC3 0x89
> (b) 0x63 0x61 0x66 0x65 0xCC 0x81
> 
> Clearly these are different sequences of code units. They have different
> lengths, for a start. If stored in char[] arrays then a.length would be 5 and
> b.length would be 6. So now let's look at them, not at the code unit level, but
> at the character level. Now we have:
> 
> (a) "caf\u00E9"
> (b) "cafe\u0301"
> 
> Again, these are different sequences of characters. The first one is four
> characters long, and contains the character U+00E9 (lowercase e with acute
> accent). The second one is five characters long, and contains the character
> U+0301 (combining acute accent). These strings would not compare equal either as
> char[]s or as dchar[]s. But now let's look at them at the /grapheme/ level. Now
> it gets interesting, because here we get:
> 
> (a) "café"
> (b) "café"
> 
July 28, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:ce869q$7o5$1@digitaldaemon.com...
> A Unicode string can be interpretted on many levels. As you have pointed
out,
> char[] arrays, etc., work at the level of code units. You say that it
would be
> more appropriate to work at the character level. That's a reasonable point
of
> view, but the truth is we can already do this. A simple dchar[] array will
do
> the job. Or, if you want to stick with char[] arrays, there are functions
in
> std.utf which can cope with this level of abstration quite happily.

There's a little-known feature of foreach loops:

    char[] s;
    foreach (dchar c; s)
    {
            ... c is each dchar decoded from s[] ...
    }

which is pretty handy for decoding UTF-8 strings. Of course, it also works for any combination of char, wchar, or dchar for either argument of foreach, making it pretty convenient for writing generic code.


July 28, 2004
Walter wrote:

>>std.utf which can cope with this level of abstration quite happily.
> 
> 
> There's a little-known feature of foreach loops:
> 
>     char[] s;
>     foreach (dchar c; s)
>     {
>             ... c is each dchar decoded from s[] ...
>     }
> 
> which is pretty handy for decoding UTF-8 strings. Of course, it also works
> for any combination of char, wchar, or dchar for either argument of foreach,
> making it pretty convenient for writing generic code.
> 

This was actually taken from a Thread in which I explained some of the benefits of a class String implementation. It ended with this assessment:

So to recap in big O terms:
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3a) The time requirements of UTF16 are N
July 28, 2004
In article <ce8pfd$fm5$1@digitaldaemon.com>, parabolis says...

>This was actually taken from a Thread in which I explained some of the benefits of a class String implementation. It ended with this assessment:
>
>So to recap in big O terms:
>     1) The memory requirements of String are identical to UTF16
>     2) For length()
>        2a) The time requirements of String are 1
>        3a) The time requirements of UTF16 are N
>     3) For substring()
>        3a) The time requirements of String are 1
>        3a) The time requirements of UTF16 are N

This confuses me, but perhaps that's because I haven't understood how you imagine implementing such a class. Example - let me construct a string:

#    dchar[] s = new dchar[0x110000];
#    for (uint i=0; i<0x110000; ++i)
#    {
#        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
#    }

If I've got my maths right, all elements of s will be assigned; all elements of s will be unique; and to compute the value of s[i] given only i is hard (requires a modinv(), which is expensive). So, how can any class store s in such a way as to guarantee the memory and time requirements which you claim?

Arcane Jill



July 28, 2004
In article <ce8pfd$fm5$1@digitaldaemon.com>, parabolis says...
>
>So to recap in big O terms:
>     1) The memory requirements of String are identical to UTF16
>     2) For length()
>        2a) The time requirements of String are 1
>        3a) The time requirements of UTF16 are N
>     3) For substring()
>        3a) The time requirements of String are 1
>        3a) The time requirements of UTF16 are N

Out of curiosity, why care about printable string length anyway?  Unless this is a UI compoment where I might want to restrict the number of printable characters I can't think of any reason.


Sean


July 28, 2004
Sean Kelly wrote:

> In article <ce8pfd$fm5$1@digitaldaemon.com>, parabolis says...
> 
>>So to recap in big O terms:
>>    1) The memory requirements of String are identical to UTF16
>>    2) For length()
>>       2a) The time requirements of String are 1
>>       3a) The time requirements of UTF16 are N
>>    3) For substring()
>>       3a) The time requirements of String are 1
>>       3a) The time requirements of UTF16 are N
> 
> 
> Out of curiosity, why care about printable string length anyway?  Unless this is
> a UI compoment where I might want to restrict the number of printable characters
> I can't think of any reason.

Sometimes it helps getting the tail end of a string.  For example,
separating the extension from a file name:

String filename = "filename.txt";
String extension = filename.substring(filename.length() - 4);

(Using Java conventions).

It would be so much easier if there was a "tail" method.

dchar[] filename = "filename.txt";
dchar[] extension = tail(filename, 4);

Most operating systems still don't have proper unicode support so the
fact that a filename might have international characters is not as
likely.

It just gives a representation of one of the non-printing needs for
proper string length detection.
July 28, 2004
Arcane Jill wrote:

> This confuses me, but perhaps that's because I haven't understood how you
> imagine implementing such a class. Example - let me construct a string:
> 
> #    dchar[] s = new dchar[0x110000];
> #    for (uint i=0; i<0x110000; ++i)
> #    {
> #        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
> #    }
> 
> If I've got my maths right, all elements of s will be assigned; all elements of
> s will be unique; and to compute the value of s[i] given only i is hard
> (requires a modinv(), which is expensive). So, how can any class store s in such
> a way as to guarantee the memory and time requirements which you claim?
> 

I am assuming you accept 2b) and 3b):
    1) The memory requirements of String are identical to UTF16
    2) For length()
       2a) The time requirements of String are 1
       2b) The time requirements of UTF16 are N
    3) For substring()
       3a) The time requirements of String are 1
       3b) The time requirements of UTF16 are N

To show that 1), 2a) and 2b) hold I will make the example a bit simpler let me throw away the sparse array portion and just construct a class that stores Characters with a two wchar arrays:
================================================================
FatString {
    wchar[] loBits;
    wchar[] hiBits;

    private this( wchar[] lo, wchar[] hi ) {
        loBits = lo;
        hiBits = hi;
    }

    length() { return loBits.length; }

    FatString substring( uint len, uint off ) {
        return new FatString(
            loBits[off..(off+len)],
            hiBits[off..(off+len)],
        );
    }
}
================================================================
Assuming we have N Characters to store...

The memory complexity is simply:

1) The FatString class uses 2*2*N bytes or a big o N.
A wchar[] uses at least 2*N bytes and at most 2*2*N which is also a big O N. Thus they have the same memory requirements.

Now for the time complexity:

2a) length() is a simple example. There is one statement with no
tricky side effects. So length has a big O time complexity 1.

3a) substring() is a more interesting example. It is obvious that the big O of the constructor is 1. The wonder that is array slicing simply creates dynamic array stubs which point into the appropriate array for the appropriate length. Again that is a big O of 1. Thus substring() has a big O of 1.


July 28, 2004
Sean Kelly wrote:

> In article <ce8pfd$fm5$1@digitaldaemon.com>, parabolis says...
> 
>>So to recap in big O terms:
>>    1) The memory requirements of String are identical to UTF16
>>    2) For length()
>>       2a) The time requirements of String are 1
>>       3a) The time requirements of UTF16 are N
>>    3) For substring()
>>       3a) The time requirements of String are 1
>>       3a) The time requirements of UTF16 are N
> 
> 
> Out of curiosity, why care about printable string length anyway?  Unless this is
> a UI compoment where I might want to restrict the number of printable characters
> I can't think of any reason.
> 

I tend to deal quite a bit with text data.
« First   ‹ Prev
1 2