July 28, 2004
>> #    dchar[] s = new dchar[0x110000];
>> #    for (uint i=0; i<0x110000; ++i)
>> #    {
>> #        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
>> #    }

Whoops! I just realized that this example contains invalid characters! Ho hum. (Hangs head in shame). Let's pretend no-one noticed and change it to:

#    dchar[] s = new dchar[0x110000];
#    for (uint i=0; i<0x110000; ++i)
#    {
#        dchar c = cast(dchar) i;
#        if (isValidDchar(c)) s[(i*0x10001L)%0x110000L] = c;
#    }

This doesn't change the following reasoning in any way, however.




In article <ce915n$jld$1@digitaldaemon.com>, parabolis says...

>> How can any class store s in such
>> a way as to guarantee the memory and time requirements which you claim?

>     1) The memory requirements of String are identical to UTF16

>1) The FatString class uses 2*2*N bytes or a big o N.
>A wchar[] uses at least 2*N bytes and at most 2*2*N which is
>also a big O N. Thus they have the same memory requirements.

The string s, when converted to wchar[], will require precisely 0x42000 bytes Your FatString representing the same string will contain 0x44000 bytes. How is that identical?

Jill



July 28, 2004
Arcane Jill wrote:

>>>#    dchar[] s = new dchar[0x110000];
>>>#    for (uint i=0; i<0x110000; ++i)
>>>#    {
>>>#        s[(i*0x10001L)%0x110000L] = cast(dchar) i;
>>>#    }
> 
> 
> Whoops! I just realized that this example contains invalid characters! Ho hum.
> (Hangs head in shame). Let's pretend no-one noticed and change it to:
> 
> #    dchar[] s = new dchar[0x110000];
> #    for (uint i=0; i<0x110000; ++i)
> #    {
> #        dchar c = cast(dchar) i;
> #        if (isValidDchar(c)) s[(i*0x10001L)%0x110000L] = c;
> #    }
> 
> This doesn't change the following reasoning in any way, however.
> 
> 
> 
> 
> In article <ce915n$jld$1@digitaldaemon.com>, parabolis says...
> 
> 
>>>How can any class store s in such
>>>a way as to guarantee the memory and time requirements which you claim?
> 
> 
>>    1) The memory requirements of String are identical to UTF16
> 
> 
>>1) The FatString class uses 2*2*N bytes or a big o N.
>>A wchar[] uses at least 2*N bytes and at most 2*2*N which is also a big O N. Thus they have the same memory requirements.
> 
> 
> The string s, when converted to wchar[], will require precisely 0x42000 bytes
> Your FatString representing the same string will contain 0x44000 bytes. How is
> that identical?
> 
> Jill

If you want to compare precise memory requirements in bytes then they do not have identical requirements. The have roughly identical requirements (which is the claim I made in another thread when not talking about big O).

The big O requirements are not precise. They are intedned to give a feeling for how things compare as they get arbitrarily larger while at the same time simplifying certain issues.

How many bytes will a wchar with N characters use? If you knew the probability that the Ith character requires 2 bytes then the answer would be simple. However you do not know P(I).

So how are we to compare how the memory requirements of a wchar[] generally compares to a FatString?
July 28, 2004
In article <ce9583$l7u$1@digitaldaemon.com>, parabolis says...

>So how are we to compare how the memory requirements of a wchar[] generally compares to a FatString?

sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

So far, FatString has shown no advantage over dchar[], and it looks a lot more complicated.

Jill


July 28, 2004
Arcane Jill wrote:
>>So how are we to compare how the memory requirements of a wchar[] generally compares to a FatString?
> 
> 
> sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString

I somehow suspected that you would suggest this. Remember FatString is actually a gross oversimplification of what I suggested. Lets try a less overly simplified version...

The RLEString

Still not as good as a well written sparse array implementation but it shows more of the nature of the implementation I am suggesting and I believe the workings of RLE are self evident enough that I do not need to explain how it works.

How would you compare RLEString if instead of wchar[] for hiBits it used a uint[] whose entries are a run length encoding of the upper 16 bits of the Characters in a String?

For N Characters which are all less than u+10000 the wchar[] string uses exactly 8 bytes less than RLEString.

That is an 8 byte overhead no matter how long the string is. So in this case you save 8 bytes but should sombody want the last 4 Characters in a 4K string then a wchar[] string must do 4000 parsing iterations as opposed to the RLEString which does effectively a single operation.

But now how do we compare degenerate cases? Does a 4k string with a single Character above u+FFFF and the associated memory cost of 16 additional bytes negate the 4000 iteration loop? What about 2? 3? Eventually you are going to make a value judgment that involves P(I).

Finally do not forget that (from rfc2781) :
================================================================
Characters with values greater than 0x10FFFF cannot be encoded in UTF-16.
================================================================

A great deal of this space overhead is forced because an RLEString can represent u+00000000 to u+FFFFFFFF includsively.
July 29, 2004
In article <ce98li$n8v$1@digitaldaemon.com>, parabolis says...
>
>Arcane Jill wrote:
>>>So how are we to compare how the memory requirements of a wchar[] generally compares to a FatString?
>> 
>> 
>> sizeof wchar[] string <= sizeof dchar[] string == sizeof FatString
>
>I somehow suspected that you would suggest this. Remember FatString is actually a gross oversimplification of what I suggested.

Which is what, exactly?


>Lets try a less overly simplified version...
>
>The RLEString
>
>Still not as good as a well written sparse array implementation but it shows more of the nature of the implementation I am suggesting and I believe the workings of RLE are self evident enough that I do not need to explain how it works.

Yes you do. Without an implementation, you can claim anything. Just an algorithm in pseudocode would do.


>How would you compare RLEString if instead of wchar[] for hiBits it used a uint[] whose entries are a run length encoding of the upper 16 bits of the Characters in a String?

Show me an implementation which provably has the characteristics you claim, and obviously the proof will speak for itself. This is logical, Captain.



>For N Characters which are all less than u+10000 the wchar[] string uses exactly 8 bytes less than RLEString.

Implementation? Algorithm?


>That is an 8 byte overhead no matter how long the string is. So in this case you save 8 bytes but should sombody want the last 4 Characters in a 4K string then a wchar[] string must do 4000 parsing iterations as opposed to the RLEString which does effectively a single operation.

I am not clairvoyant. The details of your algorithm have not been specified.


>But now how do we compare degenerate cases? Does a 4k string with a single Character above u+FFFF and the associated memory cost of 16 additional bytes negate the 4000 iteration loop? What about 2? 3? Eventually you are going to make a value judgment that involves P(I).

And what exactly is P(I)?



>Finally do not forget that (from rfc2781) : ================================================================ Characters with values greater than 0x10FFFF cannot be encoded in UTF-16. ================================================================

"do not forget"!? As if!

Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me that. This is, however, irrelevant, because there *are* no characters with codepoints greater than 0x10FFFF. Unless you have discovered some character standard of which I am unaware, and which has even more characters in it than Unicode.



>A great deal of this space overhead is forced because an RLEString can represent u+00000000 to u+FFFFFFFF includsively.

Obviously nobody can argue with this without a reference implementation.


Summary:
Thus far,
* UTF-8 has the edge on memory requirements /if/ the text is mostly ASCII
* UTF-16 has the edge on memory requirements otherwise
* UTF-32 has the edge on speed of random character access

Nothing you've come up with so far has, simultaneously, the memory requirements of UTF-16 and the lookup speed of UTF-32, which was my understanding of your original claim.

/Why/ do you want a string class? I ask, because, so far, the only /sensible/ arguments in favor of a string class involve the explication of levels of abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood you, in which case I apologize, but you seem to want to craft something to replace char arrays *just because*, and offering no advantage. I don't get it.

Jill


July 29, 2004
In article <ce8mjf$eh3$1@digitaldaemon.com>, parabolis says...

I missed this one. I think you're posting from a time-machine or something, because the D forum seems to sort your posts into the wrong place.


>You are trying (unsuccessfully) to make use of the fact that UTF-32 code units (dchar) alawys correspond to a Character but are missing the fact they are still defined as code units in the D Docs Types section: ================================================================
>         char  unsigned 8 bit UTF-8  0xFF
>         wchar  unsigned 16 bit UTF-16  0xFFFF
>         dchar  unsigned 32 bit UTF-32
>================================================================

Strictly speaking, there are distinctions between (a) an ASCII code unit, (b) an
ASCII codepoint, and (c) and ASCII character. But so what? The mapping between
them is utterly trivial so nobody pays it a blind bit of notice - and nor should
they. So it is with dchar.



>If what you were suggesting were really the case then anybody who has a reason to use UTF-8 or UTF-16 would have to count the Characters by converting to UTF-32 and then checking the length of the resulting dchar array.

That's not the /only/ way of doing it. You could iterate through the chars of UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That would get you the right answer faster than doing an actual conversion (no need to allocate memory, etc.).

But yes, if you want indexing by character, use dchar[]. Why is that a problem?


>This implementation is a crude hack

The implementation against which you argued was not one that I suggested. I can spot a strawman argument a mile away, and I won't fall for it. I suggested precisly two implementation: (1) the counting technique described above, and (2) keep everything in dchars throughout - no conversion required.



>that works by virtue of the fact that the number of code units in UTF32 are guaranteed to be 1:1 with Characters defined by the UC.

Isn't that a bit like saying "that works by virtue of the fact that one plus one equals two"? I mean, there /is/ such a guarantee.


>The reason D uses the precise name 'char' is a direct result of D's ability to compile C directly

?


>and has nothing to do with the unfortunate and misleading association char sometimes has with Character.

Look, it's just jargon, dude. It's just words. It doesn't matter what things are called, so long as they are well-defined. I'm not greatly in love with all of the Unicode Consortium's names for things either, to be honest. Even the regular correspondents in the Unicode public forum will say "glyph" to each other when the technical term as defined by the UC is actually "default grapheme cluster". A bit of common sense is in order here. And why do you keep capitalizing "character"? - it's very disconcerting.

Jill


July 29, 2004
Arcane Jill wrote:

> 
> 
>>You are trying (unsuccessfully) to make use of the fact that UTF-32 code units (dchar) alawys correspond to a Character but are missing the fact they are still defined as code units in the D Docs Types section:
>>================================================================
>>        char  unsigned 8 bit UTF-8  0xFF
>>        wchar  unsigned 16 bit UTF-16  0xFFFF
>>        dchar  unsigned 32 bit UTF-32
>>================================================================
> 
> 
> Strictly speaking, there are distinctions between (a) an ASCII code unit, (b) an
> ASCII codepoint, and (c) and ASCII character. But so what? The mapping between
> them is utterly trivial so nobody pays it a blind bit of notice - and nor should
> they. So it is with dchar.

It is only trivial if you know how to do it. Otherwise UTF-32 data might as well be encrypted for someone who does not want to bother to spend the time puzzling out the UTF encodings.

My assumption before reading the UTF docs was that UTF-n were roughly just Unicode mapped variables using 8, 16 or 32 bits. I would have used char slicing, .length and perhaps even .sort happily until I came upon an 8bit character that caused some /really/ obscure bug. I don't mean to be rude but for people in such a situation your suggestion that just using a dchar works seems pedantic in the worst way.

Consider an analogy. Let us suppose that in D any byte[] array access sequence of 135 followed by 134 would result in all future array index calculations to assume you meant (.length - index). Let us also suppose that any short[] array access sequence of 2086 followed by 2085 would produce the same results.

Now let us further assume there are technical reasons for storing array data using these WTF-8, WTF-16 and WTF-32 encodings (the WTF-32 encoding being trivially similar to the way normal array access works in other languages).

The fact that there are no 'normal' arrays in D' is covered only by the a notation in the doc section about types and everywhere else they are refered to as byte arrays and short arrays.

To your suggestion that if you want common sense array results then use an int array in all cases I say you are insane. I would not seriously consider using D if there were no byte or short arrays.

More importantly D would not call these things arrays and by using sensible terminology would have not confused these WTF things with arrays and thus provided support for common sense arrays *from the start*.


>>If what you were suggesting were really the case then anybody who has a reason to use UTF-8 or UTF-16 would have to count the Characters by converting to UTF-32 and then checking the length of the resulting dchar array.
> 
> 
> That's not the /only/ way of doing it. You could iterate through the chars of
> UTF-8 and count only bytes < 0x80 or >= 0xC0. Similarly, you could iterate
> through the wchars of UTF-16 and count only bytes < 0xD800 or >= 0xDC00. That
> would get you the right answer faster than doing an actual conversion (no need
> to allocate memory, etc.).

Quite true ... I could. But only for char/wchar arrays that I have used std.utf.validate() and have not written to. Or I could implement my suggested String class and get all the benefits I believe it will offer. But these only solve my problem.

However my problem is that if D did not confuse Character and Code Unit then I am sure the character counting issue would not have been overlooked and /something/ sensible would have been implemented much earlier in the development process that it has.

>  
>>This implementation is a crude hack
> 
> 
> The implementation against which you argued was not one that I suggested. I can
> spot a strawman argument a mile away, and I won't fall for it. I suggested
> precisly two implementation: (1) the counting technique described above, and (2)
> keep everything in dchars throughout - no conversion required.

I am not arguing against an implementation. I am arguing against /your/ suggestion that dchar[] was an intentional device. I believe basic functionality was overlooked that would have been included if the Character/Code Unit ambiguity did not exist (more on ambiguity below).

> 
>>that works by virtue of the fact that the number of code units in UTF32 are guaranteed to be 1:1 with Characters defined by the UC.
> 
> 
> Isn't that a bit like saying "that works by virtue of the fact that one plus one
> equals two"? I mean, there /is/ such a guarantee.
> 


> 
> Look, it's just jargon, dude. It's just words. It doesn't matter what things are
> called, so long as they are well-defined. I'm not greatly in love with all of

I agree. But in this case I am saying string is not well-defined. It is used as both Strings of Character and Unicode String.

You were suprised char[].sort invalidated an encoding. The library implicity suggests char[].length and char[i..k] implement length-querry and substring.

Seperating the two terms in a meaningful way would avoid these kinds of ambiguities.

> A bit of common sense is in order here. And why do you keep capitalizing
> "character"? - it's very disconcerting.

Actually because it is disconcerting it is useful to draw attention to the fact that I do not mean char or UTF-n code unit.
July 29, 2004
Arcane Jill wrote:

> In article <ce98li$n8v$1@digitaldaemon.com>, parabolis says...
>>I somehow suspected that you would suggest this. Remember FatString is actually a gross oversimplification of what I suggested.
> 
> 
> Which is what, exactly?

As I said in the next paragraph... A sparse array implementation (quoted below):
>>The RLEString
>>
>>Still not as good as a well written sparse array implementation but it shows more of the nature of the implementation I am suggesting and I believe the workings of RLE are self evident enough that I do not need to explain how it works.
> 
> 
> Yes you do. Without an implementation, you can claim anything. Just an algorithm
> in pseudocode would do.
I do not believe I am being being unreasonable to excpect that RLE is a commonly undertood algorithm. If you are not familiar with it then here is the first Google result which explains it:

================================================
The RLE compression algorithm
http://www.prepressure.com/techno/compressionrle.htm

Or perhaps some of these links would help more:

The Stony Brook Algorithm Repository
  http://www.cs.sunysb.edu/~algorith/

Dictionary of Algorithms and Data Structures
  http://www.nist.gov/dads/

The Complete Collection of Algorithm Animations (CCAA)
  http://www.cs.hope.edu/~alganim/ccaa/

Or try searching for "RLE implementation" or "RLE compression implementation."
================================================

I will clarify that I was assuming the entries in uint[] are to be taken by twos (assuming index i is even) to mean:
    hiBits[i+0] upper 16 bits of Character
    hiBits[i+1] number of times entry[i+0] is repeated.

So for all Characters less than u+10000 the upper 16 bits can be represented with only two entries (hence my 8 byte overhead):
    hiBits[0] = 0
    hiBits[0] = loBits.length

>>But now how do we compare degenerate cases? Does a 4k string with a single Character above u+FFFF and the associated memory cost of 16 additional bytes negate the 4000 iteration loop? What about 2? 3? Eventually you are going to make a value judgment that involves P(I).
> 
> 
> And what exactly is P(I)?

The second post of mine above this suggests you will have a difficult time judging an implementation without big O notation because a more detailed assessment will eventually stumble on:
================================================================
How many bytes will a wchar with N characters use? If you knew the probability that the Ith character requires 2 bytes then the answer would be simple. However you do not know P(I).
================================================================

> 
>>Finally do not forget that (from rfc2781) :
>>================================================================
>>Characters with values greater than 0x10FFFF cannot be encoded in UTF-16.
>>================================================================
> 
> 
> "do not forget"!? As if!
> 
> Of COURSE they can't, by definition of UTF-16. I don't need an RFC to tell me
> that. This is, however, irrelevant, because there *are* no characters with
> codepoints greater than 0x10FFFF. Unless you have discovered some character
> standard of which I am unaware, and which has even more characters in it than
> Unicode.

Let us hope they do not discover they need more bits in the future...

> 
> Nothing you've come up with so far has, simultaneously, the memory requirements
> of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
> original claim.

This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.

> 
> /Why/ do you want a string class? I ask, because, so far, the only /sensible/
> arguments in favor of a string class involve the explication of levels of
> abstraction not present in char[], wchar[] or dchar[]. I may have misunderstood
> you, in which case I apologize, but you seem to want to craft something to
> replace char arrays *just because*, and offering no advantage. I don't get it.

Again you know why I  believe a String class would be better and my that stated reasons have nothing to do with abstraction. I have only cited efficiency concerns.

July 29, 2004
In article <ceate1$1e06$1@digitaldaemon.com>, parabolis says...

>>>The RLEString
>I do not believe I am being being unreasonable to excpect that RLE is a commonly undertood algorithm. If you are not familiar with it

Of course I am, but being familiar with the concept of run length encoding does not tell me anything about how you imagine a class called RLEString would be implemented.

>The second post of mine above this suggests you will have a difficult time judging an implementation without big O notation

Come on! I'm a mathematician and I've been coding for three decades. The science of cryptography, with which I have a passing familarity, relies heavily upon big O notation. But you are abusing that notation. It measures algorithmic complexity, not memory requirements. You just can't get away with saying that /memory storage requirements/ are O(n) because to do so is meaningless. Yes /of course/ the length of a character string is going to be less than or equal to some multiple of the number of characters. Saying that the /memory requirements/ of a string are O(n) tells you nothing. It is a useless piece of information. What you want to know (and compare) is the average number of bytes per character.


>> This is, however, irrelevant, because there *are* no characters with codepoints greater than 0x10FFFF. Unless you have discovered some character standard of which I am unaware, and which has even more characters in it than Unicode.
>
>Let us hope they do not discover they need more bits in the future...

With all due respect, that was a naive comment. It is trivial to invent a new encoding. I could do it standing on my head blindfold, /and/ make it backwardly compatible with any existing UTF. So, alright then, let's imagine a post-Unicode character set with millions and millions of characters. It will utilise new encodings - let's call them XUTF-8, XUTF-16 and XUTF-32 - which will be backwardly compatible with our current UTF family. Applications will upgrade relatively smoothly. Future D will still be okay in this world: it would only have to redefine its char types to use the XUTF encodings instead of the UTF encodings (and that's the great thing about backward compatibility - nothing breaks). You're getting hung up on academic arguments. The real world runs on practicality. And in any case, that imaginary character set is not going to happen any time soon.



>> Nothing you've come up with so far has, simultaneously, the memory requirements of UTF-16 and the lookup speed of UTF-32, which was my understanding of your original claim.
>
>This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.

Fine. So write your class, make it open source and freely distributable, and I guarantee that /if/ people find it useful, they will use it.


>Again you know why I  believe a String class would be better and my that stated reasons have nothing to do with abstraction.

Right - but that's precisely /why/ I don't see such a class as being useful. According to the wiki (or some article I read somewhere) there was much discussion here on the D forum about a string class a while back (before I joined it), and consensus came down on the side of "we don't need one". I don't see that changing unless a String class offered some higher level of abstraction than that which can currently be achieved with D's built-in dchar[].


>I have only cited efficiency concerns.

So write your class. I have absolutely no problem with my being wrong. (Every time I'm wrong, I learn something new, and that's /great/). Your implementation will be your evidence, and it will speak for itself.

Arcane Jill


July 29, 2004
Arcane Jill wrote:
> In article <ceate1$1e06$1@digitaldaemon.com>, parabolis says...
> 
> 
>>>>The RLEString
>>
>>I do not believe I am being being unreasonable to excpect that RLE is a commonly undertood algorithm. If you are not familiar with it
> 
> 
> Of course I am, but being familiar with the concept of run length encoding does
> not tell me anything about how you imagine a class called RLEString would be
> implemented.

No but telling you it would use an uint[] to implement an RLE encoding of the upper 16 bits of digits should be sufficient.

> 
> 
>>The second post of mine above this suggests you will have a difficult time judging an implementation without big O notation 
> 
> 
> Come on! I'm a mathematician and I've been coding for three decades. The science
> of cryptography, with which I have a passing familarity, relies heavily upon big
> O notation. But you are abusing that notation. It measures algorithmic
> complexity, not memory requirements. You just can't get away with saying that

Sorry but big O is useful (and used) for both. I understand that the results can seem somewhat deceptive but as I said in my original suggestion: O(1) vs O(N) is a non-trivial improvement and the memory requirements do not do anything unexcepected like being N^2. Any finer detail and we will end up debating over values for P(I).

> /memory storage requirements/ are O(n) because to do so is meaningless. Yes /of
> course/ the length of a character string is going to be less than or equal to
> some multiple of the number of characters. Saying that the /memory requirements/

Actually I could suggest a storage algorithm that stores any sized String in 8 bytes. The conditions require some insane assumptions but it /is/ possible. But I only mention it as more of an amusement than anything else. I do get what you meant.

> of a string are O(n) tells you nothing. It is a useless piece of information.
> What you want to know (and compare) is the average number of bytes per
> character.

Yes which would relate to P(I) as I said before. I would like to know but it is not possible to find precisely and would require a prohibitively immense amount resource to determine an approximation experimentally.


>>Let us hope they do not discover they need more bits in the future...
> 
> 
> With all due respect, that was a naive comment. It is trivial to invent a new

I agree.

>>>Nothing you've come up with so far has, simultaneously, the memory requirements
>>>of UTF-16 and the lookup speed of UTF-32, which was my understanding of your
>>>original claim.
>>
>>This is invalid reasoning. If my suggestions have so far all failed because a lack of specifics does not allow me to make claims /for/ the implementation then certainly you cannot say the implementation fails. That would assume you know enough about it to make judgments which you have refused to do.
> 
> 
> Fine. So write your class, make it open source and freely distributable, and I
> guarantee that /if/ people find it useful, they will use it.

I was actually hoping this discussion would head more in the direction of somebody pointing out that the length and substring methods of a String class do not compose the whole class itself and thus only under certain circumstances would you want to optimize them.

I was hoping to learn more about how people would use a String class and thus how it should be tuned. I have no class implementation in mind.

>>Again you know why I  believe a String class would be better and my that stated reasons have nothing to do with abstraction.
> 
> 
> Right - but that's precisely /why/ I don't see such a class as being useful.
> According to the wiki (or some article I read somewhere) there was much
> discussion here on the D forum about a string class a while back (before I
> joined it), and consensus came down on the side of "we don't need one". I don't
> see that changing unless a String class offered some higher level of abstraction
> than that which can currently be achieved with D's built-in dchar[].
> 

I also did not say that it would /fail/ to offer a level of abstraction. What I was talking about was the bonus class implementations get over non-class implementations because they get to make assumptions about data and leverage those assumptions for overall better performance.
1 2
Next ›   Last »