December 29, 2011
On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
> This a a great idea! In this case the default string will be a
> random-access range, not a bidirectional range. Also, processing
> dstring is faster, then string, because no encoding needs to be done.
> Processing power is more expensive, then memory. utf-8 is valuable
> only to pass it as an ASCII string (which is not too common) and to
> store large chunks of it. Both these cases are much less common then
> all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.
December 29, 2011
What if the string converted itself from utf-8 to utf-32 back and forth as necessary (utf-8 for storing and utf-32 for processing):

struct String
{
public:
    bool encoded() @property const
    {
        return _encoded;
    }

    bool encoded(bool should) @property
    {
        if(should)
            if(!encoded)
            {
                _utf8 = to!string(_utf32);
                encoded = true;
            }
        else
            if(encoded)
            {
                _utf32 = to!dstring(_utf8);
                encoded = false;
            }
    }

    // Here goes the part where you get to use the string

private:
    bool _encoded;
    union
    {
        string _utf8;
        dstring _utf32;
    }
}

This has a lot of drawbacks and is purely a curiosity. The idea of expressing the encoding of string as a property of strings, rather, then a difference between separate types of strings.

On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright <newshound2@digitalmars.com> wrote:
> On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
>>
>> This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing.
>
>
> dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.



-- 
Bye,
Gor Gyolchanyan.
December 29, 2011
oops. I accidentally made a recursive call in the setter. scratch that, it should change the attribute.

On Thu, Dec 29, 2011 at 6:58 PM, Gor Gyolchanyan <gor.f.gyolchanyan@gmail.com> wrote:
> What if the string converted itself from utf-8 to utf-32 back and forth as necessary (utf-8 for storing and utf-32 for processing):
>
> struct String
> {
> public:
>    bool encoded() @property const
>    {
>        return _encoded;
>    }
>
>    bool encoded(bool should) @property
>    {
>        if(should)
>            if(!encoded)
>            {
>                _utf8 = to!string(_utf32);
>                encoded = true;
>            }
>        else
>            if(encoded)
>            {
>                _utf32 = to!dstring(_utf8);
>                encoded = false;
>            }
>    }
>
>    // Here goes the part where you get to use the string
>
> private:
>    bool _encoded;
>    union
>    {
>        string _utf8;
>        dstring _utf32;
>    }
> }
>
> This has a lot of drawbacks and is purely a curiosity. The idea of expressing the encoding of string as a property of strings, rather, then a difference between separate types of strings.
>
> On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright <newshound2@digitalmars.com> wrote:
>> On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
>>>
>>> This a a great idea! In this case the default string will be a random-access range, not a bidirectional range. Also, processing dstring is faster, then string, because no encoding needs to be done. Processing power is more expensive, then memory. utf-8 is valuable only to pass it as an ASCII string (which is not too common) and to store large chunks of it. Both these cases are much less common then all the rest of string processing.
>>
>>
>> dstring consumes 4x the memory, and this can easily cause perf degradations due to thrashing and poor cache locality.
>
>
>
> --
> Bye,
> Gor Gyolchanyan.



-- 
Bye,
Gor Gyolchanyan.
December 29, 2011
On 12/29/11 2:04 AM, Vladimir Panteleev wrote:
> I think it would be simpler to just make dstring the default string type.
>
> dstring is simple and safe. People who want better memory usage can use
> UTF-8 at their own discretion.

memory == time

Andrei
December 29, 2011
On Thursday, 29 December 2011 at 06:09:17 UTC, Andrei Alexandrescu wrote:
> Nah, that still breaks a lotta code because people parameterize on T[], use isSomeString/isSomeChar etc.

/* snip struct string */

import std.traits;
void tem(T)(T t) if(isSomeString!T) {}
void tem2(T : immutable(char)[])(T t) {}

string a = "test";
tem(a); // works
tem2(a); // works


It's the alias this magic again.

(btw I also tried renaming struct string to
struct STRING, and it still worked, so it wasn't
just naming coincidence!)
December 29, 2011
Le 28/12/2011 21:43, Jonathan M Davis a écrit :
> On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
>> I'm afraid you're wrong here. The current setup is very good, and much
>> better than one in which "string" would be an alias for const(char)[].
>>
>> The problem is escaping. A function that transitorily operates on a
>> string indeed does not care about the origin of the string, but storing
>> a string inside an object is a completely different deal. The setup
>>
>> class Query
>> {
>>       string name;
>>       ...
>> }
>>
>> is safe, minimizes data copying, and never causes surprises to anyone
>> ("I set the name of my query and a little later it's all messed up!").
>>
>> So immutable(char)[] is the best choice for a correct string abstraction
>> compared against both char[] and const(char)[]. In fact it's in a way
>> good that const(char)[] takes longer to type, because it also carries
>> larger liabilities.
>>
>> If you want to create a string out of a char[] or const(char)[], use
>> std.conv.to or the unsafe assumeUnique.
>
> Agreed. And for a number of functions, taking const(char)[] would be worse,
> because they would have to dup or idup the string, whereas with
> immutable(char)[], they can safely slice it without worrying about its value
> changing.
>

Is inout a solution for the standard lib here ?

The user could idup if a string is needed from a const/mutable char[]
December 29, 2011
Le 29/12/2011 07:48, Jonathan M Davis a écrit :
> On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
>> I don't think this is a problem you can solve without educating
>> people. They will need to know a thing or two about how UTF works
>> to know the performance implications of many of the "safe" ways
>> to handle UTF strings. Further, for much use of Unicode strings
>> in D you can't get away with not knowing anything anyway because
>> D only abstracts up to code points, not graphemes. Imagine trying
>> to explain to the unknowing programmer what is going on when an
>> algorithm function broke his grapheme and he doesn't know the
>> first thing about Unicode.
>>
>> I'm not claiming to be an expert myself, but I believe D offers
>> Unicode the right way as it is.
>
> Ultimately, the programmer _does_ need to understand unicode properly if
> they're going to write code which is both correct and efficient. However, if the
> easy way to use strings in D is correct, even if it's not as efficient as we'd
> like, then at least code will tend to be correct in its use of unicode. And
> then if the programmer wants to their string processing to be more efficient,
> they need to actually learn how unicode works so that they code for it more
> efficiently.
>
> The issue, however, is that it's currently _way_ too easy to use strings
> completely incorrectly and operate on code units as if they were characters. A
> _lot_ of programmers will be using string and char[] as if a char were a
> character, and that's going to create a lot of bugs. Making it harder to
> operate on a char[] or string as if it were an array of characters will
> seriously reduce such bugs and on some level will force people to become
> better educated about unicode.
>
> No, it doesn't completely solve the problem, since then we're operating at the
> code point level rather than the unicode level, but it's still a _lot_ better
> than operating on the code unit level as is likely to happen now.
>
> - Jonathan M Davis

That is the whole point of D IMO. I think we shouldn't let an ego question dictate language decision.
December 29, 2011
On 12/29/2011 07:53 AM, Walter Bright wrote:
> On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
>> The only solution is to explain Walter no other programmer in the
>> world codes
>> UTF like him. Really. I emulate that sometimes (learned from him) but
>> I see code
>> from hundreds of people day in and day out - it's never like his.
>>
>> Once we convince him, he'll be like "ah, I see what you mean.
>> Requiring .rep is
>> awesome. Let's do it."
>
> If that ever happens, I owe you a beer. Maybe two!
>
> Maybe it's hubris, but I think D nails what a string type should be. I'm
> extremely reluctant to mess with its success. It strikes the right
> balance between aesthetics, efficiency and utility.
>

I fully agree. If I had to design an imperative programming language, this is how its strings would work.

> C++11 and C11 appear to have copied it.
December 29, 2011
On 12/29/2011 07:45 AM, foobar wrote:
> On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
>> On 12/28/2011 11:12 PM, foobar wrote:
>>> On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
>>>>
>>>> I was educated enough not to make that mistake, because I read the
>>>> entire language specification before deciding the language was awesome
>>>> and downloading the compiler. I find it strange that the product
>>>> should be made less usable because we do not expect users to read the
>>>> manual. But it is of course a valid point.
>>>>
>>>
>>> That's awfully optimistic to expect people to read the manual.
>>>
>>
>> Well, if the alternative is slowly butchering the language I will be
>> awfully optimistic about it all day long.
>>
>>>> There is nothing wrong with operating at the code unit level.
>>>> Efficient slicing is very desirable.
>>>>
>>>
>>> I agree that it's useful. It is however the incorrect abstraction level
>>> when you need a "string" which is by far the common case in user code.
>>
>> I would not go as far as to call it 'incorrect'.
>>
>>> i.e. if I need a name variable in a class: codeUnit[] name; // bug!
>>> string Name; // correct
>>>
>>
>> From a pragmatic viewpoint it does not matter because if string is
>> used like this, then codeUnit[] does exactly the same thing. Nobody
>> forces anyone to index or slice into a string variable when they don't
>> need that functionality. All engineers have to work with leaky
>> abstractions. Why is it such a big deal?
>>
>>
>>> I expect that most uses of code-unit arrays should be in the standard
>>> library anyway since it provides the string manipulation routines. It
>>> all boils down to making the common case trivial and the rare case
>>> possible. You can use the underlying data structure (code units) if you
>>> need it but the default "string" is what people expect when thinking
>>> about what such a type does (a string of letters). D's already 80% there
>>> since Phobos already treats strings as bi-directional ranges of
>>> code-points which is much closer to the mental image of a string of
>>> letters, so I think this is about bringing the current design to its
>>> final conclusion.
>>>
>>
>> Well, that mental image is just not the right one when dealing with
>> Unicode.
>>
>>>>
>>>> Exactly. It is acting less and less like an array of code units. But
>>>> it *is* an array of code units. If the general consensus is that we
>>>> need a string data type that acts at a different abstraction level by
>>>> default (with which I'd disagree, but apparently I don't have a
>>>> popular opinion here), then we need a string type in the standard
>>>> library to do that. Changing the language so that an array of code
>>>> units stops behaving like an array of code units is not a solution.
>>>>
>>>
>>> I agree that we should not break T[] for any T and instead introduce a
>>> library type. While I personally believe that such a change will expose
>>> hidden bugs (certainly when unaware programmers treat string as ASCII
>>> and the product is later on localized), it's a big disturbance in
>>> people's code and it's worth a consideration if the benefit worth the
>>> costs. Perhaps, some middle ground could be found such that existing
>>> code can rely on existing behavior and the new library type will be an
>>> opt-in.
>>
>> What will such a type offer, except that it disallows indexing and
>> slicing?
>
>
>  From a pragmatic view point people can also continue programming in C++
> instead of investing a lot of effort learning a new language.
>

I disagree.

Pragmatism: "Dealing with things sensibly and realistically in a way that is based on practical rather than theoretical considerations."

In practice, programming in D beats the pants off programming in C++.

> The only difference between programming languages is the human interface
> aspect.

No. There is also the aspect of how well it maps to the machine it will run on. An interface always has two sides.

> Anything you can program with D you could also do in assembly
> yet you prefer D because it's more convenient.

I prefer D because it is more productive.

> In that regard, a code-unit array is definitely worse than a string type.
>

A code-unit array type is a string type, albeit a simple one.

> A programmer can choose to either change his 'naive' mental image or
> change the programming language.  Most will do the latter.

A programmer does not care about how D strings work or he is happy that they are so simple to work with.

> Computers need to adapt and be human friendly, not vice-versa.

When I meet a computer that adapts itself in order to be human friendly, I'll buy you a cookie.
December 29, 2011
On 28.12.2011 20:00, Andrei Alexandrescu wrote:
> Oh, one more thing - one good thing that could come out of this thread
> is abolition (through however slow a deprecation path) of s.length and
> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
> s.rep[i] instead of s[i] would improve the quality of narrow strings
> tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
> Then, people would access the decoding routines on the needed occasions,
> or would consciously use the representation.
>
> Yum.


If I understand this correctly, most others don't. Effectively, .rep just means, "I know what I'm doing", and there's no change to existing semantics, purely a syntax change.

If you change s[i] into s.rep[i], it does the same thing as now. There's no loss of functionality -- it's just stops you from accidentally doing the wrong thing. Like .ptr for getting the address of an array.
Typically all the ".rep" everywhere would get annoying, so you would write:
ubyte [] u = s.rep;
and use u from then on.

I don't like the name 'rep'. Maybe 'raw' or 'utf'?
Apart from that, I think this would be perfect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Top | Discussion index | About this forum | D home