View mode: basic / threaded / horizontal-split · Log in · Help
December 29, 2011
Re: string is rarely useful as a function argument
On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
> This a a great idea! In this case the default string will be a
> random-access range, not a bidirectional range. Also, processing
> dstring is faster, then string, because no encoding needs to be done.
> Processing power is more expensive, then memory. utf-8 is valuable
> only to pass it as an ASCII string (which is not too common) and to
> store large chunks of it. Both these cases are much less common then
> all the rest of string processing.

dstring consumes 4x the memory, and this can easily cause perf degradations due 
to thrashing and poor cache locality.
December 29, 2011
Re: string is rarely useful as a function argument
What if the string converted itself from utf-8 to utf-32 back and
forth as necessary (utf-8 for storing and utf-32 for processing):

struct String
{
public:
   bool encoded() @property const
   {
       return _encoded;
   }

   bool encoded(bool should) @property
   {
       if(should)
           if(!encoded)
           {
               _utf8 = to!string(_utf32);
               encoded = true;
           }
       else
           if(encoded)
           {
               _utf32 = to!dstring(_utf8);
               encoded = false;
           }
   }

   // Here goes the part where you get to use the string

private:
   bool _encoded;
   union
   {
       string _utf8;
       dstring _utf32;
   }
}

This has a lot of drawbacks and is purely a curiosity. The idea of
expressing the encoding of string as a property of strings, rather,
then a difference between separate types of strings.

On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
<newshound2@digitalmars.com> wrote:
> On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
>>
>> This a a great idea! In this case the default string will be a
>> random-access range, not a bidirectional range. Also, processing
>> dstring is faster, then string, because no encoding needs to be done.
>> Processing power is more expensive, then memory. utf-8 is valuable
>> only to pass it as an ASCII string (which is not too common) and to
>> store large chunks of it. Both these cases are much less common then
>> all the rest of string processing.
>
>
> dstring consumes 4x the memory, and this can easily cause perf degradations
> due to thrashing and poor cache locality.



-- 
Bye,
Gor Gyolchanyan.
December 29, 2011
Re: string is rarely useful as a function argument
oops. I accidentally made a recursive call in the setter. scratch
that, it should change the attribute.

On Thu, Dec 29, 2011 at 6:58 PM, Gor Gyolchanyan
<gor.f.gyolchanyan@gmail.com> wrote:
> What if the string converted itself from utf-8 to utf-32 back and
> forth as necessary (utf-8 for storing and utf-32 for processing):
>
> struct String
> {
> public:
>    bool encoded() @property const
>    {
>        return _encoded;
>    }
>
>    bool encoded(bool should) @property
>    {
>        if(should)
>            if(!encoded)
>            {
>                _utf8 = to!string(_utf32);
>                encoded = true;
>            }
>        else
>            if(encoded)
>            {
>                _utf32 = to!dstring(_utf8);
>                encoded = false;
>            }
>    }
>
>    // Here goes the part where you get to use the string
>
> private:
>    bool _encoded;
>    union
>    {
>        string _utf8;
>        dstring _utf32;
>    }
> }
>
> This has a lot of drawbacks and is purely a curiosity. The idea of
> expressing the encoding of string as a property of strings, rather,
> then a difference between separate types of strings.
>
> On Thu, Dec 29, 2011 at 1:02 PM, Walter Bright
> <newshound2@digitalmars.com> wrote:
>> On 12/29/2011 12:12 AM, Gor Gyolchanyan wrote:
>>>
>>> This a a great idea! In this case the default string will be a
>>> random-access range, not a bidirectional range. Also, processing
>>> dstring is faster, then string, because no encoding needs to be done.
>>> Processing power is more expensive, then memory. utf-8 is valuable
>>> only to pass it as an ASCII string (which is not too common) and to
>>> store large chunks of it. Both these cases are much less common then
>>> all the rest of string processing.
>>
>>
>> dstring consumes 4x the memory, and this can easily cause perf degradations
>> due to thrashing and poor cache locality.
>
>
>
> --
> Bye,
> Gor Gyolchanyan.



-- 
Bye,
Gor Gyolchanyan.
December 29, 2011
Re: string is rarely useful as a function argument
On 12/29/11 2:04 AM, Vladimir Panteleev wrote:
> I think it would be simpler to just make dstring the default string type.
>
> dstring is simple and safe. People who want better memory usage can use
> UTF-8 at their own discretion.

memory == time

Andrei
December 29, 2011
Re: string is rarely useful as a function argument
On Thursday, 29 December 2011 at 06:09:17 UTC, Andrei 
Alexandrescu wrote:
> Nah, that still breaks a lotta code because people parameterize 
> on T[], use isSomeString/isSomeChar etc.

/* snip struct string */

import std.traits;
void tem(T)(T t) if(isSomeString!T) {}
void tem2(T : immutable(char)[])(T t) {}

string a = "test";
tem(a); // works
tem2(a); // works


It's the alias this magic again.

(btw I also tried renaming struct string to
struct STRING, and it still worked, so it wasn't
just naming coincidence!)
December 29, 2011
Re: string is rarely useful as a function argument
Le 28/12/2011 21:43, Jonathan M Davis a écrit :
> On Wednesday, December 28, 2011 10:27:15 Andrei Alexandrescu wrote:
>> I'm afraid you're wrong here. The current setup is very good, and much
>> better than one in which "string" would be an alias for const(char)[].
>>
>> The problem is escaping. A function that transitorily operates on a
>> string indeed does not care about the origin of the string, but storing
>> a string inside an object is a completely different deal. The setup
>>
>> class Query
>> {
>>       string name;
>>       ...
>> }
>>
>> is safe, minimizes data copying, and never causes surprises to anyone
>> ("I set the name of my query and a little later it's all messed up!").
>>
>> So immutable(char)[] is the best choice for a correct string abstraction
>> compared against both char[] and const(char)[]. In fact it's in a way
>> good that const(char)[] takes longer to type, because it also carries
>> larger liabilities.
>>
>> If you want to create a string out of a char[] or const(char)[], use
>> std.conv.to or the unsafe assumeUnique.
>
> Agreed. And for a number of functions, taking const(char)[] would be worse,
> because they would have to dup or idup the string, whereas with
> immutable(char)[], they can safely slice it without worrying about its value
> changing.
>

Is inout a solution for the standard lib here ?

The user could idup if a string is needed from a const/mutable char[]
December 29, 2011
Re: string is rarely useful as a function argument
Le 29/12/2011 07:48, Jonathan M Davis a écrit :
> On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
>> I don't think this is a problem you can solve without educating
>> people. They will need to know a thing or two about how UTF works
>> to know the performance implications of many of the "safe" ways
>> to handle UTF strings. Further, for much use of Unicode strings
>> in D you can't get away with not knowing anything anyway because
>> D only abstracts up to code points, not graphemes. Imagine trying
>> to explain to the unknowing programmer what is going on when an
>> algorithm function broke his grapheme and he doesn't know the
>> first thing about Unicode.
>>
>> I'm not claiming to be an expert myself, but I believe D offers
>> Unicode the right way as it is.
>
> Ultimately, the programmer _does_ need to understand unicode properly if
> they're going to write code which is both correct and efficient. However, if the
> easy way to use strings in D is correct, even if it's not as efficient as we'd
> like, then at least code will tend to be correct in its use of unicode. And
> then if the programmer wants to their string processing to be more efficient,
> they need to actually learn how unicode works so that they code for it more
> efficiently.
>
> The issue, however, is that it's currently _way_ too easy to use strings
> completely incorrectly and operate on code units as if they were characters. A
> _lot_ of programmers will be using string and char[] as if a char were a
> character, and that's going to create a lot of bugs. Making it harder to
> operate on a char[] or string as if it were an array of characters will
> seriously reduce such bugs and on some level will force people to become
> better educated about unicode.
>
> No, it doesn't completely solve the problem, since then we're operating at the
> code point level rather than the unicode level, but it's still a _lot_ better
> than operating on the code unit level as is likely to happen now.
>
> - Jonathan M Davis

That is the whole point of D IMO. I think we shouldn't let an ego 
question dictate language decision.
December 29, 2011
Re: string is rarely useful as a function argument
On 12/29/2011 07:53 AM, Walter Bright wrote:
> On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
>> The only solution is to explain Walter no other programmer in the
>> world codes
>> UTF like him. Really. I emulate that sometimes (learned from him) but
>> I see code
>> from hundreds of people day in and day out - it's never like his.
>>
>> Once we convince him, he'll be like "ah, I see what you mean.
>> Requiring .rep is
>> awesome. Let's do it."
>
> If that ever happens, I owe you a beer. Maybe two!
>
> Maybe it's hubris, but I think D nails what a string type should be. I'm
> extremely reluctant to mess with its success. It strikes the right
> balance between aesthetics, efficiency and utility.
>

I fully agree. If I had to design an imperative programming language, 
this is how its strings would work.

> C++11 and C11 appear to have copied it.
December 29, 2011
Re: string is rarely useful as a function argument
On 12/29/2011 07:45 AM, foobar wrote:
> On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
>> On 12/28/2011 11:12 PM, foobar wrote:
>>> On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr wrote:
>>>>
>>>> I was educated enough not to make that mistake, because I read the
>>>> entire language specification before deciding the language was awesome
>>>> and downloading the compiler. I find it strange that the product
>>>> should be made less usable because we do not expect users to read the
>>>> manual. But it is of course a valid point.
>>>>
>>>
>>> That's awfully optimistic to expect people to read the manual.
>>>
>>
>> Well, if the alternative is slowly butchering the language I will be
>> awfully optimistic about it all day long.
>>
>>>> There is nothing wrong with operating at the code unit level.
>>>> Efficient slicing is very desirable.
>>>>
>>>
>>> I agree that it's useful. It is however the incorrect abstraction level
>>> when you need a "string" which is by far the common case in user code.
>>
>> I would not go as far as to call it 'incorrect'.
>>
>>> i.e. if I need a name variable in a class: codeUnit[] name; // bug!
>>> string Name; // correct
>>>
>>
>> From a pragmatic viewpoint it does not matter because if string is
>> used like this, then codeUnit[] does exactly the same thing. Nobody
>> forces anyone to index or slice into a string variable when they don't
>> need that functionality. All engineers have to work with leaky
>> abstractions. Why is it such a big deal?
>>
>>
>>> I expect that most uses of code-unit arrays should be in the standard
>>> library anyway since it provides the string manipulation routines. It
>>> all boils down to making the common case trivial and the rare case
>>> possible. You can use the underlying data structure (code units) if you
>>> need it but the default "string" is what people expect when thinking
>>> about what such a type does (a string of letters). D's already 80% there
>>> since Phobos already treats strings as bi-directional ranges of
>>> code-points which is much closer to the mental image of a string of
>>> letters, so I think this is about bringing the current design to its
>>> final conclusion.
>>>
>>
>> Well, that mental image is just not the right one when dealing with
>> Unicode.
>>
>>>>
>>>> Exactly. It is acting less and less like an array of code units. But
>>>> it *is* an array of code units. If the general consensus is that we
>>>> need a string data type that acts at a different abstraction level by
>>>> default (with which I'd disagree, but apparently I don't have a
>>>> popular opinion here), then we need a string type in the standard
>>>> library to do that. Changing the language so that an array of code
>>>> units stops behaving like an array of code units is not a solution.
>>>>
>>>
>>> I agree that we should not break T[] for any T and instead introduce a
>>> library type. While I personally believe that such a change will expose
>>> hidden bugs (certainly when unaware programmers treat string as ASCII
>>> and the product is later on localized), it's a big disturbance in
>>> people's code and it's worth a consideration if the benefit worth the
>>> costs. Perhaps, some middle ground could be found such that existing
>>> code can rely on existing behavior and the new library type will be an
>>> opt-in.
>>
>> What will such a type offer, except that it disallows indexing and
>> slicing?
>
>
>  From a pragmatic view point people can also continue programming in C++
> instead of investing a lot of effort learning a new language.
>

I disagree.

Pragmatism: "Dealing with things sensibly and realistically in a way 
that is based on practical rather than theoretical considerations."

In practice, programming in D beats the pants off programming in C++.

> The only difference between programming languages is the human interface
> aspect.

No. There is also the aspect of how well it maps to the machine it will 
run on. An interface always has two sides.

> Anything you can program with D you could also do in assembly
> yet you prefer D because it's more convenient.

I prefer D because it is more productive.

> In that regard, a code-unit array is definitely worse than a string type.
>

A code-unit array type is a string type, albeit a simple one.

> A programmer can choose to either change his 'naive' mental image or
> change the programming language.  Most will do the latter.

A programmer does not care about how D strings work or he is happy that 
they are so simple to work with.

> Computers need to adapt and be human friendly, not vice-versa.

When I meet a computer that adapts itself in order to be human friendly, 
I'll buy you a cookie.
December 29, 2011
Re: string is rarely useful as a function argument
On 28.12.2011 20:00, Andrei Alexandrescu wrote:
> Oh, one more thing - one good thing that could come out of this thread
> is abolition (through however slow a deprecation path) of s.length and
> s[i] for narrow strings. Requiring s.rep.length instead of s.length and
> s.rep[i] instead of s[i] would improve the quality of narrow strings
> tremendously. Also, s.rep[i] should return ubyte/ushort, not char/wchar.
> Then, people would access the decoding routines on the needed occasions,
> or would consciously use the representation.
>
> Yum.


If I understand this correctly, most others don't. Effectively, .rep 
just means, "I know what I'm doing", and there's no change to existing 
semantics, purely a syntax change.

If you change s[i] into s.rep[i], it does the same thing as now. There's 
no loss of functionality -- it's just stops you from accidentally doing 
the wrong thing. Like .ptr for getting the address of an array.
Typically all the ".rep" everywhere would get annoying, so you would write:
ubyte [] u = s.rep;
and use u from then on.

I don't like the name 'rep'. Maybe 'raw' or 'utf'?
Apart from that, I think this would be perfect.
4 5 6 7 8 9 10 11 12
Top | Discussion index | About this forum | D home