View mode: basic / threaded / horizontal-split · Log in · Help
December 29, 2011
Re: string is rarely useful as a function argument
On 12/28/11 11:36 PM, Walter Bright wrote:
> On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
>> On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei Alexandrescu wrote:
>>> If we have two facilities (string and e.g. String) we've lost. We'd
>>> need to
>>> slowly change the built-in string type.
>>
>> Have you actually tried to do it?
>
> I've seen the damage done in C++ with multiple string types. Being able
> to convert from one to the other doesn't help much.

This.

The only solution is to explain Walter no other programmer in the world 
codes UTF like him. Really. I emulate that sometimes (learned from him) 
but I see code from hundreds of people day in and day out - it's never 
like his.

Once we convince him, he'll be like "ah, I see what you mean. Requiring 
.rep is awesome. Let's do it."


Andrei
December 29, 2011
Re: string is rarely useful as a function argument
On 12/29/11 12:01 AM, Adam D. Ruppe wrote:
> On Thursday, 29 December 2011 at 05:37:00 UTC, Walter Bright wrote:
>> I've seen the damage done in C++ with multiple string types. Being
>> able to convert from one to the other doesn't help much.
>
> Note that I'm on your side here re strings, but you're
> underselling the D language too! These conversions
> are implicit both ways, and completely free. D structs
> can wrap other D types perfectly well.

Nah, that still breaks a lotta code because people parameterize on T[], 
use isSomeString/isSomeChar etc.

Nagonna.


Andrei
December 29, 2011
Re: string is rarely useful as a function argument
On Thursday, 29 December 2011 at 06:08:05 UTC, Andrei 
Alexandrescu wrote:
> On 12/28/11 11:36 PM, Walter Bright wrote:
>> On 12/28/2011 8:32 PM, Adam D. Ruppe wrote:
>>> On Thursday, 29 December 2011 at 04:17:37 UTC, Andrei 
>>> Alexandrescu wrote:
>>>> If we have two facilities (string and e.g. String) we've 
>>>> lost. We'd
>>>> need to
>>>> slowly change the built-in string type.
>>>
>>> Have you actually tried to do it?
>>
>> I've seen the damage done in C++ with multiple string types. 
>> Being able
>> to convert from one to the other doesn't help much.
>
> This.
>
> The only solution is to explain Walter no other programmer in 
> the world codes UTF like him. Really. I emulate that sometimes 
> (learned from him) but I see code from hundreds of people day 
> in and day out - it's never like his.
>
> Once we convince him, he'll be like "ah, I see what you mean. 
> Requiring .rep is awesome. Let's do it."
>
>
> Andrei

I don't think this is a problem you can solve without educating 
people. They will need to know a thing or two about how UTF works 
to know the performance implications of many of the "safe" ways 
to handle UTF strings. Further, for much use of Unicode strings 
in D you can't get away with not knowing anything anyway because 
D only abstracts up to code points, not graphemes. Imagine trying 
to explain to the unknowing programmer what is going on when an 
algorithm function broke his grapheme and he doesn't know the 
first thing about Unicode.

I'm not claiming to be an expert myself, but I believe D offers 
Unicode the right way as it is.
December 29, 2011
Re: string is rarely useful as a function argument
On Wednesday, 28 December 2011 at 22:39:15 UTC, Timon Gehr wrote:
> On 12/28/2011 11:12 PM, foobar wrote:
>> On Wednesday, 28 December 2011 at 21:17:49 UTC, Timon Gehr 
>> wrote:
>>>
>>> I was educated enough not to make that mistake, because I 
>>> read the
>>> entire language specification before deciding the language 
>>> was awesome
>>> and downloading the compiler. I find it strange that the 
>>> product
>>> should be made less usable because we do not expect users to 
>>> read the
>>> manual. But it is of course a valid point.
>>>
>>
>> That's awfully optimistic to expect people to read the manual.
>>
>
> Well, if the alternative is slowly butchering the language I 
> will be awfully optimistic about it all day long.
>
>>> There is nothing wrong with operating at the code unit level.
>>> Efficient slicing is very desirable.
>>>
>>
>> I agree that it's useful. It is however the incorrect 
>> abstraction level
>> when you need a "string" which is by far the common case in 
>> user code.
>
> I would not go as far as to call it 'incorrect'.
>
>> i.e. if I need a name variable in a class: codeUnit[] name; // 
>> bug!
>> string Name; // correct
>>
>
> From a pragmatic viewpoint it does not matter because if string 
> is used like this, then codeUnit[] does exactly the same thing. 
> Nobody forces anyone to index or slice into a string variable 
> when they don't need that functionality. All engineers have to 
> work with leaky abstractions. Why is it such a big deal?
>
>
>> I expect that most uses of code-unit arrays should be in the 
>> standard
>> library anyway since it provides the string manipulation 
>> routines. It
>> all boils down to making the common case trivial and the rare 
>> case
>> possible.  You can use the underlying data structure (code 
>> units) if you
>> need it but the default "string" is what people expect when 
>> thinking
>> about what such a type does (a string of letters). D's already 
>> 80% there
>> since Phobos already treats strings as bi-directional ranges of
>> code-points which is much closer to the mental image of a 
>> string of
>> letters, so I think this is about bringing the current design 
>> to its
>> final conclusion.
>>
>
> Well, that mental image is just not the right one when dealing 
> with Unicode.
>
>>>
>>> Exactly. It is acting less and less like an array of code 
>>> units. But
>>> it *is* an array of code units. If the general consensus is 
>>> that we
>>> need a string data type that acts at a different abstraction 
>>> level by
>>> default (with which I'd disagree, but apparently I don't have 
>>> a
>>> popular opinion here), then we need a string type in the 
>>> standard
>>> library to do that. Changing the language so that an array of 
>>> code
>>> units stops behaving like an array of code units is not a 
>>> solution.
>>>
>>
>> I agree that we should not break T[] for any T and instead 
>> introduce a
>> library type. While I personally believe that such a change 
>> will expose
>> hidden bugs (certainly when unaware programmers treat string 
>> as ASCII
>> and the product is later on localized), it's a big disturbance 
>> in
>> people's code and it's worth a consideration if the benefit 
>> worth the
>> costs. Perhaps, some middle ground could be found such that 
>> existing
>> code can rely on existing behavior and the new library type 
>> will be an
>> opt-in.
>
> What will such a type offer, except that it disallows indexing 
> and slicing?


From a pragmatic view point people can also continue programming 
in C++ instead of investing a lot of effort learning a new 
language.

The only difference between programming languages is the human 
interface aspect.  Anything you can program with D you could also 
do in assembly yet you prefer D because it's more convenient. In 
that regard, a code-unit array is definitely worse than a string 
type.

A programmer can choose to either change his 'naive' mental image 
or change the programming language. Most will do the latter. 
Computers need to adapt and be human friendly, not vice-versa.
December 29, 2011
Re: string is rarely useful as a function argument
On Thursday, December 29, 2011 07:33:28 Jakob Ovrum wrote:
> I don't think this is a problem you can solve without educating
> people. They will need to know a thing or two about how UTF works
> to know the performance implications of many of the "safe" ways
> to handle UTF strings. Further, for much use of Unicode strings
> in D you can't get away with not knowing anything anyway because
> D only abstracts up to code points, not graphemes. Imagine trying
> to explain to the unknowing programmer what is going on when an
> algorithm function broke his grapheme and he doesn't know the
> first thing about Unicode.
> 
> I'm not claiming to be an expert myself, but I believe D offers
> Unicode the right way as it is.

Ultimately, the programmer _does_ need to understand unicode properly if 
they're going to write code which is both correct and efficient. However, if the 
easy way to use strings in D is correct, even if it's not as efficient as we'd 
like, then at least code will tend to be correct in its use of unicode. And 
then if the programmer wants to their string processing to be more efficient, 
they need to actually learn how unicode works so that they code for it more 
efficiently.

The issue, however, is that it's currently _way_ too easy to use strings 
completely incorrectly and operate on code units as if they were characters. A 
_lot_ of programmers will be using string and char[] as if a char were a 
character, and that's going to create a lot of bugs. Making it harder to 
operate on a char[] or string as if it were an array of characters will 
seriously reduce such bugs and on some level will force people to become 
better educated about unicode.

No, it doesn't completely solve the problem, since then we're operating at the 
code point level rather than the unicode level, but it's still a _lot_ better 
than operating on the code unit level as is likely to happen now.

- Jonathan M Davis
December 29, 2011
Re: string is rarely useful as a function argument
On 12/28/2011 10:08 PM, Andrei Alexandrescu wrote:
> The only solution is to explain Walter no other programmer in the world codes
> UTF like him. Really. I emulate that sometimes (learned from him) but I see code
> from hundreds of people day in and day out - it's never like his.
>
> Once we convince him, he'll be like "ah, I see what you mean. Requiring .rep is
> awesome. Let's do it."

If that ever happens, I owe you a beer. Maybe two!

Maybe it's hubris, but I think D nails what a string type should be. I'm 
extremely reluctant to mess with its success. It strikes the right balance 
between aesthetics, efficiency and utility.

C++11 and C11 appear to have copied it.
December 29, 2011
Re: string is rarely useful as a function argument
On 12/28/2011 10:33 PM, Jakob Ovrum wrote:
> I don't think this is a problem you can solve without educating people. They
> will need to know a thing or two about how UTF works to know the performance
> implications of many of the "safe" ways to handle UTF strings. Further, for much
> use of Unicode strings in D you can't get away with not knowing anything anyway
> because D only abstracts up to code points, not graphemes. Imagine trying to
> explain to the unknowing programmer what is going on when an algorithm function
> broke his grapheme and he doesn't know the first thing about Unicode.
>
> I'm not claiming to be an expert myself, but I believe D offers Unicode the
> right way as it is.

I think this goes to, at some point, the language is no longer able to hide the 
realities of the underlying machine. This happens with floating point (they are 
NOT mathematical real numbers), integers (they overflow), etc.

Keep in mind that D already has a string type where the code points match the 
characters:

     dstring[]
December 29, 2011
Re: string is rarely useful as a function argument
On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei 
Alexandrescu wrote:
> On 12/28/11 12:46 PM, Walter Bright wrote:
>> On 12/28/2011 10:35 AM, Peter Alexander wrote:
>>> On 28/12/11 6:15 PM, Walter Bright wrote:
>>>> If such a change is made, then people will use const string 
>>>> when they
>>>> mean immutable, and the values underneath are not guaranteed 
>>>> to be
>>>> consistent.
>>>
>>> Then people should learn what const and immutable mean!
>>>
>>> I don't think it's fair to dismiss my suggestion on the 
>>> grounds that
>>> people
>>> don't understand the language.
>>
>> People do what is convenient, and as endless experience shows, 
>> doing the
>> right thing should be easier than doing the wrong thing. If 
>> you present
>> people with a choice:
>>
>> #1: string s;
>> #2: immutable(char)[] s;
>>
>> sure as the sun rises, they will type the former, and it will 
>> be subtly
>> incorrect if string is const(char)[].
>>
>> Telling people they should know better and pick #2 instead is 
>> a strategy
>> that never works very well - not for programming, nor any 
>> other endeavor.
>
> Oh, one more thing - one good thing that could come out of this 
> thread is abolition (through however slow a deprecation path) 
> of s.length and s[i] for narrow strings. Requiring s.rep.length 
> instead of s.length and s.rep[i] instead of s[i] would improve 
> the quality of narrow strings tremendously. Also, s.rep[i] 
> should return ubyte/ushort, not char/wchar. Then, people would 
> access the decoding routines on the needed occasions, or would 
> consciously use the representation.

I think it would be simpler to just make dstring the default 
string type.

dstring is simple and safe. People who want better memory usage 
can use UTF-8 at their own discretion.
December 29, 2011
Re: string is rarely useful as a function argument
This a a great idea! In this case the default string will be a
random-access range, not a bidirectional range. Also, processing
dstring is faster, then string, because no encoding needs to be done.
Processing power is more expensive, then memory. utf-8 is valuable
only to pass it as an ASCII string (which is not too common) and to
store large chunks of it. Both these cases are much less common then
all the rest of string processing.

+1

On Thu, Dec 29, 2011 at 12:04 PM, Vladimir Panteleev
<vladimir@thecybershadow.net> wrote:
> On Wednesday, 28 December 2011 at 19:00:53 UTC, Andrei Alexandrescu wrote:
>>
>> On 12/28/11 12:46 PM, Walter Bright wrote:
>>>
>>> On 12/28/2011 10:35 AM, Peter Alexander wrote:
>>>>
>>>> On 28/12/11 6:15 PM, Walter Bright wrote:
>>>>>
>>>>> If such a change is made, then people will use const string when they
>>>>> mean immutable, and the values underneath are not guaranteed to be
>>>>> consistent.
>>>>
>>>>
>>>> Then people should learn what const and immutable mean!
>>>>
>>>> I don't think it's fair to dismiss my suggestion on the grounds that
>>>> people
>>>> don't understand the language.
>>>
>>>
>>> People do what is convenient, and as endless experience shows, doing the
>>> right thing should be easier than doing the wrong thing. If you present
>>> people with a choice:
>>>
>>> #1: string s;
>>> #2: immutable(char)[] s;
>>>
>>> sure as the sun rises, they will type the former, and it will be subtly
>>> incorrect if string is const(char)[].
>>>
>>> Telling people they should know better and pick #2 instead is a strategy
>>> that never works very well - not for programming, nor any other endeavor.
>>
>>
>> Oh, one more thing - one good thing that could come out of this thread is
>> abolition (through however slow a deprecation path) of s.length and s[i] for
>> narrow strings. Requiring s.rep.length instead of s.length and s.rep[i]
>> instead of s[i] would improve the quality of narrow strings tremendously.
>> Also, s.rep[i] should return ubyte/ushort, not char/wchar. Then, people
>> would access the decoding routines on the needed occasions, or would
>> consciously use the representation.
>
>
> I think it would be simpler to just make dstring the default string type.
>
> dstring is simple and safe. People who want better memory usage can use
> UTF-8 at their own discretion.



-- 
Bye,
Gor Gyolchanyan.
December 29, 2011
Re: string is rarely useful as a function argument
On Thu, 29 Dec 2011 16:36:59 +1100, Walter Bright  
<newshound2@digitalmars.com> wrote:

> I've seen the damage done in C++ with multiple string types. Being able  
> to convert from one to the other doesn't help much.

I'm not quite sure about that last sentence. I suspect that the better way  
for applications to handle strings of characters would be to internally  
store and manipulate them as utf-32 (dchar[]) and only when doing I/O use  
the other utf forms. So converting from the different forms is very  
helpful.

-- 
Derek Parnell
Melbourne, Australia
3 4 5 6 7 8 9 10 11
Top | Discussion index | About this forum | D home