View mode: basic / threaded / horizontal-split · Log in · Help
December 31, 2011
Re: string is rarely useful as a function argument
I don't know that Unicode expertise is really required here anyway.  All one has to know is that UTF8 is a multibyte encoding and built-in string attributes talk in bytes. Knowing when one wants bytes vs characters isn't rocket science. That said, I'm on the fence about this change. It breaks consistency for a benefit I'm still weighing. With this change, the char type will still be a single byte, correct?  What happens to foreach on strings?

Sent from my iPhone

On Dec 31, 2011, at 8:20 AM, Timon Gehr <timon.gehr@gmx.ch> wrote:

> On 12/31/2011 03:17 PM, Michel Fortin wrote:
>> 
>> As for Walter being the only one coding by looking at the code units
>> directly, that's not true. All my parser code look at code units
>> directly and only decode to code points where necessary (just look at
>> the XML parsing code I posted a while ago to get an idea to how it can
>> apply to ranges). And I don't think it's because I've seen Walter code
>> before, I think it is because I know how Unicode works and I want to
>> make my parser efficient. I've done the same for a parser in C++ a while
>> ago. I can hardly imagine I'm the only one (with Walter and you). I
>> think this is how efficient algorithms dealing with Unicode should be
>> written.
>> 
> 
> +1.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/30/2011 05:27 PM, Timon Gehr wrote:
> On 12/30/2011 10:36 PM, deadalnix wrote:
>>
>> The #1 quality of a programmer is to act like he/she is a morron.
>> Because sometime we all are morrons.
> 
> The #1 quality of a programmer is to write correct code. If he/she acts
> as if he/she is a moron, he/she will write code that acts like a moron.
> Simple as that.

Tsk tsk.  Missing the point.

I believe what deadalnix is trying to say is this:
Programmers should try to write correct code, but should never trust
themselves to write correct code.

...

Programs worth writing are complex enough that there is no way any of us
can write them perfectly correct code on first draft.  There is always
going to be some polishing, and maybe even /a lot/ of polishing, and
perhaps some complete tear downs and rebuilds from time to time.  "Build
one to throw away; you will anyways."  If you tell me that you can
always write correct code the first time and you never need to go back
and fix anything when you do testing (you do test right?) then I will
have a hard time taking you seriously.

That said, it is extremely pleasant to have a language that catches you
when you inevitably fall.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 06:32 PM, Chad J wrote:
> On 12/30/2011 05:27 PM, Timon Gehr wrote:
>> On 12/30/2011 10:36 PM, deadalnix wrote:
>>>
>>> The #1 quality of a programmer is to act like he/she is a morron.
>>> Because sometime we all are morrons.
>>
>> The #1 quality of a programmer is to write correct code. If he/she acts
>> as if he/she is a moron, he/she will write code that acts like a moron.
>> Simple as that.
>
> Tsk tsk.  Missing the point.

Not at all. And I don't take anyone seriously who feels the need to 'Tsk 
tsk' btw.

>
> I believe what deadalnix is trying to say is this:
> Programmers should try to write correct code, but should never trust
> themselves to write correct code.
>

No, programmers should write correct code and then test it thoroughly. 
'Trying to' is the wrong way to go about anything. And there is no need 
to distrust oneself.

Anyway, I have a _very hard time_ translating 'acting like a moron' to 
'writing correct code'.

> ...
>
> Programs worth writing are complex enough that there is no way any of us
> can write them perfectly correct code on first draft.  There is always
> going to be some polishing, and maybe even /a lot/ of polishing, and
> perhaps some complete tear downs and rebuilds from time to time.  "Build
> one to throw away; you will anyways."  If you tell me that you can
> always write correct code the first time and you never need to go back
> and fix anything when you do testing (you do test right?) then I will
> have a hard time taking you seriously.

Testing is the main part of my development. Furthermore, I use 
assertions all over the place.

>
> That said, it is extremely pleasant to have a language that catches you
> when you inevitably fall.

That is why I also like Haskell.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/30/2011 02:55 PM, Timon Gehr wrote:
> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>> On 12/29/11 12:28 PM, Don wrote:
>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>> Oh, one more thing - one good thing that could come out of this thread
>>>>> is abolition (through however slow a deprecation path) of s.length and
>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>> and
>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>> char/wchar.
>>>>> Then, people would access the decoding routines on the needed
>>>>> occasions,
>>>>> or would consciously use the representation.
>>>>>
>>>>> Yum.
>>>>
>>>>
>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>> just means, "I know what I'm doing", and there's no change to existing
>>>> semantics, purely a syntax change.
>>>
>>> Exactly!
>>>
>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>> There's
>>>> no loss of functionality -- it's just stops you from accidentally doing
>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>> write:
>>>> ubyte [] u = s.rep;
>>>> and use u from then on.
>>>>
>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>> Apart from that, I think this would be perfect.
>>>
>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>> the connection is tenuous. "raw" sounds great.
>>>
>>> Now I'm twice sorry this will not happen...
>>>
>>
>> Maybe it could happen if we
>> 1. make dstring the default strings type --
> 
> Inefficient.
> 

But correct (enough).

>> code units and characters would be the same
> 
> Wrong.
> 

*sigh*, FINE.  Code units and /code points/ would be the same.

>> or 2. forward string.length to std.utf.count and opIndex to
>> std.utf.toUTFindex
> 
> Inconsistent and inefficient (it blows up the algorithmic complexity).
> 

Inconsistent?  How?

Inefficiency is a lot easier to deal with than incorrect.  If something
is inefficient, then in the right places I will NOTICE.  If something is
incorrect, it can hide for years until that one person (or country, in
this case) with a different usage pattern than the others uncovers it.

>>
>> so programmers could use the slices/indexing/length (no lazyness
>> problems), and if they really want codeunits use .raw/.rep (or better
>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>
> 
> Anyone who intends to write efficient string processing code needs this.
> Anyone who does not want to write string processing code will not need
> to index into a string -- standard library functions will suffice.
> 

What about people who want to write correct string processing code AND
want to use this handy slicing feature?  Because I totally want both of
these.  Slicing is super useful for script-like coding.

>> But generally I liked the idea of just having an alias for strings...
> 
> Me too. I think the way we have it now is optimal. The only reason we
> are discussing this is because of fear that uneducated users will write
> code that does not take into account Unicode characters above code point
> 0x80. But what is the worst thing that can happen?
> 
> 1. They don't notice. Then it is not a problem, because they are
> obviously only using ASCII characters and it is perfectly reasonable to
> assume that code units and characters are the same thing.
> 

How do you know they are only working with ASCII?  They might be /now/.
But what if someone else uses the program a couple years later when the
original author is no longer maintaining that chunk of code?

> 2. They get screwed up string output, look for the reason, patch up
> their code with some functions from std.utf and will never make the same
> mistakes again.
> 

Except they don't.  Because there are a lot of programmers that will
never put in non-ascii strings to begin with.  But that has nothing to
do with whether or not the /users/ or /maintainers/ of that code will
put non-ascii strings in.  This could make some messes.

> 
> I have *never* seen an user in D.learn complain about it. They might
> have been some I missed, but it is certainly not a prevalent problem.
> Also, just because an user can type .rep does not mean he understands
> Unicode: He is able to make just the same mistakes as before, even more
> so, as the array he is getting back has the _wrong element type_.
> 

You know, here in America (Amurica?) we don't know that other countries
exist.  I think there is a large population of programmers here that
don't even know how to enter non-latin characters, much less would think
to include such characters in their test cases.  These programmers won't
necessarily be found on the internet much, but they will be found in
cubicles all around, doing their 9-to-5 and writing mediocre code that
the rest of us have to put up with.  Their code will pass peer review
(their peers are also from America) and continue working just fine until
someone from one of those confusing other places decides to type in the
characters they feel comfortable typing in.  No, there will not be
/tests/ for code points greater than 0x80, because there is no one
around to write those.  I'd feel a little better if D herds people into
writing correct code to begin with, because they won't otherwise.

...

There's another issue at play here too: efficiency vs correctness as a
default.

Here's the tradeoff --

Option A:
char[i] returns the i'th byte of the string as a (char) type.
Consequences:
(1) Code is efficient and INcorrect.
(2) It requires extra effort to write correct code.
(3) Detecting the incorrect code may take years, as these errors can
hide easily.

Option B:
char[i] returns the i'th codepoint of the string as a (dchar) type.
Consequences:
(1) Code is INefficient and correct.
(2) It requires extra effort to write efficient code.
(3) Detecting the inefficient code happens in minutes.  It is VERY
noticable when your program runs too slowly.


This is how I see it.

And I really like my correct code.  If it's too slow, and I'll /know/
when it's too slow, then I'll profile->tweak->profile->etc until the
slowness goes away.  I'm totally digging option B.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 01:13 PM, Timon Gehr wrote:
> On 12/31/2011 06:32 PM, Chad J wrote:
>> On 12/30/2011 05:27 PM, Timon Gehr wrote:
>>> On 12/30/2011 10:36 PM, deadalnix wrote:
>>>>
>>>> The #1 quality of a programmer is to act like he/she is a morron.
>>>> Because sometime we all are morrons.
>>>
>>> The #1 quality of a programmer is to write correct code. If he/she acts
>>> as if he/she is a moron, he/she will write code that acts like a moron.
>>> Simple as that.
>>
>> Tsk tsk.  Missing the point.
> 
> Not at all. And I don't take anyone seriously who feels the need to 'Tsk
> tsk' btw.
> 

Well, you've certainly a right to it.

I just take it a little rough when it seems like someone's words are
being intentionally misread.

>>
>> I believe what deadalnix is trying to say is this:
>> Programmers should try to write correct code, but should never trust
>> themselves to write correct code.
>>
> 
> No, programmers should write correct code and then test it thoroughly.
> 'Trying to' is the wrong way to go about anything. And there is no need
> to distrust oneself.
> 

There's a perfect reason to distrust oneself: oneself is a squishy
meatbag that makes mistakes.

Repeated "trying" with rigor applied will lead to success.

> Anyway, I have a _very hard time_ translating 'acting like a moron' to
> 'writing correct code'.
> 

I'm pretty sure it's suggestive.  If an intelligent or careful person
acts like a moron, then they will be forced to assume that they will
make mistakes, and therefore take measures to ensure that the ALL
mistakes are caught and fixed or mitigated.  That is how you get from
'acting like a moron' to 'writing correct code'.

>> ...
>>
>> Programs worth writing are complex enough that there is no way any of us
>> can write them perfectly correct code on first draft.  There is always
>> going to be some polishing, and maybe even /a lot/ of polishing, and
>> perhaps some complete tear downs and rebuilds from time to time.  "Build
>> one to throw away; you will anyways."  If you tell me that you can
>> always write correct code the first time and you never need to go back
>> and fix anything when you do testing (you do test right?) then I will
>> have a hard time taking you seriously.
> 
> Testing is the main part of my development. Furthermore, I use
> assertions all over the place.
> 
>>
>> That said, it is extremely pleasant to have a language that catches you
>> when you inevitably fall.
> 
> That is why I also like Haskell.

I hear ya.  I feel Haskell is an important language to understand, if
not know how to use effectively.  I wish I knew how to use it better
than I do, but I haven't had too many projects that are amenable to it.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/11 10:47 AM, Michel Fortin wrote:
> On 2011-12-31 15:03:13 +0000, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> said:
>
>> On 12/31/11 8:17 CST, Michel Fortin wrote:
>>> At one time Java and other frameworks started to use UTF-16 as
>>> if they were characters, that turned wrong on them. Now we know that not
>>> even code points should be considered characters, thanks to characters
>>> spanning on multiple code points. You might call it perfect, but for
>>> that you have made two assumptions:
>>>
>>> 1. treating code points as characters is good enough, and
>>> 2. the performance penalty of decoding everything is tolerable
>>
>> I'm not sure how you concluded I drew such assumptions.
>
> 1: Because treating UTF-8 strings as a range of code point encourage
> people to think so. 2: From things you posted on the newsgroup
> previously. Sorry I don't have the references, but it'd take too long to
> dig them back.

That's sort of difficult to refute. Anyhow, I think it's great that 
algorithms can use types to go down to the representation if needed, and 
stay up at bidirectional range level otherwise.

>>> Ranges of code points might be perfect for you, but it's a tradeoff that
>>> won't work in every situations.
>>
>> Ranges can be defined to span logical glyphs that span multiple code
>> points.
>
> I'm talking about the default interpretation, where string ranges are
> ranges of code units, making that tradeoff the default.
>
> And also, I think we can agree that a logical glyph range would be
> terribly inefficient in practice, although it could be a nice teaching
> tool.

Well people who want that could use byGlyph() or something. If you want 
glyphs, you gotta pay the price.

>>> The whole concept of generic algorithms working on strings efficiently
>>> doesn't work.
>>
>> Apparently std.algorithm does.
>
> First, it doesn't really work.

Oh yes it does.

> It seems to work fine, but it doesn't
> handle (yet) characters spanning multiple code points.

That's the job of std.range, not std.algorithm.

> To handle this
> case, you could use a logical glyph range, but that'd be quite
> inefficient. Or you can improve the algorithm working on code points so
> that it checks for combining characters on the edges, but then is it
> still a generic algorithm?
>
> Second, it doesn't work efficiently. Sure you can specialize the
> algorithm so it does not decode all code units when it's not necessary,
> but then does it still classify as a generic algorithm?
>
> My point is that *generic* algorithms cannot work *efficiently* with
> Unicode, not that they can't work at all. And even then, for the
> inneficient generic algorithm to work correctly with all input, the user
> need to choose the correct Unicode representation to for the problem at
> hand, which requires some general knowledge of Unicode.
>
> Which is why I'd just discourage generic algorithms for strings.

I think you are in a position that is defensible, but not generous and 
therefore undesirable. The military equivalent would be defending a 
fortified landfill drained by a sewer. You don't _want_ to be there. 
Taking your argument to its ultimate conclusion is that we give up on 
genericity for strings and go home.

Strings are a variable-length encoding on top of an array. That is a 
relatively easy abstraction to model. Currently we don't have a 
dedicated model for that - we offer the encoded data as a bidirectional 
range and also the underlying array. Algorithms that work with 
bidirectional ranges work out of the box. Those that can use the 
representation gainfully can opportunistically specialize on isSomeString!R.

You contend that that doesn't "work", and I think you're wrong. But to 
the extent you have a case, an abstraction could be defined for 
variable-length encodings, and algorithms could be defined to work with 
that abstraction. I thought several times about that, but couldn't 
gather enough motivation for the simple reason that the current approach 
_works_.

>>> I'm not against making strings more opaque to encourage people to use
>>> the Unicode algorithms from the standard library instead of rolling
>>> their own.
>>
>> I'd say we're discussing making the two kinds of manipulation (encoded
>> sequence of logical character vs. array of code units) more
>> distinguished from each other. That's a Good Thing(tm).
>
> It's a good abstraction to show the theory of Unicode. But it's not the
> way to go if you want efficiency. For efficiency you need for each
> element in the string to use the lowest abstraction required to handle
> this element, so your algorithm needs to know about the various
> abstraction layers.

Correct.

> This is the kind of "range" I'd use to create algorithms dealing with
> Unicode properly:
>
> struct UnicodeRange(U)
> {
> U frontUnit() @property;
> dchar frontPoint() @property;
> immutable(U)[] frontGlyph() @property;
>
> void popFrontUnit();
> void popFrontPoint();
> void popFrontGlyph();
>
> ...
> }

We already have most of that. For a string s, s[0] is frontUnit, s.front 
is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
popFrontPoint. We only need to define the glyph routines.

But I think you'd be stopping short. You want generic variable-length 
encoding, not the above.

> Not really a range per your definition of ranges, but basically it lets
> you intermix working with units, code points, and glyphs. Add a way to
> slice at the unit level and a way to know the length at the unit level
> and it's all I need to make an efficient parser, or any algorithm really.

Except for the glpyhs implementation, we're already there. You are 
talking about existing capabilities!

> The problem with .raw is that it creates a separate range for the units.

That's the best part about it.

> This means you can't look at the frontUnit and then decide to pop the
> unit and then look at the next, decide you need to decode using
> frontPoint, then call popPoint and return to looking at the front unit.

Of course you can.

while (condition) {
  if (s.raw.front == someFrontUnitThatICareAbout) {
     s.raw.popFront();
     auto c = s.front;
     s.popFront();
  }
}

Now that I wrote it I'm even more enthralled with the coolness of the 
scheme. You essentially have access to two separate ranges on top of the 
same fabric.


Andrei
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 07:22 PM, Chad J wrote:
> On 12/30/2011 02:55 PM, Timon Gehr wrote:
>> On 12/30/2011 08:33 PM, Joshua Reusch wrote:
>>> Am 29.12.2011 19:36, schrieb Andrei Alexandrescu:
>>>> On 12/29/11 12:28 PM, Don wrote:
>>>>> On 28.12.2011 20:00, Andrei Alexandrescu wrote:
>>>>>> Oh, one more thing - one good thing that could come out of this thread
>>>>>> is abolition (through however slow a deprecation path) of s.length and
>>>>>> s[i] for narrow strings. Requiring s.rep.length instead of s.length
>>>>>> and
>>>>>> s.rep[i] instead of s[i] would improve the quality of narrow strings
>>>>>> tremendously. Also, s.rep[i] should return ubyte/ushort, not
>>>>>> char/wchar.
>>>>>> Then, people would access the decoding routines on the needed
>>>>>> occasions,
>>>>>> or would consciously use the representation.
>>>>>>
>>>>>> Yum.
>>>>>
>>>>>
>>>>> If I understand this correctly, most others don't. Effectively, .rep
>>>>> just means, "I know what I'm doing", and there's no change to existing
>>>>> semantics, purely a syntax change.
>>>>
>>>> Exactly!
>>>>
>>>>> If you change s[i] into s.rep[i], it does the same thing as now.
>>>>> There's
>>>>> no loss of functionality -- it's just stops you from accidentally doing
>>>>> the wrong thing. Like .ptr for getting the address of an array.
>>>>> Typically all the ".rep" everywhere would get annoying, so you would
>>>>> write:
>>>>> ubyte [] u = s.rep;
>>>>> and use u from then on.
>>>>>
>>>>> I don't like the name 'rep'. Maybe 'raw' or 'utf'?
>>>>> Apart from that, I think this would be perfect.
>>>>
>>>> Yes, I mean "rep" as a short for "representation" but upon first sight
>>>> the connection is tenuous. "raw" sounds great.
>>>>
>>>> Now I'm twice sorry this will not happen...
>>>>
>>>
>>> Maybe it could happen if we
>>> 1. make dstring the default strings type --
>>
>> Inefficient.
>>
>
> But correct (enough).
>
>>> code units and characters would be the same
>>
>> Wrong.
>>
>
> *sigh*, FINE.  Code units and /code points/ would be the same.

Relax.

>
>>> or 2. forward string.length to std.utf.count and opIndex to
>>> std.utf.toUTFindex
>>
>> Inconsistent and inefficient (it blows up the algorithmic complexity).
>>
>
> Inconsistent?  How?

int[]
bool[]
float[]
char[]

>
> Inefficiency is a lot easier to deal with than incorrect.  If something
> is inefficient, then in the right places I will NOTICE.  If something is
> incorrect, it can hide for years until that one person (or country, in
> this case) with a different usage pattern than the others uncovers it.
>
>>>
>>> so programmers could use the slices/indexing/length (no lazyness
>>> problems), and if they really want codeunits use .raw/.rep (or better
>>> .utf8/16/32 with std.string.representation(std.utf.toUTF8/16/32)
>>>
>>
>> Anyone who intends to write efficient string processing code needs this.
>> Anyone who does not want to write string processing code will not need
>> to index into a string -- standard library functions will suffice.
>>
>
> What about people who want to write correct string processing code AND
> want to use this handy slicing feature?  Because I totally want both of
> these.  Slicing is super useful for script-like coding.
>

Except that the proposal would make slicing strings go away.

>>> But generally I liked the idea of just having an alias for strings...
>>
>> Me too. I think the way we have it now is optimal. The only reason we
>> are discussing this is because of fear that uneducated users will write
>> code that does not take into account Unicode characters above code point
>> 0x80. But what is the worst thing that can happen?
>>
>> 1. They don't notice. Then it is not a problem, because they are
>> obviously only using ASCII characters and it is perfectly reasonable to
>> assume that code units and characters are the same thing.
>>
>
> How do you know they are only working with ASCII?  They might be /now/.
>   But what if someone else uses the program a couple years later when the
> original author is no longer maintaining that chunk of code?

Then they obviously need to fix the code, because the requirements have 
changed. Most of it will already work correctly though, because UTF-8 
extends ASCII in a natural way.

>
>> 2. They get screwed up string output, look for the reason, patch up
>> their code with some functions from std.utf and will never make the same
>> mistakes again.
>>
>
> Except they don't.  Because there are a lot of programmers that will
> never put in non-ascii strings to begin with.  But that has nothing to
> do with whether or not the /users/ or /maintainers/ of that code will
> put non-ascii strings in.  This could make some messes.
>
>>
>> I have *never* seen an user in D.learn complain about it. They might
>> have been some I missed, but it is certainly not a prevalent problem.
>> Also, just because an user can type .rep does not mean he understands
>> Unicode: He is able to make just the same mistakes as before, even more
>> so, as the array he is getting back has the _wrong element type_.
>>
>
> You know, here in America (Amurica?) we don't know that other countries
> exist.  I think there is a large population of programmers here that
> don't even know how to enter non-latin characters, much less would think
> to include such characters in their test cases.  These programmers won't
> necessarily be found on the internet much, but they will be found in
> cubicles all around, doing their 9-to-5 and writing mediocre code that
> the rest of us have to put up with.  Their code will pass peer review
> (their peers are also from America) and continue working just fine until
> someone from one of those confusing other places decides to type in the
> characters they feel comfortable typing in.  No, there will not be
> /tests/ for code points greater than 0x80, because there is no one
> around to write those.  I'd feel a little better if D herds people into
> writing correct code to begin with, because they won't otherwise.
>

There is no way to 'herd people into writing correct code' and UTF-8 is 
quite easy to deal with.

> ...
>
> There's another issue at play here too: efficiency vs correctness as a
> default.
>
> Here's the tradeoff --
>
> Option A:
> char[i] returns the i'th byte of the string as a (char) type.
> Consequences:
> (1) Code is efficient and INcorrect.

Do you have an example of impactful incorrect code resulting from those 
semantics?

> (2) It requires extra effort to write correct code.
> (3) Detecting the incorrect code may take years, as these errors can
> hide easily.

None of those is a direct consequence of char[i] returning char. They 
are the consequence of at least 3 things:

1. char[] is an array of char
2. immutable(char)[] is the default string type
3. the programmer does not know about 1. and/or 2.

I say, 1. is inevitable. You say 3. is inevitable. If we are both right, 
then 2. is the culprit.

>
> Option B:
> char[i] returns the i'th codepoint of the string as a (dchar) type.
> Consequences:
> (1) Code is INefficient and correct.

It is awfully optimistic to assume the code will be correct.

> (2) It requires extra effort to write efficient code.
> (3) Detecting the inefficient code happens in minutes.  It is VERY
> noticable when your program runs too slowly.
>

Except when in testing only small inputs are used and only 2 years later 
maintainers throw your program at a larger problem instance and wonder 
why it does not terminate. Or your program is DOS'd. Polynomial blowup 
in runtime can be as large a problem as a correctness bug in practice 
just fine.

>
> This is how I see it.
>
> And I really like my correct code.  If it's too slow, and I'll /know/
> when it's too slow, then I'll profile->tweak->profile->etc until the
> slowness goes away.  I'm totally digging option B.

Those kinds of inefficiencies build up and make the whole program run 
sluggish, and it will possibly be to late when you notice.

Option B is not even on the table. This thread is about a breaking 
interface change and special casing T[] for T in {char, wchar}.
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/11 10:47 AM, Sean Kelly wrote:
> I don't know that Unicode expertise is really required here anyway.
> All one has to know is that UTF8 is a multibyte encoding and
> built-in string attributes talk in bytes. Knowing when one wants
> bytes vs characters isn't rocket science. That said, I'm on the fence
> about this change. It breaks consistency for a benefit I'm still
> weighing. With this change, the char type will still be a single
> byte, correct? What happens to foreach on strings?

Clearly this is a what-if debate. The best level of agreement we could
ever reach is "well, it would've been nice... sigh".

It's possible that we'll define a Rope type in std.container - a
heavy-duty string type with small string optimization, interning, the
works. That type may use insights we are deriving from this exchange.


Andrei
December 31, 2011
Re: string is rarely useful as a function argument
On 12/31/2011 08:06 PM, Andrei Alexandrescu wrote:
> On 12/31/11 10:47 AM, Sean Kelly wrote:
>> I don't know that Unicode expertise is really required here anyway.
>> All one has to know is that UTF8 is a multibyte encoding and
>> built-in string attributes talk in bytes. Knowing when one wants
>> bytes vs characters isn't rocket science. That said, I'm on the fence
>> about this change. It breaks consistency for a benefit I'm still
>> weighing. With this change, the char type will still be a single
>> byte, correct? What happens to foreach on strings?
>
> Clearly this is a what-if debate. The best level of agreement we could
> ever reach is "well, it would've been nice... sigh".
>
> It's possible that we'll define a Rope type in std.container - a
> heavy-duty string type with small string optimization, interning, the
> works. That type may use insights we are deriving from this exchange.
>
>
> Andrei

That would be great.
December 31, 2011
Re: string is rarely useful as a function argument
On 2011-12-31 18:56:01 +0000, Andrei Alexandrescu 
<SeeWebsiteForEmail@erdani.org> said:

> On 12/31/11 10:47 AM, Michel Fortin wrote:
>> It seems to work fine, but it doesn't
>> handle (yet) characters spanning multiple code points.
> 
> That's the job of std.range, not std.algorithm.

As I keep saying, if you handle combining code points at the range 
level you'll have very inefficient code. But I think you get that.

>> To handle this
>> case, you could use a logical glyph range, but that'd be quite
>> inefficient. Or you can improve the algorithm working on code points so
>> that it checks for combining characters on the edges, but then is it
>> still a generic algorithm?
>> 
>> Second, it doesn't work efficiently. Sure you can specialize the
>> algorithm so it does not decode all code units when it's not necessary,
>> but then does it still classify as a generic algorithm?
>> 
>> My point is that *generic* algorithms cannot work *efficiently* with
>> Unicode, not that they can't work at all. And even then, for the
>> inneficient generic algorithm to work correctly with all input, the user
>> need to choose the correct Unicode representation to for the problem at
>> hand, which requires some general knowledge of Unicode.
>> 
>> Which is why I'd just discourage generic algorithms for strings.
> 
> I think you are in a position that is defensible, but not generous and 
> therefore undesirable. The military equivalent would be defending a 
> fortified landfill drained by a sewer. You don't _want_ to be there.

I don't get the analogy.

> Taking your argument to its ultimate conclusion is that we give up on 
> genericity for strings and go home.

That is more or less what I am saying. Genericity for strings leads to 
inefficient algorithms, and you don't want inefficient algorithms, at 
least not without being warned in advance. This is why for instance you 
give a special name to inefficient (linear) operations in 
std.container. In the same way, I think generic operations on strings 
should be disallowed unless you opt-in by explicitly saying on which 
representation you want to algorithm to perform its task.


>> This is the kind of "range" I'd use to create algorithms dealing with
>> Unicode properly:
>> 
>> struct UnicodeRange(U)
>> {
>> U frontUnit() @property;
>> dchar frontPoint() @property;
>> immutable(U)[] frontGlyph() @property;
>> 
>> void popFrontUnit();
>> void popFrontPoint();
>> void popFrontGlyph();
>> 
>> ...
>> }
> 
> We already have most of that. For a string s, s[0] is frontUnit, 
> s.front is frontPoint, s = s[1 .. $] is popFrontUnit(), s.popFront() is 
> popFrontPoint. We only need to define the glyph routines.

Indeed. I came with this concept when writing my XML parser, I defined 
frontUnit and popFrontUnit and used it all over the place (in 
conjunction with slicing). And I rarely needed to decode whole code 
points using front and popFront.

> But I think you'd be stopping short. You want generic variable-length 
> encoding, not the above.

Really? How'd that work?

> Except for the glpyhs implementation, we're already there. You are 
> talking about existing capabilities!
> 
>> The problem with .raw is that it creates a separate range for the units.
> 
> That's the best part about it.

Depends. It should create a *linked* range, not a *separate* one, in 
the sense that if you advance the "raw" range with popFront, it should 
advance the underlying "code point" range too.

>> This means you can't look at the frontUnit and then decide to pop the
>> unit and then look at the next, decide you need to decode using
>> frontPoint, then call popPoint and return to looking at the front unit.
> 
> Of course you can.
> 
> while (condition) {
>    if (s.raw.front == someFrontUnitThatICareAbout) {
>       s.raw.popFront();
>       auto c = s.front;
>       s.popFront();
>    }
> }

But will s.raw.popFront() also pop a single unit from s? "raw" would 
need to be defined as a reinterpret cast of the reference to the char[] 
to do what I want, something like this:

	ref ubyte[] raw(ref char[] s) { return *cast(ubyte[]*)&s; }

The current std.string.representation doesn't do that at all.

Also, how does it work with slicing? It can work with raw, but you'll 
have to cast things everywhere because raw is a ubyte[]:

	string = "éà";
	s = cast(typeof(s))s.raw[0..4];


> Now that I wrote it I'm even more enthralled with the coolness of the 
> scheme. You essentially have access to two separate ranges on top of 
> the same fabric.

Glad you like the concept.


-- 
Michel Fortin
michel.fortin@michelf.com
http://michelf.com/
9 10 11 12 13 14 15 16
Top | Discussion index | About this forum | D home