earthquake changes of std.regexp to come (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » earthquake changes of std.regexp to come (page 3)

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Bill Baxter

Andrei Alexandrescu

Posted in reply to Bill Baxter

Bill Baxter wrote:
> On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
> 
>> Besides std.regexp only works with (narrow) strings and we want it to work
>> on streams of all widths and structures. One pet complaint I have is that
>> std.regexp puts a class around it all as if everybody's favorite pastime
>> would be to inherit Regexp and override some random function in it.
> 
> So what do you think it should be, a struct?

Yes.

> That would imply to me that everybody's favorite pastime is making
> value copies of regex structures, when in fact nobody does that.

Well you'd be surprised. The RegEx class saves the state of the last search, which is a sensible thing to do. But then consider a simple range Splitter that, when iterated, nicely gives you...

string a = ",a,  bcd, def,gh,";
foreach (e; splitter(a, pattern(", *"))
    writeln("[", e, "]");

writes

[]
[a]
[bcd]
[def]
[gh]

This is similar to the function std.regex.split with the notable difference that no extra memory is allocated. Now Splitter is an input range. This means you wouldn't expect that you copy a Splitter and then have iterating the original value affect the copy. Well, that's exactly what happens when you use the "good" reference semantics of the RegEx stored inside splitter. Worse, RegExp has no cloning primitive, so I need to resort to storing the pattern and recompiling it from scratch at every copy of Splitter. So essentially the "good" semantics of RegEx are useless when it comes to composing it in larger objects.

> Regex is a class in order to give it reference semantics and provide
> encapsulation of some re-usable state.  Maybe it should be a final
> class, but my impression is "final class" doesn't really work in D.

Re-usable state is provided by structs too. In addition they can choose value vs. reference semantics with ease.

Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Bill Baxter

Andrei Alexandrescu

Posted in reply to Bill Baxter

Bill Baxter wrote:
> On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu
> <SeeWebsiteForEmail@erdani.org> wrote:
>> I'm quite unhappy with the API of std.regexp. It's a chaotic design that
>> provides a hodgepodge of functionality and tries to use as many synonyms of
>> "to find" in the dictionary (e.g. search, match). I could swear Walter never
>> really cared for using regexps, and that is felt throughout the design: it
>> fills the bullet point but it's asinine to use.
>>
>> Besides std.regexp only works with (narrow) strings and we want it to work
>> on streams of all widths and structures. One pet complaint I have is that
>> std.regexp puts a class around it all as if everybody's favorite pastime
>> would be to inherit Regexp and override some random function in it.
>>
>> In the upcoming releases of D 2.0 there will be rather dramatic breaking
>> changes of phobos. I just wanted to ask whether y'all could stomach yet
>> another rewritten API or you'd rather use std.regexp as it is for the time
>> being.
>>
> 
> Btw, I've got no problems with you breaking the API of 2.0 either.
> Though you might consider moving the current implementation to
> std.deprecated.regex and leaving it there for a year with a
> pragma(msg, "This module is deprecated").
> 
> That way making a quick fix to broken code is just a matter of
> inserting ".deprecated" into your import statements.


I was thinking of moving older stuff to etc, is that ok?

Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Daniel de Kok

Andrei Alexandrescu

Posted in reply to Daniel de Kok

Daniel de Kok wrote:
> On Tue, Feb 17, 2009 at 8:39 PM, BCS <ao@pathlink.com> wrote:
>> For what it's worth, I have a partial clone of the .NET API built on top of
>> PCRE. I would have to ask my boss but I expect I could donate it if anyone
>> want to use it as a basis.
> 
> Actually, I was wondering why nobody is considering real regular
> languages anymore, that can be compiled to a normal finite state
> recognizer or transducer. While this may not be as fancy as Perl-like
> extensions, they are much faster, and it's easier to do fun stuff such
> as composition.

I am considering that. One nice feature of "classic" regexes is that they never backtrack, so they work with pure input iterators. This has crucial consequences with regard to where and how regexes fit the range concept hierarchy.


Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Joel C. Salomon

Andrei Alexandrescu

Posted in reply to Joel C. Salomon

Joel C. Salomon wrote:
> bearophile wrote:
>> I agree that the API of regexes in Phobos is not much good, but I think designing a good API for it is quite hard.
> 
> So steal one, rather than invent something new. My suggestion would be
> to expose the DFA object, as in Plan 9’s library (documentation at
> <http://plan9.bell-labs.com/magic/man2html/2/regexp>, implementation at
> <http://plan9.bell-labs.com/sources/plan9/sys/src/libregexp/>,
> discussion and links to a Unix implementation at
> <http://swtch.com/~rsc/regexp/>).
> 
> Simple API:
> • regcomp: Compile a regexp DFA;
> • regexec: Apply it to a string, returning a slice of the string that
> matches the first hit (or an array of slices if parenthesized
> expressions are used); and

s/string/input range/

Also returning a range instead of an array of slices is more flexible.

Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by bearophile
in reply to Andrei Alexandrescu

bearophile

Posted in reply to Andrei Alexandrescu

Andrei Alexandrescu:
> string a = ",a,  bcd, def,gh,";
> foreach (e; splitter(a, pattern(", *"))
>      writeln("[", e, "]");

(I often use xplit() that is like split but yields items lazily, for larger strings it's much faster).

A better approach is to fuse the xsplit and such xsplitter function in a single lazy generator that can take as a second argument a string or char or RE pattern.
A 3rd optional argument can be the max number of splits (so after such max it yields all the rest of the string).

You can then add an eager splitter function with the same signature, that outputs an array.

Bye,
bearophile

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Bill Baxter
in reply to Andrei Alexandrescu

Bill Baxter

Posted in reply to Andrei Alexandrescu

On Wed, Feb 18, 2009 at 6:56 AM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
> Bill Baxter wrote:
>>
>> On Wed, Feb 18, 2009 at 3:36 AM, Andrei Alexandrescu <SeeWebsiteForEmail@erdani.org> wrote:
>>
>>> Besides std.regexp only works with (narrow) strings and we want it to
>>> work
>>> on streams of all widths and structures. One pet complaint I have is that
>>> std.regexp puts a class around it all as if everybody's favorite pastime
>>> would be to inherit Regexp and override some random function in it.
>>
>> So what do you think it should be, a struct?
>
> Yes.
>
>> That would imply to me that everybody's favorite pastime is making value copies of regex structures, when in fact nobody does that.
>
> Well you'd be surprised. The RegEx class saves the state of the last search, which is a sensible thing to do. But then consider a simple range Splitter that, when iterated, nicely gives you...
>
> string a = ",a,  bcd, def,gh,";
> foreach (e; splitter(a, pattern(", *"))
>    writeln("[", e, "]");
>
> writes
>
> []
> [a]
> [bcd]
> [def]
> [gh]
>
> This is similar to the function std.regex.split with the notable difference that no extra memory is allocated. Now Splitter is an input range. This means you wouldn't expect that you copy a Splitter and then have iterating the original value affect the copy. Well, that's exactly what happens when you use the "good" reference semantics of the RegEx stored inside splitter. Worse, RegExp has no cloning primitive, so I need to resort to storing the pattern and recompiling it from scratch at every copy of Splitter. So essentially the "good" semantics of RegEx are useless when it comes to composing it in larger objects.

So that sounds to me like RegEx should have a .dup, and then it would be fine, no?  I agree it should have a dup for the odd occasion when you do want to make a copy for some reason.

>> Regex is a class in order to give it reference semantics and provide encapsulation of some re-usable state.  Maybe it should be a final class, but my impression is "final class" doesn't really work in D.

> Re-usable state is provided by structs too. In addition they can choose value vs. reference semantics with ease.

I think this choice is not so much available with D1, plus the constructor situation with D1 is less than ideal.  Given that, I think the choice of class for RegEx was apropriate.   But if the struct problems are all going away in D2, then that's great.  Sounds like you're saying we'll really be able to use D structs just like one uses a non-polymorphic C++ class.  If so, then that's super.

--bb

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to dsimcha

Andrei Alexandrescu

Posted in reply to dsimcha

dsimcha wrote:
> == Quote from dsimcha (dsimcha@yahoo.com)'s article
>> == Quote from Andrei Alexandrescu (SeeWebsiteForEmail@erdani.org)'s article
>>> I'm quite unhappy with the API of std.regexp. It's a chaotic design that
>>> provides a hodgepodge of functionality and tries to use as many synonyms
>>> of "to find" in the dictionary (e.g. search, match). I could swear
>>> Walter never really cared for using regexps, and that is felt throughout
>>> the design: it fills the bullet point but it's asinine to use.
>>> Besides std.regexp only works with (narrow) strings and we want it to
>>> work on streams of all widths and structures. One pet complaint I have
>>> is that std.regexp puts a class around it all as if everybody's favorite
>>> pastime would be to inherit Regexp and override some random function in it.
>>> In the upcoming releases of D 2.0 there will be rather dramatic breaking
>>> changes of phobos. I just wanted to ask whether y'all could stomach yet
>>> another rewritten API or you'd rather use std.regexp as it is for the
>>> time being.
>>> Andrei
>> As I've said before, anyone who can't stomach breaking changes w/o complaining has
>> no business using D2 at this point.  I'd rather deal with the aggravation of stuff
>> breaking in the sort run to have a nice language and libraries to go with it in
>> the long run.
>> This whole concept of ranges as you've created them seems to have achieved the the
>> holy grail of both making simple things simple and complex things possible, where
>> "complex things" includes needing code to be efficient, so I can see your reason
>> for wanting to redo all kinds of stuff in them.  This compares favorably to C++
>> STL iterators, which are very flexible and efficient but a huge PITA to use for
>> simple things because the syntax is so low-level and ugly, and to the D1/early D2
>> way, which gives beautiful, simple notation for the more common cases (basic
>> dynamic arrays), at the expense of flexiblity when doing more complicated things
>> like streams, chaining, strides, etc.
> 
> BTW, can you elaborate on how arrays, both builtin and any library versions, will
> work when everything is finalized?

Well finalizations hinges not only on me but on Walter (bugfixes and a couple of new features) and on all of you with the continuous stream of great suggestions and ideas. Again, without being able to experiment much I don't have a clear idea on how arrays/containers should at best look like. The interesting challenge is accommodating good, precise semantics with the freedom given by garbage collection. Here are some highlights:

* Today's T[] will be firmly an incarnation of the random-access range concept, to the extent that all code expecting a random-access range can always be passed a T[] without any impedance adaptation.

* $ will be generalized to mean "end of range" even for infinite ranges.

* We don't have a solution to address the perils of extending a slice by using ~=. We're considering adding the type T[new], but I'm not sure we should take the hit of a new built-in type constructor, particularly when it's implementable as a library.

* Fixed-size arrays will in all likelihood be value types. We couldn't find any other semantics that works.

* Containers will have value semantics.

* "Resources come and go; memory is forever" is the likely default in D resource management. This means that destroying e.g. an array of File objects will close the underlying files, but will not deallocate the memory allocated for them. In essence, destroying values means calling the destructor but not delete-ing them (unless of course they're on the stack). This approach has a number of disadvantages, but plenty of advantages that compensate them in most applications.

* std.matrix will define memory layouts for a variety of popular libraries and also the common means to iterate said layouts.

* For those who want containers with reference semantics, they can use the type Class!(T) for any value type T. That includes built-in value types (int, float...) and whichever value containers we define. It's unclear to me whether this is enough to satisfy those in need for complex container hierarchies.

Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Walter Bright
in reply to Andrei Alexandrescu

Walter Bright

Posted in reply to Andrei Alexandrescu

Andrei Alexandrescu wrote:
> I was thinking of moving older stuff to etc, is that ok?

Yes. But you should also rename the new one, perhaps to std.regex. That way, legacy code will refuse to compile, rather than compile wrongly.

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Bill Baxter

Andrei Alexandrescu

Posted in reply to Bill Baxter

Bill Baxter wrote:
> I think this choice is not so much available with D1, plus the
> constructor situation with D1 is less than ideal.  Given that, I think
> the choice of class for RegEx was apropriate.   But if the struct
> problems are all going away in D2, then that's great.  Sounds like
> you're saying we'll really be able to use D structs just like one uses
> a non-polymorphic C++ class.  If so, then that's super.

I lost that perspective when criticizing RegExp, you're right. But still the API is lousy - every single time I am using a RegExp, I find myself fumbling through the thoroughly overlapping primitives in the documentation, and never seem to find an idiom that's simple, comfortable, and memorable.

Andrei

February 17, 2009

Re: earthquake changes of std.regexp to come

Posted by Andrei Alexandrescu
in reply to Walter Bright

Andrei Alexandrescu

Posted in reply to Walter Bright

Walter Bright wrote:
> Andrei Alexandrescu wrote:
>> I was thinking of moving older stuff to etc, is that ok?
> 
> Yes. But you should also rename the new one, perhaps to std.regex. That way, legacy code will refuse to compile, rather than compile wrongly.

Terrific. I prefer "regex" to "regexp" because it's easier to pronounce, particularly if you're a foreigner. "Regex" sounds like a frog utterance by a forest lake, "regexp" sounds like nothing in particular.

Andrei

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation