DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » Announce » DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky (page 3)

June 13, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Atila Neves
in reply to Dmitry Olshansky

Atila Neves

Posted in reply to Dmitry Olshansky

On Thursday, 12 June 2014 at 08:42:49 UTC, Dmitry Olshansky wrote:
> 11-Jun-2014 22:03, Atila Neves пишет:
>> On Tuesday, 10 June 2014 at 19:36:57 UTC, bearophile wrote:
>>> At about 40.42 in the "Thoughts on static regex" there is written
>>> "even compile-time printf would be awesome". There is a patch about
>>> __ctWrite in GitHug, it should be fixed and merged.
>>>
>>> Bye,
>>> bearophile
>>
>> I wish I'd taken the mic at the end, and 2 days later Adam D. Ruppe said
>> what I was thinking of saying: unit test and debug the CTFE function at
>> runtime and then use it at compile-time when it's ready for production.
>>
>
> Yes, that's a starting point - a function working at R-T.
>
>> Yes, Dmitry brought up compiler bugs. But if you write a compile-time UT
>> and it fails, you'll know it wasn't because of your own code because the
>> run-time ones still pass.
>
> It doesn't help that it's not your fault :)
> And with a bit of __ctfe's to workaround compiler bugs you won't be so sure of your code anymore.
>
>>
>> Maybe there's still a place for something more than pragma msg, but I'd
>> definitely advocate for the above at least in the beginning. If
>> anything, easier ways to write compile-time UTs would be, to me,
>> preferable to a compile-time printf.
>>
>
> There is nice assertCTFEable written by Kenji in Phobos. I think it's our private magic for now but I see no reason not to expose it somewhere.
>
>> Atila

It helps; you won't lose time looking at your code and wondering. I thought of the __cfte problem though: that would mean different code paths and what I said wouldn't be valid anymore.

Atila

June 14, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dicebot
in reply to Dmitry Olshansky

Dicebot

Posted in reply to Dmitry Olshansky

On Thursday, 12 June 2014 at 16:42:38 UTC, Dmitry Olshansky wrote:
> It's always nice to ask something on D NG, so many good answers I can hardly choose whom to reply ;) So this is kind of broadcast.
>
> Yes, the answer seems spot on - reflection! But allow me to retort.
>
> I'm not talking about completely stand-alone generator. Just as well generator tool could be written in D using the same exact sources as your D program does. Including the static introspection and type-awareness. Then generator itself is a library + "an invocation script" in D.
>
> The Q is specifically of CTFE in this scenario, including not only obvious shortcomings of design, but fundamental ones of compilation inside of compilation. Unlike proper compilation is has nothing persistent to back it up. It feels backwards, a bit like C++ TMP but, of course, much-much better.
>
>> 1)
>>
>> Reflection. It is less of an issue for pure DSL solutions because those
>> don't provide any good reflection capabilities anyway, but other code
>> generation approaches have very similar problems.
>>
>> By doing all code generation in separate build step you potentially lose
>> many of guarantees of keeping various parts of your application in sync.
>>
>
> Use the same sources for the generator. In essence all is the same, just relying on separate runs and linkage, not mixin. Necessary "hooks" to link to later could indeed be generated with a tiny bit of CTFE.
>
> Yes, deeply embedded stuff might not be that easy. The scope and damage is smaller though.
>
>> 2)
>>
>> Moving forward. You use traditional reasoning of DSL generally being
>> something rare and normally stable. This fits most common DSL usage but
>> tight in-language integration D makes possible brings new opportunities
>> of using DSL and code generation casually all other your program.
>>
>
> Well, I'm biased by heavy-handed ones. Say I have a (no longer) secret plan of doing a next-gen parser generator in D. Needless to say swaths of non-trivial code generation. I'm all for embedding nicely but I see very little _practical_ gains in CTFE+mixin here EVEN if CTFE wouldn't suck. See the point above about using the same metadata and types as the user application would.

Consider something like REST API generator I have described during DConf. There is different code generated in different contexts from same declarative description - both for server and client. Right now simple fact that you import very same module from both gives solid 100% guarantee that API usage between those two programs stays in sync.

In your proposed scenario there will be two different generated files imported by server and client respectively. Tiny typo in writing your build script will result in hard to detect run-time bug while code itself still happily compiles.

You may keep convenience but losing guarantees hurts a lot. To be able to verify static correctness of your program / group of programs type system needs to be aware how generated code relates to original source.

Also this approach does not scale. I can totally imagine you doing it for two or three DSL in single program, probably even dozen. But something like 100+? Huge mess to maintain. According to my experience all builds systems are incredibly fragile beasts, trusting them something that impacts program correctness and won't be detected at compile time is just too dangerous.

>> I totally expect programming culture to evolve to the point where
>> something like 90% of all application code is being generated in typical
>> project. D has good base for promoting such paradigm switch and reducing
>> any unnecessary mental context switches is very important here.
>>
>> This was pretty much the point I was trying to make with my DConf talk (
>> and have probably failed :) )
>
> I liked the talk, but you know ... 4th or 5th talk with CTFE/mixin I think I might have been distracted :)
>
> More specifically this bright future of 90%+ concise DSL driven programs is undermined by the simple truth - no amount of improvement in CTFE would make generators run faster then optimized standalone tool invocation. The tool (library written in D) may read D metadata just fine.
>
> I heard D builds times are important part of its adoption so...

Adoption - yes. Production usage - less so (though still important). Difference between 1 second and 5 seconds is very important. Between 10 seconds and 1 minute - not so much.

JIT will be probably slower than stand-alone generators but not that slower.

> It might solve most of _current_ problems, but I foresee fundamental issues of "no global state" in CTFE that in say 10 years from now would look a lot like `#include` in C++.

I hope 10 years ago from now we will consider having global state in RTFE stone age relict :P

> A major one is there is no way for compiler to not recompile generated code as it has no knowledge of how it might have changed from the previous run.

Why can't we merge basic build system functionality akin to rdmd into compiler itself? It makes perfect sense to me as build process can benefit a lot from being semantically aware.

June 14, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Andrei Alexandrescu
in reply to Dicebot

Andrei Alexandrescu

Posted in reply to Dicebot

On 6/14/14, 8:05 AM, Dicebot wrote:
> Adoption - yes. Production usage - less so (though still important).
> Difference between 1 second and 5 seconds is very important. Between 10
> seconds and 1 minute - not so much.

Wait, what? -- Andrei

June 14, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dicebot
in reply to Andrei Alexandrescu

Dicebot

Posted in reply to Andrei Alexandrescu

On Saturday, 14 June 2014 at 15:25:11 UTC, Andrei Alexandrescu wrote:
> On 6/14/14, 8:05 AM, Dicebot wrote:
>> Adoption - yes. Production usage - less so (though still important).
>> Difference between 1 second and 5 seconds is very important. Between 10
>> seconds and 1 minute - not so much.
>
> Wait, what? -- Andrei

If build time becomes long enough that it forces you to switch the mental context, it is less important how long it takes - you are much likely to do something else and return to it later. Of course it can also get to famous C++ hours of build time which is next level of inconvenience :)

But reasonably big and complicated project won't build in 5 seconds anyway (even with perfect compiler), so eventually pure build time becomes less of a selling point. Still important but not _that_ important.

June 14, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dmitry Olshansky
in reply to Dicebot

Dmitry Olshansky

Posted in reply to Dicebot

14-Jun-2014 19:05, Dicebot пишет:
> On Thursday, 12 June 2014 at 16:42:38 UTC, Dmitry Olshansky wrote:
[snip]
>> Well, I'm biased by heavy-handed ones. Say I have a (no longer) secret
>> plan of doing a next-gen parser generator in D. Needless to say swaths
>> of non-trivial code generation. I'm all for embedding nicely but I see
>> very little _practical_ gains in CTFE+mixin here EVEN if CTFE wouldn't
>> suck. See the point above about using the same metadata and types as
>> the user application would.
>
> Consider something like REST API generator I have described during
> DConf. There is different code generated in different contexts from same
> declarative description - both for server and client. Right now simple
> fact that you import very same module from both gives solid 100%
> guarantee that API usage between those two programs stays in sync.

But let's face it - it's a one-time job to get it right in your favorite build tool. Then you have fast and cached (re)build. Comparatively costs of CTFE generation are paid in full during _each_ build.

> In your proposed scenario there will be two different generated files
> imported by server and client respectively. Tiny typo in writing your
> build script will result in hard to detect run-time bug while code
> itself still happily compiles.

Or a link error if we go a hybrid path where the imported module is emitting declarations/hooks via CTFE to be linked to by the proper generated code. This is something I'm thinking that could be a practical solution.

I.e. currently to get around wasting cycles again and again:

module a;
bool verify(string s){
  static re = ctRegex!"...."; return match(s, re);
}
//
module b;
import a;
void foo(){
	...
	verify("blah");
	...
}

vs would-be hybrid approach:

module gen_re;

void main() //or wrap it in tiny template mixin
{
generateCtRegex(
	//all patterns
);
}

module b;
import std.regex;
//notice no import of a

void foo(){
	...
	static re = ctRegex!(...); //
	...
}
and using ctRegex as usual in b, but any miss of compiled cache would lead to a link error.

In fact it might be the best of both worlds if there is a switch to try full CTFE vs link-time external option.

>
> You may keep convenience but losing guarantees hurts a lot. To be able
> to verify static correctness of your program / group of programs type
> system needs to be aware how generated code relates to original source.

Build system does it. We have this problem with all of external deps anyway (i.e. who verifies the right version of libXYZ is linked not some other?)

> Also this approach does not scale. I can totally imagine you doing it
> for two or three DSL in single program, probably even dozen. But
> something like 100+?

Not everything is suitable, of course. Some stuff  is good only inline and on spot. But it does use the same sources, it may look a lot like this in case of REST generators:

import everything;

void main(){
	foreach(m; module){
	//... generate client code from meta-data
	}
}

Waiting for 100+ DSL compiled in a JIT interpreter that can't optimize a thing (pretty much by definition or use separate flags for that?) is not going to be fun too.

> Huge mess to maintain. According to my experience
> all builds systems are incredibly fragile beasts, trusting them
> something that impacts program correctness and won't be detected at
> compile time is just too dangerous.

Could be, but we have dub which should be simple and nice.
I had very positive experience with scons and half-generated sources.

>>
>> I heard D builds times are important part of its adoption so...
>
> Adoption - yes. Production usage - less so (though still important).
> Difference between 1 second and 5 seconds is very important. Between 10
> seconds and 1 minute - not so much.
>
> JIT will be probably slower than stand-alone generators but not that
> slower.
>
>> It might solve most of _current_ problems, but I foresee fundamental
>> issues of "no global state" in CTFE that in say 10 years from now
>> would look a lot like `#include` in C++.
>
> I hope 10 years ago from now we will consider having global state in
> RTFE stone age relict :P

Well, no amount of purity dismisses the point that a cache is a cache. When I say global in D I mean thread/fiber local.

>
>> A major one is there is no way for compiler to not recompile generated
>> code as it has no knowledge of how it might have changed from the
>> previous run.
>
> Why can't we merge basic build system functionality akin to rdmd into
> compiler itself? It makes perfect sense to me as build process can
> benefit a lot from being semantically aware.

I wouldn't cross my fingers, but yes ideally it would need to have powers of a build system making it that much more complicated. Then it can cache results including templates instantiations across module and separate invocations of the tool. It's a distant dream though.

Currently available caching at the level of object files is very coarse grained and not really helpful to our problem at hand.

-- 
Dmitry Olshansky

June 15, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dicebot
in reply to Dmitry Olshansky

Dicebot

Posted in reply to Dmitry Olshansky

On Saturday, 14 June 2014 at 16:34:35 UTC, Dmitry Olshansky wrote:
>> Consider something like REST API generator I have described during
>> DConf. There is different code generated in different contexts from same
>> declarative description - both for server and client. Right now simple
>> fact that you import very same module from both gives solid 100%
>> guarantee that API usage between those two programs stays in sync.
>
> But let's face it - it's a one-time job to get it right in your favorite build tool. Then you have fast and cached (re)build. Comparatively costs of CTFE generation are paid in full during _each_ build.

There is no such thing as one-time job in programming unless you work alone and abandon any long-term maintenance. As time goes any mistake that can possibly happen will inevitably happen.

>> In your proposed scenario there will be two different generated files
>> imported by server and client respectively. Tiny typo in writing your
>> build script will result in hard to detect run-time bug while code
>> itself still happily compiles.
>
> Or a link error if we go a hybrid path where the imported module is emitting declarations/hooks via CTFE to be linked to by the proper generated code. This is something I'm thinking that could be a practical solution.
>
> <snip>

What is the benefit of this approach over simply keeping all ctRegex bodies in separate package, compiling it as a static library and referring from actual app by own unique symbol? This is something that can does not need any changes in compiler or Phobos, just matter of project layout.

It does not work for more complicated cases were you actually need access to generated sources (generate templates for example).

>> You may keep convenience but losing guarantees hurts a lot. To be able
>> to verify static correctness of your program / group of programs type
>> system needs to be aware how generated code relates to original source.
>
> Build system does it. We have this problem with all of external deps anyway (i.e. who verifies the right version of libXYZ is linked not some other?)

It is somewhat worse because you don't routinely change external libraries, as opposed to local sources.

>> Huge mess to maintain. According to my experience
>> all builds systems are incredibly fragile beasts, trusting them
>> something that impacts program correctness and won't be detected at
>> compile time is just too dangerous.
>
> Could be, but we have dub which should be simple and nice.
> I had very positive experience with scons and half-generated sources.

dub is terrible at defining any complicated build models. Pretty much anything that is not single step compile-them-all approach can only be done via calling external shell script. If using external generators is necessary I will take make over anything else :)

> <snip>

tl; dr: I believe that we should improve compiler technology to achieve same results instead of promoting temporary hacks as the true way to do things. Relying on build system is likely to be most practical solution today but it is not solution I am satisfied with and hardly one I can accept as accomplished target.

Imaginary compiler that continuously runs as daemon/service, is capable of JIT-ing and provides basic dependency tracking as part of compilation step should behave as good as any external solution with much better correctness guarantees and overall user experience out of the box.

June 15, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dmitry Olshansky
in reply to Dicebot

Dmitry Olshansky

Posted in reply to Dicebot

15-Jun-2014 20:21, Dicebot пишет:
> On Saturday, 14 June 2014 at 16:34:35 UTC, Dmitry Olshansky wrote:
>> But let's face it - it's a one-time job to get it right in your
>> favorite build tool. Then you have fast and cached (re)build.
>> Comparatively costs of CTFE generation are paid in full during _each_
>> build.
>
> There is no such thing as one-time job in programming unless you work
> alone and abandon any long-term maintenance. As time goes any mistake
> that can possibly happen will inevitably happen.

The frequency of such event is orders of magnitude smaller. Let's not take arguments to supreme as then doing anything is futile due to the potential of mistake it introduces sooner or later.

>>> In your proposed scenario there will be two different generated files
>>> imported by server and client respectively. Tiny typo in writing your
>>> build script will result in hard to detect run-time bug while code
>>> itself still happily compiles.
>>
>> Or a link error if we go a hybrid path where the imported module is
>> emitting declarations/hooks via CTFE to be linked to by the proper
>> generated code. This is something I'm thinking that could be a
>> practical solution.
>>
>> <snip>
>
> What is the benefit of this approach over simply keeping all ctRegex
> bodies in separate package, compiling it as a static library and
> referring from actual app by own unique symbol? This is something that
> can does not need any changes in compiler or Phobos, just matter of
> project layout.

Automation. Dumping the body of ctRegex is manual work after all, including putting it with the right symbol. In proposed scheme it's just a matter of copy-pasting a pattern after initial setup has been done.

> It does not work for more complicated cases were you actually need
> access to generated sources (generate templates for example).

Indeed, this is a limitation, and the import of generated source would be required.

>>> You may keep convenience but losing guarantees hurts a lot. To be able
>>> to verify static correctness of your program / group of programs type
>>> system needs to be aware how generated code relates to original source.
>>
>> Build system does it. We have this problem with all of external deps
>> anyway (i.e. who verifies the right version of libXYZ is linked not
>> some other?)
>
> It is somewhat worse because you don't routinely change external
> libraries, as opposed to local sources.
>

But surely we have libraries that are built as separate project and are "external" dependencies, right? There is nothing new here except that "d-->obj-->lib file" is changed to "generator-->generated D file--->obj file".

>>> Huge mess to maintain. According to my experience
>>> all builds systems are incredibly fragile beasts, trusting them
>>> something that impacts program correctness and won't be detected at
>>> compile time is just too dangerous.
>>
>> Could be, but we have dub which should be simple and nice.
>> I had very positive experience with scons and half-generated sources.
>
> dub is terrible at defining any complicated build models. Pretty much
> anything that is not single step compile-them-all approach can only be
> done via calling external shell script.

I'm not going to like dub then ;)

> If using external generators is
> necessary I will take make over anything else :)

Then I understand your point about inevitable mistakes, it's all in the tool.

>> <snip>
>
> tl; dr: I believe that we should improve compiler technology to achieve
> same results instead of promoting temporary hacks as the true way to do
> things. Relying on build system is likely to be most practical solution
> today but it is not solution I am satisfied with and hardly one I can
> accept as accomplished target.
> Imaginary compiler that continuously runs as daemon/service, is capable
> of JIT-ing and provides basic dependency tracking as part of compilation
> step should behave as good as any external solution with much better
> correctness guarantees and overall user experience out of the box.

What I want to point out is to not mistake goals and the means to an end. No matter how we call it CTFE code generation is just a means to an end, with serious limitations (especially as it stands today, in the real world).

Seamless integration is not about packing everything into single compiler invocation:

dmd src/*.d

Generation is generation, as long as it's fast and automatic it solves the problem(s) meta programming was established to solve.

For instance if D compiler allowed external tools as plugins (just an example to show means vs ends distinction) with some form of the following construct:

mixin(call_external_tool("args", 3, 14, 15, .92));

it would make any generation totally practical *today*. This was proposed before, and dismissed out of fear of security risks, never identifying the proper set of restrictions. After all we have textual mixins of potential security risk no problem.

Let's focus on the facts that this has the benefits of:
- sane debugging of the plug-in (it's just a program with the usual symbols)
- fast, as the tool could be built with full optimization flags or run as service
- trivially able to cache things across builds and even per each AST node
- easy to implement (as in next release)
- may include things inexpressible in CTFE like calling into external systems and vendor-specific tools

That will for instance give as ability to have practical C-->D transparent header inclusion as say:

extern mixin(htod("some_header.h"));

How long till C preprocessor is working at CTFE? How long till it's practical to do:

mixin(htod(import("some_header.h")));

and have it done optimally fast at CTFE?

My answer is - no amount of JITing CTFE and compiler architecture improvements in foreseeable future will get it better then standalone tool(s), due to the mentioned _fundamental_ limitations.

There are real practical boundaries on where an internal interpreter can stay competitive.

-- 
Dmitry Olshansky

June 16, 2014

Re: DConf 2014 Day 1 Talk 4: Inside the Regular Expressions in D by Dmitry Olshansky

Posted by Dicebot
in reply to Dmitry Olshansky

Dicebot

Posted in reply to Dmitry Olshansky

On Sunday, 15 June 2014 at 21:38:18 UTC, Dmitry Olshansky wrote:
> 15-Jun-2014 20:21, Dicebot пишет:
>> On Saturday, 14 June 2014 at 16:34:35 UTC, Dmitry Olshansky wrote:
>>> But let's face it - it's a one-time job to get it right in your
>>> favorite build tool. Then you have fast and cached (re)build.
>>> Comparatively costs of CTFE generation are paid in full during _each_
>>> build.
>>
>> There is no such thing as one-time job in programming unless you work
>> alone and abandon any long-term maintenance. As time goes any mistake
>> that can possibly happen will inevitably happen.
>
> The frequency of such event is orders of magnitude smaller. Let's not take arguments to supreme as then doing anything is futile due to the potential of mistake it introduces sooner or later.

It is more likely to happen if you change you build scripts more often. And this is exactly what you propose.

I am not going to say it is impractical, just mentioning flaws that make me seek for better solution.

> Automation. Dumping the body of ctRegex is manual work after all, including putting it with the right symbol. In proposed scheme it's just a matter of copy-pasting a pattern after initial setup has been done.

I think defining regexes in separate module is even less of an effort than adding few lines to the build script ;)

>> It is somewhat worse because you don't routinely change external
>> libraries, as opposed to local sources.
>>
>
> But surely we have libraries that are built as separate project and are "external" dependencies, right? There is nothing new here except that "d-->obj-->lib file" is changed to "generator-->generated D file--->obj file".

Ok, I am probably convinced on this one. Incidentally I do always prefer full source builds as opposed to static library separation inside application itself. When there is enough RAM for dmd of course :)

>>>> Huge mess to maintain. According to my experience
>> dub is terrible at defining any complicated build models. Pretty much
>> anything that is not single step compile-them-all approach can only be
>> done via calling external shell script.
>
> I'm not going to like dub then ;)

It is primarily source dependency manager, not a build tool. I remember Sonke mentioning it is intentionally kept simplistic to guarantee no platform-unique features are ever needed.

For anything complicated I'd probably wrap dub call inside makefile to prepare all necessary extra files.

>> If using external generators is
>> necessary I will take make over anything else :)
>
> Then I understand your point about inevitable mistakes, it's all in the tool.

make is actually pretty good if you don't care about other platforms than Linux. Well, other than stupid whitespace sensitivity. But it is incredibly good at defining build systems with chained dependencies.

> What I want to point out is to not mistake goals and the means to an end. No matter how we call it CTFE code generation is just a means to an end, with serious limitations (especially as it stands today, in the real world).

I agree. What I do disagree about is definition of the goal. It is not just "generating code", it is "generating code in a manner understood by compiler".

> For instance if D compiler allowed external tools as plugins (just an example to show means vs ends distinction) with some form of the following construct:
>
> mixin(call_external_tool("args", 3, 14, 15, .92));
>
> it would make any generation totally practical *today*.

But this is exactly the case when language integration gives you nothing over build system solution :) If compiler itself is not aware how code gets generated from arguments, there is no real advantage in putting tool invocation inline.

> How long till C preprocessor is working at CTFE? How long till it's practical to do:
>
> mixin(htod(import("some_header.h")));
>
> and have it done optimally fast at CTFE?

Never, but it is not really about being fast or convenient. For htod you don't want just C grammar / preprocessor support, you want it as good as one in real C compilers.

> My answer is - no amount of JITing CTFE and compiler architecture improvements in foreseeable future will get it better then standalone tool(s), due to the mentioned _fundamental_ limitations.
>
> There are real practical boundaries on where an internal interpreter can stay competitive.

I don't see any fundamental practical boundaries. Quality of implementation ones - sure. Quite the contrary, I totally see how better compiler can easily outperform any external tools for most build tasks despite somewhat worse JIT codegen - it has huge advantage of being able to work on language semantical entities and not just files. That allows much smarter caching and dependency tracking, something external tools will never be able to achieve.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation