Compilation strategy (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Compilation strategy (page 3)

December 17, 2012

Re: Compilation strategy

Posted by foobar
in reply to Michel Fortin

foobar

Posted in reply to Michel Fortin

On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
> On 2012-12-17 03:18:45 +0000, Walter Bright <newshound2@digitalmars.com> said:
>
>> Whether the file format is text or binary does not make any fundamental difference.
>
> I too expect the difference in performance to be negligible in binary form if you maintain the same structure. But if you're translating it to another format you can improve the structure to make it faster.
>
> If the file had a table of contents (TOC) of publicly visible symbols right at the start, you could read that table of content alone to fill symbol tables while lazy-loading symbol definitions from the file only when needed.
>
> Often, most of the file beyond the TOC wouldn't be needed at all. Having to parse and construct the syntax tree for the whole file incurs many memory allocations in the compiler, which you could avoid if the file was structured for lazy-loading. With a TOC you have very little to read from disk and very little to allocate in memory and that'll make compilation faster.
>
> More importantly, if you use only fully-qualified symbol names in the translated form, then you'll be able to load lazily privately imported modules because they'll only be needed when you need the actual definition of a symbol. (Template instantiation might require loading privately imported modules too.)
>
> And then you could structure it so a whole library could fit in one file, putting all the TOCs at the start of the same file so it loads from disk in a single read operation (or a couple of *sequential* reads).
>
> I'm not sure of the speedup all this would provide, but I'd hazard a guess that it wouldn't be so negligible when compiling a large project incrementally.
>
> Implementing any of this in the current front end would be a *lot* of work however.

Precisely. That is the correct solution and is also how [turbo?] pascal units (==libs) where implemented *decades ago*.

I'd like to also emphasize the importance of using a *single* encapsulated file. This prevents synchronization hazards that D inherited from the broken c/c++ model.

December 17, 2012

Re: Compilation strategy

Posted by Paulo Pinto
in reply to foobar

Paulo Pinto

Posted in reply to foobar

Am 17.12.2012 21:09, schrieb foobar:
> On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
>> On 2012-12-17 03:18:45 +0000, Walter Bright
>> <newshound2@digitalmars.com> said:
>>
>>> Whether the file format is text or binary does not make any
>>> fundamental difference.
>>
>> I too expect the difference in performance to be negligible in binary
>> form if you maintain the same structure. But if you're translating it
>> to another format you can improve the structure to make it faster.
>>
>> If the file had a table of contents (TOC) of publicly visible symbols
>> right at the start, you could read that table of content alone to fill
>> symbol tables while lazy-loading symbol definitions from the file only
>> when needed.
>>
>> Often, most of the file beyond the TOC wouldn't be needed at all.
>> Having to parse and construct the syntax tree for the whole file
>> incurs many memory allocations in the compiler, which you could avoid
>> if the file was structured for lazy-loading. With a TOC you have very
>> little to read from disk and very little to allocate in memory and
>> that'll make compilation faster.
>>
>> More importantly, if you use only fully-qualified symbol names in the
>> translated form, then you'll be able to load lazily privately imported
>> modules because they'll only be needed when you need the actual
>> definition of a symbol. (Template instantiation might require loading
>> privately imported modules too.)
>>
>> And then you could structure it so a whole library could fit in one
>> file, putting all the TOCs at the start of the same file so it loads
>> from disk in a single read operation (or a couple of *sequential* reads).
>>
>> I'm not sure of the speedup all this would provide, but I'd hazard a
>> guess that it wouldn't be so negligible when compiling a large project
>> incrementally.
>>
>> Implementing any of this in the current front end would be a *lot* of
>> work however.
>
> Precisely. That is the correct solution and is also how [turbo?] pascal
> units (==libs) where implemented *decades ago*.
>
> I'd like to also emphasize the importance of using a *single*
> encapsulated file. This prevents synchronization hazards that D
> inherited from the broken c/c++ model.

I really miss it, but at least it has been picked up by Go as well.

Still find strange that many C and C++ developers are unaware that we have modules since the early 80's.

--
Paulo

December 17, 2012

Re: Compilation strategy

Posted by Dmitry Olshansky
in reply to Paulo Pinto

Dmitry Olshansky

Posted in reply to Paulo Pinto

12/18/2012 12:34 AM, Paulo Pinto пишет:
> Am 17.12.2012 21:09, schrieb foobar:
>> On Monday, 17 December 2012 at 04:49:46 UTC, Michel Fortin wrote:
>>> On 2012-12-17 03:18:45 +0000, Walter Bright
>>> <newshound2@digitalmars.com> said:
>>>
>>>> Whether the file format is text or binary does not make any
>>>> fundamental difference.
>>>
>>> I too expect the difference in performance to be negligible in binary
>>> form if you maintain the same structure. But if you're translating it
>>> to another format you can improve the structure to make it faster.
>>>
>>> If the file had a table of contents (TOC) of publicly visible symbols
>>> right at the start, you could read that table of content alone to fill
>>> symbol tables while lazy-loading symbol definitions from the file only
>>> when needed.
>>>
>>> Often, most of the file beyond the TOC wouldn't be needed at all.
>>> Having to parse and construct the syntax tree for the whole file
>>> incurs many memory allocations in the compiler, which you could avoid
>>> if the file was structured for lazy-loading. With a TOC you have very
>>> little to read from disk and very little to allocate in memory and
>>> that'll make compilation faster.
>>>
>>> More importantly, if you use only fully-qualified symbol names in the
>>> translated form, then you'll be able to load lazily privately imported
>>> modules because they'll only be needed when you need the actual
>>> definition of a symbol. (Template instantiation might require loading
>>> privately imported modules too.)
>>>
>>> And then you could structure it so a whole library could fit in one
>>> file, putting all the TOCs at the start of the same file so it loads
>>> from disk in a single read operation (or a couple of *sequential*
>>> reads).
>>>
>>> I'm not sure of the speedup all this would provide, but I'd hazard a
>>> guess that it wouldn't be so negligible when compiling a large project
>>> incrementally.
>>>
>>> Implementing any of this in the current front end would be a *lot* of
>>> work however.
>>
>> Precisely. That is the correct solution and is also how [turbo?] pascal
>> units (==libs) where implemented *decades ago*.
>>
>> I'd like to also emphasize the importance of using a *single*
>> encapsulated file. This prevents synchronization hazards that D
>> inherited from the broken c/c++ model.
>

I really loved the way Turbo Pascal units were made. I wish D go the same route.  Object files would then be looked at as minimal and stupid variation of module where symbols are identified by mangling (not plain meta data as (would be) in module) and no source for templates is emitted.

AFAIK Delphi is able to produce both DCU and OBJ files (and link with). Dunno what it does with generics (and which kind these are) and how.

> I really miss it, but at least it has been picked up by Go as well.
>
> Still find strange that many C and C++ developers are unaware that we
> have modules since the early 80's.
>
+1

I suspect it's one of prime examples where UNIX philosophy of combining a bunch of simple (~ dumb) programs together in place of one more complex program was taken *far* beyond reasonable lengths.

Having a pipe-line:
preprocessor -> compiler -> (still?) assembler -> linker

where every program tries hard to know nothing about the previous ones (and be as simple as possibly can be) is bound to get inadequate results on many fronts:
- efficiency & scalability
- cross-border error reporting and detection (linker errors? errors for expanded macro magic?)
- cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC)
- multiple problems from a loss of information across pipeline*

*Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication.

While simplicity (and correspondingly size in memory) of programs was the king in 70's it's well past due. Nowadays I think is all about getting highest throughput and more powerful features.

> --
> Paulo


-- 
Dmitry Olshansky

December 17, 2012

Re: Compilation strategy

Posted by Walter Bright
in reply to Dmitry Olshansky

Walter Bright

Posted in reply to Dmitry Olshansky

On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
> I really loved the way Turbo Pascal units were made. I wish D go the same
> route.  Object files would then be looked at as minimal and stupid variation of
> module where symbols are identified by mangling (not plain meta data as (would
> be) in module) and no source for templates is emitted.
> +1

I'll bite. How is this superior to D's system? I have never used TP.


> *Semantic info on interdependency of symbols in a source file is destroyed right
> before the linker and thus each .obj file is included as a whole or not at all.
> Thus all C run-times I've seen _sidestep_ this by writing each function in its
> own file(!). Even this alone should have been a clear indication.

This is done using COMDATs in C++ and D today.

December 17, 2012

Re: Compilation strategy

Posted by Paulo Pinto
in reply to Walter Bright

Paulo Pinto

Posted in reply to Walter Bright

Am 17.12.2012 23:23, schrieb Walter Bright:
> On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
>> I really loved the way Turbo Pascal units were made. I wish D go the same
>> route.  Object files would then be looked at as minimal and stupid
>> variation of
>> module where symbols are identified by mangling (not plain meta data
>> as (would
>> be) in module) and no source for templates is emitted.
>> +1
>
> I'll bite. How is this superior to D's system? I have never used TP.
>

Just explaining the TP way, not doing comparisons.

Each unit (module) is a single file and contains all declarations, there is a separation between the public and implementation part.

Multiple units can be circular dependent, if they depend between each other on the implementation part.

The compiler and IDE are able to extract all the necessary information from a unit file, thus making a single file all that is required for making the compiler happy and avoiding synchronization errors.

Like any language using modules, the compiler is pretty fast and uses an
included linker optimized for the information stored in the units.

Besides the IDE, there are command line utilities that dump the public information of a given unit, as a way for programmers to read the available exported API.

Basically not much different from what Java and .NET do, but with a language that by default uses native compilation tooling.

--
Paulo

December 18, 2012

Re: Compilation strategy

Posted by Dmitry Olshansky
in reply to Walter Bright

Dmitry Olshansky

Posted in reply to Walter Bright

12/18/2012 2:23 AM, Walter Bright пишет:
> On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
>> I really loved the way Turbo Pascal units were made. I wish D go the same
>> route.  Object files would then be looked at as minimal and stupid
>> variation of
>> module where symbols are identified by mangling (not plain meta data
>> as (would
>> be) in module) and no source for templates is emitted.
>> +1
>
> I'll bite. How is this superior to D's system? I have never used TP.
>

One superiority is having a compiled module with public interface (a-la .di but in some binary format) in one file. Along with public interface it retains dependency information. Basically things that describe one entity should not be separated.

I can say that advantage of "grab this single file and you are good to go" should not be underestimated.

Thusly there is no mess with header files out of date and/or object files that fail to link because of that. Now back then there were no templates nor CTFE. So module structure was simple. There were no packages (they landed in Delphi).

I'd expect D to have a format built around modules and packages of these. Then pre-compiled libraries are commonly distributed as a package.

The upside of having our special format is being able to tailor it for our needs e.g. store type info & meta-data plainly (not mangle-demangle), having separately compiled (and checked) pure functions, better cross symbol dependency etc.

To link with C we could still compile all of D modules into a huge object file (split into a monstrous amount of sections).

>
>> *Semantic info on interdependency of symbols in a source file is
>> destroyed right
>> before the linker and thus each .obj file is included as a whole or
>> not at all.
>> Thus all C run-times I've seen _sidestep_ this by writing each
>> function in its
>> own file(!). Even this alone should have been a clear indication.
>
> This is done using COMDATs in C++ and D today.

Well, that's terse. Either way it looks like a workaround for templates that during separate compilation dump identical code in obj-s to auto-merge these.

More then that - the end result is the same: to avoid carrying junk into an app you (or compiler) still have to put each function in its own section.

Doing separate compilation I always (unless doing LTO or template heavy code) see either whole or nothing (D included). Most likely the compiler will do it for you only with a special switch.

This begs another question - why not eliminate junk by default?

P.S.
Looking at M$: http://msdn.microsoft.com/en-us/library/xsa71f43.aspx
it needs 2 switches - 1 for linker 1 for compiler. Hilarious.

-- 
Dmitry Olshansky

December 18, 2012

Re: Compilation strategy

Posted by foobar
in reply to Walter Bright

foobar

Posted in reply to Walter Bright

On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote:
> On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
>> I really loved the way Turbo Pascal units were made. I wish D go the same
>> route.  Object files would then be looked at as minimal and stupid variation of
>> module where symbols are identified by mangling (not plain meta data as (would
>> be) in module) and no source for templates is emitted.
>> +1
>
> I'll bite. How is this superior to D's system? I have never used TP.
>
>
>> *Semantic info on interdependency of symbols in a source file is destroyed right
>> before the linker and thus each .obj file is included as a whole or not at all.
>> Thus all C run-times I've seen _sidestep_ this by writing each function in its
>> own file(!). Even this alone should have been a clear indication.
>
> This is done using COMDATs in C++ and D today.

Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files?

You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives.

Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases:
 * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need.
 * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code.

Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache.

December 18, 2012

Re: Compilation strategy

Posted by Paulo Pinto
in reply to foobar

Paulo Pinto

Posted in reply to foobar

On Tuesday, 18 December 2012 at 11:43:18 UTC, foobar wrote:
> On Monday, 17 December 2012 at 22:24:00 UTC, Walter Bright wrote:
>> On 12/17/2012 2:08 PM, Dmitry Olshansky wrote:
>>> I really loved the way Turbo Pascal units were made. I wish D go the same
>>> route.  Object files would then be looked at as minimal and stupid variation of
>>> module where symbols are identified by mangling (not plain meta data as (would
>>> be) in module) and no source for templates is emitted.
>>> +1
>>
>> I'll bite. How is this superior to D's system? I have never used TP.
>>
>>
>>> *Semantic info on interdependency of symbols in a source file is destroyed right
>>> before the linker and thus each .obj file is included as a whole or not at all.
>>> Thus all C run-times I've seen _sidestep_ this by writing each function in its
>>> own file(!). Even this alone should have been a clear indication.
>>
>> This is done using COMDATs in C++ and D today.
>
> Honest question - If D already has all the semantic info in COMDAT sections, why do we still require additional auxiliary files? Surely, a single binary library (lib/so) should be enough to encapsulate a library without the need to re-parse the source files or additional header files?
>
> You yourself seem to agree that a single zip file is superior to what we currently have and as an aside the entire Java community agrees with use - Java Jar/War/etc formats are all renamed zip archives.
>
> Regarding the obfuscation and portability issues - the zip file can contain whatever we want. This means it should be possible to tailor the contents to support different use-cases:
>  * provide fat-libraries as in OSX - internally store multiple binaries for different architectures, those binary objects are very hard to decompile back to source code thus answering the obfuscation need.
>  * provide a byte-code solution to support the portability case. e.g Java byte-code or Google's pNaCL solution that relies on LLVM bit-code.
>
> Also, there are different work-flows that can be implemented - Java uses JIT to gain efficiency vs. .NET that supports install-time AOT compilation. It basically stores the native executable in a special cache.

In Windows 8 RT, .NET binaries are actually compiled to native code when uploaded to the Windows App Store.

December 18, 2012

Re: Compilation strategy

Posted by Walter Bright
in reply to Dmitry Olshansky

Walter Bright

Posted in reply to Dmitry Olshansky

On 12/18/2012 1:33 AM, Dmitry Olshansky wrote:
> More then that - the end result is the same: to avoid carrying junk into an app
> you (or compiler) still have to put each function in its own section.

That's what COMDATs are.

> Doing separate compilation I always (unless doing LTO or template heavy code)
> see either whole or nothing (D included). Most likely the compiler will do it
> for you only with a special switch.

dmd emits COMDATs for all global functions.

You can see this by running dumpobj on the output.

December 18, 2012

Re: Compilation strategy

Posted by Walter Bright
in reply to foobar

Walter Bright

Posted in reply to foobar

On 12/18/2012 3:43 AM, foobar wrote:
> Honest question - If D already has all the semantic info in COMDAT sections,

It doesn't. COMDATs are object file sections. They do not contain type info, for example.

>   * provide a byte-code solution to support the portability case. e.g Java
> byte-code or Google's pNaCL solution that relies on LLVM bit-code.

There is no advantage to bytecodes. Putting them in a zip file does not make them produce better results.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation