December 17, 2012
On 12/17/2012 1:53 PM, Rob T wrote:
> I mentioned in a previous post that we should perhaps focus on making the .di
> concept more efficient rather than focus on obfuscation.

We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation.


> Shorter file sizes is a potential use case, and you could even allow a
> distributor of byte code to optionally supply in compressed form that is
> automatically uncompressed when compiling, although as a trade off that would
> add on a small compilation performance hit.

I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little.

I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than:

   dmd foo.zip

or:

   dmd myfile ThirdPartyLib.zip

and have it work. The advantage here is simply that everything can be contained in one simple file.

The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then:

   dmd xx foo.zip

is equivalent to:

   unzip foo
   dmd xx a.d b/c.d d.obj

P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files.

December 17, 2012
Walter Bright:

>    dmd foo.zip
>
> or:
>
>    dmd myfile ThirdPartyLib.zip
>
> and have it work. The advantage here is simply that everything can be contained in one simple file.

This was discussed a lot of time ago (even using a "rock" suffix for those zip files) and it seems a nice idea.

Bye,
bearophile
December 17, 2012
On 12/17/12, Walter Bright <newshound2@digitalmars.com> wrote:
> I have toyed with the idea many times, however, of having dmd support zip files.

I think such a feature is better suited for RDMD. Then many other compilers would benefit, since RDMD can be used with GDC and LDC.
December 17, 2012
On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote:
> On 12/17/2012 1:53 PM, Rob T wrote:
>> I mentioned in a previous post that we should perhaps focus on making the .di
>> concept more efficient rather than focus on obfuscation.
>
> We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation.
>
>
>> Shorter file sizes is a potential use case, and you could even allow a
>> distributor of byte code to optionally supply in compressed form that is
>> automatically uncompressed when compiling, although as a trade off that would
>> add on a small compilation performance hit.
>
> I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little.
>
> I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than:
>
>    dmd foo.zip
>
> or:
>
>    dmd myfile ThirdPartyLib.zip
>
> and have it work. The advantage here is simply that everything can be contained in one simple file.
>
> The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then:
>
>    dmd xx foo.zip
>
> is equivalent to:
>
>    unzip foo
>    dmd xx a.d b/c.d d.obj
>
> P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files.

Yes please.
This is successfully used in Java, their Jar files are actually zip files that can contain: source code, binary files, resources, documentation, package meta-data, etc. This can store multiple binaries inside to support multi-arch (as in OSX).
.NET assemblies accomplish similar goals (although I don't know if they are zip archives internally as well). D can support both "legacy" C/C++ compatible formats (lib/obj/headers) and this new dlib format.
December 17, 2012
On Monday, 17 December 2012 at 21:36:46 UTC, Walter Bright wrote:
> On 12/17/2012 12:49 PM, deadalnix wrote:
>> Granted, this is still easier than assembly, but you neglected the fact that
>> java is rather simple, where D isn't. It is unlikely that an optimized D
>> bytecode can ever be decompiled in a satisfying way.
>
> Please listen to me.
>
> You have FULL TYPE INFORMATION in the Java bytecode.

That is not true for scalla or clojure. Java bytecode don't allow to express closure and similar concept. Decoded scalla bytecode is frankly hard to understand.

Java bytecode is nice to decompile java, nothing else.

> You have ZERO, ZERO, ZERO type information in object code. (Well, you might be able to extract some from mangled global symbol names, for C++ and D (not C), if they haven't been stripped.) Do not underestimate what the loss of ALL the type information means to be able to do meaningful decompilation.
>
> Please understand that I actually do know what I'm talking about with this stuff. I have written a Java compiler. I know what it emits. I know what's in Java bytecode, and how it is TRIVIALLY reversed back into Java source.
>

I know that. I not arguing against that. I'm arguing against the fact that this is a blocker. This is blocker in very few use cases in fact. I just look at the whole picture here. People needing that are the exception, not the rule.

And what prevent us from using a bytecode that loose information ? As long as it is CTFEable, most people will be happy.
December 17, 2012
On Monday, 17 December 2012 at 22:12:01 UTC, Walter Bright wrote:
> On 12/17/2012 1:53 PM, Rob T wrote:
>> I mentioned in a previous post that we should perhaps focus on making the .di
>> concept more efficient rather than focus on obfuscation.
>
> We're not going to do obfuscation, because as I explained such cannot work, and we shouldn't do a disservice to users by pretending it does. There are many ways that *do* work, such as PIMPL, which work today and should be used by any organization wishing to obfuscate their implementation.

I agree.

>
> I suspect most file transport protocols already compress the data, so compressing it ourselves probably accomplishes nothing. There are also compressed filesystems, so storing files in a compressed manner likely accomplishes little.

Yes however my understanding is that html based file transfers are often not compressed despite the protocol specifically supporting the feature. The problem is not with the protocol, its that some clients and servers simply do not implement the feature or are in a misconfigured state. HTML as you know is very widely used for transferring files.

Another thing to consider, is for using byte code for interpretation, that way D could be used directly in game engines in place of LUA or other scripting methods, or even as a replacement for Java Script. Of course you know best if this is practical for a language like D, but maybe a subset of D is practical, I don't know.

> I have toyed with the idea many times, however, of having dmd support zip files. Zip files can contain an arbitrary file hierarchy, with individual files in compressed, encrypted, or plaintext at the selection of the zip builder. An entire project, or library, or collection of source modules can be distributed as a zip file, and could be compiled with nothing more than:
>
>    dmd foo.zip
>
> or:
>
>    dmd myfile ThirdPartyLib.zip
>
> and have it work. The advantage here is simply that everything can be contained in one simple file.
>
> The concept is simple. The files in the zip file simply replace the zip file in the command line. So, if foo.zip contains a.d, b/c.d, and d.obj, then:
>
>    dmd xx foo.zip
>
> is equivalent to:
>
>    unzip foo
>    dmd xx a.d b/c.d d.obj
>
> P.S. I've also wanted to use .zip files as the .lib file format (!), as the various .lib formats have nothing over .zip files.

Wow, I think that's exactly what we could use! It serves multiple optional use cases all at once!

Was there a technical reason for you not getting around towards implementing, or just a lack of time?

--rt
December 17, 2012
On Mon, 17 Dec 2012 13:47:36 -0800, Walter Bright <newshound2@digitalmars.com> wrote:

> I've often thought Java bytecode was a complete joke. It doesn't deliver any of its promises. You could tokenize Java source code, run the result through an lzw compressor, and get the equivalent functionality in every way.
>

Not true at all. Bytecode is semi-optimized, easier to manipulate with (obfuscate, instrument, etc), JVM/CLR bytecode is shared by many languages (Java, Scala/C#,F#) so you don't need a separate parser for each language, and there is hardware that supports running JVM bytecode on the metal. Try doing the same with lzw'd source code.
December 18, 2012
On Tue, Dec 18, 2012 at 02:08:55AM +0400, Dmitry Olshansky wrote: [...]
> I suspect it's one of prime examples where UNIX philosophy of combining a bunch of simple (~ dumb) programs together in place of one more complex program was taken *far* beyond reasonable lengths.
> 
> Having a pipe-line:
> preprocessor -> compiler -> (still?) assembler -> linker
> 
> where every program tries hard to know nothing about the previous
> ones (and be as simple as possibly can be) is bound to get
> inadequate results on many fronts:
> - efficiency & scalability
> - cross-border error reporting and detection (linker errors? errors
> for expanded macro magic?)
> - cross-file manipulations (e.g. optimization, see _how_ LTO is done in GCC)
> - multiple problems from a loss of information across pipeline*

The problem is not so much the structure preprocessor -> compiler -> assembler -> linker; the problem is that these logical stages have been arbitrarily assigned to individual processes residing in their own address space, communicating via files (or pipes, whatever it may be).

The fact that they are separate processes is in itself not that big of a problem, but the fact that they reside in their own address space is a big problem, because you cannot pass any information down the chain except through rudimentary OS interfaces like files and pipes. Even that wouldn't have been so bad, if it weren't for the fact that user interface (in the form of text input / object file format) has also been conflated with program interface (the compiler has to produce the input to the assembler, in *text*, and the assembler has to produce object files that do not encode any direct dependency information because that's the standard file format the linker expects).

Now consider if we keep the same stages, but each stage is not a separate program but a *library*. The code then might look, in greatly simplified form, something like this:

	import libdmd.compiler;
	import libdmd.assembler;
	import libdmd.linker;

	void main(string[] args) {
		// typeof(asmCode) is some arbitrarily complex data
		// structure encoding assembly code, inter-module
		// dependencies, etc.
		auto asmCode = compiler.lex(args)
			.parse()
			.optimize()
			.codegen();

		// Note: no stupid redundant convert to string, parse,
		// convert back to internal representation.
		auto objectCode = assembler.assemble(asmCode);

		// Note: linker has direct access to dependency info,
		// etc., carried over from asmCode -> objectCode.
		auto executable = linker.link(objectCode);
		File output(outfile, "w");
		executable.generate(output);
	}

Note that the types asmCode, objectCode, executable, are arbitrarily complex, and may contain lazy-evaluated data structure, references to on-disk temporary storage (for large projects you can't hold everything in RAM), etc.. Dependency information in asmCode is propagated to objectCode, as necessary. The linker has full access to all info the compiler has access to, and can perform inter-module optimization, etc., by accessing information available to the *compiler* front-end, not just some crippled object file format.

The root of the current nonsense is that perfectly-fine data structures are arbitrarily required to be flattened into some kind of intermediate form, written to some file (or sent down some pipe), often with loss of information, then read from the other end, interpreted, and reconstituted into other data structures (with incomplete info), then processed. In many cases, information that didn't make it through the channel has to be reconstructed (often imperfectly), and then used. Most of these steps are redundant. If the compiler data structures were already directly available in the first place, none of this baroque dance is necessary.


> *Semantic info on interdependency of symbols in a source file is destroyed right before the linker and thus each .obj file is included as a whole or not at all. Thus all C run-times I've seen _sidestep_ this by writing each function in its own file(!). Even this alone should have been a clear indication.
> 
> While simplicity (and correspondingly size in memory) of programs was the king in 70's it's well past due. Nowadays I think is all about getting highest throughput and more powerful features.
[...]

Simplicity is good. Simplicity lets you modularize a very complex piece of software (a compiler that converts D source code into executables) into manageable chunks. Simplicity does not require shoe-horning modules into separate programs with separate address spaces with separate (and deficient) input/output formats.

The problem isn't with simplicity, the problem is with carrying over the archaic mapping of compilation stage -> separate program. I mean, imagine if std.regex was written so that regex compilation runs in a separate program with a separate address space, and the regex matcher that executes the match runs in another separate program with a separate address space, and the two talk to each other via pipes, or worse, intermediate files.

I've mentioned a few times before a horrendous C++ project that I had to work with once, where to make a single function call to a particular subsystem, it had to go through 6 layers of abstraction, one of which was IPC through a local UNIX socket, *and* another of which involved fwrite()ing function parameters into a file and fread()ing said parameters from the file in another process, with the 6 layers repeating in reverse to propagate the return value of the function back to the caller.

In the new version of said project, that subsystem exposes a library API where to make a function call, you, um, just call the function (gee, what a concept).  Needless to say, it didn't take a lot of effort to convince customers to upgrade, upon which we proceeded with great relish to delete every single source file having to do with that 6-layered monstrosity, and had a celebration afterwards.

>From the design POV, though, the layout of the old version of the
project utterly made sense. It was superbly (over)engineered, and if you made UML diagrams of it, they would be works of art fit for the British Museum. The implementation, however, was "somewhat" disappointing.


T

-- 
The irony is that Bill Gates claims to be making a stable operating system and Linus Torvalds claims to be trying to take over the world. -- Anonymous
December 18, 2012
On 12/17/2012 3:03 PM, deadalnix wrote:
> I know that. I not arguing against that. I'm arguing against the fact that this
> is a blocker. This is blocker in very few use cases in fact. I just look at the
> whole picture here. People needing that are the exception, not the rule.

I'm not sure what you mean. A blocker for what?


> And what prevent us from using a bytecode that loose information ?

I'd turn that around and ask why have a bytecode?


> As long as it is CTFEable, most people will be happy.

CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation.

I know that bytecode has been around since 1995 in its current incarnation, and there's an ingrained assumption that since there's such an extensive ecosystem around it, that there is some advantage to it.

But there isn't.
December 18, 2012
On Tuesday, 18 December 2012 at 00:42:13 UTC, Walter Bright wrote:
> On 12/17/2012 3:03 PM, deadalnix wrote:
>> I know that. I not arguing against that. I'm arguing against the fact that this
>> is a blocker. This is blocker in very few use cases in fact. I just look at the
>> whole picture here. People needing that are the exception, not the rule.
>
> I'm not sure what you mean. A blocker for what?
>
>
>> And what prevent us from using a bytecode that loose information ?
>
> I'd turn that around and ask why have a bytecode?
>

Because it is CTFEable efficiently, without requiring either to recompile the source code or even distribute the source code.

>
>> As long as it is CTFEable, most people will be happy.
>
> CTFE needs the type information and AST trees and symbol table. Everything needed for decompilation.
>

You do not need more information that what is in a di file. Java and C# put more info in that because of runtime reflection (and still, they are tools to strip most of it, no type info, granted, but everything else), something we don't need.