November 17, 2014
On 11/16/2014 5:43 PM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> On Monday, 17 November 2014 at 01:39:38 UTC, Walter Bright wrote:
>> Notice the total lack of strlen()'s in Warp.
>
> Why would you need that? You know where the lexeme begins and ends? If we are
> talking about old architectures you have to acknowledge that storage was premium
> and that the major cost was getting the strings into memory in the first place.

The preprocessor stores lots of strings. Things like identifiers, keywords, string literals, expanded macro text, etc.

The C preprocessor I wrote in C years ago is filled with strlen(), as is about every C string processing program ever written. Heck, how do you think strcat() works?

(Another problem with strlen() is that the string pointed to is in a different piece of memory, and it'll have to be loaded into the cache to scan for the 0. Whereas with slices, the length data is in the hot cache.)


>>> Nah, if you know that the file ends with zero then you can build an efficient
>>> finite automata as a classifier.
>>
>> deadalnix busted that myth a while back with benchmarks.
>
> I haven't seen it,

It's in the n.g. archives somewhere in a thread about implementing lexers.


> but it is difficult to avoid lexers being bandwidth limited
> these days.
>
> Besides, how do you actually implement a lexer without constructing a FA one way
> or the other?

That's the wrong question. The question is does a trailing sentinel result in a faster FA? deadalnix demonstrated that the answer is 'no'.

You know, Ola, I've been in the trenches with this problem for decades. Sometimes I still learn something new, as I did with deadalnix's benchmark. But the stuff you are positing is well-trodden ground. There's a damn good reason why D uses slices and not 0 terminated strings.
November 17, 2014
On Monday, 17 November 2014 at 10:18:41 UTC, Walter Bright wrote:
> (Another problem with strlen() is that the string pointed to is in a different piece of memory, and it'll have to be loaded into the cache to scan for the 0. Whereas with slices, the length data is in the hot cache.)

Oh, I am not saying that strlen() is a good contemporary solution. I am saying that when you have <32KiB RAM total it makes sense to save space by not storing the string length.

>>> deadalnix busted that myth a while back with benchmarks.
>>
>> I haven't seen it,
>
> It's in the n.g. archives somewhere in a thread about implementing lexers.

Well, then it is just words.

> That's the wrong question. The question is does a trailing sentinel result in a faster FA? deadalnix demonstrated that the answer is 'no'.

I hear that, but the fact remains, you do less work. It should therefore be faster. So if it is not, then you're either doing something wrong or you have bubbles in the pipeline on a specific CPU that you fail to fill. On newer CPUs you have a tiny loop buffer for tight inner loops that runs microops without decode, you want to keep the codesize down there.

Does it matter a lot in the world of SIMD? Probably not, but then you get a more complex lexer to maintain.

> deadalnix's benchmark. But the stuff you are positing is well-trodden ground. There's a damn good reason why D uses slices and not 0 terminated strings.

I've never said that D should use 0 terminated strings. Now you twist the debate.
November 17, 2014
Remember that the alternative to zero-terminated strings at that time was to have 2 string types, one with a one byte length and one with a larger length. So I think C made the right choice for it's time, to have a single string type without a length.
November 17, 2014
On Monday, 17 November 2014 at 11:43:45 UTC, Ola Fosheim Grøstad wrote:
> Remember that the alternative to zero-terminated strings at that time was to have 2 string types, one with a one byte length and one with a larger length. So I think C made the right choice for it's time, to have a single string type without a length.

Black hat hackers, virus and security tools vendors around the world rejoice of that decision...

It was anything but right.
November 17, 2014
On Monday, 17 November 2014 at 12:36:49 UTC, Paulo  Pinto wrote:
> On Monday, 17 November 2014 at 11:43:45 UTC, Ola Fosheim Grøstad wrote:
>> Remember that the alternative to zero-terminated strings at that time was to have 2 string types, one with a one byte length and one with a larger length. So I think C made the right choice for it's time, to have a single string type without a length.
>
> Black hat hackers, virus and security tools vendors around the world rejoice of that decision...
>
> It was anything but right.

I don't think buffer overflow and string fundamentals are closely related, if used reasonably, but I'm not surprised you favour Pascal's solution of having two string types: one for strings up to 255 bytes and another one for longer strings.

Anyway, here is the real reason for how C implemented strings:

«None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.

Individual characters in a BCPL string were usually manipulated by spreading the string out into another array, one character per cell, and then repacking it later; B provided corresponding routines, but people more often used other library functions that accessed or replaced individual characters in a string.»

http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

November 17, 2014
On Monday, 17 November 2014 at 12:49:16 UTC, Ola Fosheim Grøstad wrote:
> On Monday, 17 November 2014 at 12:36:49 UTC, Paulo  Pinto wrote:
>> On Monday, 17 November 2014 at 11:43:45 UTC, Ola Fosheim Grøstad wrote:
>>> Remember that the alternative to zero-terminated strings at that time was to have 2 string types, one with a one byte length and one with a larger length. So I think C made the right choice for it's time, to have a single string type without a length.
>>
>> Black hat hackers, virus and security tools vendors around the world rejoice of that decision...
>>
>> It was anything but right.
>
> I don't think buffer overflow and string fundamentals are closely related, if used reasonably, but I'm not surprised you favour Pascal's solution of having two string types: one for strings up to 255 bytes and another one for longer strings.
>
> Anyway, here is the real reason for how C implemented strings:
>
> «None of BCPL, B, or C supports character data strongly in the language; each treats strings much like vectors of integers and supplements general rules by a few conventions. In both BCPL and B a string literal denotes the address of a static area initialized with the characters of the string, packed into cells. In BCPL, the first packed byte contains the number of characters in the string; in B, there is no count and strings are terminated by a special character, which B spelled `*e'. This change was made partially to avoid the limitation on the length of a string caused by holding the count in an 8- or 9-bit slot, and partly because maintaining the count seemed, in our experience, less convenient than using a terminator.
>
> Individual characters in a BCPL string were usually manipulated by spreading the string out into another array, one character per cell, and then repacking it later; B provided corresponding routines, but people more often used other library functions that accessed or replaced individual characters in a string.»
>
> http://cm.bell-labs.com/cm/cs/who/dmr/chist.html

I am fully aware how UNIX designers decided to ignore the systems programming being done in Algol variants, PL/I variants and many other wannabe systems programming languages that came before C.

Which they are repeating again with Go.

--
Paulo
November 17, 2014
On Monday, 17 November 2014 at 13:39:05 UTC, Paulo  Pinto wrote:
> I am fully aware how UNIX designers decided to ignore the systems programming being done in Algol variants, PL/I variants and many other wannabe systems programming languages that came before C.

I wouldn't say that Algol is a systems programming language, and Pascal originally only had fixed width strings!

(But Simula actually had decent GC backed string support with substrings pointing to the same buffer and a link to the full buffer from substrings, thus somewhat more advanced than D ;-)

November 17, 2014
On 11/17/2014 3:43 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> Remember that the alternative to zero-terminated strings at that time was to
> have 2 string types, one with a one byte length and one with a larger length.

No, that was not the alternative.


November 17, 2014
On 11/17/2014 3:00 AM, "Ola Fosheim Grøstad" <ola.fosheim.grostad+dlang@gmail.com>" wrote:
> I am saying
> that when you have <32KiB RAM total it makes sense to save space by not storing
> the string length.

I know what you're saying.

You're saying without evidence that sentinels are faster. They are not.
You're saying without evidence that 0 terminated strings use less memory. They do not.

(It does not save space when "filename" and "filename.ext" cannot be overlapped.)

November 17, 2014
On Monday, 17 November 2014 at 19:24:49 UTC, Walter Bright wrote:
> You're saying without evidence that sentinels are faster. They are not.

You are twisting and turning so much in discussions that you make me dizzy.

I've been saying that for SOME OPERATIONS they are too, and that is not without evidence. Just plot it out for a 65xx, 680xx, Z80 etc CPU and it becomes self-evident. Any system level programmer should be able to do it in a few minutes.

Using sentinels is a common trick for speeding up algorithms, it has some downsides, and some upsides, but they are used for a reason (either speed, convenience or both).

Pretending that sentinels are entirely useless is not a sane line of argument. I use sentinels in many situations and for many purposes, and they can greatly speed up and/or simplify code.

> You're saying without evidence that 0 terminated strings use less memory. They do not.
>
> (It does not save space when "filename" and "filename.ext" cannot be overlapped.)

0-terminated and shortstring (first byte being used for length) takes the same amount of space, but permanent substring reference slices are very wasteful of memory for low memory situations:

1. you need a ref count on the base buffer (2-4 bytes)
2. you need pointer to base + 2 offsets (4-12 bytes)

And worst is you retain the whole buffer even if you only reference a tiny portion of it. Yuk! In such a use scenario you are generally better of reallocation or use compaction. For non-permanent substrings you can still use begin/end pointers.

And please no, GC is not the answer, Simula had GC and the kind of strings and substrings you argued for but it was not intended for system level programming and it was not resource efficient. It was convenient. Scripty style concatenation and substring slicing is fun, but it is not system level programming. System level programming is about taking control over the hardware and use it most efficiently. Abstractions "that lie" mess this up.

Is wasting space on meta information less critical today? YES, OF COURSE! It does matter that we have 100.000 times more RAM available.