February 17, 2011
David Nadlinger wrote:
> On 2/17/11 8:56 AM, Denis Koroskin wrote:
>> I second that. word/uword are shorter than ssize_t/size_t and more in
>> line with other type names.
>>
>> I like it.
> 
> I agree that size_t/ptrdiff_t are misnomers and I'd love to kill them with fire, but when I read about »word«, I intuitively associated it with »two bytes« first – blame Intel or whoever else, but the potential for confusion is definitely not negligible.
> 
> David

Me too. A word is two bytes. Any other definition seems to be pretty useless.

The whole concept of "machine word" seems very archaic and incorrect to me anyway. It assumes that the data registers and address registers are the same size, which is very often not true.
For example, on an 8-bit machine (eg, 6502 or Z80), the accumulator was only 8 bits, yet size_t was definitely 16 bits.
It's quite plausible that at some time in the future we'll get a machine with 128-bit registers and data bus, but retaining the 64 bit address bus. So we could get a size_t which is smaller than the machine word.

In summary: size_t is not the machine word.
February 17, 2011
<minor-rant>

On Thu, 2011-02-17 at 10:13 +0100, Don wrote:
[ . . . ]
> Me too. A word is two bytes. Any other definition seems to be pretty useless.

Sounds like people have been living with 8- and 16-bit processors for too long.

A word is the natural length of an integer item in the processor.  It is necessarily machine specific.  cf. DEC-10 had 9-bit bytes and 36-bit word, IBM 370 has an 8-bit byte and a 32-bit word, though addresses were 24-bit.  ix86 follows IBM 8-bit byte and 32-bit word.

The really interesting question is whether on x86_64 the word is 32-bit or 64-bit.

> The whole concept of "machine word" seems very archaic and incorrect to me anyway. It assumes that the data registers and address registers are the same size, which is very often not true.

Machine words are far from archaic, even on the JVM, if you don't know the length of the word on the machine you are executing on, how do you know the set of values that can be represented?  In floating point numbers, if you don't know the length of the word, how do you know the accuracy of the computation?

Clearly data registers and address registers can be different lengths, it is not the job of a programming language that compiles to native code to ignore this and attempt to homogenize things beyond what is reasonable.

If you are working in native code then word length is a crucial property since it can change depending on which processor you compile for.

> For example, on an 8-bit machine (eg, 6502 or Z80), the accumulator was only 8 bits, yet size_t was definitely 16 bits.

The 8051 was only surpassed a couple of years ago by ARMs as the most numerous processor on the planet.  8-bit processors may only have had 8-bit ALUs -- leading to an hypothesis that the word was 8-bits -- but the word length was effectively 16-bit due to the hardware support for multi-byte integer operations.

> It's quite plausible that at some time in the future we'll get a machine with 128-bit registers and data bus, but retaining the 64 bit address bus. So we could get a size_t which is smaller than the machine word.
> 
> In summary: size_t is not the machine word.

Agreed !

As long as the address bus is less wide than an integer, there are no apparent problems using integers as addresses.  The problem comes when addresses are wider than integers.  A good statically-typed programming language should manage this by having integers and addresses as distinct sets.  C and C++ have led people astray.  There should be an appropriate set of integer types and an appropriate set of address types and using one from the other without active conversion is always going to lead to problems.

Do not be afraid of the word.  Fear leads to anger.  Anger leads to hate.  Hate leads to suffering. (*)

</minor-rant>

(*) With apologies to Master Yoda (**) for any misquote.

(**) Or more likely whoever his script writer was.
-- 
Russel. ============================================================================= Dr Russel Winder      t: +44 20 7585 2200   voip: sip:russel.winder@ekiga.net 41 Buckmaster Road    m: +44 7770 465 077   xmpp: russel@russel.org.uk London SW11 1EN, UK   w: www.russel.org.uk  skype: russel_winder


February 17, 2011
On 02/17/2011 05:19 AM, Kevin Bealer wrote:
> == Quote from spir (denis.spir@gmail.com)'s article
>> On 02/16/2011 03:07 AM, Jonathan M Davis wrote:
>>> On Tuesday, February 15, 2011 15:13:33 spir wrote:
>>>> On 02/15/2011 11:24 PM, Jonathan M Davis wrote:
>>>>> Is there some low level reason why size_t should be signed or something
>>>>> I'm completely missing?
>>>>
>>>> My personal issue with unsigned ints in general as implemented in C-like
>>>> languages is that the range of non-negative signed integers is half of the
>>>> range of corresponding unsigned integers (for same size).
>>>> * practically: known issues, and bugs if not checked by the language
>>>> * conceptually: contradicts the "obvious" idea that unsigned (aka naturals)
>>>> is a subset of signed (aka integers)
>>>
>>> It's inevitable in any systems language. What are you going to do, throw away a
>>> bit for unsigned integers? That's not acceptable for a systems language. On some
>>> level, you must live with the fact that you're running code on a specific machine
>>> with a specific set of constraints. Trying to do otherwise will pretty much
>>> always harm efficiency. True, there are common bugs that might be better
>>> prevented, but part of it ultimately comes down to the programmer having some
>>> clue as to what they're doing. On some level, we want to prevent common bugs,
>>> but the programmer can't have their hand held all the time either.
>> I cannot prove it, but I really think you're wrong on that.
>> First, the question of 1 bit. Think at this -- speaking of 64 bit size:
>> * 99.999% of all uses of unsigned fit under 2^63
>> * To benefit from the last bit, you must have the need to store a value 2^63<=
>> v<  2^64
>> * Not only this, you must step on a case where /any/ possible value for v
>> (depending on execution data) could be>= 2^63, but /all/ possible values for v
>> are guaranteed<  2^64
>> This can only be a very small fraction of cases where your value does not fit
>> in 63 bits, don't you think. Has it ever happened to you (even in 32 bits)?
>> Something like: "what a luck! this value would not (always) fit in 31 bits, but
>> (due to this constraint), I can be sure it will fit in 32 bits (always,
>> whatever input data it depends on).
>> In fact, n bits do the job because (1) nearly all unsigned values are very
>> small (2) the size used at a time covers the memory range at the same time.
>> Upon efficiency, if unsigned is not a subset of signed, then at a low level you
>> may be forced to add checks in numerous utility routines, the kind constantly
>> used, everywhere one type may play with the other. I'm not sure where the gain is.
>> Upon correctness, intuitively I guess (just a wild guess indeed) if unigned
>> values form a subset of signed ones programmers will more easily reason
>> correctly about them.
>> Now, I perfectly understand the "sacrifice" of one bit sounds like a sacrilege ;-)
>> (*)
>> Denis
>> (*) But you know, when as a young guy you have coded for 8&  16-bit machines,
>> having 63 or 64...
>
> If you write low level code, it happens all the time.  For example, you can copy
> memory areas quickly on some machines by treating them as arrays of "long" and
> copying the values -- which requires the upper bit to be preserved.
>
> Or you compute a 64 bit hash value using an algorithm that is part of some
> standard protocol.  Oops -- requires an unsigned 64 bit number, the signed version
> would produce the wrong result.  And since the standard expects normal behaving
> int64's you are stuck -- you'd have to write a little class to simulate unsigned
> 64 bit math.  E.g. a library that computes md5 sums.
>
> Not to mention all the code that uses 64 bit numbers as bit fields where the
> different bits or sets of bits are really subfields of the total range of values.
>
> What you are saying is true of high level code that models real life -- if the
> value is someone's salary or the number of toasters they are buying from a store
> you are probably fine -- but a lot of low level software (ipv4 stacks, video
> encoders, databases, etc) are based on designs that require numbers to behave a
> certain way, and losing a bit is going to be a pain.
>
> I've run into this with Java, which lacks unsigned types, and once you run into a
> case that needs that extra bit it gets annoying right quick.

You're right indeed, but this is a different issue. If you need to perform bit-level manipulation, then the proper type to use is u-somesize.
What we were discussing, I guess, is the standard type used by both stdlib and application code for indices/positions and counts/sizes/lenths.
	SomeType count (E) (E[] elements, E element)
	SomeType search (E) (E[] elements, E element, SomeType fromPos=0)

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

February 17, 2011
On 02/17/2011 10:13 AM, Don wrote:
> David Nadlinger wrote:
>> On 2/17/11 8:56 AM, Denis Koroskin wrote:
>>> I second that. word/uword are shorter than ssize_t/size_t and more in
>>> line with other type names.
>>>
>>> I like it.
>>
>> I agree that size_t/ptrdiff_t are misnomers and I'd love to kill them with
>> fire, but when I read about »word«, I intuitively associated it with »two
>> bytes« first – blame Intel or whoever else, but the potential for confusion
>> is definitely not negligible.
>>
>> David
>
> Me too. A word is two bytes. Any other definition seems to be pretty useless.
>
> The whole concept of "machine word" seems very archaic and incorrect to me
> anyway. It assumes that the data registers and address registers are the same
> size, which is very often not true.
> For example, on an 8-bit machine (eg, 6502 or Z80), the accumulator was only 8
> bits, yet size_t was definitely 16 bits.
> It's quite plausible that at some time in the future we'll get a machine with
> 128-bit registers and data bus, but retaining the 64 bit address bus. So we
> could get a size_t which is smaller than the machine word.
>
> In summary: size_t is not the machine word.

Right, there is no single native machine word size; but I guess what we're interesting in is, from those sizes, the one that ensures minimal processing time. I mean, the data size for which there are native computation instructions (logical, numeric), so that if we use it we get the least number of cycles for a given operation.
Also, this size (on common modern architectures, at least) allows directly accessing all of the memory address space; not a neglectable property ;-).
Or are there points I'm overlooking?

Denis
-- 
_________________
vita es estrany
spir.wikidot.com

February 17, 2011
spir wrote:
> On 02/17/2011 10:13 AM, Don wrote:
>> David Nadlinger wrote:
>>> On 2/17/11 8:56 AM, Denis Koroskin wrote:
>>>> I second that. word/uword are shorter than ssize_t/size_t and more in
>>>> line with other type names.
>>>>
>>>> I like it.
>>>
>>> I agree that size_t/ptrdiff_t are misnomers and I'd love to kill them with
>>> fire, but when I read about »word«, I intuitively associated it with »two
>>> bytes« first – blame Intel or whoever else, but the potential for confusion
>>> is definitely not negligible.
>>>
>>> David
>>
>> Me too. A word is two bytes. Any other definition seems to be pretty useless.
>>
>> The whole concept of "machine word" seems very archaic and incorrect to me
>> anyway. It assumes that the data registers and address registers are the same
>> size, which is very often not true.
>> For example, on an 8-bit machine (eg, 6502 or Z80), the accumulator was only 8
>> bits, yet size_t was definitely 16 bits.
>> It's quite plausible that at some time in the future we'll get a machine with
>> 128-bit registers and data bus, but retaining the 64 bit address bus. So we
>> could get a size_t which is smaller than the machine word.
>>
>> In summary: size_t is not the machine word.
> 
> Right, there is no single native machine word size; but I guess what we're interesting in is, from those sizes, the one that ensures minimal processing time. I mean, the data size for which there are native computation instructions (logical, numeric), so that if we use it we get the least number of cycles for a given operation.

There's frequently more than one such size.

> Also, this size (on common modern architectures, at least) allows directly accessing all of the memory address space; not a neglectable property ;-).

This is not necessarily the same.

> Or are there points I'm overlooking?
> 
> Denis
February 17, 2011
Walter Bright Wrote:

> Actually, you can have a segmented model on a 32 bit machine rather than a flat model, with separate segments for code, data, and stack. The Digital Mars DOS Extender actually does this. The advantage of it is you cannot execute data on the stack.

AFAIK you inevitably have segments in flat model, x86 just doesn't work in other way. On windows stack segment seems to be the same as data segment, code segment is different. Are they needed for access check? I thought access modes are checked in page tables.
February 17, 2011
Russel Winder wrote:
> <minor-rant>
> 
> On Thu, 2011-02-17 at 10:13 +0100, Don wrote:
> [ . . . ]
>> Me too. A word is two bytes. Any other definition seems to be pretty useless.
> 
> Sounds like people have been living with 8- and 16-bit processors for
> too long.
> 
> A word is the natural length of an integer item in the processor.  It is
> necessarily machine specific.  cf. DEC-10 had 9-bit bytes and 36-bit
> word, IBM 370 has an 8-bit byte and a 32-bit word, though addresses were
> 24-bit.  ix86 follows IBM 8-bit byte and 32-bit word.

Yes, I know. It's true but I think rather useless.
We need a name for an 8 bit quantity, and a 16 bit quantity, and higher powers of two. 'byte' is an established name for the first one, even though historically there were 9-bit bytes. IMHO 'word' wasn't such a bad name for the second one, even though its etomology comes from the machine word size of some specific early processors. But the equally arbitrary name 'short' has become widely accepted.

> The really interesting question is whether on x86_64 the word is 32-bit
> or 64-bit.

With the rising importance of the SIMD instruction set, you could even argue that it is 128 bits in many cases...


>> The whole concept of "machine word" seems very archaic and incorrect to me anyway. It assumes that the data registers and address registers are the same size, which is very often not true.
> 
> Machine words are far from archaic, even on the JVM, if you don't know
> the length of the word on the machine you are executing on, how do you
> know the set of values that can be represented?  In floating point
> numbers, if you don't know the length of the word, how do you know the
> accuracy of the computation?

Yes, but they're not necessarily the same number. There is a native size for every type of operation, but it's not universal across all operations.

I don't think there's a way you can define "machine word" in a way which is terribly useful. By the time you've got something unambiguous and well-defined, it doesn't have many interesting properties. It's valid in such limited cases that you'd be better off with a clearer name.

> Clearly data registers and address registers can be different lengths,
> it is not the job of a programming language that compiles to native code
> to ignore this and attempt to homogenize things beyond what is
> reasonable.

Agreed, and this is I think what makes the concept of "machine word" not very helpful.

> 
> If you are working in native code then word length is a crucial property
> since it can change depending on which processor you compile for.
> 
>> For example, on an 8-bit machine (eg, 6502 or Z80), the accumulator was only 8 bits, yet size_t was definitely 16 bits.
> 
> The 8051 was only surpassed a couple of years ago by ARMs as the most
> numerous processor on the planet.  8-bit processors may only have had
> 8-bit ALUs -- leading to an hypothesis that the word was 8-bits -- but
> the word length was effectively 16-bit due to the hardware support for
> multi-byte integer operations.

The 6502 was restricted to 8 bits in almost every way. About half of the instructions that involved 16 bit quantities would wrap on page boundaries. jmp (0x7FF) would do an indirect jump, getting the low word from address 0x7FF and the high word from 0x700 !!


>> It's quite plausible that at some time in the future we'll get a machine with 128-bit registers and data bus, but retaining the 64 bit address bus. So we could get a size_t which is smaller than the machine word.
>>
>> In summary: size_t is not the machine word.
> 
> Agreed !
> 
> As long as the address bus is less wide than an integer, there are no
> apparent problems using integers as addresses.  The problem comes when
> addresses are wider than integers.  A good statically-typed programming
> language should manage this by having integers and addresses as distinct
> sets.  C and C++ have led people astray.  There should be an appropriate
> set of integer types and an appropriate set of address types and using
> one from the other without active conversion is always going to lead to
> problems.

Indeed.

> 
> Do not be afraid of the word.  Fear leads to anger.  Anger leads to
> hate.  Hate leads to suffering. (*)
> 
> </minor-rant>
> 
> (*) With apologies to Master Yoda (**) for any misquote.
> 
> (**) Or more likely whoever his script writer was.
February 17, 2011
Le 17/02/2011 13:28, Don a écrit :
>
> Yes, I know. It's true but I think rather useless.
> We need a name for an 8 bit quantity, and a 16 bit quantity, and higher
> powers of two. 'byte' is an established name for the first one, even
> though historically there were 9-bit bytes. IMHO 'word' wasn't such a
> bad name for the second one, even though its etomology comes from the
> machine word size of some specific early processors. But the equally
> arbitrary name 'short' has become widely accepted.

8 bits: octet -> http://en.wikipedia.org/wiki/Octet_%28computing%29
February 17, 2011
== Quote from Daniel Gibson (metalcaedes@gmail.com)'s article
> It was not proposed to alter ulong (int64), but to only a size_t equivalent. ;)
> And I agree that not having unsigned types (like in Java) just sucks.
> Wasn't Java even advertised as a programming language for network stuff? Quite
> ridiculous without unsigned types..
> Cheers,
> - Daniel

Ah yes, but if you want to copy data quickly you want to use the efficient size for doing so.  Since architectures vary, size_t (or the new name if one is added) would seem to new users to be the natural choice for that size.  So it becomes a likely error if it doesn't behave as expected.

My personal reaction to this thread is that I think most of the arguments of the people who want to change the name or add a new one are true -- but not sufficient to make it worth while.  There is always some learning curve and size_t is not that hard to learn or that hard to accept.

Kevin
February 17, 2011
Russel Winder wrote:
> Do not be afraid of the word.  Fear leads to anger.  Anger leads to
> hate.  Hate leads to suffering. (*)

> (*) With apologies to Master Yoda (**) for any misquote.

"Luke, trust your feelings!" -- Oggie Ben Doggie

Of course, expecting consistency from Star Wars is a waste of time.