June 30, 2004
In article <opsadsu8f75a2sq9@digitalmars.com>, Regan Heath says...

>s = p.getValue("foo");
>if (s) ..
>s = p.getValue("bar");
>if (s) ..
>
>Right...
>
>If I cannot return null, then (using the code above) I cannot tell the difference between whether foo or bar was passed or had an empty value.


And indeed that very situation is ALSO true with integer parameters. How can tell the difference between an integer parameter being present and zero, and no integer parameter being present at all?

But of course, there are various solutions to this problem, many much simpler than you propose. For a start, you could return an int* instead of an int, or indeed a char[]* instead of a char[]. Then you could explicitly test for ===null in both cases.

In C++, I'd just return a std::pair<bool, T>. I'm sure that once we have a good supply of standard templates in D we'll be able to do much the same thing. (Even without templates, you could define a struct and return it).

Anything wrong with either of these approaches?

Arcane Jill


June 30, 2004
Arcane Jill wrote:

> In article <cbsufo$a8u$1@digitaldaemon.com>, Sam McCall says...
> 
> 
>>Okay, suppose java had a 21- or 32-bit char type.
> 
> 
> I'm led to believe there was a lot of debate about this. Some folk said that
> Java's char could NOT be anything other that 16 bits wide because it was defined
> that way and changing it would break things. Other folk looked under the hood of
> the JVM and decided that actually it probably wouldn't break anything after all.
> I don't know the ins and outs of it, but I gather the first lot won. The way
> it's going to go is UTF-16 support, with functions like isLetter() taking an int
> rather than a char.
Sorry, I meant "if java had originally been defined to have char being 21 bits instead of 16, and storing a unicode codepoint instead of a UTF-16 fragment". All java's string manipulation stuff is char-based, and I was convinced there was a one-to-one correspondence between chars and characters (or possibly some too-big char values possible). Clearly I was mistaken, but if they had made chars 21 bits and kept the rest the same, it looks to me like it'd be just about perfect. (Well, I'm sure the APIs could be improved in minor ways, etc, but relatively speaking).

>>Glyphs aren't really a practical option as the logical element type of strings if they can't be easily represented as a fixed-width number, I'd imagine.
> 
> Well, they can, with a bit of sneaky manipulation. The trick is to map only
> those ones you actually USE to the unused codepoints between 0x110000 and
> 0xFFFFFFFF. So long as such a mapping stays within the application (like, don't
> try to export it), you can indeed have one dchar per glyph. But it would be a
> temporary one - not one you could write to a file, for example.
Ooh, clever :) But I don't see this working in a situation where you have dynamic libraries, for example.

>>>It /is/ okay to use ASCII. All valid ASCII also happens to be valid UTF-8. UTF-8
>>>was designed that way.
>>
>>So this means a char[] has two purposes depending on the app?
> 
> I'm not sure I follow that. If you say char[] a = "hello world"; then you will
> get a string containing eleven chars, and it will be both valid ASCII and valid
> UTF-8. It's not like you have to choose.
>
>> On the one hand, ASCII/Unicode being a per-app decision is fair
>> enough.
>
> That isn't what I said. It's possible we may be misunderstanding each > other somehow.

Sorry, what I originally meant:
> Especially in places where ASCII works fine, that's certainly easy and
> consistent! The current way seems to suggest that officially it's all
> unicode and happy, but (don't tell anyone) feel free to use ascii
Was that although unicode is the officially designated content of these types, char[] looks and feels (and the standard library uses it) like it's ASCII, and people won't bother to use unicode, because it's requires calling conversion functions and so on.
Especially since if you assume the language will take care of unicode for you like java (almost) does, then you'll end up with code that only works properly for ASCII data. That's probably all a lot of people will test it with. We should get unicode by default.

>>If it were documented as only working for ASCII, sure, otherwise you might assume it was a UTF-8 encoded character list. And I'm still not sure it'd be reasonable unless a wchar/dchar version was provided, how good is a language's unicode support if string manipulation functions only work on ascii?
> 
> 
> I'm not completely clear what functions you're talking about, as I haven't read
> the source code for std.string. Am I correct in assuming that the quote below is
> an extract?
std.string.maketrans and std.string.translate.

>>Anyway:
>>/************************************
>> * Construct translation table for translate().
>> */
>>
>>char[] maketrans(char[] from, char[] to)
>>    in
>>    {
>>	assert(from.length == to.length);
>>    }
>>    body
>>    {
>>	char[] t = new char[256];
>>	int i;
>>
>>	for (i = 0; i < 256; i++)
>>	    t[i] = cast(char)i;
>>
>>	for (i = 0; i < from.length; i++)
>>	    t[from[i]] = to[i];
>>
>>	return t;
>>    }
>>
> 
> 
> This is a bug. ASCII stops at 0x7F. Characters above 0x7F are not ASCII. If this
> function is intended as an ASCII-only function then (a) it should be documented
> as such, and (b) it should leave all bytes >0x7F unmodified. Char values between
> 0x80 and 0xFF are resevered for the role they play in UTF-8. You CANNOT mess
> with them (unless you're a UTF-8 engine).
It's got a single-line explanation that doesn't mention encoding. I'll report it.
Sam
June 30, 2004
In article <cbts89$1poh$1@digitaldaemon.com>, Sam McCall says...

>Sorry, I meant "if java had originally been defined to have char being 21 bits instead of 16, and storing a unicode codepoint instead of a UTF-16 fragment". All java's string manipulation stuff is char-based, and I was convinced there was a one-to-one correspondence between chars and characters (or possibly some too-big char values possible). Clearly I was mistaken,

You weren't mistaken. You were spot on.

When Java was invented, Unicode stood at version 2.0. Possibly even earlier. At that time, Unicode was touted as a 16-bit standard, and its maximum codepoint was U+FFFF. At that time, there was no such thing as UTF-16. A Unicode char was 16 bits wide, and that was that. The only relevant 16-bit encodings were UCS-16LE (which meant, emit the 16-bit codepoint low order byte first), and UCS-16BE (which meant, emit the codepoint high order byte first).

Java simply took that on board and went with it.

But as time went by, the Unicode folk realized that sixty five thousand characters wasn't actually ENOUGH for all the world's scripts (including historical ones that nobody ever uses any more), so they managed to find a way to squeeze even more characters into that 16-bit model. They called it UTF-16, and it extends the range from U+FFFF to U+10FFFF.

There has been some discussion on the Unicode public formum as to whether even THIS limit will ever be extended. The Unicode Consortium currently are stating flat out that there will never, ever, be Unicode characters with codepoints above U+10FFFF. So, you can choose to believe them, or you can regard this statement with as much credibility as the statements like "64K should be enough memory for anyone" which were touted in the ZX81 days.

Java got caught out by the changing of the times. D's chars should probably be wider than 21-bits, just in case.... (Not that I'm choosing to disbelieve the Unicode Consortium of course!)  32 bits seems safe enough, for the forseeable future.




>but if they had made chars 21 bits and kept the rest the same, it looks to me like it'd be just about perfect.

Yes. I'll bet the Java folk thought that at the time.



>Was that although unicode is the officially designated content of these types, char[] looks and feels (and the standard library uses it) like it's ASCII, and people won't bother to use unicode, because it's requires calling conversion functions and so on.

Well, of course UTF-8 was /designed/ to be compatible with ASCII, to ease transition. That's not such a bad thing. Bugs will happen, of course, just as they happen with any other encoding, but they can be found and fixed (and fixing them will be easier, the more library support there is). It's just one of those things which is going to get better with time.

Arcane Jill



June 30, 2004
Arcane Jill wrote:

> In article <cbts89$1poh$1@digitaldaemon.com>, Sam McCall says...
> 
> You weren't mistaken. You were spot on.
<snip>
Wow, thanks for that explanation, I really appreciate it :-)
> 
> 
>>but if they had made chars 21 bits and kept the rest the same, it looks to me like it'd be just about perfect.
> 
> 
> Yes. I'll bet the Java folk thought that at the time.
> 
Okay, we'll stick with 32 bits. If they reach that in my lifetime, someone is going to die...

Anyway, by the time I work out how to efficiently character-index UTF-8 in mutable stri]ngs, I'm sure I'll think unicode is thorougly overrated :-D
Sam

> Well, of course UTF-8 was /designed/ to be compatible with ASCII, to ease
> transition. That's not such a bad thing. Bugs will happen, of course, just as
> they happen with any other encoding, but they can be found and fixed (and fixing
> them will be easier, the more library support there is). It's just one of those
> things which is going to get better with time.

June 30, 2004
In article <cbt5vu$kdb$1@digitaldaemon.com>, Sam McCall says...
>
>We're talking about pointers for low level iteration, this doesn't apply to associative arrays, who's data structure's opaque. I don't think we're moving towards iterators, just talking about pointers. The fact that iterators pretend to be pointers in their syntax is neither here nor threre ;)

This is easy enough to do with free functions anyway.  Something like:

alias char[][char[]] StrMap;
StrMap map;
Iterator!(Pair!(char[],char[])) i = begin!(StrMap)( map );

I'm sure the syntax could bwe improved but you get the idea.  I've already experimented with such iterators for associative arrays and they work just fine.



Sean


June 30, 2004
On Wed, 30 Jun 2004 07:27:33 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:

> In article <opsadsu8f75a2sq9@digitalmars.com>, Regan Heath says...
>
>> s = p.getValue("foo");
>> if (s) ..
>> s = p.getValue("bar");
>> if (s) ..
>>
>> Right...
>>
>> If I cannot return null, then (using the code above) I cannot tell the
>> difference between whether foo or bar was passed or had an empty value.
>
>
> And indeed that very situation is ALSO true with integer parameters. How can
> tell the difference between an integer parameter being present and zero, and no
> integer parameter being present at all?

Yep. As another poster noted he had the same problem with integers, resulting in him using a value of -1 to represent null. Yuck.

> But of course, there are various solutions to this problem, many much simpler
> than you propose. For a start, you could return an int* instead of an int, or
> indeed a char[]* instead of a char[]. Then you could explicitly test for ===null
> in both cases.

This is the C solution. For int I cannot think of a good D solution. For char[] (or any array) we already have one, the array emulates/acts like a reference type, it's just inconsistent.

> In C++, I'd just return a std::pair<bool, T>. I'm sure that once we have a good
> supply of standard templates in D we'll be able to do much the same thing. (Even
> without templates, you could define a struct and return it).

You're emulating a reference type, why not just have one. This may be the best soln for int and other strict value types.

> Anything wrong with either of these approaches?

Yep. Neither is as simple, elegant or clean as a reference type, which we already have in D arrays albeit inconsistently.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
June 30, 2004
On Tue, 29 Jun 2004 22:35:28 -0700, Andy Friesen <andy@ikagames.com> wrote:
> Regan Heath wrote:
>> This is taken from a real life example, I have a config file with 10 different settings, all optional, I want 3 or them at this point in the code, so I process the file once and load the 3 settings which may or may not be present, and may or may not have a zero length values.
>
> I guess it's just a matter of preference.  I don't have a problem with something like this:
>
>      char[][char[]] attribs = ...;
>
>      if ("a" in attribs && "b" in attribs && "c" in attribs) {

It's more like:

if ("a" in attribs) {
}
if ("b" in attribs) {
}
if ("c" in attribs) {
}

but, you seem to have completely ignored the fact that, *if* we remove the ability to return null when an array type is expected (you suggested removing the ability to assign null to an array, it's the same thing), the above will cease to work altogether as I imagine the above is simply going

if (attribs["a"] != null)

which is the same as

char[] s;

s = attribs["a"];
if (s != null)

which is impossible if you cannot use null with arrays.

> If nonexistence is an alias for some default, fill the array before parsing the file.  Attributes that are present will override those which are not.
>
> Python offers a get() method which takes two arguments: a key, and a default value which is returned should the key not exist.  I use this a lot.

but if there is no default, you're left doing the nadda thing below which is simply an ugly hack (explanation below)

>>> things could get very weird if you need to express a non-null array of 0 length.
>>
>> char[] s = ""
>>
>> s is a non-null array of 0 length.
>
> What about non-char types?
>
>>> If you *really* need to, you could probably get away with doing something like:
>>>
>>>      const char[] nadda = "nadda";
>>>      if (s is not nadda) { ... }
>>
>>
>> True, but this is yucky and what if a setting actually had a value of "nadda"?
>
> That's why you use 'is' and not ==.  'is' performs a pointer comparison.    The array has to point into that exact string literal for the comparison to be true.  The only catch is string pooling.  It'd be okay as long as the string literal "nadda" isn't declared anywhere in the source code.

ahh, gotcha, so basically you're creating null with another name. Why not just have null. :)

> Come to think of it, this is better:
>
>     char[] nonString = new char[1]; // don't mutate me!  Just compare with 'is'!

Another face for the same entity, null.

> I'm officially out of ideas now.  heh.

Think of it from the other point of view, assume we make the minor adjustments to arrays that I suggested, what effect does it have on the people who cannot see themselves needing a null array? hmm.. I think none. IMO it simply gives us more flexibilty of expression at no cost.

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
June 30, 2004
Sam McCall <tunah.d@tunah.net> wrote in news:cbsupg$anb$1@digitaldaemon.com:

> Farmer wrote:
> 
>> Arcane Jill <Arcane_member@pathlink.com> wrote in news:cbr53s$op8$1@digitaldaemon.com:
>> 
>>>Maybe the real solution would be to make it a compile error to assign an array with null, or to compare it with null. This would then force people to say what they mean, and all such problems would go away.
>> 
>> 
>> I agree, that would help to avoid some confusion. Unfortunately, people would be forced to either say 'I mean empty' or to shut up completely and use sth. completely different.
> We don't have array literals, so we can't do this:
> foo( [] );
> At the moment we can do this:
> foo( null );
> If we outlawed using nulls as arrays, we'd be left with
> foo( new int[0] )
> which is maybe a bit messy?
> Sam

What's messy here? A bit more typing, that's it.

One disadvantage of
   foo( null );
is,  that there is no type information.


If you had
    	foo(int[])
    	foo(float[])
you would need a cast, because it gets ambiguous.


Farmer.
June 30, 2004
Sean Kelly <sean@f4.ca> wrote in news:cbsqnf$547$1@digitaldaemon.com:

> In article <Xns9517F3F654C29itsFarmer@63.105.9.61>, Farmer says...
>>
>>The .length parameter would still work with null-arrays (as they
>>currently do).
>>But why would you want to initialize an array to null/empty and then
>>resize it, instead of 'newing' it with the correct size in first place?
> 
> Consider the following:
> 
> char[] str = new char[100];
> str.length = 0; // A
> str.length = 5; // B
> str = new char[10]; // C
> 
> In A, AFAIK it's legal for the compiler to retain the memory and merely change the length parameter for the string.  B then just changes the length parameter again, and no reallocation is performed.  C forces a reallocation even if the array already has the (hidden) capacity in place.  Lacking allocators, this is a feature I consider rather nice in D.
I agree with you that this feature is quite useful.
The problem with (A) is, that DMD doesn't do that; the function
'arraysetlength' explicitly checks whether the new length is null, and if so
destroys the data pointer. Furthermore it seems that it is not allowed to
call the .length property for null-arrays.
How do I know? Well the function in the phobos file internal\gc.d
    	byte[] _d_arraysetlength(uint newlength, uint sizeelem, Array *p)
contains this assertion
    	assert(!p.length || p.data);

Ironically, this assertion permits, that the data pointer is null, but the length is greater than 0.



> 
>>Extra coding is not required if you don't need null-arrays: if some user passes a null-array, the user gets a nice access violation/array bounds exception and will quickly learn to not pass null-arrays to such functions. A quick check in the DbC section of your function would do the job, too. (But I suppose, the user might not adapt that fast that way :-)
> 
> I originally thought D worked the way you describe and added DBC clauses to all my functions to check for null array parameters.  After some testing I realized I'd been mistaken and happily removed most of these clauses.  The result IMO was tighter, cleaner code that was easier to understand.  I suppose it's really a matter of opinion.  I like that arrays work the same as the other primitive types.

I always love it when this happens. Code that isn't written, is bug-free, maintainable, and super-fast ;-)



> 
>>If your function should deal with both null-arrays and empty-arrays, no extra code is required, since the .length property can be accessed for both null- arrays and emtpy-arrays.
> 
> Could it?  I suppose so, but the concept seems a tad odd.  I kind of expect none of the parameters (besides sizeof, perhaps) to work for dynamic types that have not been initialized.  Though perhaps that's the C way of thinking.

Yes, I think it is bit odd, too. For reading the length property it makes sense, but for resizing it is more questionable. But I am definetely thinking the C way here.


Farmer.
June 30, 2004
Andy Friesen <andy@ikagames.com> wrote in news:cbpsi6$1u7d$1@digitaldaemon.com:

> Regan Heath wrote:
> 
>> ... I could return existance and
>> fill a passed char[]...  so my code now looks like...
>> 
>> char[] s;
>> if (getValue("foo",s))
> 
> I like this.  It's simple and obvious.

An expression like
    	if (getValue("foo",s) == true)
doesn't tell much to the maintainer. An enumeration is needed to fully
express the intend.


[snip]

> 
> Exposing POST data as an associative array seems like a win to me; it's faster and can can be iterated over conveniently.  Also, as a language intrinsic, it's a bit more likely to plug into other APIs easily.
> 
> If you *really* need to, you could probably get away with doing something like:
> 
>      const char[] nadda = "nadda";
>      if (s is not nadda) { ... }
> 
>   -- andy

I see one issue with associative arrays here.
It would break up the encapsulation of the class. The internal data would be
revealed. If your internal data structure is different you must convert the
internal data to the associate array. At best, a call of .dup would be needed
as safety-practice.


Farmer.