November 15, 2005
"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote
> Dear gods, no! Think of the code readability man.
> One should keep all the elements necessary to understand a piece of code
> to a minimum, (and as close and local as possible). Keep things simple.

Oh, I most fully agree, so suspect I may have been misunderstood.


November 15, 2005
Damn... :(
I'm no Unicode expert, but from what I've just read in Wikipedia about Unicode, quite some of the above posts/discussion have been made with an incorrect notion of "code point". A code point is an integer number for each Unicode grapheme/symbol. What actually varies is the (encoding) code unit:

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
"All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point."

http://en.wikipedia.org/wiki/Unicode
"Unicode takes the role of providing a unique code point — a number, not a glyph — for each character."

Is this not correct?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 15, 2005
Bruno Medeiros wrote:
> Damn... :(
> I'm no Unicode expert, but from what I've just read in Wikipedia about Unicode, quite some of the above posts/discussion have been made with an incorrect notion of "code point". A code point is an integer number for each Unicode grapheme/symbol. What actually varies is the (encoding) code unit:
> 
> http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
> "All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point."
> 
> http://en.wikipedia.org/wiki/Unicode
> "Unicode takes the role of providing a unique code point — a number, not a glyph — for each character."
> 
> Is this not correct?
> 

Also, seems that the term "code value" is also used with the same meaning as "code unit" :

http://en.wikipedia.org/wiki/Unicode
"In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code value actually manifests as a bit sequence). In the other cases, each code point may be represented by a variable number of code values."

(frankly I don't like that term very much. Prefer "code unit")

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 15, 2005

Regan Heath wrote:
> On Tue, 15 Nov 2005 02:58:23 +0200, Georg Wrede wrote:

> After modification I get:
> i: 255  ii: 255  m:   ff m:255
> 
>> Seriously, if D advertises "char" as a UTF-only entity instead of "C  char", then _it_should_be_illegal_ to Promote it to Integer!
>>
>> That's because a "char" octet can be part of a multibyte UTF character.
> 
> Correct, a single char (octet) represents a single UTF-8 codepoint, part  of a complete grapheme (character).
> 
>> In that case any Promotion would rip it off of context.
> 
> It's true, it takes it out of context. So you're proposing that an  explicit cast be required to obtain the value of a codepoint(char) as an  int or vice versa? Or are you proposing we prevent it alltogether?

Naa, D is a Pragmatic language. :-) But I definitely think Promotions are out of the question!

Having to do a cast reminds the programmer that he ought to watch what he's doing.

>> That's like Promoting the second octet of a "real". It would make just  as much sense.
> 
> True, provided that a single UTF-8 codepoint cannot be useful out of  context. I have this vague recollection that a grapheme may be made up of  a base codepoint followed by modifying codepoints, I mentioned this before  with your Finnish example. eg. The letter "Ä" may be made up of 'A'  followed by the "add two dots" codepoint.

Of course anything at all processable by computers _has_ to be accessible to a Systems Language, so yes. But:

There are two kinds of programmers. One kind is like us, we do compilers, language libraries, operating systems, and language development. Thus, we're a minority.

(( Below this point I got carried away.
You might want to skip it and search
down for LABEL: CONTINUE HERE ))

Regular programmers (and here I'm not suggesting they'd be dumber, less educated, or lazy), they are the group who will be the vast majority of D users (if we ever get D out in the world big time).

Regular programmers should be considered a lot more when we make decisions about things. Their ease of use, their productivity, their opinion of the language is very important. And gets more so when D becomes even more universally used.

There are actually two other groups of programmers, them being important too. They are professional programmers switching to D, and the last group is those who learn programmin, using D. Contrary to Walter, I believe D is an excellent First Language. And it's strong enough to last all the way to Post Graduate level.

Actually, _all_ of these groups stand to lose if we go on with things like

 - char not being char (and us too lazy to give it another name)
 - auto keyword doing disparate stuff
 - (I promise not to mention bit here.)
 - loose functions that sometimes look like instance methods
 - some other stuff

You get the picture.

One of the hardest things for developers (of languages, of user interfaces, hardware control panels, power plant monitor graphics, mobile phone designers, etc.), is to _become_ aware of the fact that their product may not _appear_ the same to other people!

Case in point: In the late '80s, I took a UI class. He was a Mac fanatic, so we used Macs with Hyper Card. We each had to do a UI related assignment to completion.

I designed a (to my own belief) superior User Interface Paradigm (which was not asked), and proudly presented my assignment before class. Turned out that the teacher had not understood a bit about how to use my program, and had to call me the next week to show him how to use it.

And I had tought even an old lady from the woods would use my program right off the bat.

---

Now a utf code point gets the name of char. A word auto is used here and there.

 ** The absolutely worst thing you can do to others, is to expose them to something where things that look alike aren't, and things that look different arent. **

Go to the amusement park. You might find a labyrinth, maybe with glass walls and mirrors. It is frustrating to navigate there. What looks like an opening, is a glass wall, another opening is a mirror. And the walls are set up so that the real opening looks like a mirror or glass, unless you look very carefully.

At one time I did some visual basic programming. VB was just full of this. "If you want to save the state of this entity, absolutely do not use the save method, because then you will lose the state, unless [long description of exceptional and unlikely situation deleted]. Rather, register that entity with the Logger, and pass it to the Logger's permanize method."

In C, even today, I have no way of understanding a nontrivial pointer-to-array-of-arrays-of-pointers-to-functions taking-an-array-of-arrays-of-pointers-to-strings type declaration.

Thank Bob (or Walter!) I understand _any_ such thing in D.

---

Oh, so C# has a char that's not char either. Well, do we really have to repeat what M$ does?????

>> Another example of where even we ourselves have no clue where we are.  And all just because of the fact that '"char" represents "UFT-8"'.
> 
> I dont have a problem with 'char' representing a single UTF-8 codepoint,  provided it's explained correctly in the docs. I think it's fair to expect  people to learn about unicode issues and learn that 'char' in D is not  'char' from C.

So, as long as it is explained correctly in the docs, we could have star as the operator for subtraction and dash as the operator for multiplication. Sure, we'd have no problem with it once we got used to it.

But who's gonna pay for all the time that gets wasted from other people because of that?

A little arithmetic: if, during the first 6 months of using D before they get used to it, people waste 10 hours looking for bugs caused by this -- and if we expect 20 million people to eventually use D -- and if we assume $40 as the cost per hour -- then: swapping *&- will cost 8 billion dollars.

What do we care. It's from their employers' pocket, not ours. And actually not even that, because the D newcomers would perceive the bugs as their own fault, so they'd catch up on their own time!

Besides, we could earn extra sorting out their problems!

> I also believe that we need to provide the tools to manage unicode issues  in the standard library. Certainly we haven't got those tools yet. eg. to  begin with a function that could tell you the grapheme length of a char[].  There was talk about porting a populat unicode library to D, this would  likely solve many of the problems.

Yes, porting an existing library would probably be a lot smarter than doing it from scratch. We could always publish our own at 2.0 or something, if we really wanted. (Which I doubt.)

LABEL: CONTINUE HERE

A good language (or, actually implementation) should cover the entire UFT issue with library calls. The regular programmer should never _have_ to access octets within a utf string.

I mean, every C user knows how to search a string, reverse another, pad a third, etc. But even there one has library routines for all that.
November 15, 2005
Kris wrote:
> "Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote
> 
>>Dear gods, no! Think of the code readability man.
>>One should keep all the elements necessary to understand a piece of code to a minimum, (and as close and local as possible). Keep things simple.
> 
> 
> Oh, I most fully agree, so suspect I may have been misunderstood.
> 
> 
Didn't you just say you wanted a pragma directive that could change/specify the type of undecorated string literals *in an individual  class* (i.e., not a global,program-wide directive, but a class specific one)?

-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 15, 2005
Walter Bright wrote:
> "Kris" <fu@bar.com> wrote in message news:dl0ngf$2f7s$1@digitaldaemon.com...
> 
>>Still, it troubles me that one has to decorate string literals in this
>>manner, when the type could be inferred by the content, or by a cast()
>>operator where explicit conversion is required. Makes it a hassle to
> 
> create
> 
>>and use APIs that deal with all three array types.
> 
> 
> One doesn't have to decorate them in that manner, one could use a cast
> instead.
> 
> 
Speaking of string casting, correct me if this is wrong: the only string cast that makes an (Unicode enconding) conversion is the cast of string literals, right? The other string casts are opaque/conversion-less casts.

Also, should the following be allowed? :
  cast(wchar[])("123456"c)
If so what should the meaning be? (I already know what the DMD.139 meaning *is*, just wondering if it's conceptually correct)




-- 
Bruno Medeiros - CS/E student
"Certain aspects of D are a pathway to many abilities some consider to be... unnatural."
November 15, 2005
"Walter Bright" <newshound@digitalmars.com> wrote...
>
> "kris" <fu@bar.org> wrote in message news:dlc9iv$13n4$1@digitaldaemon.com...
>> Doesn't work for opX aliases, such as opShl() for C++ compatability :-(
>
> That's true, though I'd argue that emulating C++ iostreams' use of << and
>  >>
> needs to die <g>.

:-)

That doesn't mean the use of operators should be cursed with this behaviour though <g>. In other words, the usage of operators in D is currently limited with respect to char/wchar/dchar and their corresponding array types (when it comes to literals, sans suffix). One could argue that D provides the tools to implement a rather compelling and succinct IO model, using operators ~ a model far superior to the iostreams model <g>.

That aside; let's, for a moment, try to look at this from a different angle?

The problem is that method-overloading has a hard-time resolving between various signatures, when provided with literals. What if, just in speculation, the method signatures had an opportunity to differentiate between literal and non-literal parameters? I just noticed the specialized "class TypeInfo_StaticArray : TypeInfo" in Object.d, which prompted the thought ... Yes, I can hear you wringing your hands ... yet it could be a solution <g>


November 15, 2005
On Tue, 15 Nov 2005 22:50:19 +0000, Bruno Medeiros wrote:

> Walter Bright wrote:
>> "Kris" <fu@bar.com> wrote in message news:dl0ngf$2f7s$1@digitaldaemon.com...
>> 
>>>Still, it troubles me that one has to decorate string literals in this manner, when the type could be inferred by the content, or by a cast() operator where explicit conversion is required. Makes it a hassle to
>> 
>> create
>> 
>>>and use APIs that deal with all three array types.
>> 
>> One doesn't have to decorate them in that manner, one could use a cast instead.
>> 
> Speaking of string casting, correct me if this is wrong: the only string cast that makes an (Unicode enconding) conversion is the cast of string literals, right? The other string casts are opaque/conversion-less casts.
> 
> Also, should the following be allowed? :
>    cast(wchar[])("123456"c)
> If so what should the meaning be? (I already know what the DMD.139
> meaning *is*, just wondering if it's conceptually correct)

I believe its saying that "123456" is a UTF-8 encoded string that you are tricking the compiler into thinking that its a UTF-16 encoded string. Not a wise thing to do in most situations.

The 'cast(<X>char[])' idiom never does conversions of one utf encoding to another; not for literals or for variables. The apparent exception is that an undecorated string literal with a 'cast' is syntactically equivalent to a decorated string literal.

-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
16/11/2005 10:19:56 AM
November 15, 2005
"Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote...
> Kris wrote:
>> "Bruno Medeiros" <daiphoenixNO@SPAMlycos.com> wrote
>>
>>>Dear gods, no! Think of the code readability man.
>>>One should keep all the elements necessary to understand a piece of code
>>>to a minimum, (and as close and local as possible). Keep things simple.
>>
>>
>> Oh, I most fully agree, so suspect I may have been misunderstood.
>>
>>
> Didn't you just say you wanted a pragma directive that could change/specify the type of undecorated string literals *in an individual class* (i.e., not a global,program-wide directive, but a class specific one)?

No, and yes. The post didn't say "wanted a pragma directive" per se (it said "pragma/something", in reference to the preceeding posts). And it was a suggestion for the purposes of discussion; not a "want", as you indicate. Don't mean to split hairs, Bruno, but implications are sometimes tough to manage <g>

I did suggest that a mechanism might exist at the class/struct/interface level to describe how the {overloaded methods therein expect to handle literals}. Said mechanism might be scoped, it might not require additional syntax at all, it might be inherited, or might not. But any mechanism whose effect is constrained to a group of highly related methods (such as class/struct/interface) is, in my opinion, "local and as close as possible". As you described it. It also, IMO, keeps those "elements necessary to understand a piece of code to a minimum", since it's all tied up in a nice little code bundle. Being optional, that also helps to keep things simple.

I'm afraid code maintenence (and hence readability) is a /big/ soapbox of mine ~ but I'm beginning to suspect that matters less and less these days. T'was discussed in another thread ages ago, which you might be interested in.

Perhaps I misunderstood you :-)


November 15, 2005
Kris wrote:
> 
> That doesn't mean the use of operators should be cursed with this behaviour though <g>. In other words, the usage of operators in D is currently limited with respect to char/wchar/dchar and their corresponding array types (when it comes to literals, sans suffix). One could argue that D provides the tools to implement a rather compelling and succinct IO model, using operators ~ a model far superior to the iostreams model <g>.
> 
> That aside; let's, for a moment, try to look at this from a different angle?
> 
> The problem is that method-overloading has a hard-time resolving between various signatures, when provided with literals. What if, just in speculation, the method signatures had an opportunity to differentiate between literal and non-literal parameters? I just noticed the specialized "class TypeInfo_StaticArray : TypeInfo" in Object.d, which prompted the thought ... Yes, I can hear you wringing your hands ... yet it could be a solution <g> 

Still over-engineering the problem a bit I think.  If decoration really isn't sufficient then I think the most practical approach would be to simply choose one encoding as the default to be used when overload ambiguities arise.  Or perhaps a more general rule?  "When an ambiguity arises as a result of overload resolution for UTF-based types, the narrowest suitable encoding will be chosen as the default."  Thus:

write( char[] );
write( wchar[] );
write( dchar[] );

would result in the char[] method being chosen, but:

write( wchar[] );
write( dchar[] );

would result in the wchar[] method being chosen.


Sean