July 27, 2004
Arcane Jill wrote:

> In article <ce3o9u$1enc$1@digitaldaemon.com>, Berin Loritsch says...
> 
> 
>>Another place to look, if you want to see how they are planning on improving things is here:
>>
>>http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
> 
> 
> Hey, cool. They can parse non-Latin digits.
> 
> # NonASCIIDigit  ::  	= A non-ASCII character c for which Character.isDigit(c)
> returns true
> 
> So, Arab digits, Bengali digits, no problem.

I am not suprised. I excpect one of the areas of programming language development in the near future to be inclusion of non ASCII names for things like classes and variables etc. Just imagine trying to program with class names in Hiragana. I dont think support would be difficult. However I am still trying to fathom the depths of that Unicode beast.

> 
> Not sure how they'd cope with Osmanya digits though - these have codepoints
> U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
> 
(been working on Unicode stuff so I know this...)
I believe they would cope using an escape sequence (surrogate pairs). Which I suppose means the String.length() function lies sometimes.

From java.nio.Charset
================================
The native coded character set of the Java programming language is that of the first seventeen planes of the Unicode version 3.0 character set; that is, it consists in the basic multilingual plane (BMP) of Unicode version 1 plus the next sixteen planes of Unicode version 3. This is because the language's internal representation of characters uses the UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs, a simple escape mechanism, to encode the other planes
================================

July 27, 2004
In article <ce3qsf$1fqt$1@digitaldaemon.com>, Sean Kelly says...

>I've been wondering about this.  readf (was scanf) still uses some lame shortcuts like "x - '0'" but that wouldn't be too terribly hard to fix.  I don't suppose the unicode isdigit function currently supports these numbering schemes?

The function getDecimalDigit(dchar) in etc.unicode returns the numeric value in the range 0 to 9 of all Unicode decimal digits. It returns -1 for all non-digits. You can find source code for this function in Deimos on dsource. Temporarily, there is no prebuilt library, but the source code works fine.

When I get back to writing code, my very next task will be to tidy up etc.unicode, release the codebuilder code, etc.. Right now I'm still taking a few weeks off coding because I'm still a bit blown away by my gran's death, so, for now, you'll just have to put up with me ranting on this forum without actually /doing/ anything - but I imagine I'll get back onto the task in hand in maybe a couple of weeks or so.

There is also a similar function, getDigit(dchar), which is similarly defined,
except that it also considers things like SUPERSCRIPT TWO and CIRCLED THREE to
be "digits". I imagine, therefore, that for readf(), getDecimalDigit() would be
more appropriate than getDigit().



>Also, is it reasonable to assume that every numbering scheme is base 10?  I'd certainly think so, but I suppose it's worth asking.

As far as Unicode is concerned, yes.
As far as reality is concerned, no.

In the Tamil script, for example, they use base twelve. Unicode simply cannot comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be "decimal". However - for our purposes, /this doesn't matter/. Our job is to implement the Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job for the Unicode Consortium, and that may happen in some future release. For now - as Walter said - we put metaphorical blinkers on and go with what the standard says.


For hexadecimal, there's the function getHexValue(), which returns a value in the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have implemented that yet, or that I implemented it inefficiently. When I get back to D-coding, I'll fix this).

Jill


July 27, 2004
In article <ce4ec5$1moq$2@digitaldaemon.com>, parabolis says...

>> # NonASCIIDigit  ::  	= A non-ASCII character c for which Character.isDigit(c)
>> returns true
>> 
>> Not sure how they'd cope with Osmanya digits though - these have codepoints U+0104A0 to U+0104A9 inclusive - too big to fit into a Java char.
>> 
>(been working on Unicode stuff so I know this...)
>I believe they would cope using an escape sequence (surrogate pairs).
>Which I suppose means the String.length() function lies sometimes.

Yes, Java uses UTF-16 (which is what you meant by "escape sequence" or "surrogate pairs"). However, that doesn't change the definition above: "A non-ASCII character c for which Character.isDigit(c) returns true". The function Character.isDigit(c) takes a Java char as it's parameter, not a UTF-16 sequence.

It doesn't matter for me, though, as I don't use Java, and I intend for D to do better.

Jill


July 27, 2004
parabolis wrote:
> Berin Loritsch wrote:
> 
>>
>> Another place to look, if you want to see how they are planning on improving things is here:
>>
>> http://java.sun.com/j2se/1.5.0/docs/api/java/util/Scanner.html
>> http://java.sun.com/j2se/1.5.0/docs/api/
> 
> 
> Sadly the one thing I really want them to implement will never happen - unsigned primitive types.

I think the main reason for that is the focus of Java.  Java is designed
for applications, while C/C++/D is designed to include systems
development as well.  I have not come accross many instances where an
unsigned primitive would be useful in the application space.  And those
few times where it does make a difference the signed primitives can be
used because there are no comparisons to be done.  In short, for the
things Java is good at, I haven't run into the need for unsigned
primitives myself.
July 27, 2004
Arcane Jill wrote:

> 
> It doesn't matter for me, though, as I don't use Java, and I intend for D to do
> better.
> 

In my opinion D is off to a really bad start with Unicode.



July 27, 2004
Berin Loritsch wrote:

> parabolis wrote:
> 
>>
>> Sadly the one thing I really want them to implement will never happen - unsigned primitive types.
> 
> 
> I think the main reason for that is the focus of Java.  Java is designed
> for applications, while C/C++/D is designed to include systems
> development as well.  I have not come accross many instances where an
> unsigned primitive would be useful in the application space.  And those
> few times where it does make a difference the signed primitives can be
> used because there are no comparisons to be done.  In short, for the
> things Java is good at, I haven't run into the need for unsigned
> primitives myself.
================================================================
I thought I remembered reading that Java was originally deisgned
for appliance microprocessors, but I could be wrong.

As for the unsigned primitive... Consider java.lang.String's:

    copyValueOf(char[] data, int offset, int count)

These type of functions are everywhere in the library code. Does
a negative offset or count ever make sense? Almost never... So
the first few lines of the code check to make sure the values
are in fact unsigned.

The same thing is true with reading and writing to arrays and
other sequential data structures in general. Every read/write
is checked to make sure it is actually unsigned.

The reason it bothers me is that I almost never write any code
using signed primitives in any other language. Being forced
to declare function parameters as signed and then check that
the values are not signed is a doubly whammy...

You probably wouldnt think about it but does it really make sense
to use a signed value in most of the for loops you write? It may
seem like an odd question but I tend to use unsigned by default
and signed when I must. So when I see a for loop with a signed
condition variable I wonder why someone would choose to do that.

July 27, 2004
parabolis wrote:

> Berin Loritsch wrote:
> 
>> parabolis wrote:
>>
>>>
>>> Sadly the one thing I really want them to implement will never happen - unsigned primitive types.
>>
>>
>>
>> I think the main reason for that is the focus of Java.  Java is designed
>> for applications, while C/C++/D is designed to include systems
>> development as well.  I have not come accross many instances where an
>> unsigned primitive would be useful in the application space.  And those
>> few times where it does make a difference the signed primitives can be
>> used because there are no comparisons to be done.  In short, for the
>> things Java is good at, I haven't run into the need for unsigned
>> primitives myself.
> 
> ================================================================
> I thought I remembered reading that Java was originally deisgned
> for appliance microprocessors, but I could be wrong.

Ok, you are going pre sun involvement here...

> 
> As for the unsigned primitive... Consider java.lang.String's:
> 

<snip/>

Let me just say that it doesn't have a serious impact on day to day
programming activities--even if it is not "ideologically pure".  Most
values used in day to day development fall well within the range of
the signed positive value range.  Most folks don't even worry about
whether it would be more efficient to use a byte or an int.  We just
use ints because the performance gains of using the smaller primitive
is no where near the gains of improving the algorithm.

But that's just my experience (public projects I have worked on include
Apache Avalon, Apache Cocoon, Apache JMeter, Apache Axis, and the
D-Haven projects.  I know this is a D forum, but I am including these
to add weight to the argument that signed vs. unsigned arguments really
don't impact most average programs all that much.

Does it affect some people?  sure.  But the most common solution is
either to ignore the sign or jump up to the next larger data size.
It's no biggy.
July 27, 2004
In article <ce68g6$2h7r$1@digitaldaemon.com>, parabolis says...
>
>Arcane Jill wrote:
>
>> 
>> It doesn't matter for me, though, as I don't use Java, and I intend for D to do better.
>> 
>
>In my opinion D is off to a really bad start with Unicode.

The "start" hasn't even happened yet. What we have now isn't anything like what we're /going/ to have. There are /loads/ of (other) things that D doesn't have yet (like decent streams support), but most of these things are *in progress*. I'd say you made your call too early.

Look at it like this. D has only been around for three or four years, and it was basically a one-person project. We're not even at version 1.0 yet, so the best is most definitely yet to come. Now Walter planned for good Unicode support from the start, and, with that in mind, he laid down the foundations, for example by insisting that D strings be Unicode. Those foundations are now being built upon. For example, the library etc.unicode (temporarily on hold for a few weeks due to a family death) currently gives you access to (almost) every Unicode character property. C doesn't give you this. C++ doesn't give you this. Even Java only gives you this for codepoints up to U+FFFF. D covers the lot - and that's /right now/. What's more, this library is robot-built from the actual Unicode database files, and so can be rebuilt with every new version of Unicode as it comes out, /and/ can be rebuilt for old versions of Unicode should that need arise. We're way ahead of Java there, which leaves you stuck with whatever version happens to come with your JVM.

And as for the future - well, for stage 2 we've got the normalization, canonical and compatibility equivalence stuff all planned, grapheme boundary detection, full localized casing ... which I think will take us way ahead of Java. And meanwhile, there are guys working on strings and streams who are getting transcoding issues sussed.

For stage three - and by this stage we'll be way ahead of the field - we'll have fuzzy matching, collation, and so on, all of which are locale-aware, plus full support for PUA properties. And meanwhile, there will be other guys working on other internationalization translation issues like number formatting and whatnot.

I think you have made your judgement too early. Phobos is tiny right now, compared with Java's vast array of classes. Deimos is even tinier, and somewhat more piecemeal. But already D's Unicode support is:

* Better than C
* Better than C++
* Catching up with Java (and better in some areas)

To expect the full whack right at the start is unrealistic (and we /are/ still right at the start). Walter was way too busy getting the core of the language together to start worrying about how you do uppercasing in Deseret*, but the language has now reached the point where we can do that.

So tell me. Against what are you comparing D? Java? Tell me in what ways you think D is behind? Tell me what does better than D, and in what way? I suspect you may be hard pressed to come up with examples.

Arcane Jill


* something which Java can't do, but D can, right now.


July 27, 2004
In article <ce6cl2$2j39$1@digitaldaemon.com>, parabolis says...

>So when I see a for loop with a signed
>condition variable I wonder why someone would choose to do that.

Well, here's one possible reason:

# for (int i=9; i>=0; --i) { /* blah */ }

is likely to be a few cycles faster than

# for (uint i=0; i<10; ++i) { /* blah */ }

(depending on how good the compiler is at optimizing - a black art about which I
know nothing)

Jill


July 27, 2004
Arcane Jill wrote:
> In article <ce3qsf$1fqt$1@digitaldaemon.com>, Sean Kelly says...
> 
>>Also, is it reasonable to assume that every numbering scheme is base 10?  I'd
>>certainly think so, but I suppose it's worth asking.
> 
> As far as Unicode is concerned, yes.
> As far as reality is concerned, no.
> 
> In the Tamil script, for example, they use base twelve. Unicode simply cannot
> comprehend this, and (erroneously) declares Tamil digits 0 to 9 to be "decimal".
> However - for our purposes, /this doesn't matter/. Our job is to implement the
> Unicode standard, even if it's wrong. Fixing the Unicode code charts is a job
> for the Unicode Consortium, and that may happen in some future release. For now
> - as Walter said - we put metaphorical blinkers on and go with what the standard
> says.

Makes sense.  The scanf spec that I was working off of makes no concession for a base 12 numbering scheme anyway.  And I hesitate to add it as it would confuse things.

> For hexadecimal, there's the function getHexValue(), which returns a value in
> the range 0 to 15 for hex digits, -1 otherwise. (It's possible I may not have
> implemented that yet, or that I implemented it inefficiently. When I get back to
> D-coding, I'll fix this).

Perfect.  I'll just use this function for everything.  It will simplify the code a bit anyway.


Sean