June 04, 2005
Derek Parnell wrote:

> The current Phobos routines are heavily biased to char[]. Also, the use of
> templates is not always the best solution because there are some
> optimizations available, depending on the UTF encoding format used.

Not that anyone cares, but templates also have severe problems
on other D platforms such as with the GDC compiler on Mac OS X...

It's getting better, but it's like "the early days of C++" or so.

--anders
June 04, 2005
Hasan Aljudy wrote:

> I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.

That's like saying that booleans should always be represented
with "int", and I'm afraid it won't fly around here since we're
obsessed with the size of variables more than processing time :-)

Conversion is a real problem, but at least you can do:
   char[] str; foreach(dchar c; str) { ... }
Plus some ASCII shortcuts, when the high bit isn't set.


Much more on http://prowiki.org/wiki4d/wiki.cgi?CharsAndStrs
(and several other pages on the Wiki4D, like Derek's RFE:
 "FeatureRequestList/ImplicitConversionBetweenUTF")

--anders

PS. You probably meant to say "dchar[]", and not dchar ?
June 04, 2005
Anders F Björklund wrote:
> Hasan Aljudy wrote:
> 
>> I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.
> 
> 
> That's like saying that booleans should always be represented
> with "int", and I'm afraid it won't fly around here since we're
> obsessed with the size of variables more than processing time :-)
> 

No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles.

booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans.

But char, wchar, and dchar are all characters, just with different storage space.

I don't really think anybody cares about size, most people who care would care most about performance (processing time).

imagine if all std functions used short instead of int ;) that could be a serious problem.

> Conversion is a real problem, but at least you can do:
>    char[] str; foreach(dchar c; str) { ... }
> Plus some ASCII shortcuts, when the high bit isn't set.
> 

I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory.

C'mon people, D is a high level language.
June 04, 2005
It would be great to resolve this ongoing concern. However, you might consider trying the ICU project for all your unicode needs ~ it's what Java uses under the covers: http://www-306.ibm.com/software/globalization/icu/index.jsp

There's a D interface available over here, along with a well-rounded String class: http://dsource.org/forums/viewtopic.php?t=148

- Kris

"Hasan Aljudy" <hasan.aljudy@gmail.com> wrote in message news:d7t8tc$b40$1@digitaldaemon.com...
> Anders F Björklund wrote:
> > Hasan Aljudy wrote:
> >
> >> I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.
> >
> >
> > That's like saying that booleans should always be represented with "int", and I'm afraid it won't fly around here since we're obsessed with the size of variables more than processing time :-)
> >
>
> No, it's not like representing booleans with ints .. it's actually like saying ints should always be represented by doubles.
>
> booleans are not numbers, there is no reason to represent them as numbers, and no one should ever store numbers in booleans.
>
> But char, wchar, and dchar are all characters, just with different storage space.
>
> I don't really think anybody cares about size, most people who care would care most about performance (processing time).
>
> imagine if all std functions used short instead of int ;) that could be a serious problem.
>
> > Conversion is a real problem, but at least you can do:
> >    char[] str; foreach(dchar c; str) { ... }
> > Plus some ASCII shortcuts, when the high bit isn't set.
> >
>
> I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory.
>
> C'mon people, D is a high level language.


June 04, 2005
> I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff dealing with how char and dchar are represented in memory.
>
> C'mon people, D is a high level language.

Maybe there should be isascii(char) somewhere :)
Would be inlined and self documenting.
June 05, 2005
Vathix wrote:

>> I don't like having to read the unicode specs to be able to deal with  simple things like char. Your "ASCII shortcuts" would be low-level stuff  dealing with how char and dchar are represented in memory.
>>
>> C'mon people, D is a high level language.
> 
> Maybe there should be isascii(char) somewhere :)
> Would be inlined and self documenting.

I suggested that enhancement last year, but it wasn't popular...

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/2154

Or maybe it just got lost in this crippled "bug reporting system" ?

--anders
June 05, 2005
On Sun, 05 Jun 2005 09:25:09 +0200, Anders F Björklund wrote:

> Vathix wrote:
> 
>>> I don't like having to read the unicode specs to be able to deal with simple things like char. Your "ASCII shortcuts" would be low-level stuff  dealing with how char and dchar are represented in memory.
>>>
>>> C'mon people, D is a high level language.
>> 
>> Maybe there should be isascii(char) somewhere :)
>> Would be inlined and self documenting.
> 
> I suggested that enhancement last year, but it wasn't popular...
> 
> http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/2154
> 
> Or maybe it just got lost in this crippled "bug reporting system" ?

You mean like this ...
//---------------------------
//  --- isASCII --
// Returns true if the supplied argument is an ASCII character.
//
// Paramaters:
//      (1)   -- char -- The character to test.
//   (return) -- bool -- 'true' if the character is ASCII otherwise false.
//---------------------------
bool isASCII(char c)
out(result)
{
    assert(result == (UTF8stride[c] == 1));
}
body{
    return (cast(uint)c <= 127U ? true : false);
}
unittest
{
   assert(isASCII('a') == true);
   assert(isASCII('~') == true);
   assert(isASCII('\xFF') == false);
   assert(isASCII('\x80') == false);
   assert(isASCII('\x00') == true);
   assert(isASCII(cast(char) -1) == false);
}
//---------------------------



-- 
Derek Parnell
Melbourne, Australia
5/06/2005 7:13:16 PM
June 05, 2005
Derek Parnell wrote:

> You mean like this ...
> //---------------------------
> //  --- isASCII --
> // Returns true if the supplied argument is an ASCII character.
> //
> // Paramaters:
> //      (1)   -- char -- The character to test.
> //   (return) -- bool -- 'true' if the character is ASCII otherwise false.
> //---------------------------

Is that the "Natural Docs" format ?

I think I prefer Doxygen, myself:
/// Is the supplied code unit an ASCII character ?
/// @param c    The UTF-8 code unit to test.
/// @return     'true' if the character is ASCII

> bool isASCII(char c)
> out(result)
> {
>     assert(result == (UTF8stride[c] == 1));
> }
> body{
>     return (cast(uint)c <= 127U ? true : false);
> }

But surely this workaround shouldn't be needed ?

If a "bool" function can't return a comparison,
then there's something severly broken somewhere...

--anders
June 05, 2005
On Sun, 05 Jun 2005 12:09:47 +0200, Anders F Björklund wrote:

> Derek Parnell wrote:
> 
>> You mean like this ...
>> //---------------------------
>> //  --- isASCII --
>> // Returns true if the supplied argument is an ASCII character.
>> //
>> // Paramaters:
>> //      (1)   -- char -- The character to test.
>> //   (return) -- bool -- 'true' if the character is ASCII otherwise false.
>> //---------------------------
> 
> Is that the "Natural Docs" format ?
>
Dunno. What's that ? I just made this up on the spot.

> I think I prefer Doxygen, myself:
> /// Is the supplied code unit an ASCII character ?
> /// @param c    The UTF-8 code unit to test.
> /// @return     'true' if the character is ASCII

Good on ya.

>> bool isASCII(char c)
>> out(result)
>> {
>>     assert(result == (UTF8stride[c] == 1));
>> }
>> body{
>>     return (cast(uint)c <= 127U ? true : false);
>> }
> 
> But surely this workaround shouldn't be needed ?
> 
> If a "bool" function can't return a comparison,
> then there's something severly broken somewhere...

I make a distinction between the machine code that is generated by a compiler and the source code that is read by a human.

Yes, the compiler is able to work out that a bool is returned from a comparison, but by writing it out explicitly, we also get a clear and unambiguous statement of intent by the coder. We get the same machine code generated and now its also human readable too.

In other words, it is self-documenting and does not rely on the sophistication of the compiler.

-- 
Derek Parnell
Melbourne, Australia
5/06/2005 8:39:19 PM
June 05, 2005
Derek Parnell wrote:

>>Is that the "Natural Docs" format ?
> 
> Dunno. What's that ? I just made this up on the spot.

http://www.naturaldocs.org/

Whatever style is used, it should be parsable ?

> Yes, the compiler is able to work out that a bool is returned from a
> comparison, but by writing it out explicitly, we also get a clear and
> unambiguous statement of intent by the coder. We get the same machine code
> generated and now its also human readable too.

Ah, OK, then it wasn't a compiler bug <phew>.
Just a matter of opinion on readability... :-)

Like: "a < b" versus "(a < b) ? true : false"

--anders
1 2
Next ›   Last »