April 24, 2007
Tomas Lindquist Olsen Wrote:

> okibi wrote:
> >> 
> >> Why not just do:
> >> 
> >> char[] text = "some text";
> >> char num5 = text[5];
> >> 
> >> 
> > 
> > Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.
> 
> import std.stdio;
> 
> void main()
> {
>     char[] text = "this is a sentence";
>     int loc = 5;
>     writefln("%s", typeid(typeof(text[loc])));
> }
> 
> this prints 'char' as expected...

That fixed the problem, thanks!
April 24, 2007
On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:

> Derek Parnell Wrote:
> 
>> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
>> 
>>> Is there a getCharAt() function for D?
>> 
>> Get a character from what? A string, a file, a console screen, ... ?
>> 
>> -- 
>> Derek Parnell
>> Melbourne, Australia
>> "Justice for David Hicks!"
>> skype: derek.j.parnell
> 
> Such as this:
> 
> char[] text = "This is a test sentence.";
> 
> int loc = 5;
> 
> char num5 = text.getCharAt(loc);
> 
> Something along those lines.

Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ...

   import std.stdio;
   import std.utf;

   T getCharAt(T)(T pText, uint pPos)
   {
       size_t lUTF_Index;
       uint   lStride;

       // Firstly, find out where the character starts in the string.
       lUTF_Index = std.utf.toUTFindex(pText, pPos-1);

       // Then find out its width (in bytes)
       lStride = std.utf.stride(pText, lUTF_Index);

       // Return the character encoded in UTF format.
       return pText[lUTF_Index .. lUTF_Index + lStride];
  }

  void main()
  {
    char[] text = "a\ua034bcdef";
    uint loc = 4;
    writefln("%s", getCharAt(text, loc)); // shows "c"
    writefln("%s", text[loc-1]); // correctly fails
  }


If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway.

Remember that char[] is not an array of characters. It is an array of UTF-8
code point fragments (each 1-byte wide) and a UTF-8 encoded character (code
point) can have from 1 to 4 fragments.

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell
April 25, 2007
Derek Parnell wrote:
> On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:
> 
>> Derek Parnell Wrote:
>>
>>> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
>>>
>>>> Is there a getCharAt() function for D?
>>> Get a character from what? A string, a file, a console screen, ... ?
>>>
>>> -- 
>>> Derek Parnell
>>> Melbourne, Australia
>>> "Justice for David Hicks!"
>>> skype: derek.j.parnell
>> Such as this:
>>
>> char[] text = "This is a test sentence.";
>>
>> int loc = 5;
>>
>> char num5 = text.getCharAt(loc);
>>
>> Something along those lines.
> 
> Because char[] represents a UTF-8 encoded unicode string, to get the Nth
> character (first character is a position 1), try this ...
> 
>    import std.stdio;
>    import std.utf;
> 
>    T getCharAt(T)(T pText, uint pPos)
>    {
>        size_t lUTF_Index;
>        uint   lStride;
> 
>        // Firstly, find out where the character starts in the string.
>        lUTF_Index = std.utf.toUTFindex(pText, pPos-1);
> 
>        // Then find out its width (in bytes)
>        lStride = std.utf.stride(pText, lUTF_Index);
> 
>        // Return the character encoded in UTF format.
>        return pText[lUTF_Index .. lUTF_Index + lStride];
>   }
> 
>   void main()
>   {
>     char[] text = "a\ua034bcdef";
>     uint loc = 4;
>     writefln("%s", getCharAt(text, loc)); // shows "c"
>     writefln("%s", text[loc-1]); // correctly fails
>   }
> 
> 
> If you just use 'text[loc]', you may not get the correct character, and you
> actually only get a UTF code point fragment anyway.
> 
> Remember that char[] is not an array of characters. It is an array of UTF-8
> code point fragments (each 1-byte wide) and a UTF-8 encoded character (code
> point) can have from 1 to 4 fragments.
>  

Which is why I tend to try and bite the bullet and just use dchar[] for general purpose things.  I only use char[] in cases where I know it's "safe" to do so (that is, cases where I know what the input will be, and know it will be within the single-byte character range).  That said, its a darn good thing Phobos has std.utf and Tango has tango.utils.Utf, otherwise we'd often be in a pickle.  (Avoiding potential tango.io joke.)

-- Chris Nicholson-Sauls
April 25, 2007

Derek Parnell wrote:
> On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:
> 
>> Derek Parnell Wrote:
>>
>>> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
>>>
>>>> Is there a getCharAt() function for D?
>>> Get a character from what? A string, a file, a console screen, ... ?
>>>
>>> -- 
>>> Derek Parnell
>>> Melbourne, Australia
>>> "Justice for David Hicks!"
>>> skype: derek.j.parnell
>> Such as this:
>>
>> char[] text = "This is a test sentence.";
>>
>> int loc = 5;
>>
>> char num5 = text.getCharAt(loc);
>>
>> Something along those lines.
> 
> Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ...
> 
>    import std.stdio;
>    import std.utf;
> 
>    T getCharAt(T)(T pText, uint pPos)
>    {
>        size_t lUTF_Index;
>        uint   lStride;
> 
>        // Firstly, find out where the character starts in the string.
>        lUTF_Index = std.utf.toUTFindex(pText, pPos-1);
> 
>        // Then find out its width (in bytes)
>        lStride = std.utf.stride(pText, lUTF_Index);
> 
>        // Return the character encoded in UTF format.
>        return pText[lUTF_Index .. lUTF_Index + lStride];
>   }
> 
>   void main()
>   {
>     char[] text = "a\ua034bcdef";
>     uint loc = 4;
>     writefln("%s", getCharAt(text, loc)); // shows "c"
>     writefln("%s", text[loc-1]); // correctly fails
>   }
> 
> 
> If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway.
> 
> Remember that char[] is not an array of characters. It is an array of UTF-8
> code point fragments (each 1-byte wide) and a UTF-8 encoded character (code
> point) can have from 1 to 4 fragments.

I was going to post a link to my old Text In D article[1], but I guess that'd be redundant now :P

Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:

> dchar nthCharacter(char[] string, int n)
> {
>     int curChar = 0;
>     foreach( dchar cp ; string )
>         if( curChar++ == n )
>             return cp;
>     return dchar.init;
> }

I'm curious since I don't want to recommend a slow solution if I can help it :)

	-- Daniel

[1] http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD

-- 
int getRandomNumber()
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

http://xkcd.com/

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
April 25, 2007
On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:

> Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:

It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear. I also changed my routine to output a dchar rather than a char[] and to test for invalid position input.

//-----------------------------
import std.perf;
import std.stdio;
import std.utf;


 dchar getCharAt(T)(T pText, int pPos)
 {
       size_t lUTF_Index;
       uint   lStride;

       if (pPos < 0 || pPos >= pText.length)
        return dchar.init;
       // Firstly, find out where the character starts in the string.
       lUTF_Index = std.utf.toUTFindex(pText, pPos);

       // Then find out its width (in bytes)
       lStride = std.utf.stride(pText, lUTF_Index);

       // Return the character encoded in UTF format.
       return std.utf.toUTF32(
                pText[lUTF_Index .. lUTF_Index + lStride])[0];
}

dchar nthCharacter(T)(T string, int n)
{
    int curChar = 0;
    foreach( dchar cp ; string )
    {
        if( curChar == n )
            return cp;
        curChar++;
    }
    return dchar.init;
}

void main()
{
    char[] text = "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg1"
                  "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg2"
                  "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg3"
                  "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg4"
                  "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg5"
                  "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg6"
                  ;
    // Test must locate the last character.
    int loc = std.utf.toUTF32(text).length-1;

    assert(getCharAt(text, loc) == '6');
    assert(nthCharacter(text, loc) == '6');

    PerformanceCounter    counter = new PerformanceCounter();

    counter.start();
    volatile for(int i = 0; i < 10_000_000; ++i)
    {  getCharAt(text, loc); }
    counter.stop();

    writefln("Derek Parnell: %10d", counter.microseconds());

    counter.start();
    volatile for(int i = 0; i < 10_000_000; ++i)
    {  nthCharacter(text, loc); }
    counter.stop();

    writefln("  Daniel Keep: %10d", counter.microseconds());
}
//-----------------------------

On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result ...

c:\temp>test
Derek Parnell:    7939664
  Daniel Keep:   26683373

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell
April 25, 2007
Derek Parnell wrote:
> On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:
> 
>> Incidentally, I don't suppose you know anything about the relative
>> performance of your method up there ^^ and the one in my article down
>> here vv:
> 
> It seems that your routine is about 3 times slower than the one I had
> shown. Here is my test program ... I modified your routine slightly because
> the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets
> incremented before or after the comparision. I changed it to be more clear.

How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).

> I also changed my routine to output a dchar rather than a char[] and to
> test for invalid position input.
> 
> //-----------------------------
> import std.perf;
> import std.stdio;
> import std.utf;
> 
> 
>  dchar getCharAt(T)(T pText, int pPos)
>  {
>        size_t lUTF_Index;
>        uint   lStride;
> 
>        if (pPos < 0 || pPos >= pText.length)
>         return dchar.init;
>        // Firstly, find out where the character starts in the string.
>        lUTF_Index = std.utf.toUTFindex(pText, pPos);
> 


>        // Then find out its width (in bytes)
>        lStride = std.utf.stride(pText, lUTF_Index);
> 
>        // Return the character encoded in UTF format.
>        return std.utf.toUTF32(
>                 pText[lUTF_Index .. lUTF_Index + lStride])[0];

I think you can change these last two statements to just:
---
	return pText.decode(lUTF_Index);
---
(that's std.utf.decode, just to be clear)
That changes the index variable passed, but that doesn't matter here.

> }
[snip]
> //-----------------------------
> 
> On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result ...
> 
> c:\temp>test
> Derek Parnell:    7939664
>   Daniel Keep:   26683373

With mine added: (and obviously on _my_ machine)
---
urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d
   Derek Parnell:   17693368
     Daniel Keep:   54037341
Frits van Bommel:   12045495
urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test
   Derek Parnell:   19567337
     Daniel Keep:   26750383
Frits van Bommel:   14332419
---
(My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64)

So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.
April 25, 2007

Frits van Bommel wrote:
> Derek Parnell wrote:
>> On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:
>>
>>> Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:
>>
>> It seems that your routine is about 3 times slower than the one I had
>> shown. Here is my test program ... I modified your routine slightly
>> because
>> the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets
>> incremented before or after the comparision. I changed it to be more
>> clear.
> 
> How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).
> 
>> I also changed my routine to output a dchar rather than a char[] and to test for invalid position input.
>>
>> //-----------------------------
>> import std.perf;
>> import std.stdio;
>> import std.utf;
>>
>>
>>  dchar getCharAt(T)(T pText, int pPos)
>>  {
>>        size_t lUTF_Index;
>>        uint   lStride;
>>
>>        if (pPos < 0 || pPos >= pText.length)
>>         return dchar.init;
>>        // Firstly, find out where the character starts in the string.
>>        lUTF_Index = std.utf.toUTFindex(pText, pPos);
>>
> 
> 
>>        // Then find out its width (in bytes)
>>        lStride = std.utf.stride(pText, lUTF_Index);
>>
>>        // Return the character encoded in UTF format.
>>        return std.utf.toUTF32(
>>                 pText[lUTF_Index .. lUTF_Index + lStride])[0];
> 
> I think you can change these last two statements to just:
> ---
>     return pText.decode(lUTF_Index);
> ---
> (that's std.utf.decode, just to be clear)
> That changes the index variable passed, but that doesn't matter here.
> 
>> }
> [snip]
>> //-----------------------------
>>
>> On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result
>> ...
>>
>> c:\temp>test
>> Derek Parnell:    7939664
>>   Daniel Keep:   26683373
> 
> With mine added: (and obviously on _my_ machine)
> ---
> urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d
>    Derek Parnell:   17693368
>      Daniel Keep:   54037341
> Frits van Bommel:   12045495
> urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test
>    Derek Parnell:   19567337
>      Daniel Keep:   26750383
> Frits van Bommel:   14332419
> ---
> (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64)
> 
> So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.

Yoikes!  I'm rather amazed that the "simple" foreach method is that much slower.  I'll add the faster version to the article as soon as I get the chance.

Thanks, guys.

	-- Daniel

-- 
int getRandomNumber()
{
    return 4; // chosen by fair dice roll.
              // guaranteed to be random.
}

http://xkcd.com/

v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP  http://hackerkey.com/
April 25, 2007
On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:

> Derek Parnell wrote:
>> On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote:
>> 
>>> Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv:
>> 
>> It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear.
> 
> How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression).

Yes, I know what it is supposed to do, but when written as it is, it can either be mistakenly thought that the variable gets incremented before the comparision or requires that extra bit of thinking to 'see' the process flow. For that reason, I prefer to either have ++ written as its own statement or write it so the casual reader can explicitly see the process flow.

For example, in the original code by Daniel, I was unsure as to whether he was using a 0-based index or a 1-based index, as I had done in my example. The code he supplied assumed a 0-based if the ++ worked as you describe but it assumed a 1-based index if it worked the other way. As my example was 1-based, and I assumed that Daniel knew how to use ++ correctly, I figured he had thus changed my definition of the Position parameter. But the point is, because it was not absolutely clear what the *intention* of the Daniel was, I decided to coded it so the intention was more clear.


> I think you can change these last two statements to just:
 ...
> So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.

Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that was not the point. Daniel's code is another example of 'foreach' not producing the best machine code to solve the problem at hand.

-- 
Derek Parnell
Melbourne, Australia
"Justice for David Hicks!"
skype: derek.j.parnell
April 25, 2007
Derek Parnell wrote:
> On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote:
> 
>> I think you can change these last two statements to just:
>  ... 
>> So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example.
> 
> Well, if we were really into a pissing contest, we'd both remove the calls
> to library routines and code it inline, in assembler etc ... but that was

I was just mentioning that you seemed to be over-complicating the code, and as a side-benefit the simpler code was faster as well.

> not the point. Daniel's code is another example of 'foreach' not producing
> the best machine code to solve the problem at hand.

Well to be fair, I don't think that's purely the fault of 'foreach' implementation problems in this case.
'foreach' is doing genuinely more work in this case. Specifically, the foreach loop is decoding all characters up to the one it returns while the getCharAt() variants only actually decode the character asked for, using no more than the stride of the preceding ones.

What the foreach version does is therefore more like the following:
-----
dchar nthCharacter2(T)(T string, int n)
{
    int curChar = 0;
    for(size_t index = 0 ; index < string.length ; string.decode(index))
    {
        if( curChar == n )
            return string.decode(index);	// return _next_ char
        curChar++;
    }
    return dchar.init;
}
-----
Which is also on the slow side. (Though on DMD this version is still faster than the 'foreach' version :( )
The results with this added as well:
=====
urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d
   Derek Parnell:   14416041
Frits van Bommel:    9803830
     Daniel Keep:   37386228
      for-decode:   33767606
urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test
   Derek Parnell:   17267995
Frits van Bommel:   11836242
     Daniel Keep:   21390295
      for-decode:   25339226
=====
("for-decode" is the code above)
1 2
Next ›   Last »