April 24, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Tomas Lindquist Olsen | Tomas Lindquist Olsen Wrote:
> okibi wrote:
> >>
> >> Why not just do:
> >>
> >> char[] text = "some text";
> >> char num5 = text[5];
> >>
> >>
> >
> > Because it isn't working for me. That was what I was trying to do seeing as char[] is simply an array of characters. However, it's returning an int and not a char.
>
> import std.stdio;
>
> void main()
> {
> char[] text = "this is a sentence";
> int loc = 5;
> writefln("%s", typeid(typeof(text[loc])));
> }
>
> this prints 'char' as expected...
That fixed the problem, thanks!
|
April 24, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to okibi | On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote: > Derek Parnell Wrote: > >> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote: >> >>> Is there a getCharAt() function for D? >> >> Get a character from what? A string, a file, a console screen, ... ? >> >> -- >> Derek Parnell >> Melbourne, Australia >> "Justice for David Hicks!" >> skype: derek.j.parnell > > Such as this: > > char[] text = "This is a test sentence."; > > int loc = 5; > > char num5 = text.getCharAt(loc); > > Something along those lines. Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... import std.stdio; import std.utf; T getCharAt(T)(T pText, uint pPos) { size_t lUTF_Index; uint lStride; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos-1); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return pText[lUTF_Index .. lUTF_Index + lStride]; } void main() { char[] text = "a\ua034bcdef"; uint loc = 4; writefln("%s", getCharAt(text, loc)); // shows "c" writefln("%s", text[loc-1]); // correctly fails } If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. Remember that char[] is not an array of characters. It is an array of UTF-8 code point fragments (each 1-byte wide) and a UTF-8 encoded character (code point) can have from 1 to 4 fragments. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote:
> On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote:
>
>> Derek Parnell Wrote:
>>
>>> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote:
>>>
>>>> Is there a getCharAt() function for D?
>>> Get a character from what? A string, a file, a console screen, ... ?
>>>
>>> --
>>> Derek Parnell
>>> Melbourne, Australia
>>> "Justice for David Hicks!"
>>> skype: derek.j.parnell
>> Such as this:
>>
>> char[] text = "This is a test sentence.";
>>
>> int loc = 5;
>>
>> char num5 = text.getCharAt(loc);
>>
>> Something along those lines.
>
> Because char[] represents a UTF-8 encoded unicode string, to get the Nth
> character (first character is a position 1), try this ...
>
> import std.stdio;
> import std.utf;
>
> T getCharAt(T)(T pText, uint pPos)
> {
> size_t lUTF_Index;
> uint lStride;
>
> // Firstly, find out where the character starts in the string.
> lUTF_Index = std.utf.toUTFindex(pText, pPos-1);
>
> // Then find out its width (in bytes)
> lStride = std.utf.stride(pText, lUTF_Index);
>
> // Return the character encoded in UTF format.
> return pText[lUTF_Index .. lUTF_Index + lStride];
> }
>
> void main()
> {
> char[] text = "a\ua034bcdef";
> uint loc = 4;
> writefln("%s", getCharAt(text, loc)); // shows "c"
> writefln("%s", text[loc-1]); // correctly fails
> }
>
>
> If you just use 'text[loc]', you may not get the correct character, and you
> actually only get a UTF code point fragment anyway.
>
> Remember that char[] is not an array of characters. It is an array of UTF-8
> code point fragments (each 1-byte wide) and a UTF-8 encoded character (code
> point) can have from 1 to 4 fragments.
>
Which is why I tend to try and bite the bullet and just use dchar[] for general purpose things. I only use char[] in cases where I know it's "safe" to do so (that is, cases where I know what the input will be, and know it will be within the single-byte character range). That said, its a darn good thing Phobos has std.utf and Tango has tango.utils.Utf, otherwise we'd often be in a pickle. (Avoiding potential tango.io joke.)
-- Chris Nicholson-Sauls
|
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > On Tue, 24 Apr 2007 11:56:19 -0400, okibi wrote: > >> Derek Parnell Wrote: >> >>> On Tue, 24 Apr 2007 10:30:16 -0400, okibi wrote: >>> >>>> Is there a getCharAt() function for D? >>> Get a character from what? A string, a file, a console screen, ... ? >>> >>> -- >>> Derek Parnell >>> Melbourne, Australia >>> "Justice for David Hicks!" >>> skype: derek.j.parnell >> Such as this: >> >> char[] text = "This is a test sentence."; >> >> int loc = 5; >> >> char num5 = text.getCharAt(loc); >> >> Something along those lines. > > Because char[] represents a UTF-8 encoded unicode string, to get the Nth character (first character is a position 1), try this ... > > import std.stdio; > import std.utf; > > T getCharAt(T)(T pText, uint pPos) > { > size_t lUTF_Index; > uint lStride; > > // Firstly, find out where the character starts in the string. > lUTF_Index = std.utf.toUTFindex(pText, pPos-1); > > // Then find out its width (in bytes) > lStride = std.utf.stride(pText, lUTF_Index); > > // Return the character encoded in UTF format. > return pText[lUTF_Index .. lUTF_Index + lStride]; > } > > void main() > { > char[] text = "a\ua034bcdef"; > uint loc = 4; > writefln("%s", getCharAt(text, loc)); // shows "c" > writefln("%s", text[loc-1]); // correctly fails > } > > > If you just use 'text[loc]', you may not get the correct character, and you actually only get a UTF code point fragment anyway. > > Remember that char[] is not an array of characters. It is an array of UTF-8 > code point fragments (each 1-byte wide) and a UTF-8 encoded character (code > point) can have from 1 to 4 fragments. I was going to post a link to my old Text In D article[1], but I guess that'd be redundant now :P Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv: > dchar nthCharacter(char[] string, int n) > { > int curChar = 0; > foreach( dchar cp ; string ) > if( curChar++ == n ) > return cp; > return dchar.init; > } I'm curious since I don't want to recommend a slow solution if I can help it :) -- Daniel [1] http://www.prowiki.org/wiki4d/wiki.cgi?DanielKeep/TextInD -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/ |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Daniel Keep | On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote: > Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv: It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear. I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. //----------------------------- import std.perf; import std.stdio; import std.utf; dchar getCharAt(T)(T pText, int pPos) { size_t lUTF_Index; uint lStride; if (pPos < 0 || pPos >= pText.length) return dchar.init; // Firstly, find out where the character starts in the string. lUTF_Index = std.utf.toUTFindex(pText, pPos); // Then find out its width (in bytes) lStride = std.utf.stride(pText, lUTF_Index); // Return the character encoded in UTF format. return std.utf.toUTF32( pText[lUTF_Index .. lUTF_Index + lStride])[0]; } dchar nthCharacter(T)(T string, int n) { int curChar = 0; foreach( dchar cp ; string ) { if( curChar == n ) return cp; curChar++; } return dchar.init; } void main() { char[] text = "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg1" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg2" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg3" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg4" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg5" "a\ua034bcdefa\ua034bcdefa\ua034bcdefa\ua034bcdefg6" ; // Test must locate the last character. int loc = std.utf.toUTF32(text).length-1; assert(getCharAt(text, loc) == '6'); assert(nthCharacter(text, loc) == '6'); PerformanceCounter counter = new PerformanceCounter(); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { getCharAt(text, loc); } counter.stop(); writefln("Derek Parnell: %10d", counter.microseconds()); counter.start(); volatile for(int i = 0; i < 10_000_000; ++i) { nthCharacter(text, loc); } counter.stop(); writefln(" Daniel Keep: %10d", counter.microseconds()); } //----------------------------- On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result ... c:\temp>test Derek Parnell: 7939664 Daniel Keep: 26683373 -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote: > >> Incidentally, I don't suppose you know anything about the relative >> performance of your method up there ^^ and the one in my article down >> here vv: > > It seems that your routine is about 3 times slower than the one I had > shown. Here is my test program ... I modified your routine slightly because > the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets > incremented before or after the comparision. I changed it to be more clear. How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression). > I also changed my routine to output a dchar rather than a char[] and to > test for invalid position input. > > //----------------------------- > import std.perf; > import std.stdio; > import std.utf; > > > dchar getCharAt(T)(T pText, int pPos) > { > size_t lUTF_Index; > uint lStride; > > if (pPos < 0 || pPos >= pText.length) > return dchar.init; > // Firstly, find out where the character starts in the string. > lUTF_Index = std.utf.toUTFindex(pText, pPos); > > // Then find out its width (in bytes) > lStride = std.utf.stride(pText, lUTF_Index); > > // Return the character encoded in UTF format. > return std.utf.toUTF32( > pText[lUTF_Index .. lUTF_Index + lStride])[0]; I think you can change these last two statements to just: --- return pText.decode(lUTF_Index); --- (that's std.utf.decode, just to be clear) That changes the index variable passed, but that doesn't matter here. > } [snip] > //----------------------------- > > On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result ... > > c:\temp>test > Derek Parnell: 7939664 > Daniel Keep: 26683373 With mine added: (and obviously on _my_ machine) --- urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 17693368 Daniel Keep: 54037341 Frits van Bommel: 12045495 urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 19567337 Daniel Keep: 26750383 Frits van Bommel: 14332419 --- (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example. |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Frits van Bommel | Frits van Bommel wrote: > Derek Parnell wrote: >> On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote: >> >>> Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv: >> >> It seems that your routine is about 3 times slower than the one I had >> shown. Here is my test program ... I modified your routine slightly >> because >> the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets >> incremented before or after the comparision. I changed it to be more >> clear. > > How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression). > >> I also changed my routine to output a dchar rather than a char[] and to test for invalid position input. >> >> //----------------------------- >> import std.perf; >> import std.stdio; >> import std.utf; >> >> >> dchar getCharAt(T)(T pText, int pPos) >> { >> size_t lUTF_Index; >> uint lStride; >> >> if (pPos < 0 || pPos >= pText.length) >> return dchar.init; >> // Firstly, find out where the character starts in the string. >> lUTF_Index = std.utf.toUTFindex(pText, pPos); >> > > >> // Then find out its width (in bytes) >> lStride = std.utf.stride(pText, lUTF_Index); >> >> // Return the character encoded in UTF format. >> return std.utf.toUTF32( >> pText[lUTF_Index .. lUTF_Index + lStride])[0]; > > I think you can change these last two statements to just: > --- > return pText.decode(lUTF_Index); > --- > (that's std.utf.decode, just to be clear) > That changes the index variable passed, but that doesn't matter here. > >> } > [snip] >> //----------------------------- >> >> On my machine (Intel Core 2 6600 @ 2.40GHz, 2GB RAM) I got this result >> ... >> >> c:\temp>test >> Derek Parnell: 7939664 >> Daniel Keep: 26683373 > > With mine added: (and obviously on _my_ machine) > --- > urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d > Derek Parnell: 17693368 > Daniel Keep: 54037341 > Frits van Bommel: 12045495 > urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test > Derek Parnell: 19567337 > Daniel Keep: 26750383 > Frits van Bommel: 14332419 > --- > (My machine & compilers: AMD Sempron 3200+, 1GB RAM, 64-bit Ubuntu 6.10, running DMD 1.013 and GDC 0.23/x86_64) > > So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example. Yoikes! I'm rather amazed that the "simple" foreach method is that much slower. I'll add the faster version to the article as soon as I get the chance. Thanks, guys. -- Daniel -- int getRandomNumber() { return 4; // chosen by fair dice roll. // guaranteed to be random. } http://xkcd.com/ v2sw5+8Yhw5ln4+5pr6OFPma8u6+7Lw4Tm6+7l6+7D i28a2Xs3MSr2e4/6+7t4TNSMb6HTOp5en5g6RAHCP http://hackerkey.com/ |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Frits van Bommel | On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote: > Derek Parnell wrote: >> On Wed, 25 Apr 2007 13:41:25 +1000, Daniel Keep wrote: >> >>> Incidentally, I don't suppose you know anything about the relative performance of your method up there ^^ and the one in my article down here vv: >> >> It seems that your routine is about 3 times slower than the one I had shown. Here is my test program ... I modified your routine slightly because the idiom "if (x++ == n)" is a dangerous one as it is unclear if 'x' gets incremented before or after the comparision. I changed it to be more clear. > > How is it unclear? Postfix-increment clearly means that the value before incrementation is returned (and thus compared to n in that expression). Yes, I know what it is supposed to do, but when written as it is, it can either be mistakenly thought that the variable gets incremented before the comparision or requires that extra bit of thinking to 'see' the process flow. For that reason, I prefer to either have ++ written as its own statement or write it so the casual reader can explicitly see the process flow. For example, in the original code by Daniel, I was unsure as to whether he was using a 0-based index or a 1-based index, as I had done in my example. The code he supplied assumed a 0-based if the ++ worked as you describe but it assumed a 1-based index if it worked the other way. As my example was 1-based, and I assumed that Daniel knew how to use ++ correctly, I figured he had thus changed my definition of the Position parameter. But the point is, because it was not absolutely clear what the *intention* of the Daniel was, I decided to coded it so the intention was more clear. > I think you can change these last two statements to just: ... > So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example. Well, if we were really into a pissing contest, we'd both remove the calls to library routines and code it inline, in assembler etc ... but that was not the point. Daniel's code is another example of 'foreach' not producing the best machine code to solve the problem at hand. -- Derek Parnell Melbourne, Australia "Justice for David Hicks!" skype: derek.j.parnell |
April 25, 2007 Re: Get Character At? | ||||
---|---|---|---|---|
| ||||
Posted in reply to Derek Parnell | Derek Parnell wrote: > On Wed, 25 Apr 2007 15:52:45 +0200, Frits van Bommel wrote: > >> I think you can change these last two statements to just: > ... >> So my version is even faster (about 30%), at least on my machine. And IMHO it's also more readable. No need to know what "stride" is, for example. > > Well, if we were really into a pissing contest, we'd both remove the calls > to library routines and code it inline, in assembler etc ... but that was I was just mentioning that you seemed to be over-complicating the code, and as a side-benefit the simpler code was faster as well. > not the point. Daniel's code is another example of 'foreach' not producing > the best machine code to solve the problem at hand. Well to be fair, I don't think that's purely the fault of 'foreach' implementation problems in this case. 'foreach' is doing genuinely more work in this case. Specifically, the foreach loop is decoding all characters up to the one it returns while the getCharAt() variants only actually decode the character asked for, using no more than the stride of the preceding ones. What the foreach version does is therefore more like the following: ----- dchar nthCharacter2(T)(T string, int n) { int curChar = 0; for(size_t index = 0 ; index < string.length ; string.decode(index)) { if( curChar == n ) return string.decode(index); // return _next_ char curChar++; } return dchar.init; } ----- Which is also on the slow side. (Though on DMD this version is still faster than the 'foreach' version :( ) The results with this added as well: ===== urxae@urxae:~/tmp$ dmd -O -release -inline -run test.d Derek Parnell: 14416041 Frits van Bommel: 9803830 Daniel Keep: 37386228 for-decode: 33767606 urxae@urxae:~/tmp$ gdc -O3 -finline -frelease -o test test.d && ./test Derek Parnell: 17267995 Frits van Bommel: 11836242 Daniel Keep: 21390295 for-decode: 25339226 ===== ("for-decode" is the code above) |
Copyright © 1999-2021 by the D Language Foundation