length of string result not as expected

Aug 14, 2013

jicman

Aug 14, 2013

Jonathan M Davis

Aug 14, 2013

Aug 14, 2013

Aug 14, 2013

Aug 14, 2013

Aug 14, 2013

Greetings. import std.stdio; void main() { char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... writefln(str.length); } this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. josé

On Wednesday, August 14, 2013 04:53:34 jicman wrote: > Greetings. > > import std.stdio; > > void main() > { > char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... > writefln(str.length); > } > > this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. length gives you the length of the array, which is 39, because it contains 39 chars. If you want to know the number of code points in the string as opposed to the number of code units (char is a UTF-8 code unit), then use std.range.walkLength. e.g. writeln(walkLength(str)); It'll iterate through the string and count up the number of code points. - Jonathan M Davis

On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote: > know the exact length of the characters that I have in a char[] variable? Thanks. Your code looks like D1... in D1 or D2: import std.uni; dstring s2 = toUTF32(str); writeln(s2.length); // 13 in D2 you can do it a little more efficiently like this: import std.range; writeln(walkLength(str)); // 13 The reason it shows 39 instead of 13 is that the char[] is UTF-8, and Chinese characters are multi-byte characters in utf-8. The .length property gives the number elements in the array, which are bytes in utf-8. dstring uses UTF-32, which has a consistent size for each code point. Which isn't technically quite the same as a character actually, but close enough that it works here. Bottom line though, char[] for non-English text tends to have a longer length than you expect because a lot of characters are multi-byte in utf8. If you use dstring, the length is more consistent.

On Wednesday, 14 August 2013 at 03:00:00 UTC, Jonathan M Davis wrote: > On Wednesday, August 14, 2013 04:53:34 jicman wrote: >> Greetings. >> >> import std.stdio; >> >> void main() >> { >> char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... >> writefln(str.length); >> } >> >> this program returns 39. I expected to return 13. How do I know >> the exact length of the characters that I have in a char[] >> variable? Thanks. > > length gives you the length of the array, which is 39, because it contains 39 > chars. If you want to know the number of code points in the string as opposed > to the number of code units (char is a UTF-8 code unit), then use > std.range.walkLength. e.g. > > writeln(walkLength(str)); > > It'll iterate through the string and count up the number of code points. > > - Jonathan M Davis thanks, Jonathan. That looks like D2, since D1 does not have std.range in its phobos library.

On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote: > > Greetings. > > import std.stdio; > > void main() > { > char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters... > writefln(str.length); > } > > this program returns 39. I expected to return 13. How do I know the exact length of the characters that I have in a char[] variable? Thanks. > > josé What version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);" Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them. Alternatively, you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.

August 14, 2013

Re: length of string result not as expected

Posted by jicman
in reply to Jeremy DeHaan

Permalink

jicman

Posted in reply to Jeremy DeHaan

Permalink

On Wednesday, 14 August 2013 at 03:16:08 UTC, Jeremy DeHaan wrote:
> On Wednesday, 14 August 2013 at 02:53:43 UTC, jicman wrote:
>>
>> Greetings.
>>
>> import std.stdio;
>>
>> void main()
>> {
>>  char[] str = "不良反應事件和產品客訴報告"; // 13 chinese characters...
>>  writefln(str.length);
>> }
>>
>> this program returns 39.  I expected to return 13.  How do I know the exact length of the characters that I have in a char[] variable?  Thanks.
>>
>> josé
>
> What version of DMD are you using? This code doesn't even compile for me. It gives me errors about not being able to convert type string to char[], like it should since a string literal is immutable data. To test the code I changed char[] to string. I also got an error for "writefln(str.length);" so I just changed that to "writeln(str.length);"
>
> Anyways, from what I understand, the reason you get this is because each of those characters is greater than a single 8 byte representation. D's chars are utf-8, so that means it takes more than a single char to store the data needed to represent one of the chinese characters. str.length will give you the length of the string with respect to each char it contains. You have 13 characters in your string, but you need 39 chars to store the data to represent them.
>
> Alternatively,  you can use a different encoding to see the actual number of characters in your string, eg. wstring or dstring. I usually use dstrings when working with unicode personally.

This is D1. Forgot to mention that.  I am still in the old ages. :-)  thanks for the insight.  I figured that much, but I need to know go and try to figure out what to do with both western character set as well as the asian, hebrew, etc.  Thanks.

On 2013-08-14 05:05, Adam D. Ruppe wrote: > Your code looks like D1... > > in D1 or D2: > import std.uni; > dstring s2 = toUTF32(str); > writeln(s2.length); // 13 > > > in D2 you can do it a little more efficiently like this: > > import std.range; > writeln(walkLength(str)); // 13 In D1 you can easily implement walkLength yourself: import std.utf; size_t walkLength (C) (C[] arr) { size_t i; size_t len; while (i < arr.length) { i += arr.stride(i); len++; } return len; } void main () { auto a = "不良反應事件和產品客訴報告"; assert(walkLength(a) == 13); } -- /Jacob Carlborg

Forums