Thread overview
Best way to count character spaces.
Jun 30, 2015
Taylor Hillegeist
Jul 01, 2015
Rikki Cattermole
Jul 01, 2015
H. S. Teoh
June 30, 2015
So I am aware that Unicode is not simple... I have been working on a boxes like project http://boxes.thomasjensen.com/

it basically puts a pretty border around stdin characters. like so:
 ________________________
/\                       \
\_|Different all twisty a|
  |of in maze are you,   |
  |passages little.      |
  |   ___________________|_
   \_/_____________________/

but I find that I need to know a bit more than the length of the string because of encoding differences

I had a thought at one point to do this:

MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();

Should get me the longest line.

but this has a problem too because control characters might not take up space (backspace?).

https://en.wikipedia.org/wiki/Unicode_control_characters

leaving an unwanted nasty space :( or take weird amount of space \t. And perhaps the first isn't really something to worry about.

Or should i do something like:

MyString.splitLines
		.map!(a => a
			  .map!(a => a
					.isGraphical)
			  .map!(a => cast(int) a?1:0)
			  .array
			  .reduce!((a,b) => a+b))
		.reduce!max

Mostly I am just curious of best practice in this situation.

Both of the above fail with the input:
"hello \n People \nP\u0008ofEARTH"
on my command prompt at least.


July 01, 2015
On 1/07/2015 6:33 a.m., Taylor Hillegeist wrote:
> So I am aware that Unicode is not simple... I have been working on a
> boxes like project http://boxes.thomasjensen.com/
>
> it basically puts a pretty border around stdin characters. like so:
>   ________________________
> /\                       \
> \_|Different all twisty a|
>    |of in maze are you,   |
>    |passages little.      |
>    |   ___________________|_
>     \_/_____________________/
>
> but I find that I need to know a bit more than the length of the string
> because of encoding differences
>
> I had a thought at one point to do this:
>
> MyString.splitlines.map!(a => a.toUTF32.length).reduce!max();
>
> Should get me the longest line.
>
> but this has a problem too because control characters might not take up
> space (backspace?).
>
> https://en.wikipedia.org/wiki/Unicode_control_characters
>
> leaving an unwanted nasty space :( or take weird amount of space \t. And
> perhaps the first isn't really something to worry about.
>
> Or should i do something like:
>
> MyString.splitLines
>          .map!(a => a
>                .map!(a => a
>                      .isGraphical)
>                .map!(a => cast(int) a?1:0)
>                .array
>                .reduce!((a,b) => a+b))
>          .reduce!max
>
> Mostly I am just curious of best practice in this situation.
>
> Both of the above fail with the input:
> "hello \n People \nP\u0008ofEARTH"
> on my command prompt at least.


Well I would personally use isWhite[0].
I would also use filter and count along with it.

So something like this:
size_t[] lengths = MyString.splitLines
.filter!isWhite
.count
.array;

Untested of course, but may give you ideas :)

[0] http://dlang.org/phobos/std_uni.html#.isWhite
July 01, 2015
On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via Digitalmars-d-learn wrote:
> So I am aware that Unicode is not simple... I have been working on a boxes like project http://boxes.thomasjensen.com/
> 
> it basically puts a pretty border around stdin characters. like so:
>  ________________________
> /\                       \
> \_|Different all twisty a|
>   |of in maze are you,   |
>   |passages little.      |
>   |   ___________________|_
>    \_/_____________________/
> 
> but I find that I need to know a bit more than the length of the string because of encoding differences
[...]

Use std.uni.byGrapheme. That's the only reliable way to count anything remotely resembling the display length of the string, which is not to be confused with the number of code points, which is also different from the length of the string in bytes or the number of code units.

Note that even with byGrapheme, you may still need some post-processing, because certain terminals may output Asian block characters in double width, meaning that 1 grapheme takes up two columns on the screen. But byGrapheme should get you started on the right footing.


T

-- 
If the comments and the code disagree, it's likely that *both* are wrong. -- Christopher
July 01, 2015
On 7/1/15 1:25 AM, H. S. Teoh via Digitalmars-d-learn wrote:
> On Tue, Jun 30, 2015 at 06:33:32PM +0000, Taylor Hillegeist via Digitalmars-d-learn wrote:
>> So I am aware that Unicode is not simple... I have been working on a boxes
>> like project http://boxes.thomasjensen.com/
>>
>> it basically puts a pretty border around stdin characters. like so:
>>   ________________________
>> /\                       \
>> \_|Different all twisty a|
>>    |of in maze are you,   |
>>    |passages little.      |
>>    |   ___________________|_
>>     \_/_____________________/
>>
>> but I find that I need to know a bit more than the length of the string
>> because of encoding differences
> [...]
>
> Use std.uni.byGrapheme. That's the only reliable way to count anything
> remotely resembling the display length of the string, which is not to be
> confused with the number of code points, which is also different from
> the length of the string in bytes or the number of code units.
>
> Note that even with byGrapheme, you may still need some post-processing,
> because certain terminals may output Asian block characters in double
> width, meaning that 1 grapheme takes up two columns on the screen. But
> byGrapheme should get you started on the right footing.
>
>

BTW, this exercise would make an EXCELLENT blog post highlighting both the power of D's unicode support and the hairy issues of unicode.

I like the ascii er... unicode art concept :)

-Steve