Jump to page: 1 2
Thread overview
what's the correct way to handle unicode? - trying to print out graphemes here.
Jul 03, 2018
aliak
Jul 03, 2018
aliak
Jul 03, 2018
ag0aep6g
Jul 04, 2018
crimaniak
Jul 03, 2018
Adam D. Ruppe
Jul 04, 2018
aliak
Jul 03, 2018
ag0aep6g
Jul 04, 2018
aliak
Jul 04, 2018
ag0aep6g
July 03, 2018
Hi, trying to figure out how to loop through a string of characters and then spit them back out.

Eg:

foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
  writeln(c);
}

So basically the above just doesn't work. Prints gibberish.

So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?

foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
  writeln(c.<????>);
}

And then if I type the loop variable as dchar,  then it seems  that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume)

Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar?

Cheers,
- Ali
July 03, 2018
On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
> Hi, trying to figure out how to loop through a string of characters and then spit them back out.
>
> Eg:
>
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
>   writeln(c);
> }
>
> So basically the above just doesn't work. Prints gibberish.
>
> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>   writeln(c.<????>);
> }
>
> And then if I type the loop variable as dchar,  then it seems  that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume)
>
> Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar?
>
> Cheers,
> - Ali

Hehe I guess the forum really is using D :p

The two graphemes I'm talking about (which seem to not be rendered correctly above) are:

family emoji: https://emojipedia.org/family-woman-woman-boy-boy/
rainbow flag: https://emojipedia.org/rainbow-flag/

July 03, 2018
On 7/3/18 9:32 AM, aliak wrote:
> Hi, trying to figure out how to loop through a string of characters and then spit them back out.
> 
> Eg:
> 
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
>    writeln(c);
> }
> 
> So basically the above just doesn't work. Prints gibberish.
> 
> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
> 
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>    writeln(c.<????>);
> }
> 
> And then if I type the loop variable as dchar,  then it seems that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume)

Yeah, it appears that you can't actually print a grapheme. I would have assumed writeln(c) works. It does work, it just prints the struct data instead of converting back to utf.

> Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar?

I honestly can't figure it out. I think directly writing graphemes as viewable UTF was not something that was considered.

Definitely needs a bugzilla issue.

-Steve
July 03, 2018
On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
> So basically the above just doesn't work. Prints gibberish.

What system are you on? Successfully printing this stuff depends on a lot of display details too, like writeln goes to a terminal/console and they are rarely configured to support such characters by default.

You might actually be better off printing it to a file instead of to a display, then opening that file in your browser or something, just to confirm the code printed is correctly displayed by the other program.

> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>   writeln(c.<????>);

prolly just printing `c` itself would work and if not try `c[]`

but then again it might see it as multiple graphemes, idk if it is even implemented.
July 03, 2018
On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
>   writeln(c);
> }
>
> So basically the above just doesn't work. Prints gibberish.

Because you're printing one UTF-8 code unit (`char`) per line.

> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>
> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>   writeln(c.<????>);
> }

You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old.
July 03, 2018
On Tuesday, 3 July 2018 at 13:36:56 UTC, aliak wrote:
> Hehe I guess the forum really is using D :p
>
> The two graphemes I'm talking about (which seem to not be rendered correctly above) are:
>
> family emoji: https://emojipedia.org/family-woman-woman-boy-boy/
> rainbow flag: https://emojipedia.org/rainbow-flag/

Looks like forum.dlang.org has a problem when they appear side by-side.

Works (in the preview): 👩‍👩‍👦‍👦 🏳️‍🌈
Doesn't work: 👩‍👩‍👦‍👦🏳️‍🌈
July 03, 2018
On 7/3/18 10:37 AM, ag0aep6g wrote:
> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
>>   writeln(c);
>> }
>>
>> So basically the above just doesn't work. Prints gibberish.
> 
> Because you're printing one UTF-8 code unit (`char`) per line.
> 
>> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>>
>> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>>   writeln(c.<????>);
>> }
> 
> You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old.

Oops! I didn't realize this, ignore my message about reporting a bug.

I still think it's very odd for printing a grapheme to print the data structure.

-Steve
July 04, 2018
On Tuesday, 3 July 2018 at 14:43:37 UTC, Steven Schveighoffer wrote:
> On 7/3/18 10:37 AM, ag0aep6g wrote:
>> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>>> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈") {
>>>   writeln(c);
>>> }
>>>
>>> So basically the above just doesn't work. Prints gibberish.
>> 
>> Because you're printing one UTF-8 code unit (`char`) per line.
>> 
>>> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>>>
>>> foreach (c; "👩‍👩‍👦‍👦🏳️‍🌈".byGrapheme) {
>>>   writeln(c.<????>);
>>> }
>> 
>> You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old.
>
> Oops! I didn't realize this, ignore my message about reporting a bug.
>
> I still think it's very odd for printing a grapheme to print the data structure.
>
> -Steve


Aha, ok I see. Many gracias!

Though, seems by a couple years old you mean 6 years! :) Is updating unicode stuff to the latest a matter of some config file somewhere with the code point configurations that result in specific graphemes? Feels kinda ... quite bad that we're 6 years behind the current standard.

Also, any reason (technical or otherwise) that we have to slice a grapheme to get it printed? Or just no one implemented something like toString or the like? It's quite non intuitive as it is right now IMO. I can't really imagine anyone figuring out that they have to slice a grapheme to get it to print 🤔

Cheers,
- Ali
July 04, 2018
On Tuesday, 3 July 2018 at 14:37:32 UTC, Adam D. Ruppe wrote:
> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>> [...]
>
> What system are you on? Successfully printing this stuff depends on a lot of display details too, like writeln goes to a terminal/console and they are rarely configured to support such characters by default.
>
> You might actually be better off printing it to a file instead of to a display, then opening that file in your browser or something, just to confirm the code printed is correctly displayed by the other program.
>
>>   [...]
>
> prolly just printing `c` itself would work and if not try `c[]`
>
> but then again it might see it as multiple graphemes, idk if it is even implemented.

Just 'c' didn't but 'c[]' seems like the thing to do! Thankies!

Terminal on osx, and yeah you're right. Seems like just trying to paste rainbow flag right in to terminal results in the 3 separate code points

July 04, 2018
On 07/04/2018 05:12 PM, aliak wrote:
> Is updating unicode stuff to the latest a matter of some config file
> somewhere with the code point configurations that result in specific
> graphemes?

I don't know.

[...]
> Also, any reason (technical or otherwise) that we have to slice a grapheme to get it printed? Or just no one implemented something like
> toString or the like?

I don't know.

[...]
> I can't really imagine anyone figuring out that they have to slice a
> grapheme to get it to print 🤔

You can figure it out by reading the documentation for `Grapheme`.
However, the documentation doesn't make it clear that `byGrapheme` is a
range of `Grapheme`s. That's an easy fix, though:

https://github.com/dlang/phobos/pull/6627
« First   ‹ Prev
1 2