Thread overview | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
July 03, 2018 what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Hi, trying to figure out how to loop through a string of characters and then spit them back out. Eg: foreach (c; "👩👩👦👦🏳️🌈") { writeln(c); } So basically the above just doesn't work. Prints gibberish. So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way? foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) { writeln(c.<????>); } And then if I type the loop variable as dchar, then it seems that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume) Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar? Cheers, - Ali |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote: > Hi, trying to figure out how to loop through a string of characters and then spit them back out. > > Eg: > > foreach (c; "👩👩👦👦🏳️🌈") { > writeln(c); > } > > So basically the above just doesn't work. Prints gibberish. > > So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way? > > foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) { > writeln(c.<????>); > } > > And then if I type the loop variable as dchar, then it seems that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume) > > Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar? > > Cheers, > - Ali Hehe I guess the forum really is using D :p The two graphemes I'm talking about (which seem to not be rendered correctly above) are: family emoji: https://emojipedia.org/family-woman-woman-boy-boy/ rainbow flag: https://emojipedia.org/rainbow-flag/ |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On 7/3/18 9:32 AM, aliak wrote: > Hi, trying to figure out how to loop through a string of characters and then spit them back out. > > Eg: > > foreach (c; "👩👩👦👦🏳️🌈") { > writeln(c); > } > > So basically the above just doesn't work. Prints gibberish. > > So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way? > > foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) { > writeln(c.<????>); > } > > And then if I type the loop variable as dchar, then it seems that the family empji is printed out as 4 faces - so the code points I guess - and the rainbow flag is other stuff (also its code points I assume) Yeah, it appears that you can't actually print a grapheme. I would have assumed writeln(c) works. It does work, it just prints the struct data instead of converting back to utf. > Is there a type that I can use to store graphemes and then output them as a grapheme as well? Or do I have to use like lib ICU maybe or something similar? I honestly can't figure it out. I think directly writing graphemes as viewable UTF was not something that was considered. Definitely needs a bugzilla issue. -Steve |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote: > So basically the above just doesn't work. Prints gibberish. What system are you on? Successfully printing this stuff depends on a lot of display details too, like writeln goes to a terminal/console and they are rarely configured to support such characters by default. You might actually be better off printing it to a file instead of to a display, then opening that file in your browser or something, just to confirm the code printed is correctly displayed by the other program. > foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) { > writeln(c.<????>); prolly just printing `c` itself would work and if not try `c[]` but then again it might see it as multiple graphemes, idk if it is even implemented. |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote: > foreach (c; "👩👩👦👦🏳️🌈") { > writeln(c); > } > > So basically the above just doesn't work. Prints gibberish. Because you're printing one UTF-8 code unit (`char`) per line. > So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way? > > foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) { > writeln(c.<????>); > } You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old. |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On Tuesday, 3 July 2018 at 13:36:56 UTC, aliak wrote: > Hehe I guess the forum really is using D :p > > The two graphemes I'm talking about (which seem to not be rendered correctly above) are: > > family emoji: https://emojipedia.org/family-woman-woman-boy-boy/ > rainbow flag: https://emojipedia.org/rainbow-flag/ Looks like forum.dlang.org has a problem when they appear side by-side. Works (in the preview): 👩👩👦👦 🏳️🌈 Doesn't work: 👩👩👦👦🏳️🌈 |
July 03, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to ag0aep6g | On 7/3/18 10:37 AM, ag0aep6g wrote:
> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>> foreach (c; "👩👩👦👦🏳️🌈") {
>> writeln(c);
>> }
>>
>> So basically the above just doesn't work. Prints gibberish.
>
> Because you're printing one UTF-8 code unit (`char`) per line.
>
>> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>>
>> foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) {
>> writeln(c.<????>);
>> }
>
> You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old.
Oops! I didn't realize this, ignore my message about reporting a bug.
I still think it's very odd for printing a grapheme to print the data structure.
-Steve
|
July 04, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Steven Schveighoffer | On Tuesday, 3 July 2018 at 14:43:37 UTC, Steven Schveighoffer wrote:
> On 7/3/18 10:37 AM, ag0aep6g wrote:
>> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>>> foreach (c; "👩👩👦👦🏳️🌈") {
>>> writeln(c);
>>> }
>>>
>>> So basically the above just doesn't work. Prints gibberish.
>>
>> Because you're printing one UTF-8 code unit (`char`) per line.
>>
>>> So I figured, std.uni.byGrapheme would help, since that's what they are, but I can't get it to print them back out? Is there a way?
>>>
>>> foreach (c; "👩👩👦👦🏳️🌈".byGrapheme) {
>>> writeln(c.<????>);
>>> }
>>
>> You're looking for `c[]`. But that won't work, because std.uni apparently doesn't recognize those as grapheme clusters. The emojis may be too new. std.uni is based on Unicode version 6.2, which is a couple years old.
>
> Oops! I didn't realize this, ignore my message about reporting a bug.
>
> I still think it's very odd for printing a grapheme to print the data structure.
>
> -Steve
Aha, ok I see. Many gracias!
Though, seems by a couple years old you mean 6 years! :) Is updating unicode stuff to the latest a matter of some config file somewhere with the code point configurations that result in specific graphemes? Feels kinda ... quite bad that we're 6 years behind the current standard.
Also, any reason (technical or otherwise) that we have to slice a grapheme to get it printed? Or just no one implemented something like toString or the like? It's quite non intuitive as it is right now IMO. I can't really imagine anyone figuring out that they have to slice a grapheme to get it to print 🤔
Cheers,
- Ali
|
July 04, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Adam D. Ruppe | On Tuesday, 3 July 2018 at 14:37:32 UTC, Adam D. Ruppe wrote:
> On Tuesday, 3 July 2018 at 13:32:52 UTC, aliak wrote:
>> [...]
>
> What system are you on? Successfully printing this stuff depends on a lot of display details too, like writeln goes to a terminal/console and they are rarely configured to support such characters by default.
>
> You might actually be better off printing it to a file instead of to a display, then opening that file in your browser or something, just to confirm the code printed is correctly displayed by the other program.
>
>> [...]
>
> prolly just printing `c` itself would work and if not try `c[]`
>
> but then again it might see it as multiple graphemes, idk if it is even implemented.
Just 'c' didn't but 'c[]' seems like the thing to do! Thankies!
Terminal on osx, and yeah you're right. Seems like just trying to paste rainbow flag right in to terminal results in the 3 separate code points
|
July 04, 2018 Re: what's the correct way to handle unicode? - trying to print out graphemes here. | ||||
---|---|---|---|---|
| ||||
Posted in reply to aliak | On 07/04/2018 05:12 PM, aliak wrote: > Is updating unicode stuff to the latest a matter of some config file > somewhere with the code point configurations that result in specific > graphemes? I don't know. [...] > Also, any reason (technical or otherwise) that we have to slice a grapheme to get it printed? Or just no one implemented something like > toString or the like? I don't know. [...] > I can't really imagine anyone figuring out that they have to slice a > grapheme to get it to print 🤔 You can figure it out by reading the documentation for `Grapheme`. However, the documentation doesn't make it clear that `byGrapheme` is a range of `Grapheme`s. That's an easy fix, though: https://github.com/dlang/phobos/pull/6627 |
Copyright © 1999-2021 by the D Language Foundation