Jump to page: 1 2 3
Thread overview
[Issue 7054] format() aligns using code units instead of graphemes
[Issue 7054] std.format.formattedWrite uses code units count as width instead of characters count
Aug 29, 2014
Dmitry Olshansky
Feb 09, 2016
Marco Leise
Feb 12, 2016
Stewart Gordon
Feb 15, 2016
Marco Leise
Feb 18, 2016
Stewart Gordon
Feb 24, 2016
Marco Leise
Jan 09, 2018
Basile B.
Jan 09, 2018
Basile B.
Feb 19, 2018
Seb
Mar 21, 2020
Basile-z
August 21, 2014
https://issues.dlang.org/show_bug.cgi?id=7054

hsteoh@quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hsteoh@quickfur.ath.cx

--
August 21, 2014
https://issues.dlang.org/show_bug.cgi?id=7054

hsteoh@quickfur.ath.cx changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |13348

--
August 21, 2014
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #3 from hsteoh@quickfur.ath.cx ---
Tried to fix this today, unfortunately it's blocked by std.uni.byGrapheme being impure, which causes a ripple of impurity down the call chain causing several unittest compile errors and CTFE errors.

--
August 29, 2014
https://issues.dlang.org/show_bug.cgi?id=7054

Dmitry Olshansky <dmitry.olsh@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |dmitry.olsh@gmail.com

--- Comment #4 from Dmitry Olshansky <dmitry.olsh@gmail.com> ---
(In reply to hsteoh from comment #3)
> Tried to fix this today, unfortunately it's blocked by std.uni.byGrapheme being impure, which causes a ripple of impurity down the call chain causing several unittest compile errors and CTFE errors.

Why should it call byGrapheme? Doesn't seem likly that we are doing grapheme clustering only to output some damn text.

--
August 29, 2014
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #5 from hsteoh@quickfur.ath.cx ---
Because grapheme clustering is the only sane way to handle output to a field of fixed length. For example, writeln("%5s", "a\u0301") should treat "a\u0301" as occupying only a single position in the 5-position wide output field.

Any other solution would introduce further problems, e.g. if we count code points instead, then the width field in the format string would be basically useless (the caller will have to manually count output positions -- with byGrapheme -- and adjust the width accordingly). Furthermore, it would introduce more special cases (precomposed characters will format differently from base char + combining diacritic; non-spacing characters will consume field width but occupy no space in the actual output, etc.).

--
February 09, 2016
https://issues.dlang.org/show_bug.cgi?id=7054

Marco Leise <Marco.Leise@gmx.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |Marco.Leise@gmx.de

--- Comment #6 from Marco Leise <Marco.Leise@gmx.de> ---
Graphemes work until you meet full-width characters.
Graphemes  work  until  you  meet  full-width  characters.

>From Wikipedia: "With fixed-width fonts, a halfwidth character occupies half
the width of a fullwidth character, hence the name."

https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

We need UTF decoding, grapheme clustering, character categorizing, super-cow-power width specifiers in our writeln.

--
February 12, 2016
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #7 from hsteoh@quickfur.ath.cx ---
Argh. Welcome to Unicode, where exceptions *are* the norm, and no simple algorithm is simple in practice.

And this is a double-argh, because when it comes to double-width characters, whether or not the output will even *look* right depends on what kind of terminal you're using, and how it handles double-width characters. Older terminals may not recognize double-width characters, and such characters may end up formatted as if they were single-width. (But then again, such terminals will already make a big unreadable mess of double-width characters anyway, so perhaps it's not so important to cater to them.)

But once you start down this slippery slope, the next thing that will come up is making `writefln` support right-to-left text, then vertical text, etc., and before you know it, we'll be reinventing libpango except poorly (and for a text terminal where it's questionable whether such things are even relevant anymore).

--
February 12, 2016
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #8 from Stewart Gordon <smjg@iname.com> ---
(In reply to Marco Leise from comment #6)
> https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms

So "halfwidth" means the width of a character cell, and "fullwidth" means double that width.  Seems counter-intuitive.  I would have expected them to be something like "singlewidth" and "doublewidth" respectively.

So there a few different units at work here:
- code units
- codepoints
- graphemes
- width units

A further complication is whether formattedWrite should be geared towards text terminals, writing data to a text file designed for human reading, writing data to a text file that follows a rigid format for machine processing or what.  So it looks like there's no simple solution.  But in 99% of cases, using code units (as it does at the moment) is bound to be wrong.

--
February 15, 2016
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #9 from Marco Leise <Marco.Leise@gmx.de> ---
I always regarded it as merely a means to print stuff with a non-proportional font for humans to read that extends to text files. The match up of bytes and visual characters in the early days printf is only a historical coincidence.

Most terminals - like programming languages and GUI toolkits - have to adapt to the Unicode reality and I believe it is safe to assume that when someone calls writefln or format with full-width symbols they use a terminal that can handle them. The popular VTE library used by many recent Linux terminal emulators works great for example.

That said, printf is no better, and we could just claim that the width is meant to mean bytes or ASCII characters and you are supposed to use writefln only for English text in debugging output and not user interaction. std.stdio never cared about the user locale anyways. For all we know the output terminal might expect KOI-8 (Cyrillic) or some Indian script. In Java for example you are supposed to use an encoding wrapper if your stdout goes to a terminal, IIRC. But as Unicode is kind of ubiquitous now, we might as well say that Dlang only works on Unicode enabled systems. Sorry for the derail ... :)

--
February 18, 2016
https://issues.dlang.org/show_bug.cgi?id=7054

--- Comment #10 from hsteoh@quickfur.ath.cx ---
Even if we concede that modern terminals ought to be Unicode-aware (if not fully supporting Unicode), there is still the slippery slope of how to print bidirectional text, vertical text, scripts that require glyph mutation, etc.. Where does one draw the line as to what writefln ought/ought not handle?

--
« First   ‹ Prev
1 2 3