The D Programming Language Vision Document (page 2)

On Sunday, 3 July 2022 at 18:33:29 UTC, rikki cattermole wrote: > Its just an unnecessary goal, when most of the string algorithms we have probably don't care about the encoding and those that do probably will be using dstrings. To the contrary, I find this goal coherant with the end of autodecoding. that will probably make phobos simpler: less template overloads, less template constraints to evaluate, no more isNattowString etc.

On 04/07/2022 7:18 AM, Ola Fosheim Grøstad wrote: > I hardly ever use anything outside UTF-8, and if I do then I use a well tested unicode library as it has to be correct and up to date to be useful. The utility of going beyond UTF-8 seems to be limited: > > https://en.wikipedia.org/wiki/UTF-32#Analysis I have just finished implementing string normalization which is based around UTF-32. It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first.

On Sunday, 3 July 2022 at 19:32:56 UTC, rikki cattermole wrote: > I have just finished implementing string normalization which is based around UTF-32. There's a difference between utf-32 and unicode code points. > It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first. Which you can do on any translation format.

On 04/07/2022 8:16 AM, Ola Fosheim Grøstad wrote: > On Sunday, 3 July 2022 at 19:32:56 UTC, rikki cattermole wrote: >> It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first. > > Well, I think it is reasonable for a protocol to require that the input is NFC, and just check it and reject it or call out to an external library to convert it into NFC. > > Anyway, UTF-8 is the only format that isn't affected by network byte order… So if you support more than UTF-8 then you have to support UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE… > > That is five formats for just a simple string… and only UTF-8 will be well tested by users. :-/ https://issues.dlang.org/show_bug.cgi?id=23186 We only support UTF-16/UTF-32 for the target endian. Text input comes from many sources, stdin, files and say the windowing system are three common sources that do not make any such guarantees.

July 03, 2022

Re: The D Programming Language Vision Document

Posted by Ola Fosheim Grøstad
in reply to rikki cattermole

Permalink

Ola Fosheim Grøstad

Posted in reply to rikki cattermole

Permalink

On Sunday, 3 July 2022 at 20:28:18 UTC, rikki cattermole wrote:

We only support UTF-16/UTF-32 for the target endian.

Text input comes from many sources, stdin, files and say the windowing system are three common sources that do not make any such guarantees.

Well, then the application author will use an external Unicode library anyway. If you support UTF-16 or UTF-32 there might not be a BOM mark, so you might need to use heuristics to figure out the LE/LB endian issue.

For things like gzip, png, crypto and unicode there are most likely faster and better tested open source alternatives than a small community can come up with. Maybe just use out whatever Chromium or Clang uses?

What I never liked about C++ is the string mess: char, signed char, unsigned char, char8_t, char16_t, char32_t, wchar_t, string, wstring, u8string, u16string, u32string, pmr::string, pmr::wstring, pmr::u8string, pmr::u16string, pmr::u32string… And this doesn't even account for endianess!! This is what happens over time as new needs pops up. One of the best things about Python3 and JavaScript is that there is one commonly used string type that is well supported.

Having one common string representation is a good thing for API authors.

(But make sure to have a maintained binding to a versatile C unicode library.)

We have a perfectly good Unicode handling library already. (Okay, little out of date and doesn't handle Turkic stuff, but fixable). The standard one is called ICU. Anyway, we are straying from my original point, that limiting ourselves to the string alias and not supporting wstring or dstring in Phobos is going to bite us. Its not what people expect, its not what we have supported and code that looks like it should work won't. There better be a good reason for this that isn't just removing templates.

On 04/07/2022 5:30 PM, Andrej Mitrovic wrote: > Aren't these the polar opposites of each other? The GC is one of D's strengths, yet we should avoid it as much as possible in the standard library. Not necessarily. It could and should most likely mean that it won't do any heap allocations. Heap allocations are expensive after all.

Forums