July 03, 2022
On Sunday, 3 July 2022 at 18:33:29 UTC, rikki cattermole wrote:
> Its just an unnecessary goal, when most of the string algorithms we have probably don't care about the encoding and those that do probably will be using dstrings.

To the contrary, I find this goal coherant with the end of autodecoding.
that will probably make phobos simpler: less template overloads, less template constraints to evaluate, no more isNattowString etc.
July 03, 2022

On Sunday, 3 July 2022 at 18:33:29 UTC, rikki cattermole wrote:

>

On 04/07/2022 6:10 AM, Ola Fosheim Grøstad wrote:

>

People who are willing to use 4 bytes per code point are probably using third party C-libraries that have their own representation, so you have to convert anyway?

If you use Unicode and follow their recommendations, you are going to be using dstrings at some point.

I hardly ever use anything outside UTF-8, and if I do then I use a well tested unicode library as it has to be correct and up to date to be useful. The utility of going beyond UTF-8 seems to be limited:

https://en.wikipedia.org/wiki/UTF-32#Analysis

July 04, 2022
On 04/07/2022 7:18 AM, Ola Fosheim Grøstad wrote:
> I hardly ever use anything outside UTF-8, and if I do then I use a well tested unicode library as it has to be correct and up to date to be useful. The utility of going beyond UTF-8 seems to be limited:
> 
> https://en.wikipedia.org/wiki/UTF-32#Analysis

I have just finished implementing string normalization which is based around UTF-32.

It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first.
July 03, 2022
On Sunday, 3 July 2022 at 19:32:56 UTC, rikki cattermole wrote:
> I have just finished implementing string normalization which is based around UTF-32.

There's a difference between utf-32 and unicode code points.

> It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first.

Which you can do on any translation format.
July 03, 2022

On Sunday, 3 July 2022 at 19:32:56 UTC, rikki cattermole wrote:

>

It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first.

Well, I think it is reasonable for a protocol to require that the input is NFC, and just check it and reject it or call out to an external library to convert it into NFC.

Anyway, UTF-8 is the only format that isn't affected by network byte order… So if you support more than UTF-8 then you have to support UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE…

That is five formats for just a simple string… and only UTF-8 will be well tested by users. :-/

July 04, 2022
On 04/07/2022 8:16 AM, Ola Fosheim Grøstad wrote:
> On Sunday, 3 July 2022 at 19:32:56 UTC, rikki cattermole wrote:
>> It is required for string equivalent comparisons (which is what you should be doing in a LOT more cases! Anything user provided when compared should be normalized first.
> 
> Well, I think it is reasonable for a protocol to require that the input is NFC, and just check it and reject it or call out to an external library to convert it into NFC.
> 
> Anyway, UTF-8 is the only format that isn't affected by network byte order… So if you support more than UTF-8 then you have to support UTF-8, UTF16-LE, UTF16-BE, UTF-32LE, UTF-32BE…
> 
> That is five formats for just a simple string… and only UTF-8 will be well tested by users. :-/

https://issues.dlang.org/show_bug.cgi?id=23186

We only support UTF-16/UTF-32 for the target endian.

Text input comes from many sources, stdin, files and say the windowing system are three common sources that do not make any such guarantees.
July 03, 2022

On Sunday, 3 July 2022 at 20:28:18 UTC, rikki cattermole wrote:

>

We only support UTF-16/UTF-32 for the target endian.

Text input comes from many sources, stdin, files and say the windowing system are three common sources that do not make any such guarantees.

Well, then the application author will use an external Unicode library anyway. If you support UTF-16 or UTF-32 there might not be a BOM mark, so you might need to use heuristics to figure out the LE/LB endian issue.

For things like gzip, png, crypto and unicode there are most likely faster and better tested open source alternatives than a small community can come up with. Maybe just use out whatever Chromium or Clang uses?

What I never liked about C++ is the string mess: char, signed char, unsigned char, char8_t, char16_t, char32_t, wchar_t, string, wstring, u8string, u16string, u32string, pmr::string, pmr::wstring, pmr::u8string, pmr::u16string, pmr::u32string… And this doesn't even account for endianess!! This is what happens over time as new needs pops up. One of the best things about Python3 and JavaScript is that there is one commonly used string type that is well supported.

Having one common string representation is a good thing for API authors.

(But make sure to have a maintained binding to a versatile C unicode library.)

July 04, 2022
We have a perfectly good Unicode handling library already.

(Okay, little out of date and doesn't handle Turkic stuff, but fixable).

The standard one is called ICU.

Anyway, we are straying from my original point, that limiting ourselves to the string alias and not supporting wstring or dstring in Phobos is going to bite us.

Its not what people expect, its not what we have supported and code that looks like it should work won't. There better be a good reason for this that isn't just removing templates.
July 04, 2022

On Sunday, 3 July 2022 at 08:46:31 UTC, Mike Parker wrote:

>

You can find the final draft of the high-level goals for the D programming language at the following link:

https://github.com/dlang/vision-document

Under 'Memory safety':

>

Allow the continued use of garbage collection as the default memory management strategy without impact. The GC is one of D's strengths, and we should not "throw the baby out with the bath water".

Under 'Phobos and DRuntime':

>

@nogc as much as possible.

Aren't these the polar opposites of each other? The GC is one of D's strengths, yet we should avoid it as much as possible in the standard library.

Then it's not part of D's strengths.

July 04, 2022
On 04/07/2022 5:30 PM, Andrej Mitrovic wrote:
> Aren't these the polar opposites of each other? The GC is one of D's strengths, yet we should avoid it as much as possible in the standard library.

Not necessarily.

It could and should most likely mean that it won't do any heap allocations.

Heap allocations are expensive after all.