Why is std.regex slow, well here is one reason! (page 4)

On Sunday, 26 February 2023 at 07:25:45 UTC, Richard (Rikki) Andrew Cattermole wrote: > > Basically right now globals are not leading to anything in the output. > > ``` > void func() { > static immutable Thing thing = Thing(123); > } > ``` > > The constructor call for Thing won't show up. This is the big one for std.regex basically. https://github.com/ldc-developers/ldc/pull/4339 -Johan

On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote: > On 2/24/2023 12:05 PM, Max Samukha wrote: >> Is Latin 'A' the same character as Cyrillic 'A'? Should they have the same code? > > It's the same glyph, and so should have the same code. The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code. You’d be surprised but there are typesets where Cyrillic A is visually different from ASCII A. — Dmitry Olshansky

On Thursday, 2 March 2023 at 07:35:06 UTC, Dmitry Olshansky wrote: > On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote: >> On 2/24/2023 12:05 PM, Max Samukha wrote: >>> Is Latin 'A' the same character as Cyrillic 'A'? Should they have the same code? >> >> It's the same glyph, and so should have the same code. The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code. > > You’d be surprised but there are typesets where Cyrillic A is visually different from ASCII A. Also your idea of “what it looks on paper” is basically NFKC or NFKD, which is compatibility normalization that folds lookalikes into the same canonical codepoint. I would insist that there are times when “looks the same” is not a good option. Typically programs do not have the context, that we as humans use to disambiguate. > > — > Dmitry Olshansky

On 2/25/2023 6:26 AM, Herbie Melbourne wrote: > But it is the same Latin 'A' When it's printed, how do you know the difference? > My understanding of Unicode has always been that it's merely a mapping of a number, a code point, to a letter, word, symbol, icon, an idea and nothing more. Unicode is agnostic to layout. That's defined in a font. It started out that way, but it is no more. There are Fraktur fonts embedded in Unicode. There are also direction instructions to turn the rendering right-to-left.

On 3/1/2023 11:49 PM, Dmitry Olshansky wrote: > I would insist that there are times when “looks the same” is not a good option. Typically programs do not have the context, that we as humans use to disambiguate. Programs can't tell if "die" means "the" or "expire" without context, either. The point is, once invisible semantic meaning is added, an infinite number of Unicode code points is required. > You’d be surprised Not at all. People use different fonts to assert different meanings all the time. > but there are typesets where Cyrillic A is visually different from ASCII A. Yes, and there are italic fonts, and people embed them in text using markup, not different code points.

March 03, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Dmitry Olshansky
in reply to Walter Bright

Permalink

Dmitry Olshansky

Posted in reply to Walter Bright

Permalink

On Thursday, 2 March 2023 at 20:11:14 UTC, Walter Bright wrote:
> On 3/1/2023 11:49 PM, Dmitry Olshansky wrote:
>> I would insist that there are times when “looks the same” is not a good option. Typically programs do not have the context, that we as humans use to disambiguate.
>
> Programs can't tell if "die" means "the" or "expire" without context, either.
>

We are talking about characters. Yes we can’t tell the meaning but we can upper/lowercase or word break it at ease.

> The point is, once invisible semantic meaning is added, an infinite number of Unicode code points is required.

> > You’d be surprised
>
> Not at all. People use different fonts to assert different meanings all the time.
>
> > but there are typesets where Cyrillic A is visually different
> from ASCII A.
>
> Yes, and there are italic fonts, and people embed them in text using markup, not different code points.

Let’s see another example. Cyrillic letter ‘В’ looks the same as ASCII ‘B’ when capitalized, hence by your reasoning it’s the same codepoint. Now lowercase ‘в’ and ‘b’ don’t look the same hence different codepoints. Voila, you just made lowercasing/uppercasing impossible without some external context, so <cyrillic>В</cyrillic> ?

I’d rather live in a world where codepoints represent particular alphabet allowing us to generically manipulate text according to the language standards even if we do not know the semantics of words. Context if required is for high-level meaning.
—
Dmitry Olshansky

On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote: > Is 'A' in German different from the 'A' in English? Yes. Except they are literally the same Latin A and in no way are different. >> Is Latin 'A' the same character as Cyrillic 'A'? Should they have the same code? > > It's the same glyph, and so should have the same code. Except they aren't, and it's a mere coincidence that in this particular font they look the same way. Cyrillic А is traditionally written more as c\ with c being tilted to the left about 45 degrees. Even in fonts with Cyrillic A looking more like Latin A, a lot of fonts put extra emphasis on the right stroke, making it wider than the left. > The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code. The definitive test would be understanding what you're talking about

On Friday, 3 March 2023 at 11:29:34 UTC, GrimMaple wrote: > >> The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code. > > The definitive test would be understanding what you're talking about Indeed, it's a stupid argument since it's even more likely that printed glyphs are even more likely to be differing as on screen with its lower resolution than printers (600dpi printers are nowadays standard, screen with more than 200dpi not so frequent).

On Thursday, 2 March 2023 at 20:06:38 UTC, Walter Bright wrote: > On 2/25/2023 6:26 AM, Herbie Melbourne wrote: >> But it is the same Latin 'A' > > When it's printed, how do you know the difference? Heuristically from context. For example we know "6:26 AM" is latin, because it's abbreviation from "ante meridiem", you will need pretty heavy AI to do this programmatically.

Forums