Why is std.regex slow, well here is one reason! (page 2) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Why is std.regex slow, well here is one reason! (page 2)

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Richard (Rikki) Andrew Cattermole
in reply to Walter Bright

Richard (Rikki) Andrew Cattermole

Posted in reply to Walter Bright

On 24/02/2023 9:26 AM, Walter Bright wrote:
> At minimum, please file a bugzilla issue with your analysis.

https://issues.dlang.org/show_bug.cgi?id=23737

One fix (removal of formattedWrite call) https://github.com/dlang/phobos/pull/8698



This function is taking like 700ms! https://github.com/dlang/phobos/blob/master/std/regex/internal/ir.d#L52

I don't know how to minimize it, but it does need to be memorized based upon std.uni tables. There is getMatcher above it, but yeah wordMatcher also needs to pure so that isn't a solution.

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Ali Çehreli
in reply to Richard (Rikki) Andrew Cattermole

Ali Çehreli

Posted in reply to Richard (Rikki) Andrew Cattermole

On 2/24/23 05:07, Richard (Rikki) Andrew Cattermole wrote:

> Okay looks like I'm at the 62ms mark.

Too good to be true! :p

Thank you for working on this. This kind of effort improves everybody's life.

Ali

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Walter Bright
in reply to Max Samukha

Walter Bright

Posted in reply to Max Samukha

On 2/23/2023 11:28 PM, Max Samukha wrote:
> On Thursday, 23 February 2023 at 23:11:56 UTC, Walter Bright wrote:
>> Unicode is a brilliant idea, but its doom comes from the execrable decision to apply semantic meaning to glyphs.
> 
> Unicode did not start that. For example, all Cyrillic encodings encode Latin А, K, H, etc. differently than the similarly looking Cyrillic counterparts. Whether that decision was execrable is highly debatable.

Let's say I write "x". Is that the letter x, or the math symbol x? I know which it is from the context. But in Unicode, there's a letter x and the math symbol x, although they look identical.

There is no end to semantic meanings for "x", and so any attempt to encode semantics into Unicode is doomed from the outset.

Printed media do not seem to require these hidden semantics, why should Unicode? If you print the Unicode on paper, thereby losing its meaning, what again is the purpose of Unicode?

Equally stupid are:

1. encoding of various fonts

2. multiple encodings of the same character, leading to "normalization" problems

3. encodings to enable/disable the direction the glyphs are to be read

Implementing all this stuff is hopelessly complex, which is why Unicode had to introduce "levels" of Unicode support.

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Walter Bright
in reply to Richard (Rikki) Andrew Cattermole

Walter Bright

Posted in reply to Richard (Rikki) Andrew Cattermole

On 2/24/2023 2:27 AM, Richard (Rikki) Andrew Cattermole wrote:
> who knew those innocent looking symbols, all in their tables could be so complicated!

Because the Unicode designers are in love with complexity (like far too many engineers).

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Max Samukha
in reply to Walter Bright

Max Samukha

Posted in reply to Walter Bright

On Friday, 24 February 2023 at 18:34:42 UTC, Walter Bright wrote:

> Let's say I write "x". Is that the letter x, or the math symbol x? I know which it is from the context. But in Unicode, there's a letter x and the math symbol x, although they look identical.

Same as 'A' in KOI8 or Windows-1251? Latin and Cyrillic 'A' look identical but have different codes. Not that I disagree with you, but Unicode just upheld the tradition.

>
> There is no end to semantic meanings for "x", and so any attempt to encode semantics into Unicode is doomed from the outset.

That is similar to attempts to encode semantics in, say, binary operators - they are nothing but functions, but...

>
> Printed media do not seem to require these hidden semantics, why should Unicode? If you print the Unicode on paper, thereby losing its meaning, what again is the purpose of Unicode?

Looks like another case of caching, one of the two hard problems in computing. The meaning of a code point can be inferred without the need to keep track of the context.

>
> Equally stupid are:
>
> 1. encoding of various fonts
>
> 2. multiple encodings of the same character, leading to "normalization" problems

I agree that multiple encodings for the same abstract character is not a great idea, but "same character" is unfortunately not well defined. Is Latin 'A' the same character as Cyrillic 'A'? Should they have the same code?

>
> 3. encodings to enable/disable the direction the glyphs are to be read
>
> Implementing all this stuff is hopelessly complex, which is why Unicode had to introduce "levels" of Unicode support.

That's true.

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Johan
in reply to Richard (Rikki) Andrew Cattermole

Johan

Posted in reply to Richard (Rikki) Andrew Cattermole

On Friday, 24 February 2023 at 10:52:37 UTC, Richard (Rikki) Andrew Cattermole wrote:

>

I'm going to be totally honest, I have no idea how to use that information.

Its not in a format that is easy to figure out.

What I would want is stuff like this:

module
|- function
| |- initialize template module thing @~200ms
| | |- ran CTFE on thingie @~150us

Give me that, and this type of hunting would be a cake walk I think.

This was a nice small project. See here: https://gist.github.com/JohanEngelen/907c7681e4740f82d37fd2f2244ba7bf

Looking forward to your feedback about improvements to --ftime-trace ;-)

Cheers,
Johan

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Walter Bright
in reply to Max Samukha

Walter Bright

Posted in reply to Max Samukha

On 2/24/2023 12:05 PM, Max Samukha wrote:
> On Friday, 24 February 2023 at 18:34:42 UTC, Walter Bright wrote:
> 
>> Let's say I write "x". Is that the letter x, or the math symbol x? I know which it is from the context. But in Unicode, there's a letter x and the math symbol x, although they look identical.
> 
> Same as 'A' in KOI8 or Windows-1251? Latin and Cyrillic 'A' look identical but have different codes. Not that I disagree with you, but Unicode just upheld the tradition.

Is 'A' in German different from the 'A' in English? Yes. Do they have different keys on the keyboard? No. Do they have different Unicode code points? No. How do you tell a German 'A' from an English 'A'? By the context.

The same for the word "die". Is it the German "the"? Or is it the English "expire"? Should we embed this in the letters themselves? Of course not.

> Not that I disagree with you, but Unicode just upheld the
> tradition.

Inventing a new code encoding needn't follow tradition, or take tradition to such an extreme that it makes everyone who uses Unicode miserable.

>> There is no end to semantic meanings for "x", and so any attempt to encode semantics into Unicode is doomed from the outset.
> 
> That is similar to attempts to encode semantics in, say, binary operators - they are nothing but functions, but...

We know the meaning by context.

> The meaning of a code point can be inferred without the need to keep track of the context.

Meaning in a character set simply should not exist outside the visual appearance.

> Is Latin 'A' the same character as Cyrillic 'A'? Should they have the same code?

It's the same glyph, and so should have the same code. The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code.

It's fine if one wishes to develop another layer over Unicode which encodes semantics, style, font, language, emphasis, bold face, italics, etc. But these just do not belong in Unicode. They belong in a separate markup language.

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Adam D Ruppe
in reply to Walter Bright

Adam D Ruppe

Posted in reply to Walter Bright

On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote:
> It's the same glyph, and so should have the same code. The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code.

1lI 5S

i guess it depends on the font

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Andrea Fontana
in reply to Walter Bright

Andrea Fontana

Posted in reply to Walter Bright

On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote:
> It's the same glyph, and so should have the same code. The definitive test is, when printed out or displayed, can you see a difference? If the answer is "no" then they should be the same code.

It sounds like you're saying as "piano" in Italian means "slow" but also "plane" so we can merge "plane" and "slow" in English as well.

Andrea

February 24, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by H. S. Teoh
in reply to Walter Bright

H. S. Teoh

Posted in reply to Walter Bright

On Fri, Feb 24, 2023 at 10:34:42AM -0800, Walter Bright via Digitalmars-d wrote:
> On 2/23/2023 11:28 PM, Max Samukha wrote:
> > On Thursday, 23 February 2023 at 23:11:56 UTC, Walter Bright wrote:
> > > Unicode is a brilliant idea, but its doom comes from the execrable decision to apply semantic meaning to glyphs.
> > 
> > Unicode did not start that. For example, all Cyrillic encodings encode Latin А, K, H, etc. differently than the similarly looking Cyrillic counterparts. Whether that decision was execrable is highly debatable.
> 
> Let's say I write "x". Is that the letter x, or the math symbol x? I know which it is from the context. But in Unicode, there's a letter x and the math symbol x, although they look identical.

Actually x and × are *not* identical if you're using a sane font. They have different glyph shapes (that though very similar are actually different -- × for example will never have serifs even in a serif font), and different font metrics (× has more space around it on either side; x may be kerned against an adjacent letter). If you print them they will have a different representation of dots on the paper, even if the difference is fine enough you don't notice it.

With all due respect, writing systems aren't as simple as you think. Sometimes what to you seems like a lookalike glyph may be something completely different. For example, in English if you see:

	m

you can immediately tell that it's a lowercase M.  So it makes sense to have just one Unicode codepoint to encode this, right?

Now take the lowercase Cyrillic letter т.  Completely different glyph, so completely different Unicode codepoint, right?  The problem is, the *cursive* version of this letter looks like this:

	m

According to your logic, we should encode this exactly the same way you encode the English lowercase M.  But now you have two completely different codepoints for the same letter, which makes no sense because it implies that changing the display font (from upright to cursive) requires re-encoding your string.

This isn't the only instance of this. Another example is lowercase Cyrillic П, which looks like this in upright font:

	п

but in cursive:

	n

Again, you have the same problem.

It's not reasonable to expect that changing your display font requires reencoding the string. But then you must admit that the English lowercase n must be encoded differently from the Cyrillic cursive n.

Which means that you must encode the *logical* symbol rather than the physical representation of it. I.e., semnatics.

> There is no end to semantic meanings for "x", and so any attempt to encode semantics into Unicode is doomed from the outset.

If we were to take your suggestion that "x" and "×" should be encoded identically, we would quickly run into readability problems with English text that contains mathematical fragments, say, the text talks about 3×3 matrices.  How will your email reader render the ×?  Not knowing any better, it sees the exact same codepoint as x and prints it as an English letter x, say in a serif font.  Which looks out-of-place in a mathematical expression. To fix that, you have to explicitly switch to a different font in order to have a nicer symbol.  The computer can't do this for you, because, as you said, the interpretation of a symbol is context-dependent --- and computers are bad at context-dependent stuff. So you'll need complex information outside of the text itself (e.g. use HTML or some other markup) to tell the computer which meaning of "x" is intended here.  The *exact same kind of complex information* that Unicode currently deals with.

So you're not really solving anything, just pushing the complexity from one place to another.  And not having this information directly encoded in the string means that you're now going back to the bad ole days where there is no standard for marking semantics in a piece of text; everybody does it differently, and copy-n-pasting text from one program to another will almost guarantee the loss of this information (that you then have to re-input in the target software).

[...]
> Implementing all this stuff is hopelessly complex, which is why Unicode had to introduce "levels" of Unicode support.

Human writing systems are hopelessly complex.  It's just par for the course. :-D

T

-- 
You have to expect the unexpected. -- RL

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation