Why is std.regex slow, well here is one reason! (page 3)

Settings

Help

Index » General » Why is std.regex slow, well here is one reason! (page 3)

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Richard (Rikki) Andrew Cattermole
in reply to Walter Bright

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to Walter Bright

Permalink

On 25/02/2023 7:39 AM, Walter Bright wrote:
> On 2/24/2023 2:27 AM, Richard (Rikki) Andrew Cattermole wrote:
>> who knew those innocent looking symbols, all in their tables could be so complicated!
> 
> Because the Unicode designers are in love with complexity (like far too many engineers).

Not entirely.

Humans have made pretty much every form of writing system imaginable.

If there is an assumption to be had in latin, there is another script that violates it so hard that you now need a table for it.

I find Unicode to be pretty impressive. It is composed of some of the hardest parts of human society to represent and it does so with support of thousands of years of content and not only that but it is backwards and forwards compatible!

Date/time is easy in comparison lol!

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Dmitry Olshansky
in reply to Walter Bright

Permalink

Dmitry Olshansky

Posted in reply to Walter Bright

Permalink

On Thursday, 23 February 2023 at 23:11:56 UTC, Walter Bright wrote:
>> Unicode keeps growing, which is good for compilers, but horrible for standard libraries!
>
> Unicode is a brilliant idea, but its doom comes from the execrable decision to apply semantic meaning to glyphs.

Its doom comes from its success. Initial design was simple enough, and 16 bits should have been enough for everyone. Then gradually it got extended towards more and more of writing systems, the marvel here is that it managed to:
- remain compatible with earlier versions
- accommodate the vast complexity with fairly few algorithms and concepts
- handle technical debt, that is probably what you dislike about it, but at the scale of the project it’s inevitable

—
Dmitry Olshansky

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Patrick Schluter
in reply to Walter Bright

Permalink

Patrick Schluter

Posted in reply to Walter Bright

Permalink

On Friday, 24 February 2023 at 18:39:02 UTC, Walter Bright wrote:
> On 2/24/2023 2:27 AM, Richard (Rikki) Andrew Cattermole wrote:
>> who knew those innocent looking symbols, all in their tables could be so complicated!
>
> Because the Unicode designers are in love with complexity (like far too many engineers).

Languages are complex and often contradictory. The moment you want, f.ex. taking letter cases you're in for the complexity. Uppercase i is different in Turkish than in any other language. ß does not have uppercase (uppercase is SS) but has a titlecase (titlecase is not the same thing as uppercase) ß. Changing cases is not reversible in general (Greek has two lower case sigma but only one uppercase, German again with ß, which becomes SS in uppercase, but not all SS can be ß wenn lowercased). This were just some simple example in Latin scripts.
Unicode is complex because language is complex. Is it perfect? No. Is it bad, far from it.

February 26, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Richard (Rikki) Andrew Cattermole
in reply to Johan

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to Johan

Permalink

So there is a problem with time trace handling, it doesn't escape Windows paths so you end up with an exception on \T tather than \\T.

I've gone ahead and modified the tool, did some cleaning up, added a second output file that allows for consumption in a spreadsheet application, sorted by duration automatically.

I'd love to see the time trace switch upstreamed into dmd. We can then distribute this tool for an out of the box visualization experience that doesn't require a web browser. And of course globals need work, not just Windows path escaping ;)

It is an absolutely lovely tool that will ease a lot of peoples concerns over debugging compile times. Gonna be worth a blog article!

I'll put my version here also:


// Run using: rdmd timeTraceTree.d <your .time-trace file>
// Outputs timetrace.txt in current dir
module timeTraceTree;
import std.stdio;
import std.file;
import std.json;
import std.range;
import std.conv;
import std.algorithm;

File outputTextFile, outputTSVFile;
static string duration_format_string = "%13.3f ";

JSONValue sourceFile;
JSONValue[] metadata; // "M"
JSONValue[] counterdata; // "C"
JSONValue[] processes; // "X"

ulong lineNumberCounter = 1;

int main(string[] args)
{
    if (args.length < 1)
        return 1;

    auto input_json = read(args[1]).to!string;
    outputTextFile = File("timetrace.txt", "w");
    outputTSVFile = File("timetrace.tsv", "w");

    {
        sourceFile = parseJSON(input_json);
        readMetaData;
        constructTree;
        constructList;
    }

    {
        outputTextFile.writeln("Timetrace: ", args[1]);
        lineNumberCounter++;

        outputTextFile.writeln("Metadata:");
        lineNumberCounter++;

        foreach (node; metadata)
        {
            outputTextFile.write("  ");
            outputTextFile.writeln(node);
            lineNumberCounter++;
        }

        outputTextFile.writeln("Duration (ms)");
        lineNumberCounter++;
    }

    foreach (i, ref child; Node.root.children)
        child.print(0, false);

    outputTSVFile.writeln("Duration\tText Line Number\tName\tLocation\tDetail");
    foreach (node; Node.all)
        outputTSVFile.writeln(node.duration, "\t", node.lineNumber, "\t",
                node.name, "\t", node.location, "\t", node.detail);

    return 0;
}

void readMetaData()
{
    auto beginningOfTime = sourceFile["beginningOfTime"].get!ulong;
    auto traceEvents = sourceFile["traceEvents"].get!(JSONValue[]);

    // Read meta data
    foreach (value; traceEvents)
    {
        switch (value["ph"].get!string)
        {
        case "M":
            metadata ~= value;
            break;
        case "C":
            counterdata ~= value;
            break;
        case "X":
            processes ~= value;
            break;
        default: //drop
        }
    }

    // process node = {"ph":"X","name": "Sema1: Module object","ts":26825,"dur":1477,"loc":"<no file>","args":{"detail": "","loc":"<no file>"},"pid":101,"tid":101},
    // Sort time processes
    multiSort!(q{a["ts"].get!ulong < b["ts"].get!ulong}, q{a["dur"].get!ulong > b["dur"].get!ulong})(
            processes);
}

void constructTree()
{
    // Build tree (to get nicer looking structure lines)
    Node*[] parent_stack = [&Node.root]; // each stack item represents the first uncompleted note of that level in the tree

    foreach (ref process; processes)
    {
        auto last_ts = process["ts"].get!ulong + process["dur"].get!ulong;
        size_t parent_idx = 0; // index in parent_stack to which this item should be added.

        foreach (i; 0 .. parent_stack.length)
        {
            if (last_ts > parent_stack[i].last_ts)
            {
                // The current process outlasts stack item i. Stop traversing, parent is i-1;
                parent_idx = i - 1;
                parent_stack.length = i;
                break;
            }

            parent_idx = i;
        }

        parent_stack[parent_idx].children ~= Node(&process, last_ts);
        parent_stack ~= &parent_stack[parent_idx].children[$ - 1];
        Node.count++;
    }
}

void constructList()
{
    size_t offset;

    Node.all.length = Node.count - 1;

    void handle(Node* root)
    {
        Node.all[offset++] = root;

        foreach (ref child; root.children)
            handle(&child);
    }

    foreach (ref child; Node.root.children)
        handle(&child);

    Node.all.sort!((a, b) => a.duration > b.duration);
}

struct Node
{
    Node[] children;
    JSONValue* json;
    ulong last_ts; // represents the last timestamp of this node (i.e. ts + dur)
    ulong lineNumber;

    string name;
    ulong duration;
    string location;
    string detail;

    this(JSONValue* json, ulong last_ts)
    {
        this.json = json;
        this.last_ts = last_ts;

        if ((*json).type == JSONType.object && "dur" in (*json))
        {
            this.duration = (*json)["dur"].get!ulong;
            this.name = (*json)["name"].get!string;
            this.location = (*json)["args"]["loc"].get!string;
            this.detail = (*json)["args"]["detail"].get!string;
        }
    }

    void print(uint indentLevel, bool last_child)
    {
        char[] identPrefix = getIdentPrefix(indentLevel, last_child);

        import std.stdio;

        if (last_child)
        {
            identPrefix[$ - 4] = ' ';
            identPrefix[$ - 3 .. $] = "\u2514";
        }
        else
            identPrefix[$ - 2 .. $] = " |";

        outputTextFile.writef(duration_format_string,
                cast(double)(*this.json)["dur"].get!ulong / 1000);

        outputTextFile.write(identPrefix);
        outputTextFile.write("- ", this.name);
        outputTextFile.write(", ", this.detail);
        outputTextFile.writeln(", ", this.location);

        this.lineNumber = lineNumberCounter;
        lineNumberCounter++;

        if (last_child)
            identPrefix[$ - 4 .. $] = ' ';

        foreach (i, ref child; this.children)
            child.print(indentLevel + 1, i == this.children.length - 1);
    }

    static Node root = Node(new JSONValue("Tree root"), ulong.max);
    static Node*[] all;
    static size_t count = 1;
}

char[] getIdentPrefix(uint indentLevel, bool last_child)
{
    static char[] buffer;

    size_t needed = ((indentLevel + 1) * 2) + (last_child * 2);

    if (buffer.length < needed)
        buffer.length = needed;

    return buffer;
}

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Herbie Melbourne
in reply to Walter Bright

Permalink

Herbie Melbourne

Posted in reply to Walter Bright

Permalink

On Friday, 24 February 2023 at 20:44:17 UTC, Walter Bright wrote:

Is 'A' in German different from the 'A' in English? Yes. Do they have different keys on the keyboard? No. Do they have different Unicode code points? No. How do you tell a German 'A' from an English 'A'? By the context.

But it is the same Latin 'A' just like '0' is the same digit (which may look like an O) only it's pronounced differently.

The same for the word "die". Is it the German "the"? Or is it the English "expire"? Should we embed this in the letters themselves? Of course not.

Some languages use pictograms for words instead of letters, like Chinese, but whether or not they are encoded with different code points for each language idk. Also Chinese has traditional and simplified - so, multiple code points for the same word?

My understanding of Unicode has always been that it's merely a mapping of a number, a code point, to a letter, word, symbol, icon, an idea and nothing more. Unicode is agnostic to layout. That's defined in a font.

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Herbie Melbourne
in reply to Patrick Schluter

Permalink

Herbie Melbourne

Posted in reply to Patrick Schluter

Permalink

On Saturday, 25 February 2023 at 13:19:55 UTC, Patrick Schluter wrote:

ß does not have uppercase (uppercase is SS) but has a titlecase (titlecase is not the same thing as uppercase) ß.

It does. See for example https://www.sueddeutsche.de/bildung/rechtschreibung-das-alphabet-bekommt-einen-neuen-buchstaben-1.3566309

February 26, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Richard (Rikki) Andrew Cattermole
in reply to Herbie Melbourne

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to Herbie Melbourne

Permalink

On 26/02/2023 3:31 AM, Herbie Melbourne wrote:
> On Saturday, 25 February 2023 at 13:19:55 UTC, Patrick Schluter wrote:
>> ß does not have uppercase (uppercase is SS) but has a titlecase (titlecase is not the same thing as uppercase) ß.
> 
> It does. See for example https://www.sueddeutsche.de/bildung/rechtschreibung-das-alphabet-bekommt-einen-neuen-buchstaben-1.3566309

Both of you are correct.

00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;;;;

No uppercase for simple casing.

00DF; 00DF; 0053 0073; 0053 0053; # LATIN SMALL LETTER SHARP S

But it does for special casing.

February 25, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Johan
in reply to Richard (Rikki) Andrew Cattermole

Permalink

Johan

Posted in reply to Richard (Rikki) Andrew Cattermole

Permalink

On Saturday, 25 February 2023 at 13:55:00 UTC, Richard (Rikki) Andrew Cattermole wrote:
> So there is a problem with time trace handling, it doesn't escape Windows paths so you end up with an exception on \T tather than \\T.

I don't quite understand what you mean.

> I've gone ahead and modified the tool, did some cleaning up, added a second output file that allows for consumption in a spreadsheet application, sorted by duration automatically.
>
> I'd love to see the time trace switch upstreamed into dmd.

https://github.com/dlang/dmd/pull/13965

> We can then distribute this tool for an out of the box visualization experience that doesn't require a web browser. And of course globals need work, not just Windows path escaping ;)

I'll add the tool to LDC.

> It is an absolutely lovely tool that will ease a lot of peoples concerns over debugging compile times. Gonna be worth a blog article!

Thanks. Looking forward.
I don't remember adding CTFE times to the traces, so that sounds like a clear improvement point? Or was it still useful for you to tackle the issue of the OP?

> I'll put my version here also:

Thanks :)

February 26, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by Richard (Rikki) Andrew Cattermole
in reply to Johan

Permalink

Richard (Rikki) Andrew Cattermole

Posted in reply to Johan

Permalink

On 26/02/2023 5:49 AM, Johan wrote:
> On Saturday, 25 February 2023 at 13:55:00 UTC, Richard (Rikki) Andrew Cattermole wrote:
>> So there is a problem with time trace handling, it doesn't escape Windows paths so you end up with an exception on \T tather than \\T.
> 
> I don't quite understand what you mean.

{"ph":"M","ts":0,"args":{"name":"C:\Tools\D\ldc2-1.30.0-beta1-windows-multilib\bin\ldc2.exe"},"name":"process_name","pid":101,"tid":101},

Needs to be:

{"ph":"M","ts":0,"args":{"name":"C:\\Tools\\D\\ldc2-1.30.0-beta1-windows-multilib\\bin\\ldc2.exe"},"name":"process_name","pid":101,"tid":101},

>> I've gone ahead and modified the tool, did some cleaning up, added a second output file that allows for consumption in a spreadsheet application, sorted by duration automatically.
>>
>> I'd love to see the time trace switch upstreamed into dmd.
> 
> https://github.com/dlang/dmd/pull/13965
> 
>> We can then distribute this tool for an out of the box visualization experience that doesn't require a web browser. And of course globals need work, not just Windows path escaping ;)
> 
> I'll add the tool to LDC.
> 
>> It is an absolutely lovely tool that will ease a lot of peoples concerns over debugging compile times. Gonna be worth a blog article!
> 
> Thanks. Looking forward.
> I don't remember adding CTFE times to the traces, so that sounds like a clear improvement point? Or was it still useful for you to tackle the issue of the OP?

Basically right now globals are not leading to anything in the output.

```
void func() {
	static immutable Thing thing = Thing(123);
}
```

The constructor call for Thing won't show up. This is the big one for std.regex basically.

February 27, 2023

Re: Why is std.regex slow, well here is one reason!

Posted by FeepingCreature
in reply to Patrick Schluter

Permalink

FeepingCreature

Posted in reply to Patrick Schluter

Permalink

On Saturday, 25 February 2023 at 13:19:55 UTC, Patrick Schluter wrote:
> Languages are complex and often contradictory. The moment you want, f.ex. taking letter cases you're in for the complexity. Uppercase i is different in Turkish than in any other language. ß does not have uppercase (uppercase is SS) but has a titlecase (titlecase is not the same thing as uppercase) ß. Changing cases is not reversible in general (Greek has two lower case sigma but only one uppercase, German again with ß, which becomes SS in uppercase, but not all SS can be ß wenn lowercased). This were just some simple example in Latin scripts.
> Unicode is complex because language is complex. Is it perfect? No. Is it bad, far from it.

Note: ß has an official uppercase version in German, ẞ, that can be used in parallel to SS since 2017, and is preferred since 2020.

Top | Forum index | About this forum

Forums