February 20, 2017
https://issues.dlang.org/show_bug.cgi?id=17210

          Issue ID: 17210
           Summary: DMD's Failure to Inline Calls in
                    std.array.Appender.put Cause 3x Slowdown
           Product: D
           Version: D2
          Hardware: All
                OS: All
            Status: NEW
          Severity: major
          Priority: P1
         Component: dmd
          Assignee: nobody@puremagic.com
          Reporter: jack@jackstouffer.com

Consider this code in Appender

    void put(U)(U item) if (canPutItem!U)
    {
        static if (isSomeChar!T && isSomeChar!U && T.sizeof < U.sizeof)
        {
            /* may throwable operation:
             * - std.utf.encode
             */
            // must do some transcoding around here
            import std.utf : encode;
            Unqual!T[T.sizeof == 1 ? 4 : 2] encoded;
            auto len = encode(encoded, item);
            put(encoded[0 .. len]);
        }
        else
        {
            ensureAddable(1);
            immutable len = _data.arr.length;

            import std.conv : emplaceRef;

            auto bigData = (() @trusted => _data.arr.ptr[0 .. len + 1])();
            emplaceRef!(Unqual!T)(bigData[len], cast(Unqual!T)item);
            //We do this at the end, in case of exceptions
            _data.arr = bigData;
        }
    }

Manually inline-ing the call to emplaceRef for basic types leads to 3x faster code. Replace the non-char type code path with this code,

            static if (isBasicType!U)
            {
                auto d = _data.arr.ptr[0 .. len + 1];
                d[len] = cast(Unqual!T) item;
                _data.arr = d;
            }
            else
            {
                import std.conv : emplaceRef;

                auto bigData = (() @trusted => _data.arr.ptr[0 .. len + 1])();
                emplaceRef!(Unqual!T)(bigData[len], cast(Unqual!T)item);
                //We do this at the end, in case of exceptions
                _data.arr = bigData;
            }

Functionally, these different code paths are exactly the same.

Here's the numbers before and after

Before: 3 secs, 29 ms, and 842 μs
After:  1 sec, 109 ms, 734 μs, and 6 hnsecs

--