foreach (dchar c; s) is too slow

The runtime support for string foreach with decoding has some serious
performance issues.
It's currently implemented by creating a delegate for the foreach body
and calling _aApply*.
I recently submitted this pull to vibe.d
https://github.com/rejectedsoftware/vibe.d/pull/327.
Using the explicit for (auto tmp = s; !tmp.empty; tmp.popFront()) { auto
c = tmp.front; /*CODE*/ }
is about 3 times faster.
I'd like to avoid such hidden performance pitfalls. Any ideas how to
transition to templated library code?
For dchar iteration the compiler could simply prefer the range
interface, but that wouldn't work for wchar (or char with w/dstrings).
Is it something that should only be added to phobos, e.g. foreach (c;
s.byChar!dchar())?
Or should this be a combination of compiler and library support, i.e.
when an explicit char type is given the compiler
instantiates a template with that char type.

Forums