Unicode Character and String Intrinsics (page 5)

If you've got a UTF-32 string, UTF-16 is really only needed when calling things like Win32 APIs. Dan "Walter" <walter@digitalmars.com> wrote in message news:bagjlo$308t$1@digitaldaemon.com... > > "Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6beep$1qom$1@digitaldaemon.com... > > Maybe it will mend fences to say in public that UTF-32 could be dropped. > I have > > objective reasons for saying so, not vague unease: UTF-32 is rarely used > and > > truly fixed-width (so it can be 'faked' as Walter suggests). Nonetheless > > intrinsic UTF-32 is just as reasonable to support as, say, the equally > rarely > > used, and equally fake-able 'ifloat' type. > > My understanding is that the linux wchar_t type is UTF-32, which puts it in > common use. UTF-32 is also handy as an intermediate form when converting between UTF-8 and UTF-16. > >

That lets you index sequentially pretty fast, but not randomly. Sean "Walter" <walter@digitalmars.com> wrote in message news:bagk8l$30ti$2@digitaldaemon.com... > > "Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:b6bjg5$1ut5$1@digitaldaemon.com... > > "Matthew Wilson" <dmd@synesis.com.au> wrote in message news:b6bgt5$1sai$1@digitaldaemon.com... > > > This sounds like a nice idea - array of 1st-byte plus lookups. I'm > > intrigued > > > as to the nature of the lookup table. Is this a constant, process-wide, > > > entity? > > > > No, because the map is indexed by the same index used to index into the > flat > > array. Unless I'm misunderstanding something. > > You could use a static 256 byte lookup table to give you the 'stride' to the > next char.

May 23, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Walter

Permalink

Mark Evans

Posted in reply to Walter

Permalink

Walter wrote:
>I appreciate the thought, but carrying around an extra array for each string seems difficult to make work, especially in view of slicing, etc.

I would need a specific implementation code example to understand your thinking. (Clarification: I did not propose an extra array per string, but a lookup table -- something considerably smaller and often empty.)  My gut says it would be easy.

>I don't
>think there's any way to design the language so it is both efficient at
>dealing with ordinary ascii, and transparently able to do multibytes.

The problem here is either/or thinking.  Both are possible.  People who desperately want C byte arrays can declare them, irrespective of Unicode strings.

If the idea is that an intrinsic string type must simultaneously support Unicode and ASCII at equal performance levels, then I think the problem is one of definition.  In the first place D lacks an honest string intrinsic, so a new one could be defined just for Unicode, leaving the current whatever-it-is in place. If people don't care for Unicode, then they can use whatever-it-is D offers currently.

However my gut says that a Unicode string intrinsic holding just ASCII vs. an ASCII string as currently implemented would be neck and neck in terms of performance.  Remember that you don't necessarily need a bit test on every character every time.  The table object can flag callers when it's totally empty and they can proceed with manipulations on that basis.  In that sense the Unicode concept is really just a superset of what you already have.

Considering the number of languages now being retrofitted for Unicode, I think it would be a mistake not to build it into D when the chance to do it cleanly exists, one that will be regretted later.

Best,
Mark

Forums