Unicode library now in Deimos (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Unicode library now in Deimos (page 3)

July 02, 2004

Re: Unicode library now in Deimos

Posted by Walter
in reply to Arcane Jill

Walter

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cboirj$1ca$1@digitaldaemon.com...
> In article <cbnbdl$182e$1@digitaldaemon.com>, Walter says...
> >
> >Cool! Is this a supplement or a replacement for Hauke's earlier work?
> It's a replacement for PART of Hauke's work.

Hmm. This can be confusing. Can the functionality of each be made unique?

July 02, 2004

Re: Unicode library now in Deimos

Posted by Arcane Jill
in reply to Walter

Arcane Jill

Posted in reply to Walter

In article <cc297e$ik$2@digitaldaemon.com>, Walter says...
>
>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cboirj$1ca$1@digitaldaemon.com...
>> In article <cbnbdl$182e$1@digitaldaemon.com>, Walter says...
>> >
>> >Cool! Is this a supplement or a replacement for Hauke's earlier work?
>> It's a replacement for PART of Hauke's work.
>
>Hmm. This can be confusing.

I meant, it's a replacement for "unichar", but not for "utype".


>Can the functionality of each be made unique?

Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated, of course, but that's because what makes utype special is the fact that all functions have the same _name_ as the corresponding ctype functions). For examples, to convert a character to uppercase using simple-casing, you can currently use any of:

(a) toupper(c)                // ASCII only, from cytpe
(b) toupper(c)                // all Unicode chars, from utype
(c) charToUpper(c)            // all Unicode chars, from unichar
(d) getUppercaseMapping(c)    // all Unicode chars, from etc.unicode

all of the above are locale-unaware, but very shortly, there will also be:

(e) getUppercaseMapping(c, locale);
#    // locale-aware simple casing for all Unicode chars


It is not possible, however, to make "etc.unicode" and "unichar" unique. The former is a superset of the latter. The codebuilder could, of course, be instructed NOT to generate those functions for which similar functionality exists in unichar, but then you'd have problems with keeping both versions in step with each other, function names being inconsistent, linking strategy being different, and so on.


My vote would go to retaining "utype" (because people are familiar with ctype), but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a superset of what "unichar" can do, in terms of provided functions, and in addition can be rebuilt for any future (or even past) version of Unicode by any end-user in a matter of minutes (once the codebuilder program goes public). It's also likely to be smaller, because you link in only those parts you need, whereas with "unichar" it's all or nothing.

It should be noted of course that "etc.unicode" is well optimized for space (but still with guaranteed constant-time lookup), and Hauke's input is what made this possible.

The function names are definitely confusing, I agree. But in the case of "etc.unicode", the names originate from the Unicode Consortium. These folk define property names, such as "Simple_Uppercase_Mapping", and so on. The names are part of the Unicode standard - so I just slavishly put my metaphorical blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for non-boolean properties, "is" for boolean properties) to conform to the D style guide. Such names may be cumbersome, but I still think it's better than using made-up names, and it's a consistent methodology to extend to the remaining properties we haven't added yet.

I hope that makes things less confusing. Unfortunately, right now, Deimos is not well organized, because it is in the hands of many people. To me it makes more sense that people should be able to download the whole of Deimos in one go (instead of individual packages), just like they can currently download the whole of Phobos in one go. That sort of organization would be easy if Deimos were a one-person-project, or even a project with one leader whose word was law, but it's a collective effort, and I think those involved are going to HAVE to put some effort into making it look like a unified effort. This will happen in time, but I mention it because, right now (and I hope this is a temporary phase), "unichar" is easier to use than "etc.unicode", even though both are currently supplied in source code form, if only for the simple reason that "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I would hope to have the following:

(1) Deimos being easy-to-download and easy-to-use, with pre-build linkable libraries for all platforms, in both Debug and Release builds.

(2) Headers for etc.unicode (by which I mean, stripped versions of the source
code, with large tables removed), to speed up compilation time.

(3) The codebuilder program (which generates etc.unicode) being made public,
along with documentation, so that people can compile Unicode lookups for any
version of Unicode, past, present or future (or even customized).

Until this is done, unichar is likely to be easier to use. However, once these steps are taken, I would then have no hesitation in suggesting that we use as standard:


(i) etc.unicode
(possibly renamed to std.unicode - the codebuilder can locate it anywhere).

(ii) utype
(possibly renamed to std.utype).


Arcane Jill

July 02, 2004

Re: Unicode library now in Deimos

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
> In article <cc297e$ik$2@digitaldaemon.com>, Walter says...
> 
>>"Arcane Jill" <Arcane_member@pathlink.com> wrote in message
>>news:cboirj$1ca$1@digitaldaemon.com...
>>
>>>In article <cbnbdl$182e$1@digitaldaemon.com>, Walter says...
>>>
>>>>Cool! Is this a supplement or a replacement for Hauke's earlier work?
>>>
>>>It's a replacement for PART of Hauke's work.
>>
>>Hmm. This can be confusing.
> 
> 
> I meant, it's a replacement for "unichar", but not for "utype".
> 
> 
> 
>>Can the functionality of each be made unique?
> 
> 
> Hauke's "utype" module (drop-in replacement for ctype) is unique. It is not
> duplicated in etc.unicode. (The FUNCTIONALITY of some functions is duplicated,
> of course, but that's because what makes utype special is the fact that all
> functions have the same _name_ as the corresponding ctype functions). For
> examples, to convert a character to uppercase using simple-casing, you can
> currently use any of:
> 
> (a) toupper(c)                // ASCII only, from cytpe
> (b) toupper(c)                // all Unicode chars, from utype
> (c) charToUpper(c)            // all Unicode chars, from unichar
> (d) getUppercaseMapping(c)    // all Unicode chars, from etc.unicode
> 
> all of the above are locale-unaware, but very shortly, there will also be:
> 
> (e) getUppercaseMapping(c, locale);
> #    // locale-aware simple casing for all Unicode chars
> 
> 
> It is not possible, however, to make "etc.unicode" and "unichar" unique. The
> former is a superset of the latter. The codebuilder could, of course, be
> instructed NOT to generate those functions for which similar functionality
> exists in unichar, but then you'd have problems with keeping both versions in
> step with each other, function names being inconsistent, linking strategy being
> different, and so on.
> 
> 
> My vote would go to retaining "utype" (because people are familiar with ctype),
> but using "etc.unicode" in place of "unichar". What "etc.unicode" can do is a
> superset of what "unichar" can do, in terms of provided functions, and in
> addition can be rebuilt for any future (or even past) version of Unicode by any
> end-user in a matter of minutes (once the codebuilder program goes public). It's
> also likely to be smaller, because you link in only those parts you need,
> whereas with "unichar" it's all or nothing.
> 
> It should be noted of course that "etc.unicode" is well optimized for space (but
> still with guaranteed constant-time lookup), and Hauke's input is what made this
> possible.
> 
> The function names are definitely confusing, I agree. But in the case of
> "etc.unicode", the names originate from the Unicode Consortium. These folk
> define property names, such as "Simple_Uppercase_Mapping", and so on. The names
> are part of the Unicode standard - so I just slavishly put my metaphorical
> blinkers on, removed the underscores and added a "get" or "is" prefix ("get" for
> non-boolean properties, "is" for boolean properties) to conform to the D style
> guide. Such names may be cumbersome, but I still think it's better than using
> made-up names, and it's a consistent methodology to extend to the remaining
> properties we haven't added yet.
> 
> I hope that makes things less confusing. Unfortunately, right now, Deimos is not
> well organized, because it is in the hands of many people. To me it makes more
> sense that people should be able to download the whole of Deimos in one go
> (instead of individual packages), just like they can currently download the
> whole of Phobos in one go. That sort of organization would be easy if Deimos
> were a one-person-project, or even a project with one leader whose word was law,
> but it's a collective effort, and I think those involved are going to HAVE to
> put some effort into making it look like a unified effort. This will happen in
> time, but I mention it because, right now (and I hope this is a temporary
> phase), "unichar" is easier to use than "etc.unicode", even though both are
> currently supplied in source code form, if only for the simple reason that
> "unichar" is one file and "etc.unicode" is many files. In the (near?) future, I
> would hope to have the following:
> 
> (1) Deimos being easy-to-download and easy-to-use, with pre-build linkable
> libraries for all platforms, in both Debug and Release builds.
> 
> (2) Headers for etc.unicode (by which I mean, stripped versions of the source
> code, with large tables removed), to speed up compilation time.
> 
> (3) The codebuilder program (which generates etc.unicode) being made public,
> along with documentation, so that people can compile Unicode lookups for any
> version of Unicode, past, present or future (or even customized).
> 
> Until this is done, unichar is likely to be easier to use. However, once these
> steps are taken, I would then have no hesitation in suggesting that we use as
> standard:
> 
> 
> (i) etc.unicode
> (possibly renamed to std.unicode - the codebuilder can locate it anywhere).
> 
> (ii) utype
> (possibly renamed to std.utype).


I agree that Phobos should only have one Unicode package - everything else would be a bad idea.

My hope is that I'll be able to integrate some of the current advantages of unichar (faster lookup, smaller footprint) with AJ's work so that they can apply to all Unicode functions. I'll know more the possibilities once AJ releases the code generator.

I also have a feeling that it is not necessary to have as many separate modules as etc.unicode currently has (but since I have only glanced at etc.unicode I could be wrong). Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use. The main thing we'd have to keep an eye on is that static module constructors do not pull in all the data and functions.

The function names could also use some tuning - right now they feel a little clunky (as do the unichar function names, of course). The main problem here is that people knowing Unicode will recognize the property names and that the functions will still be sufficiently different from utype/ctype to prevent confusion (since utype/ctype define quite properties with the same name in different way).

Hauke

July 02, 2004

Re: Unicode library now in Deimos

Posted by Arcane Jill
in reply to Hauke Duden

Arcane Jill

Posted in reply to Hauke Duden

In article <cc3a30$1ni0$1@digitaldaemon.com>, Hauke Duden says...
>
>I agree that Phobos should only have one Unicode package - everything else would be a bad idea.
>
>My hope is that I'll be able to integrate some of the current advantages of unichar (faster lookup, smaller footprint) with AJ's work so that they can apply to all Unicode functions. I'll know more the possibilities once AJ releases the code generator.

No problem. I'll do that soon, like within a week. You should definitely be given write access, and if you can make it better/faster/whatever that would certainly be great.


>I also have a feeling that it is not necessary to have as many separate modules as etc.unicode currently has (but since I have only glanced at etc.unicode I could be wrong).

As currently written, the codebuilder generates between two and four modules per Unicode property - one is just a wrapper to present a usable interface to humans; the others are purely robot-generated and (in general) will comprise one module per lookup table (but you still need a module even if there are zero lookup tables). You could certainly have a bash at reducing the number of modules per Unicode property, but it would be bad (in my opinion) to reduce it further. As a minimum, you need one per Unicode property.

I don't know if having many modules is actually a problem though. From they end users' point of view, they will still need to import ONE module, and link with ONE library, and the rest happens automatically. But yeah, if you can make that happen better or faster, great!



>Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use.

No, the linker can only throw out whole unused modules. It cannot throw out /parts/ of modules. Therefore, each "optional thing" needs to be in its own module or modules.




>The main thing we'd have to keep an eye on is that static module constructors do not pull in all the data and functions.

There aren't any static module constructors. None at all. Zero. So keeping an eye on that should be fairly easy, I'd say.



>The function names could also use some tuning - right now they feel a little clunky (as do the unichar function names, of course). The main problem here is that people knowing Unicode will recognize the property names and that the functions will still be sufficiently different from utype/ctype to prevent confusion (since utype/ctype define quite properties with the same name in different way).

Yes, I agree. The problem is that the names are defined by the Unicode Consortium (though I did tweak them to match D style). They are the standard, official names for properties. I think we'd have to be quite imaginitive to come up with any reasonable alternative.

Arcane Jill

July 03, 2004

Re: Unicode library now in Deimos

Posted by Hauke Duden
in reply to Arcane Jill

Hauke Duden

Posted in reply to Arcane Jill

Arcane Jill wrote:
>>Since the linker will always throw out uncalled functions and unaccessed data (correct?) it should be possible to make it easier to use.
> 
> 
> No, the linker can only throw out whole unused modules. It cannot throw out
> /parts/ of modules. Therefore, each "optional thing" needs to be in its own
> module or modules.

Hmmm. I ran a few tests and it seems that you're right. As soon as you specify a module to the linker it will be included fully in the executable, regardless of whether it is used or not. It is unfortunate that it isn't a little more sophisticated :(.

GDC should be able to do better (since GCC/G++ does for C/C++), but since the unicode lib should work well with all compilers it seems that your current approach is the only feasible one.

Hauke

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation