July 19, 2004
Arcane Jill wrote:

<snip>
> Another problem is that someone might compile it for version(en_US), without
> realizing that they should have been using version(en).
<snip>

Yes, as I said in my original post:
>>> It would be necessary either for the compiler to automatically set en if en_GB or en_US or en_anything is set, and similarly for other language codes, or to persuade all D users to do this. Of course, this would be done in the aforementioned default language setting.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
July 19, 2004
Juanjo Álvarez:
> Anyway if we all can discuss the matter and come with a better solution
than
> gettext (which I'm sure it's possible) I doubt many will be opposed.

Just to point out some other - not neccessary better - localization libs:

qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html
java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html


July 19, 2004
In article <cdg2mr$7dn$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...

>The enconding of the file is declared in the header of the po files; so it can be (I think) anything, for example:

>the Python implementation, (that of about 500 lines) don't use any external C lib at all; it's 100% pure Python.

>It can be done as OO.

>Our implementation could [use exceptions].

>gettext seems to use ISO3136 country
>codes and ISO639 language codes.

Wow!

Well, you've quashed all of my objections then. I'll change my vote then. Looks like a D implementation of gettext is the way to go.

Just one last thing though - we do need a D definition of a locale. In effect, we need a class Locale (or possibly a struct Locale) containing those ISO codes. Java uses strings internally (I _think_), but there are a whole bunch of reasons why that's not such a good idea - such as the fact that "fr", "fra" and "fre" are all, equivalently, the language code for French, and should all compare as equal; such as case and other punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd vote for putting enums inside the class (enum Language and enum Country - the variant field will still need to be a string). I imagine that the gettext implementation will need to use our yet-to-be-invented Locale class, and the unicode lib certainly will (and soon). Any thoughts? Class or struct? Strings or enums? Something else?

Arcane Jill




July 19, 2004
In article <cdg8ab$98s$1@digitaldaemon.com>, Stewart Gordon says...

>ISO language codes are language-dialect pairs.  So en-GB is British English, es-MX is Mexican Spanish.

I know.


>AIUI they don't cover other aspects of locale, such as time zones, date formats and the like.

In a sense, they don't cover ANYTHING. They are just tuples of language/country/variant tags. However, if you think of these as map keys, you can turn them into anything else quite straightforwardly.


>They tend to be managed by the OS

That would be nice, but collation, number formats, date formats,  etc. (to give just a few examples) are not handled very well at all by any OS of which I am aware.


>it would seem pointless to try and write apps to override this.

But not pointless for a library. The CLDR, which is maintained by the Unicode Consortium, contains just about every fragment of information you could possibly imagine wanting (short of actual language translation). Its data files are in XML - actually a custom format called LDML (Locale Data Markup Language). It absolutely DOES include such information as time zones, currencies, number formats, and so on. It's a resource we would be foolish to ignore, and since it's XML, it can be robot-parsed far, far more easily than the Unicode database. I will most certainly be using /some/ of the CLDR data for the Unicode collation algorithm.

If you want to write several monolingual applications from the same source, no-one is going to stop you. Go ahead and do it, and use whatever version identifiers you want. There's room in this world (and indeed in D) for BOTH compile-time language selecation AND true internationalization/localization, so I guess we can all be happy.

Arcane Jill



July 19, 2004
As far as I remember (I looked at gettext a few of years ago) gettext has some serious drawbacks. The worst being that parameters that are inserted into the translated string have to be specified in "printf" formatting. That means that their order in the translated string must be the same as the order in the original text, which is not always possible and often awkward.

My memory is a little fuzzy about the specifics, so please correct me if I'm wrong.

Hauke

Arcane Jill wrote:

> In article <cdeva5$2p2o$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?=
> says...
> 
>>What about just porting GNU gettext to phobos? This way you have a
>>semi-standart way of localizing programs (which a lot of translators know
>>about), and a set of pre-written tools (even nice GUI ones).
> 
> 
> I'm not sure. I'll admit I don't know much about gettext, so perhaps you could
> clear a few things up for me (and others)?
> 
> What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it
> be UTF-8?). Put another way - how good is its Unicode support?
> 
> I guess I have to say I'd be disappointed if we had to rely on yet another C
> library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That
> said, I have no objection to using files of the same format). gettext looks like
> it does some cool stuff, but it's ... well ... C. It's not OO, unless I've
> misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning
> of "locale", which is (again, in my opinion) not right for D.
> 
> The way I see it, D should define locales exclusively in terms of ISO language
> and country codes, plus variant extensions. Unicode defines locales that way,
> and the etc.unicode library will have no choice but to use the ISO codes.
> Collation and stuff like that will need to rely on data from the CDLR (Common
> Locale Data Repository - see http://www.unicode.org/cldr/).
> 
> I suppose my gut feeling is that internationalization /isn't that hard/, so it
> ought to be relatively simple a task to come up with a native-D solution.
> gettext seems to do string translation only (again, correct me if I'm wrong),
> which only a small part of internationalization/localazation.
> 
> So, I guess, on balance, I'd vote against this one, at least pending some
> persuasive argument. That said, I'm way too busy to volunteer for any work (plus
> I'm still taking a bit of time off from coding for personal reasons) - although
> I /do/ intend to tackle Unicode localization quite soon.
> 
> Does that help? Probably not, I guess. Ah well.
> 
> Tell you what, let's start an open discussion. (I've changed the thread title).
> I think we should hear lots of opinions before anyone actually DOES anything. A
> wrong early decision here could hamper D's potential future as /the/ language
> for internationalization (which I'd like it to become).
> 
> Arcane Jill
> 
> 
> 
July 19, 2004
Thomas Kuehne wrote:

> Juanjo Álvarez:
>> Anyway if we all can discuss the matter and come with a better solution
> than
>> gettext (which I'm sure it's possible) I doubt many will be opposed.
> 
> Just to point out some other - not neccessary better - localization libs:
> 
> qt/kde: http://doc.trolltech.com/3.0/linguist-manual.html
> java: http://java.sun.com/j2se/1.5.0/docs/api/java/util/Formattable.html

I don't know about the Java implementation but Qt tr() is very similar to
gettext (the format of the translation files is different.) I don't know if
KDE uses gettext internally but they use po/mo files just like gettext (and
with the same format.)
July 19, 2004
Hauke Duden wrote:

> As far as I remember (I looked at gettext a few of years ago) gettext has some serious drawbacks. The worst being that parameters that are inserted into the translated string have to be specified in "printf" formatting. That means that their order in the translated string must be the same as the order in the original text, which is not always possible and often awkward.
> 
> My memory is a little fuzzy about the specifics, so please correct me if I'm wrong.
> 
> Hauke

Mmm, not the GNU gettext, you can put:

printf(_("There are %d %s %s\n"), count, _(color), _(name));

And the output po file will be:

"There are %1$d %2$s %3$s"

So translator can change the numbers thus changing the word order.
July 19, 2004
Arcane Jill wrote:


> Just one last thing though - we do need a D definition of a locale. In effect, we need a class Locale (or possibly a struct Locale) containing those ISO codes.

Yes, definitively.

> the language code for
> French, and should all compare as equal; such as case and other
> punctuation concerns ("en-us" == "en-US" == "en_us" == "en_US", etc.). I'd
> vote for putting enums inside the class (enum Language and enum Country -
> the variant field will still need to be a string).

I vote for that too.

> I imagine that the
> gettext implementation will need to use our yet-to-be-invented Locale
> class, and the unicode lib certainly will (and soon). Any thoughts? Class
> or struct? Strings or enums? Something else?

Class + enums, IMHO :)

Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good source of inspiration; it supports Unix, Windows and MAC style locales with a bunch of useful functions (getlocale, getdefaultlocale, setlocale, normalize, locate-aware atoi+atof+str+format+strcol, etc...).

I'll take a look at it this weekend.

[1] http://doc.astro-wise.org/locale.html
July 19, 2004
In article <cdgiqm$dua$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...
>
>"There are %1$d %2$s %3$s"
>
>So translator can change the numbers thus changing the word order.

Is this a feature of printf()? If so, is a Linux thing or an all-platform thing?
And (probably a silly question, but someone might know the answer) is this
functionality available in the new writef()?


July 19, 2004
In article <cdgjdq$e82$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...

>Also the Python locale module[1] (sorry, I'm a pythonist :) could be a good source of inspiration; it supports Unix, Windows and MAC style locales with a bunch of useful functions (getlocale, getdefaultlocale, setlocale, normalize, locate-aware atoi+atof+str+format+strcol, etc...).
>
>I'll take a look at it this weekend.

Be careful not to go too over-the-top here. I think that stuff like locale-aware-atoi(), etc., should NOT be member functions of class Locale. I'll explain my reasoning below. class Locale itself should be short, sweet and simple - little more than the embodiment of those ISO codes, in fact. Locales can identify a resource by being used as a map key, so you don't need tons of other stuff build in.

The reason I say this is circularity, or bootstrapping, or simplicity, depending on your point of view. To implement (say) locale-aware-atoi() would require actual KNOWLEDGE of how to do that, for every locale. Now, we COULD pull all that data from CDLR and implement it by hand, but it would be a lot of work.

Conceptually, it's simpler if we get the very basics up and running first, and then later overload functions such as strcol() later. In fact, in this /particular/ example (collation) we are most certainly better off leaving this until later. I plan later to implement the Unicode Collation Algorithm, based on the data in CDLR. That will end up as a function which takes a Locale as one of its parameters, and whose behavior is controlled by that parameter. Same with full casing. Where such an algorithm exists (along with all the data) it makes sense to take advantage of it, but we are not in a position to do that yet, because not enough of the basics are there.

But do take a look anyway. Look also at Java's class Locale. It basically does nothing, except identify a locale. That's the kind of line I'm thinking along, as it allows for unlimited expansion later without tying us down to anything.

Jill