Jump to page: 1 2 3
Thread overview
[Suggestion] Standard version identifiers for language
Jul 16, 2004
Stewart Gordon
Jul 16, 2004
Arcane Jill
Jul 16, 2004
J C Calvarese
Jul 18, 2004
Arcane Jill
Jul 18, 2004
Juanjo Álvarez
Internationalization - an open discussion (was: Standard version identifiers...)
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Juanjo Álvarez
Jul 19, 2004
Thomas Kuehne
Jul 19, 2004
Juanjo Álvarez
Re: Internationalization - an open discussion
Jul 19, 2004
Berin Loritsch
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Juanjo Álvarez
Re: Internationalization - an open discussion
Jul 19, 2004
Arcane Jill
Re: Internationalization - an open discussion
Jul 19, 2004
Hauke Duden
Jul 19, 2004
Juanjo Álvarez
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Sean Kelly
Jul 20, 2004
Jonathan Leffler
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Hauke Duden
Jul 19, 2004
J C Calvarese
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Stewart Gordon
Jul 19, 2004
Arcane Jill
Jul 19, 2004
Stewart Gordon
July 16, 2004
Suppose one wants to write an application with versions in different languages.  That's human languages, not programming languages.

At face value, that's simple - use version blocks to hold the various languages' UI text.  Or for Windows, define a separate resource file for each language.  (Or lists of string macros to be imported by one resource file.)

But what if you want to use one or more libraries from various sources, which also may have language versions?  Then you'd have to set all the version identifiers that the different library designers have chosen for your choice of language, which could lead to quite long command lines. It would be simpler if there could be a standard system of language identifiers for everyone to follow.

A system based on ISO 639-1 would probably be good.  One could then write

----------
version (en) {
    const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
      "Thursday", "Friday", "Saturday" ];
} else version (fr) {
    const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
      "jeudi", "vendredi", "samedi" ];
} else {
    static assert (false);
}
----------

There are a few matters of debate to be considered:

1. Should we create a prefixed language namespace?  Or will these codes by themselves do?

2. How really should a lib be written to deal with unsupported languages, or if no language has been specified?  Two possibilities I can see:
(a) the lib programmer would put his/her own language (or maybe the one predicted to be most popular) as the default.
(b) a static assert as above, effectively telling the app programmer "please set a language, or create a version block in me for your language".  Maybe a future D compiler could be configured to use a certain language version as the default if none is specified on the command line.

3. Should we really have them as version identifiers?  Or invent a new CC block called 'language' that would have the specifics of language designation built in, a corresponding command line option and a corresponding compiler configuration setting?

Dialects of a language could be indicated by replacing the hyphen in the ISO code with an underscore.  Libs would then have something like

----------
version (en_GB) {
    ...
} else version (en_US) {
    ...
} else version (en) {
    ...
}
----------

It would be necessary either for the compiler to automatically set en if en_GB or en_US or en_anything is set, and similarly for other language codes, or to persuade all D users to do this.  Of course, this would be done in the aforementioned default language setting.

This would give lib programmers a choice of writing for each dialect of each language, just covering the basic languages, or a mixture.  A default fallback for an unsupported dialect would, I guess, typically be some 'default' dialect if that makes sense.

This provides for compile-time localisation.  Of course, some might want run-time l10n, in which case the app would be explicitly programmed to do this.

In writing a lib one might choose to support RTL.  In which case the version/language blocks would be used to select the default language, which would make it usable for monolingual apps, CTL apps and RTL apps alike.  Of course, one could argue that there should be some global variable in Phobos or somewhere for run-time language....

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
July 16, 2004
In article <cd8rji$ar0$1@digitaldaemon.com>, Stewart Gordon says...
>
><some interesting ideas>

Can I respond to this next week?

Short answer - this is a run-time problem, not a compile-time problem. A third party should (IMO) be able to add additional human languages given access only to the executable binary and some ini files (or something similar).

Long answer - can you wait? I've got lots of ideas about this, but I really not up to debate just yet.

Arcane Jill


July 16, 2004
Stewart Gordon wrote:
> Suppose one wants to write an application with versions in different languages.  That's human languages, not programming languages.
> 
> At face value, that's simple - use version blocks to hold the various languages' UI text.  Or for Windows, define a separate resource file for each language.  (Or lists of string macros to be imported by one resource file.)
> 
> But what if you want to use one or more libraries from various sources, which also may have language versions?  Then you'd have to set all the version identifiers that the different library designers have chosen for your choice of language, which could lead to quite long command lines. It would be simpler if there could be a standard system of language identifiers for everyone to follow.
> 
> A system based on ISO 639-1 would probably be good.  One could then write
> 
> ----------
> version (en) {
>     const char[][] DAY = [ "Sunday", "Monday", "Tuesday", "Wednesday",
>       "Thursday", "Friday", "Saturday" ];
> } else version (fr) {
>     const char[][] DAY = [ "dimanche", "lundi", "mardi", "mercredi",
>       "jeudi", "vendredi", "samedi" ];
> } else {
>     static assert (false);
> }
> ----------

There seems to be 2 schools of thought in the area of localization (of which language issues are a subset):

1. Compile-time generated using version() as you described in your post.

2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply).


Since D is capable enough for either method, both parties can be happy.

Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions.

I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable.

version (lang_en_GB) {
     ...
 } else version (lang_en_US) {
     ...
 } else version (lang_en) {
     ...
 }

Since I don't have any real experience with localization, I'd love to hear some opinions from those who have actually worked with localization. Which programming languages make localization easy. Which libraries are helpful.

-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
July 18, 2004
In article <cd9peo$oce$1@digitaldaemon.com>, J C Calvarese says...

>There seems to be 2 schools of thought in the area of localization (of which language issues are a subset):
>
>1. Compile-time generated using version() as you described in your post.
>
>2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply).
>
>Since D is capable enough for either method, both parties can be happy.
>
>Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions.

Compile-time localization not something I'd thought about before, so it's been kind of an interesting thing to think about. The thing is, though, we already _have_ compile-time localization. We've always had it. As Stewart said, you can do:

#    version(fr)
#    {
#        writef("something in French");
#    }
#    else version(de)
#    {
#        writef("something in German");
#    }
#    else
#    {
#        writef("something in English");
#    }

And we've been able to do that in C ever since we learned how to use #ifdef. So it requires no new language features. It's already there. But, since this technique has been around for so long, you'd expect it be widely used ... unless, that is, it turns out to be not very useful.

There are a number of disadvantages I can think of. For a start, your locale-specific code will end up distributed throughout your source code, instead of all in one place. This could be a nightmare if you decide to support a new locale. Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale. And that's just for executables. For libraries, the situation is even worse. A compile-time-localized executable would have to be linked with compile-time-localized libraries, compiled for the same locale. It would be a serious headache.

Another problem is that someone might compile it for version(en_US), without
realizing that they should have been using version(en).

And all for what? To save a small amount of run-time overhead. Well, /how much/ run-time overhead? In most cases, run-time-localization amounts to looking something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it was insignificant compared to the overhead incurred by writing that localized string to printf() or a file.

Localization, to me, is the flip-side of internationalization (or i18n for lazy typists). The way it's traditionally done is you "internationalize" your code - a compile-time thing, and then "localize" it at run-time. Here's an example. Start with some normal, unlocalized code:

#    printf("hello world\n");

Now, internationalize it. It will become something like this:

#    printf(local("hello world\n"));

Not much different really. The function local() would probably be something like this, and would get inlined:

#    char[][char[]] localizedLookup; // global variable
#
#    char[] local(char[] s)
#    {
#        return localizedLookup[s];
#    }

"Localizing" this program now consists only of initializing the localizedLookup map, which could happen in any number of ways.

My guess is that what would be most useful in terms of internationalization/localization would be some classes and functions to make stuff like the above easier, providing localized number formats and so on. Plus of course the D definition of a "locale" - we have to standardize that somehow. Me, I'd prefer enums to strings - saves all that messing about with case, for one thing, and string-splitting to get at the two (possibly three) parts.



>I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable.

Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for being a pedantic bugger here.

But the original questions was, should there be "standard version identifiers"? Thing is - I don't see how there can be. A version identifier is just D's name for a #define, and there's nothing to stop anyone from using they want as such. A note in the style guide might help, but even that won't force people to use said standard.

Just my thoughts.

Arcane Jill



July 18, 2004
What about just porting GNU gettext to phobos? This way you have a
semi-standart way of localizing programs (which a lot of translators know
about), and a set of pre-written tools (even nice GUI ones).

Looking at the python implementation it should not be difficult; the Python implementation is only 493 lines (gettext.py) I'll see if I can take enough time to do it in the next weeks.

July 19, 2004
Arcane Jill wrote:
> In article <cd9peo$oce$1@digitaldaemon.com>, J C Calvarese says...
> 
> 
>>There seems to be 2 schools of thought in the area of localization (of which language issues are a subset):
>>
>>1. Compile-time generated using version() as you described in your post.
>>
>>2. Runtime-time generated using some sort of a plugin architecture or language resource files (as Arcane Jill alludes to in her reply).
>>
>>Since D is capable enough for either method, both parties can be happy.
>>
>>Personally, I think I'd prefer to use compile-time localization, but that doesn't prevent others from designing runtime-time localization functions.
> 
> 
> Compile-time localization not something I'd thought about before, so it's been
> kind of an interesting thing to think about. The thing is, though, we already
> _have_ compile-time localization. We've always had it. As Stewart said, you can
> do:
> 
> #    version(fr)
> #    {
> #        writef("something in French");
> #    }
> #    else version(de)
> #    {
> #        writef("something in German");
> #    }
> #    else
> #    {
> #        writef("something in English");
> #    }
> 
> And we've been able to do that in C ever since we learned how to use #ifdef. So
> it requires no new language features. It's already there. But, since this
> technique has been around for so long, you'd expect it be widely used ...
> unless, that is, it turns out to be not very useful.

It's nothing revolutionary, but it's a start.

> 
> There are a number of disadvantages I can think of. For a start, your
> locale-specific code will end up distributed throughout your source code,
> instead of all in one place. This could be a nightmare if you decide to support
> a new locale. Another problem is that, if you choose the locale at compile-time,
> then the end-user (as opposed to the developer) has to have the source code, OR
> an executable which was compiled especially for their locale. And that's just
> for executables. For libraries, the situation is even worse. A
> compile-time-localized executable would have to be linked with
> compile-time-localized libraries, compiled for the same locale. It would be a
> serious headache.
> 
> Another problem is that someone might compile it for version(en_US), without
> realizing that they should have been using version(en).
> 
> And all for what? To save a small amount of run-time overhead. Well, /how much/
> run-time overhead? In most cases, run-time-localization amounts to looking
> something up in a map. Is that bad? A matter of judgement, maybe, but I'd say it
> was insignificant compared to the overhead incurred by writing that localized
> string to printf() or a file. 

I could be a lot of run-time overhead. It could be a little. But ultimately, it should be left up to the individual programmer.

> 
> Localization, to me, is the flip-side of internationalization (or i18n for lazy
> typists). The way it's traditionally done is you "internationalize" your code -
> a compile-time thing, and then "localize" it at run-time. Here's an example.

I didn't realize there was a different between localization and internalization. The OP was mostly concerned with "human languages", but I think that other issues such as date formats would naturally be discussed at the same time.

> Start with some normal, unlocalized code:
> 
> #    printf("hello world\n");
> 
> Now, internationalize it. It will become something like this:
> 
> #    printf(local("hello world\n"));
> 
> Not much different really. The function local() would probably be something like
> this, and would get inlined:
> 
> #    char[][char[]] localizedLookup; // global variable
> #
> #    char[] local(char[] s)
> #    {
> #        return localizedLookup[s];
> #    }
> 
> "Localizing" this program now consists only of initializing the localizedLookup
> map, which could happen in any number of ways.
> 
> My guess is that what would be most useful in terms of
> internationalization/localization would be some classes and functions to make
> stuff like the above easier, providing localized number formats and so on. Plus
> of course the D definition of a "locale" - we have to standardize that somehow.
> Me, I'd prefer enums to strings - saves all that messing about with case, for
> one thing, and string-splitting to get at the two (possibly three) parts.

Sure, Phobos should include some modules for run-time support (and maybe compile-time support, too).

> 
> 
>>I have a quick comment about the specifics of your ideas. I think the version identifiers should have a prefix (such as "lang_"). This should make it clear to most viewers of the code what's happening. Not everyone would intuitively know that ky_KG is a language feature, but lang_ky_KG is guessable.
> 
> 
> Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
> language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
> being a pedantic bugger here.

I don't mind nit-picking (I do a lot of it myself). I hereby retract "lang_" in favor of either "locale_" or "loc_".

> 
> But the original questions was, should there be "standard version identifiers"?
> Thing is - I don't see how there can be. A version identifier is just D's name
> for a #define, and there's nothing to stop anyone from using they want as such.
> A note in the style guide might help, but even that won't force people to use
> said standard.

Right. I was thinking "convention" when I read "standard". I don't intend to compell anyone (and as you state, they can't really be compelled), but I think if the convention makes sense, many people would use it.

> 
> Just my thoughts.
> 
> Arcane Jill
> 
> 
> 


-- 
Justin (a/k/a jcc7)
http://jcc_7.tripod.com/d/
July 19, 2004
In article <cdfgli$2vdr$1@digitaldaemon.com>, J C Calvarese says...
>I didn't realize there was a different between localization and internalization.

I found a good explanation about this when looking up gettext on the web. Have a look at http://www.gnu.org/software/gettext/manual/html_mono/gettext.html#SEC3.

To quote its summary of itself: "Also, very roughly said, when it comes to multi-lingual messages, internationalization is usually taken care of by programmers, and localization is usually taken care of by translators."

I consider myself a good programmer, but I'd make a lousy translator. I'd want to leave that job to someone else.

Arcane Jill


July 19, 2004
In article <cdeva5$2p2o$1@digitaldaemon.com>, Juanjo =?ISO-8859-15?Q?=C1lvarez?= says...
>
>What about just porting GNU gettext to phobos? This way you have a
>semi-standart way of localizing programs (which a lot of translators know
>about), and a set of pre-written tools (even nice GUI ones).

I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)?

What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support?

I guess I have to say I'd be disappointed if we had to rely on yet another C library. Maybe I'm just completely mad, but I'd prefer a pure D solution. (That said, I have no objection to using files of the same format). gettext looks like it does some cool stuff, but it's ... well ... C. It's not OO, unless I've misunderstood. It doesn't use exceptions. Moreover, it assumes the Linux meaning of "locale", which is (again, in my opinion) not right for D.

The way I see it, D should define locales exclusively in terms of ISO language and country codes, plus variant extensions. Unicode defines locales that way, and the etc.unicode library will have no choice but to use the ISO codes. Collation and stuff like that will need to rely on data from the CDLR (Common Locale Data Repository - see http://www.unicode.org/cldr/).

I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution. gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation.

So, I guess, on balance, I'd vote against this one, at least pending some persuasive argument. That said, I'm way too busy to volunteer for any work (plus I'm still taking a bit of time off from coding for personal reasons) - although I /do/ intend to tackle Unicode localization quite soon.

Does that help? Probably not, I guess. Ah well.

Tell you what, let's start an open discussion. (I've changed the thread title). I think we should hear lots of opinions before anyone actually DOES anything. A wrong early decision here could hamper D's potential future as /the/ language for internationalization (which I'd like it to become).

Arcane Jill



July 19, 2004
First things first, you can have all the documentation about GNU gettext here:

http://www.gnu.org/software/gettext/manual/html_chapter/gettext_toc.html

Arcane Jill wrote:

> I'm not sure. I'll admit I don't know much about gettext, so perhaps you could clear a few things up for me (and others)?

Let's try.

> What is the encoding of a .po or .mo file? Specifically, is it UTF-8? (Or CAN it be UTF-8?). Put another way - how good is its Unicode support?

The enconding of the file is declared in the header of the po files; so it can be (I think) anything, for example:

"Project-Id-Version: animail\n"
"POT-Creation-Date: 2003-12-07 02:02+0100\n"
"PO-Revision-Date: 2004-07-08 20:19+0200\n"
"Last-Translator: XXX XXX <XXX@XXX.de>\n"
"Language-Team: Deutsch <de@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
^^^^^^^
"Content-Transfer-Encoding: 8bit\n"
^^^^^^^
"X-Generator: KBabel 1.3.1\n"

> I guess I have to say I'd be disappointed if we had to rely on yet another C library.

We don't; the Python implementation, (that of about 500 lines) don't use any external C lib at all; it's 100% pure Python.

> gettext looks like it does some cool stuff, but it's ... well ... C.

I repeat, it doesn't have to be C. gettext is more like a set of tools and formats than a library (altought the library exists, of course, but not only for C but for a lot of languages.)

> It's not OO.

It can be done as OO.

> unless I've misunderstood. It doesn't use exceptions.

Our implementation could.

> Moreover, it assumes the Linux meaning of "locale", which is (again, in my
> opinion) not right for D.
> The way I see it, D should define locales exclusively in terms of ISO
> language and country codes, plus variant extensions.
> Unicode defines
> locales that way, and the etc.unicode library will have no choice but to
> use the ISO codes. Collation and stuff like that will need to rely on data
> from the CDLR (Common Locale Data Repository - see
> http://www.unicode.org/cldr/).

I don't know the answer to this one, gettext seems to use ISO3136 country codes and ISO639 language codes.

> I suppose my gut feeling is that internationalization /isn't that hard/, so it ought to be relatively simple a task to come up with a native-D solution.

> gettext seems to do string translation only (again, correct me if I'm wrong), which only a small part of internationalization/localazation.

That's true. It also handles plural forms which is not so simple (http://www.gnu.org/software/gettext/manual/html_chapter/gettext_10.html#SEC150)

Anyway if we all can discuss the matter and come with a better solution than gettext (which I'm sure it's possible) I doubt many will be opposed.
July 19, 2004
Arcane Jill wrote:

<snip>
> There are a number of disadvantages I can think of. For a start, your
> locale-specific code will end up distributed throughout your source code,
> instead of all in one place.

Only if you choose to do it that way.  You can just as well have one version block per module, with the locale-specific data and code in them, and have the rest of the module use stuff in here.

> This could be a nightmare if you decide to support a new locale.  Another problem is that, if you choose the locale at compile-time, then the end-user (as opposed to the developer) has to have the source code, OR an executable which was compiled especially for their locale.

I believe it's quite common to offer separate downloadable versions in each language.  That way, a unilingual end-user isn't faced with the bloat of a multilingual UI or the overhead of compiling it, and you can choose whether to release the source or not.

> And that's just for executables. For libraries, the situation is even worse. A
> compile-time-localized executable would have to be linked with
> compile-time-localized libraries, compiled for the same locale. It would be a
> serious headache.

To me it would seem straightforward to build a copy of the lib, and give it an identifying name, for each language that your app supports.

<snip>
> Strictly speaking it should be "locale_", not "lang_". ("ky" is a lang, or
> language; "KG" is a country. The combination "ky_KG" is a locale). Sorry for
> being a pedantic bugger here.
<snip>

ISO language codes are language-dialect pairs.  So en-GB is British English, es-MX is Mexican Spanish.  AIUI they don't cover other aspects of locale, such as time zones, date formats and the like.  They tend to be managed by the OS - it would seem pointless to try and write apps to override this.

Stewart.

-- 
My e-mail is valid but not my primary mailbox, aside from its being the unfortunate victim of intensive mail-bombing at the moment.  Please keep replies on the 'group where everyone may benefit.
« First   ‹ Prev
1 2 3