March 02, 2009
Mon, 02 Mar 2009 07:02:10 -0800, Andrei Alexandrescu wrote:

> Consider some code in phobos that must throw an exception:
> 
> throw Exception("File `%s' not found, system error is %s.",
>      filename, errnomsg);
> 
> The localized version will look like this:
> 
> auto format = "File `%s' not found, system error is %s.";
> auto localFormat = currentLocale ? currentLocale.peek(format) : null;
> if (!localFormat) localFormat = format;
> throw Exception(localFormat, filename, errnomsg);

This example does not address the encoding problem.  Currently, errnomsg is in Russian, UTF-8 encoded.  So I get "system error is <garbage>" on the console.  If you adopt locales I'll get garbage not only for the system error but for the rest of the exception message as well.

To actually solve this problem the default exception handler must be fixed to convert any UTF-8 into the current OEM code page before printing.  It would also help if default stdin and stdout performed such a conversion.

> What happens is that the default format string _is_ the key for looking up the localized strings.

Nice.  This means that error messages become a part of API and are subject to backward and forward compatibility issues.  Isn't it too much?
March 02, 2009
Andrei Alexandrescu wrote:
> Georg Wrede wrote:
> You see, we're not communicating. I sent this link:
> 
> http://www.unicode.org/cldr/
> 
> Did you look at it? It is essentially a database of locale information in a highly structured format. All I want is to define a structure expressive enough to gobble the part of that database that is of interest. The Phobos documentation will say, we just adopt their schema. If users don't want to load any, then fine - everything is just like today.

I read the page. It says "This data is used by a wide spectrum of companies for their software internationalization and localization".

The first link in the text part is to the CLDR Overview ppt. I read it. On page 5 it says:

"Companies / Organizations
Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS & Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more…"

One sees here major companies, operating systems, and three languages: D, Python and Java. The page is from 2005.

So D "has had this since at least 2005". What can I say? I guess we have to implement it then...

>> What I'm saying is, it's debatable whether this stuff belongs to "the programming language itself" at all. Rather, it should be an external library, provided by someone else than us. It belongs to SourceForge or Dsource, not here.
> 
> http://www.unicode.org/cldr/
> 
> We just need to load it if there is such a need.

In another post you sounded as if there is a connection between this stuff and printing arrays. I'm not sure I see the connection.

> Let me try again: I don't want to define locale support. I want to provide the basics for people to roll it out themselves.

I downloaded the files in http://unicode.org/Public/cldr/1.6.1/ which were core.zip, posix.zip, tests.zip and tools.zip. They unzipped to 140MB, containing some 200 java files and some 800 xml files, among others.

The readme.txt in tools.zip says:

"The code is very preliminary, so don't expect stability from the APIs (or documentation!), since we still have to work out how we want to do the architecture."

The main web page says "CLDR 1.7 Tentative Schedule: 2008-09", but it still isn't on the download page. The last version is 2008-07-23 Version1.6.1.

==============

My take:

 * This is still a moving target
 * Using this is a major hassle for the programmer
 * With D2 itelf a moving target, nobody is going to invest enough time in this to actually use it for something worthwhile in the next 6 to 12 months anyway
 * This is more application level stuff than language level stuff
 * Doing this now will steal time from you, Walter, and many of us, both directly, and indirectly by leaching bandwidth in the newsgroup -- time that should be spent on more urgent or more important things, or even documentation
 * If it's so easy to do, then why not do it a week before the release of final D2

I really can't help it, but this is how I see it.
March 02, 2009
Georg Wrede wrote:
> Andrei Alexandrescu wrote:
>> Georg Wrede wrote:
>> You see, we're not communicating. I sent this link:
>>
>> http://www.unicode.org/cldr/
>>
>> Did you look at it? It is essentially a database of locale information in a highly structured format. All I want is to define a structure expressive enough to gobble the part of that database that is of interest. The Phobos documentation will say, we just adopt their schema. If users don't want to load any, then fine - everything is just like today.
> 
> I read the page. It says "This data is used by a wide spectrum of companies for their software internationalization and localization".
> 
> The first link in the text part is to the CLDR Overview ppt. I read it. On page 5 it says:
> 
> "Companies / Organizations
> Adobe, Apple (Mac OS X), abas Software, Ascential Software, Avaya, BEA, BluePhoenix Solutions, BMC Software (Remedy), Business Objects, caris, CERN, ClearCommerce, Cognos, Debian Linux, D programming language, Gentoo Linux, GNU Classpath, HP, Hyperion, IBM, Inktomi, Innodata Isogen, Isogon, Informatica, Intel, Interlogics, IONA, IXOS, Macromedia, Mathworks, OpenOffice, Language Analysis Systems, Lawson Software, Leica Geosystems GIS & Mapping LLC, Mandrake Linux, Novell (SuSE), Optio Software, PayPal, Progress Software, Python, QNX, Quark, Rogue Wave, SAP, Siebel, SIL, SPSS, Software AG, Sun Microsystems (Solaris, Java), Sybase, Teradata (NCR), Trados, Trend Micro, Virage, webMethods, WMS Gaming, Xerox, Yahoo!, and many more…"
> 
> One sees here major companies, operating systems, and three languages: D, Python and Java. The page is from 2005.
> 
> So D "has had this since at least 2005". What can I say? I guess we have to implement it then...

Hehe, didn't see that.

>>> What I'm saying is, it's debatable whether this stuff belongs to "the programming language itself" at all. Rather, it should be an external library, provided by someone else than us. It belongs to SourceForge or Dsource, not here.
>>
>> http://www.unicode.org/cldr/
>>
>> We just need to load it if there is such a need.
> 
> In another post you sounded as if there is a connection between this stuff and printing arrays. I'm not sure I see the connection.

Very simple. If we have a locale table, I am thinking of dedicating a branch "std" in it to stuff that's in std. For example, I can use currentLocale.get("std", "array-separator") or something.

>> Let me try again: I don't want to define locale support. I want to provide the basics for people to roll it out themselves.
> 
> I downloaded the files in http://unicode.org/Public/cldr/1.6.1/ which were core.zip, posix.zip, tests.zip and tools.zip. They unzipped to 140MB, containing some 200 java files and some 800 xml files, among others.
> 
> The readme.txt in tools.zip says:
> 
> "The code is very preliminary, so don't expect stability from the APIs (or documentation!), since we still have to work out how we want to do the architecture."
> 
> The main web page says "CLDR 1.7 Tentative Schedule: 2008-09", but it still isn't on the download page. The last version is 2008-07-23 Version1.6.1.
> 
> ==============
> 
> My take:
> 
>  * This is still a moving target
>  * Using this is a major hassle for the programmer
>  * With D2 itelf a moving target, nobody is going to invest enough time in this to actually use it for something worthwhile in the next 6 to 12 months anyway
>  * This is more application level stuff than language level stuff
>  * Doing this now will steal time from you, Walter, and many of us, both directly, and indirectly by leaching bandwidth in the newsgroup -- time that should be spent on more urgent or more important things, or even documentation
>  * If it's so easy to do, then why not do it a week before the release of final D2
> 
> I really can't help it, but this is how I see it.

I understand.


Andrei
March 02, 2009
Sergey Gromov wrote:
> Mon, 02 Mar 2009 07:02:10 -0800, Andrei Alexandrescu wrote:
> 
>> Consider some code in phobos that must throw an exception:
>>
>> throw Exception("File `%s' not found, system error is %s.",
>>      filename, errnomsg);
>>
>> The localized version will look like this:
>>
>> auto format = "File `%s' not found, system error is %s.";
>> auto localFormat = currentLocale ? currentLocale.peek(format) : null;
>> if (!localFormat) localFormat = format;
>> throw Exception(localFormat, filename, errnomsg);
> 
> This example does not address the encoding problem.  Currently, errnomsg
> is in Russian, UTF-8 encoded.  So I get "system error is <garbage>" on
> the console.  If you adopt locales I'll get garbage not only for the
> system error but for the rest of the exception message as well.
> 
> To actually solve this problem the default exception handler must be
> fixed to convert any UTF-8 into the current OEM code page before
> printing.  It would also help if default stdin and stdout performed such
> a conversion.

I see.

>> What happens is that the default format string _is_ the key for looking up the localized strings.
> 
> Nice.  This means that error messages become a part of API and are
> subject to backward and forward compatibility issues.  Isn't it too
> much?

I think it isn't too much, considering the sorry state of affairs of today's exceptions. You can't even answer the question: "Given this FileException object, what file name was concerned?" And each module defines its own exception class that is equally useless. It's ridiculous. 95% of them must be removed. And we must have systematic formatting of all strings initiated by Phobos.


Andrei
March 02, 2009
Christopher Wright wrote:
>> -- All very nice, but no cigar. That's about as smart as letting people define *unlimited* length variable names!)
> 
> I recently dealt with a programming language that specified a limit of 63 characters for identifier names. This wouldn't have been a significant problem, except that I was generating code automatically, and some of my identifiers were over 90 characters. Identifier length limits are evil, unless they're ridiculously large (C#, I think, limits identifiers to 4096 characters).

As soon as you put in a limit on identifier name length, sooner or later you'll get a bug report on it.

For example, C++ can be compiled to C code. C++ templates encode their entire state into the template instance identifier, and these can easily reach 10,000 characters or more. So if your C compiler has a length limit on identifiers, then C++ templates become severely limited.

Another thing to consider is it's actually *more* work to put a limit on, where you have to document it, explain it, detect it, diagnose it, recover from it, than if you just make it unlimited.

There are really only 3 numbers in computer programming: 0, 1, and unlimited. I always chuckle when I see an ad for like, an editor, that says "up to 5 files open at once!".
March 02, 2009
Georg Wrede wrote:
> So D "has had this since at least 2005". What can I say? I guess we have to implement it then...

Wow, D usually gets slammed for not having a feature that even a cursory glance at the documentation shows it has. This is the first vaporware feature!
March 02, 2009
Michel Fortin wrote:
> Translating strings is a little harder because 1) strings are application-defined, 2) strings are often not available in the user's prefered language, adding the need for a fallback mecanism, and 3) different applications will want to to store those strings in different ways. Perhaps we could define a base class for getting translated strings, then allow the program to use whatever subclass it wants.

It's a silly thing, but I love the little google widget you can add to a web page to automatically translate the pages. All the D site pages have it in the left column.
March 02, 2009
Sergey Gromov wrote:
> Mon, 02 Mar 2009 09:34:32 +0200, Georg Wrede wrote:
>
>>>> Of course, eventually we will want to "do something" about this. But
>>>> that should be left to the day when real issues are all sorted out in
>>>> D. This is a non-urgent, low-priority thing.
>> Had there been any need for locales, believe me, the "foreigners" in
>> this NG would have asked for it.
>
> I'm Russian.  For me, encoding problems are a PITA of such epic
> proportions that little format inconsistencies simply fade away.  Yes
> it's sometimes hard to decipher what 02/03/08 means since our custom is
> to put day first and separate with dots.  But compare this to Adobe Flex
> SDK which prints half compiler error messages in Russian (thank you
> Adobe!) using system default code page, 1251, while default /console/
> code page is actually so-called IBM 866.  Whenever I use MXML compiler
> from console I get rubbish for error messages.  And there is no way to
> disable translation--I've found none.  Phobos is no better.  Any
> exception resulting from an invalid OS call dumps UTF-8 garbage instead
> of an error message.  std.file.read("non-existent") for instance.
>
> I think games are not an issue.  I've worked for a company producing
> cell phone games for a long time.  I've localized my game for Chinese
> market, too.  The thing is, game interfaces are always custom, always
> ad-hoc.  They *never* work in untested locales.  Well, with some
> experience you can make them work most of the time in languages you are
> familiar with, from localization perspective.  Anyway, all you need to
> know is an ID of a supported locale so that you can replace text and
> locale-specific images accordingly.  Then you have correctors and native
> testing to make sure the localization works.

encoding isn't that hard compared to other issues.
for instance, have you ever tried to make a website go both ways?
March 02, 2009
Sergey Gromov wrote:
> To actually solve this problem the default exception handler must be fixed to convert any UTF-8 into the current OEM code page before printing.  It would also help if default stdin and stdout performed such a conversion.

No, stdin/stdout *must* perform this conversion.  It is a serious bug if they don't.

The conversion cannot be performed at any other level.  D uses unicode internally.  The console uses a specific encoding.  Therefore all data passing between D and the console must be encoded/decoded.


-- 
Rainer Deyke - rainerd@eldwood.com
March 02, 2009
On Mon, Mar 2, 2009 at 1:52 PM, Georg Wrede <georg.wrede@iki.fi> wrote:
>
> My take:
>
>  * This is still a moving target
>  * Using this is a major hassle for the programmer
>  * With D2 itelf a moving target, nobody is going to invest enough time in
> this to actually use it for something worthwhile in the next 6 to 12 months
> anyway
>  * This is more application level stuff than language level stuff
>  * Doing this now will steal time from you, Walter, and many of us, both
> directly, and indirectly by leaching bandwidth in the newsgroup -- time that
> should be spent on more urgent or more important things, or even
> documentation
>  * If it's so easy to do, then why not do it a week before the release of
> final D2

I agree entirely.  Localization and internationalization seem like things that should be at a much higher level than a standard library. Everyone's going to want to do it differently.  Providing a thin, cross-platform wrapper over what the OS exposes is fine, but creating a proper i18n/l10n framework is a huge project in and of itself (I think the 140MB Java package makes that abundantly clear).

I'd much rather see a rewritten std.stream and proper Unicode support in std.string (support for types other than string, functions for indexing and slicing on character boundaries) before this.