Jump to page: 1 2
Thread overview
Internationalization library ─ advice/help wanted
May 15, 2005
Uwe Salomon
Re: Internationalization library - advice/help wanted
May 15, 2005
Andrew Fedoniouk
May 15, 2005
Uwe Salomon
Re: Internationalization library ? advice/help wanted
May 17, 2005
Thomas Kuehne
May 17, 2005
Uwe Salomon
May 18, 2005
Lars Ivar Igesund
May 18, 2005
Uwe Salomon
UTF conversion API draft [was: Internationalization library ─ advice/help wanted]
May 18, 2005
Uwe Salomon
Re: UTF conversion API draft [was: Internationalization library - advice/help wanted]
May 18, 2005
Ben Hinkle
May 18, 2005
Uwe Salomon
May 18, 2005
Uwe Salomon
May 18, 2005
Ben Hinkle
May 18, 2005
Uwe Salomon
May 18, 2005
Ben Hinkle
May 18, 2005
Uwe Salomon
Re: UTF conversion API draft 2 [was: Internationalization library - advice/help wanted]
May 21, 2005
Uwe Salomon
May 21, 2005
Uwe Salomon
May 15, 2005
During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries.

As this is too big a project to develop for myself, and (i hope) of public interest for the D community, i would like to ask for:

- Advice: What is needed? How should it be implemented?
- Help: Who has the time and wants to help me? A total of 2 or 3 developers should be sufficient?

My ideas are to write a compact core library that contains the most important features (character properties, UTF encodings, basic message translation), and then write some localization modules (number formatting, date formatting, comparing and searching). The goals should be simplicity and speed (but perhaps the community wants other things more?), avoiding complicated implementations and "template magic". And it should be well documented from the beginning, not a construction site on every corner.

But that are just some ideas that come to my mind right now. I hope that everybody makes some helpful statements what he/she thinks should be covered by the library on all accounts, and what would be very nice.

Thanks & ciao
uwe
May 15, 2005
Good idea, I like it.

FYI: On Windows MultiByteToWideChar and WideCharToMultiByte support many encodings other than mentioned directly in MSDN. I am using this list:

lang_t langs[] = {
    {"asmo-708",708},
    {"dos-720",720},
    {"iso-8859-6",28596},
    {"x-mac-arabic",10004},
    {"windows-1256",1256},
    {"ibm775",775},
    {"iso-8859-4",28594},
    {"windows-1257",1257},
    {"ibm852",852},
    {"iso-8859-2",28592},
    {"x-mac-ce",10029},
    {"windows-1250",1250},
    {"euc-cn",51936},
    {"gb2312",936},
    {"hz-gb-2312",52936},
    {"x-mac-chinesesimp",10008},
    {"big5",950},
    {"x-chinese-cns",20000},
    {"x-chinese-eten",20002},
    {"x-mac-chinesetrad",10002},
    {"cp866",866},
    {"iso-8859-5",28595},
    {"koi8-r",20866},
    {"koi8-u",21866},
    {"x-mac-cyrillic",10007},
    {"windows-1251",1251},
    {"x-europa",29001},
    {"x-ia5-german",20106},
    {"ibm737",737},
    {"iso-8859-7",28597},
    {"x-mac-greek",10006},
    {"windows-1253",1253},
    {"ibm869",869},
    {"dos-862",862},
    {"iso-8859-8-i",38598},
    {"iso-8859-8",28598},
    {"x-mac-hebrew",10005},
    {"windows-1255",1255},
    {"x-ebcdic-arabic",20420},
    {"x-ebcdic-cyrillicrussian",20880},
    {"x-ebcdic-cyrillicserbianbulgarian",21025},
    {"x-ebcdic-denmarknorway",20277},
    {"x-ebcdic-denmarknorway-euro",1142},
    {"x-ebcdic-finlandsweden",20278},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-finlandsweden-euro",1143},
    {"x-ebcdic-france-euro",1147},
    {"x-ebcdic-germany",20273},
    {"x-ebcdic-germany-euro",1141},
    {"x-ebcdic-greekmodern",875},
    {"x-ebcdic-greek",20423},
    {"x-ebcdic-hebrew",20424},
    {"x-ebcdic-icelandic",20871},
    {"x-ebcdic-icelandic-euro",1149},
    {"x-ebcdic-international-euro",1148},
    {"x-ebcdic-italy",20280},
    {"x-ebcdic-italy-euro",1144},
    {"x-ebcdic-japaneseandkana",50930},
    {"x-ebcdic-japaneseandjapaneselatin",50939},
    {"x-ebcdic-japaneseanduscanada",50931},
    {"x-ebcdic-japanesekatakana",20290},
    {"x-ebcdic-koreanandkoreanextended",50933},
    {"x-ebcdic-koreanextended",20833},
    {"cp870",870},
    {"x-ebcdic-simplifiedchinese",50935},
    {"x-ebcdic-spain",20284},
    {"x-ebcdic-spain-euro",1145},
    {"x-ebcdic-thai",20838},
    {"x-ebcdic-traditionalchinese",50937},
    {"cp1026",1026},
    {"x-ebcdic-turkish",20905},
    {"x-ebcdic-uk",20285},
    {"x-ebcdic-uk-euro",1146},
    {"ebcdic-cp-us",37},
    {"x-ebcdic-cp-us-euro",1140},
    {"ibm861",861},
    {"x-mac-icelandic",10079},
    {"x-iscii-as",57006},
    {"x-iscii-be",57003},
    {"x-iscii-de",57002},
    {"x-iscii-gu",57010},
    {"x-iscii-ka",57008},
    {"x-iscii-ma",57009},
    {"x-iscii-or",57007},
    {"x-iscii-pa",57011},
    {"x-iscii-ta",57004},
    {"x-iscii-te",57005},
    {"euc-jp",51932},
    {"iso-2022-jp",50220},
    {"iso-2022-jp",50222},
    {"csiso2022jp",50221},
    {"x-mac-japanese",10001},
    {"shift_jis",932},
    {"ks_c_5601-1987",949},
    {"euc-kr",51949},
    {"iso-2022-kr",50225},
    {"johab",1361},
    {"x-mac-korean",10003},
    {"iso-8859-3",28593},
    {"iso-8859-15",28605},
    {"x-ia5-norwegian",20108},
    {"ibm437",437},
    {"x-ia5-swedish",20107},
    {"windows-874",874},
    {"ibm857",857},
    {"iso-8859-9",28599},
    {"x-mac-turkish",10081},
    {"windows-1254",1254},
    //{(const char *)L"unicode",1200},
    //{"unicodefffe",1201},
    {"utf-7",65000},
    {"utf-8",65001},
    //{"us-ascii",20127},
    {"us-ascii",1252},
    {"windows-1258",1258},
    {"ibm850",850},
    {"x-ia5",20105},
    {"iso-8859-1",1252}, //was 28591
    {"macintosh",10000},
    {"windows-1252",1252},
    {"system",CP_ACP}
  };

Second member in these structs is
codepage id directly used as first
parameter of
MultiByteToWideChar and WideCharToMultiByte

Hope this will help. At least it might help to build translation tables automaticly :)

Andrew.




"Uwe Salomon" <post@uwesalomon.de> wrote in message news:op.sqtvopik6yjbe6@sandmann.maerchenwald.net...
> During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries.
>
> As this is too big a project to develop for myself, and (i hope) of public interest for the D community, i would like to ask for:
>
> - Advice: What is needed? How should it be implemented?
> - Help: Who has the time and wants to help me? A total of 2 or 3
> developers should be sufficient?
>
> My ideas are to write a compact core library that contains the most important features (character properties, UTF encodings, basic message translation), and then write some localization modules (number formatting, date formatting, comparing and searching). The goals should be simplicity and speed (but perhaps the community wants other things more?), avoiding complicated implementations and "template magic". And it should be well documented from the beginning, not a construction site on every corner.
>
> But that are just some ideas that come to my mind right now. I hope that everybody makes some helpful statements what he/she thinks should be covered by the library on all accounts, and what would be very nice.
>
> Thanks & ciao
> uwe


May 15, 2005
> FYI: On Windows MultiByteToWideChar and WideCharToMultiByte
> support many encodings other than mentioned directly in MSDN.

Hmm, thanks for that. As libiconv is no standard for windows :) this will come in handy. Is there anyone who knows about encoding/decoding (and programming specialties in general) on the Mac? Regrettably, i know not a thing about the Mac programming environment at all. :(

uwe
May 17, 2005
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Uwe Salomon schrieb am Sun, 15 May 2005 19:47:03 +0200:
> During the writing of a string class for my Indigo library i "discovered" the need for a thorough internationalization library for D. I think a good implementation of i18n functionality would be very important for the development of applications in D, thus for the future of D. There is the ICU port of the Mango tree, but as ICU is a C/C++ library, this is not as natural and fast as it could be. I would like to write a native D i18n library which is independent of third party libraries.

[snip]

some links:
http://www.i18ngurus.com/
http://www.openi18n.org/
http://java.sun.com/j2se/corejava/intl/
http://doc.trolltech.com/3.3/i18n.html

Thomas


-----BEGIN PGP SIGNATURE-----

iD8DBQFCifJL3w+/yD4P9tIRAovzAKDAgMP6Ti7ENQPYwMo1uuLdrIBKfQCfbg9q
8bWa1c8AAVR++B5ytpxaugo=
=XnC4
-----END PGP SIGNATURE-----
May 17, 2005
> some links:
[snip]

These are very good and informative, thanks a lot!

uwe
May 18, 2005
Uwe Salomon wrote:

>> some links:
> [snip]
> 
> These are very good and informative, thanks a lot!
> 
> uwe

Also, look at

http://i18n.kde.org

and

http://developer.kde.org/documentation/library/kdeqt/kde3arch/kde-i18n-howto.html/

While KDE is based on Qt, it seems like they've expanded on the functionality, especially the part that has to do with translations of messages and gui.

Lars Ivar Igesund
May 18, 2005
> While KDE is based on Qt, it seems like they've expanded on the
> functionality, especially the part that has to do with translations of
> messages and gui.

Hmm, they are using GNU gettext() instead of the Qt tr(). Perhaps it would be a good idea to go at least one of the ways, instead of inventing something totally new. I like the KDE markup i18n("String to translate"). If i used that, all the existing tools (KBabel, Emacs PO mode) as well as string extractors and friends were available already. But it will make the lib dependant on GNU gettext(), or i would have to write my own .mo reader.

gettext() is nonstandard for Windows, right? Please, would anybody be so kind and explain to me how translation of user messages works under Windows (roughly)? I remember them using resource files. Does the application load the right resource file at runtime? And how does it work for the Mac?

Thanks for the help!
uwe
May 18, 2005
This is a first implementation for conversion between UTF encodings. I used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:


char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
char[] toUtf8(wchar[] str, inout size_t eaten);
char[] toUtf8(wchar[] str, char[] buffer);
char[] toUtf8(wchar[] str);

* The first function converts str into UTF-8, beginning at str[eaten], adjusting eaten up to where it converted (stopping before an incomplete sequence at the end of str), and using buffer if large enough, reallocating the buffer if space is not sufficient. It throws an exception if faced with invalid input encoding.

* The second function allocates a sufficient buffer itself.

* The third function converts str as a whole, asserting on an incomplete sequence at the end of str. It uses buffer if possible.

* The fourth function does like the third, and allocates the buffer itself.

* For every function there is a variant called fast_toUtf8() with the same parameters which relies on valid input, producing invalid output otherwise. It can be used if the input is guaranteed to be valid, and is much faster then.


For more explanations and a coding example visit:
http://www.uwesalomon.de/code/unicode/files/conversion-d.html

The source is at
http://www.uwesalomon.de/code/unicode/conversion.d


This is a draft, and i will be very happy if everyone who is interested comments on it, especially the API "design" (i know, fast_toUtf8() is a clumsy name :). And another question (i hope this is not arrogant): should these functions (or especially the simple form, without eaten) be included into Phobos std.utf? They are *much* faster than the current implementation. If someone would say, "Nice stuff, kiddo. Debug that properly, adjust it to the std.utf module (use their exception etc.) and submit a patch. Perhaps we will look at it then." i would sure do that. :)  But i am afraid that these kind of guerilla actions are rather unwanted, and i should better keep my mouth shut and code some useful stuff...

Thanks
uwe
May 18, 2005
"Uwe Salomon" <post@uwesalomon.de> wrote in message news:op.sqyw3zok6yjbe6@sandmann.maerchenwald.net...
> This is a first implementation for conversion between UTF encodings. I used UTF-8 <=> UTF-16 as an example. In a sum, this is what i thought of:
>
>
> char[] toUtf8(wchar[] str, inout size_t eaten, char[] buffer);
> char[] toUtf8(wchar[] str, inout size_t eaten);
> char[] toUtf8(wchar[] str, char[] buffer);
> char[] toUtf8(wchar[] str);
>
> * The first function converts str into UTF-8, beginning at str[eaten], adjusting eaten up to where it converted (stopping before an incomplete sequence at the end of str), and using buffer if large enough, reallocating the buffer if space is not sufficient. It throws an exception if faced with invalid input encoding.
>
> * The second function allocates a sufficient buffer itself.
>
> * The third function converts str as a whole, asserting on an incomplete sequence at the end of str. It uses buffer if possible.
>
> * The fourth function does like the third, and allocates the buffer itself.
>
> * For every function there is a variant called fast_toUtf8() with the same parameters which relies on valid input, producing invalid output otherwise. It can be used if the input is guaranteed to be valid, and is much faster then.
>
>
> For more explanations and a coding example visit: http://www.uwesalomon.de/code/unicode/files/conversion-d.html
>
> The source is at http://www.uwesalomon.de/code/unicode/conversion.d
>
>
> This is a draft, and i will be very happy if everyone who is interested comments on it, especially the API "design" (i know, fast_toUtf8() is a clumsy name :). And another question (i hope this is not arrogant): should these functions (or especially the simple form, without eaten) be included into Phobos std.utf? They are *much* faster than the current implementation. If someone would say, "Nice stuff, kiddo. Debug that properly, adjust it to the std.utf module (use their exception etc.) and submit a patch. Perhaps we will look at it then." i would sure do that. :)  But i am afraid that these kind of guerilla actions are rather unwanted, and i should better keep my mouth shut and code some useful stuff...
>
> Thanks
> uwe

Speeding up std.utf would be good - how can one argue with that? :-)
Three thoughts come to mind:
1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked to
indicate to the user that it's not just a faster version of another routine
(since I'd call fast_foo over foo every time!) but one that makes
significant assumptions about the input. I'm not actually sure how often it
would be ok to call such a function anyway so maybe it isn't even needed.
Getting the wrong answer quickly is not a good trade-off.
2) it looks like you reallocate the output buffer inside the loop - can it
be moved to outside?
3) the formatting of the source code is somewhat unusual. I missed the loop
at first:
 // Now do the conversion.
  if (pIn < endIn) do
  {
    // Check for enough space left in the buffer.
    if (pOut >= endOut)
    [snip 50 lines of code or so]
  }
  while (++pIn < endIn);

That first line with the "do" my eye skipped right over the "do" and I had to backtrack once I saw a "while" down at the bottom.


May 18, 2005
> 1) fast_toUtf8 should be something like toUtf8Unsafe or toUtf8Unchecked

Yes, one of them sounds much better. I did not think long about fast_xxx()... Perhaps also toUtf8Unverified(), regrettably that is very long.

> I'm not actually sure how often it
> would be ok to call such a function anyway so maybe it isn't even needed.
> Getting the wrong answer quickly is not a good trade-off.

You are right, that is an important fact, especially for a standard library. Easy test: i converted a german email (mostly ASCII, some special characters) with 5000 characters from UTF8 to UTF16. I provided the buffer, because both functions are equally well at allocating memory.

Normal compilation:
  * safe function: 0.100 ms
  * unsafe function: 0.088 ms (12% faster)

Compilation -release -O:
  * safe function: 0.050 ms
  * unsafe function: 0.046 ms (8 % faster)

I am not sure how all this could benefit from an assembler implementation. Anyways, the speed gain is minimal (actually, i thought it would be a lot more!). Well, no need to search for a good "unsafe" name then. ;)

> 2) it looks like you reallocate the output buffer inside the loop - can it be moved to outside?

Why? To shorten the loop? I thought the buffer should only be reallocated if the conversion itself shows it is too short. Do you want to move it before (so that a reallocation *cannot* occure inside the loop), or just outside (with a goto SomeWhereOutsideTheLoop and after the reallocation goto BackIntoTheLoop)?

> 3) the formatting of the source code is somewhat unusual. I missed the loop at first.

Changed.

Thanks for the reply,
uwe
« First   ‹ Prev
1 2