June 24, 2004
Mango.io may not be ready to go into the D std package as of yet; any discussion of the sort might be considered presumptuous.  But I still think it should go somewhere where people will give it serious thought.  Of late, my general feeling is that mango.io is not getting the chance in needs to make an impression on the D community. To improve this maybe these ideas should be considered:

1) A web page could be set up for listing several D stream packages for D community perusal

2) the page could have official endorsement from the www.digitalmars.com site to encourage people to sample, comment, and peruse the competing stream packages.

3) Perhaps that web page should indicate: "D Stream Packages to be considered for official D Phobos integration"

4) Each package should be well documented and easily extensible... (oops.. that just eliminated a few, didn't it?)  ;-)

5) Each package should get a review in some journal, perhaps even in the new journal that Walter mentioned in C++.announce.

Most of these ideas depend on how serious the D community is about seeing a successful stream library appear.  If we are all that serious, then we should be pushing for better publicity for the existing packages.

I really want to see Mango.io given a fair chance.  At the very least, mango.io should get consideration for inclusion in the more publically developed demios.

Later,

John
June 24, 2004
In article <cbd7ac$2284$1@digitaldaemon.com>, Sean Kelly says...
>
>Just thought of another one.  What constitutes whitespace is locale dependent as well.  So add that to the localization list.

Well, this is not true in Unicode, so - however you define that - you will have to do it without the aid a Unicode Library. Whitespace according to Unicode is defined by the property file http://www.unicode.org/Public/UNIDATA/PropList.txt. The following characters - and ONLY the following characters, are deemed to be whitespace according to Unicode:

0009..000D    ; White_Space # Cc   [5] <control-0009>..<control-000D>
0020          ; White_Space # Zs       SPACE
0085          ; White_Space # Cc       <control-0085>
00A0          ; White_Space # Zs       NO-BREAK SPACE
1680          ; White_Space # Zs       OGHAM SPACE MARK
180E          ; White_Space # Zs       MONGOLIAN VOWEL SEPARATOR
2000..200A    ; White_Space # Zs  [11] EN QUAD..HAIR SPACE
2028          ; White_Space # Zl       LINE SEPARATOR
2029          ; White_Space # Zp       PARAGRAPH SEPARATOR
202F          ; White_Space # Zs       NARROW NO-BREAK SPACE
205F          ; White_Space # Zs       MEDIUM MATHEMATICAL SPACE
3000          ; White_Space # Zs       IDEOGRAPHIC SPACE

My personal opinion (for what it's worth) is that this does not need to be localized. Maybe if a particular application (such as a D compiler, for example) required that (for example) U+00A0 be defined as non-whitespace, then the application could simply work around that - I believe Java does something similar, defining a "Java whitespace" with a slightly different list which excludes U+00A0.

But as for localizing for a locale, my personal opinion is that Unicode's definition is good enough. If you're going to use a different list, then you're not using Unicode! Part of the whole point of Unicode is to ease localization issues by simply eliminating them wherever possible. ONE standard. So far as I can tell, the argument that "what is whitespace is locale-dependent" is 8-bit-standard-thinking, and we simply don't need to use that any more.

Arcane Jill


June 24, 2004
In article <cbd6v9$21mu$1@digitaldaemon.com>, Sean Kelly says...
>
>Say I has a string of dchars and I want to write it to a text file.  Is it enough to just do an unformatted write of the string or does it have to be converted to UTF-16 or some such?  ie. is there ever any difference between internal character representation and external representation that an i/o lib may care about?

Gotcha. Okay, I do know about that - it's just I've always called it ENcoding (and DEcoding) rather than TRANScoding - but I guess "transcoding" is just as correct, if not more correct.

Yeah, well that sounds like a simple enough thing to do. You know that the encoding at one end (inside a D program) is going to be one of the Unicode UTF-s. Presumably, you have some information about the encoding at the other end (the console, a text window, a file, whatever), and is likely to be WINDOWS-1252 on Windows machines in western Europe, etc.. So all you have to do is convert from one to the other.

My guess is, you'd want some sort of Transcoder class (see I've leant a new word), so you could do something like:

#    Transcoder t1 = new Transocoder("ISO-8859-1");
#    Transcoder t2 = new Transocoder("SHIFT-JIS");
#    Transcoder t3 = new Transocoder("MAC-ROMAN");
#    Transcoder t4 = new Transocoder("ASCII");

and so on, and then use this either as a parameter to your stream constructor, or a parameter to a formatted input/output command. A transcoder would simply have to convert from a sequence of dchars to a sequence of ubytes (and vice versa). (Note that the names of encodings are universally understood - we don't have to make them up).

Currently, of course, D assumes UTF-8 on the console, which is, in fact, usually wrong, which is why if you printf() a non-ASCII character to an MS-DOS command window, they come out wrong. A transcoder scheme would fix this.









>
>>I'm not completely sure I understand the question "how would people like to see formatted IO operate?", or what it has to do with Unicode.
>
>What I meant was any internationalization.  As I mentioned, the issues I know of have to do with numeric formatting, but is there anything else?
>
>Sean
>
>


June 24, 2004
In article <cbd6v9$21mu$1@digitaldaemon.com>, Sean Kelly says...

>As I mentioned, the issues I know of
>have to do with numeric formatting,

Unfortunately, Unicode won't help you with numeric formatting. As you know, the decimal point (for example) is encoded as U+002E if you're English, but as U+002C if you're French. At one point, I did suggest to the Consortium folk that they add an UNAMBIGUOUS DECIMAL POINT character, which would render as "." in an English font or "," in a French font, but they didn't like that idea. As far as the Unicode Consortium are concerned, Unicode is a repertoire of characters, not a repertoire of semantic meanings. Therefore, when it comes to this sort of localization issue, we're on our own.

In a way, Java does this quite well - at least in the sense that it is sufficiently powerful. Where it falls down is in ease of use. It is not easy to do even simple localization in Java. I'm wondering if that's something else we're stuck with - it is even possible to make it BOTH easy AND powerful?

Unfortunately, I only have questions on this one, not answers.

Arcane Jill


June 24, 2004
On Thu, 24 Jun 2004 06:57:13 +0000 (UTC), Arcane Jill <Arcane_member@pathlink.com> wrote:
> In article <cbd6v9$21mu$1@digitaldaemon.com>, Sean Kelly says...
>>
>> Say I has a string of dchars and I want to write it to a text file.  Is it
>> enough to just do an unformatted write of the string or does it have to be
>> converted to UTF-16 or some such?  ie. is there ever any difference between
>> internal character representation and external representation that an i/o lib
>> may care about?
>
> Gotcha. Okay, I do know about that - it's just I've always called it ENcoding
> (and DEcoding) rather than TRANScoding - but I guess "transcoding" is just as
> correct, if not more correct.
>
> Yeah, well that sounds like a simple enough thing to do. You know that the
> encoding at one end (inside a D program) is going to be one of the Unicode
> UTF-s. Presumably, you have some information about the encoding at the other end
> (the console, a text window, a file, whatever), and is likely to be WINDOWS-1252
> on Windows machines in western Europe, etc.. So all you have to do is convert
> from one to the other.
>
> My guess is, you'd want some sort of Transcoder class (see I've leant a new
> word), so you could do something like:
>
> #    Transcoder t1 = new Transocoder("ISO-8859-1");
> #    Transcoder t2 = new Transocoder("SHIFT-JIS");
> #    Transcoder t3 = new Transocoder("MAC-ROMAN");
> #    Transcoder t4 = new Transocoder("ASCII");
>
> and so on, and then use this either as a parameter to your stream constructor,
> or a parameter to a formatted input/output command. A transcoder would simply
> have to convert from a sequence of dchars to a sequence of ubytes (and vice
> versa). (Note that the names of encodings are universally understood - we don't
> have to make them up).
>
> Currently, of course, D assumes UTF-8 on the console, which is, in fact, usually
> wrong, which is why if you printf() a non-ASCII character to an MS-DOS command
> window, they come out wrong. A transcoder scheme would fix this.

I always thought the best way to handle this sort of thing would be to have a pluggable modular system such that you'd have your input or 'source', your output or 'sink', and you could plug any number of transcoders in between them.

So for example your source might be a file, and your sink a socket, you plug the transcoders for SSL, UTF-8 to Unicode, or anything else you want in, they all modify the data going thru them such that the data you get from the source arrives at the sink transcoded in those ways.

I always wanted to try and write a stream library based on that idea...

Regan.

-- 
Using M2, Opera's revolutionary e-mail client: http://www.opera.com/m2/
June 24, 2004
Works for me.  I'll admit that most of what I know of localization has come from C++, which hasn't quite embraced unicode yet :)

Sean


June 24, 2004
John Reimer wrote:
> (...)
> 5) Each package should get a review in some journal, perhaps even in the new journal that Walter mentioned in C++.announce.
> (...)

6) Included in the compiler distribution.

-- 
Julio César Carrascal Urquijo
June 24, 2004
In article <cbe04d$6ri$1@digitaldaemon.com>, Arcane Jill says...
>
>In a way, Java does this quite well - at least in the sense that it is sufficiently powerful. Where it falls down is in ease of use. It is not easy to do even simple localization in Java. I'm wondering if that's something else we're stuck with - it is even possible to make it BOTH easy AND powerful?

Good question.  C++ has a global locale setting and then IIRC you can associate a stream instance with a different locale.  Formatting information is stored in a class or set of classes that defines the various separator characters and such.  This is quite easy to use but fairly complicated to extend.  But the basic idea does seem to work pretty well.  How does Java work?

Sean


June 24, 2004
In article <opr93b3wzg5a2sq9@digitalmars.com>, Regan Heath says...
>
>I always thought the best way to handle this sort of thing would be to have a pluggable modular system such that you'd have your input or 'source', your output or 'sink', and you could plug any number of transcoders in between them.
>
>So for example your source might be a file, and your sink a socket, you plug the transcoders for SSL, UTF-8 to Unicode, or anything else you want in, they all modify the data going thru them such that the data you get
> from the source arrives at the sink transcoded in those ways.
>
>I always wanted to try and write a stream library based on that idea...

I've written stream adaptors for this sort of thing.  So I might chain together a GZip stream, a Base64 stream, and a socket stream and each would encode the data and push it on to the next layer.  But for things like SSL this can get pretty complicated, as you may have to deal with password prompting and other configuration for that specific level.  This is one area where Aspect oriented programming has a lot of promise.

Sean


June 24, 2004
Julio César Carrascal Urquijo wrote:

> John Reimer wrote:
>> (...)
>> 5) Each package should get a review in some journal, perhaps even in the
>> new journal that Walter mentioned in C++.announce.
>> (...)
> 
> 6) Included in the compiler distribution.
> 

Good point!