August 16, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfog97$12n2$1@digitaldaemon.com...
>
> There have been loads and loads of discussions in recent weeks about
Unicode,
> streams, and transcodings. There seems to be a general belief that "things
are
> happening", but I'm not quite clear on the specifics - hence this post,
which is
> basically a question.

What I am excited about is D is becoming the premier language to do unicode in, by a wide margin. And that's thanks to you guys!


August 16, 2004
On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:

> Might I suggest something along the following lines:
> 
> int utf8ToDChar (char[] input, dchar[] output);
> int dCharToUtf8 (dchar[] input, char[] output);

That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that.

John
August 16, 2004
In article <cfou0o$1a91$1@digitaldaemon.com>, antiAlias says...

>Might I suggest something along the following lines:
>
>int utf8ToDChar (char[] input, dchar[] output);
>int dCharToUtf8 (dchar[] input, char[] output);
>where both return the number of bytes converted (or something like that).

That would be bad. I think it's possible you haven't understood the issues, so I'll try to explain in this post what some of them are, and why you would want to do certain things in certain ways.


>I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible.

I don't have any problem with that.


>These also happen to be the kind of
>functions that might be worth optimizing with a smattering of assembly ...

I disagree. Transcoding almost never happens in performance-critical code. It happens during input and output. A typical scenario is to get input from a console and then decode it, to to encode a string and then write it to a file. The CPU time utilized in the I/O will outweigh the time spent transcoding by a very large factor. Of course it still makes sense to do this efficiently, but assembler - given that it's not portable, decreases maintainability, etc. - is probably going a bit too far.

Okay, back to these function signatures:

>int utf8ToDChar (char[] input, dchar[] output);
>int dCharToUtf8 (dchar[] input, char[] output);

(1) The encoding is not necessarily known at compile time. This problem would also exist had you used classes/interfaces, of course, but at least with classes or interfaces instead of plain functions, you can rely on polymorphism and factory methods to do the dispatching, giving you a single point of decision. Functions like the above would lead to switch statements all over the place, and also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the IANA encoding names, case conventions, etc..

I see that in "charset.d" you made the encoding name a runtime parameter - but that too is bad, partly because you don't have a single point of decision, but partly also because you're now having to make that runtime check with /every/ fragment of text - not merely at construction time.

(2) (Trival) you forgot "out" on the output variables. You cannot expect the
caller to be aware in advance of the resulting required buffer size.

(3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.

In fact, the minimal functionality that a decoder requires, is this:

#    interface DcharGetter
#    {
#        bool hasMore();
#        dchar next();
#    }

(next() could be called get(), or read(), or whatever). The minimal
functionality upon which a decoder would rely, is this:

#    interface UbyteGetter
#    {
#        bool hasMore();
#        ubyte next();
#    }

For comparison, look at the way Walter's format() function uses an underlying put() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-time approach was conceptually at a lower level. Strings can then be handled /in terms of/ those lower-level functions.

With these two interfaces, you can put together the concept of a decoder. Thus:

#    class Decoder : DcharGetter
#    {
#        this(UbyteGettter underlyingUG) { ug = underlyingUG; }
#        bool hasNext() { return ug.hasNext(); }
#        abstract dchar next();
#
#        protected UbyteGetter ug;
#    }

And a /specific/ decoder could derive from this, thus:

#    class UTF8Decoder : Decoder
#    {
#        this(UbyteGettter underlyingUG) { super(ug); }
#
#        dchar next()
#        {
#            ubyte b = ug.next();
#            if (b < 0x80) return cast(dchar) b;
#            uint sequenceLength = b < 0xE0 ? 2 : (b < 0xF0 ? 3 : 4);
#            char[4] sequence;
#            sequence[0] = b;
#            if (sequenceLength >= 2) sequence[1] = ug.next();
#            if (sequenceLength >= 3) sequence[2] = ug.next();
#            if (sequenceLength >= 4) sequence[3] = ug.next();
#            dchar[] a = std.utf.toUTF32(sequence[0..sequenceLength]);
#            return a[0];
#        }
#    }

This could be implemented more efficiently, but I wrote it that way to illustrate the point that the decoder - not the caller - is the only entity capable of knowing the length of the byte sequence corresponding to the next (dchar) character.

So, NOW, if you want to plug this into a std.Stream, you could make one of these:

#    class StdStreamUbyteGetter : UByteGetter
#    {
#        this(Stream underlying Stream) { s = underlyingStream; }
#        bool hasNext() { return !s.eof(); }
#        ubyte next() { ubyte b; s.read(b); return b; }
#
#        private Stream underlyingStream;
#    }

And then simply make the magic decoder like so:

#    Decoder d = new UTF8Decoder(new StdStreamUbyteGetter(stdin))

And similarly for mango streams, InputStreams, strings, and so on. Strings are just not sufficiently low-level. We can rely on the compiler to inline these very simple functions.

Encoding - the reverse process - would follow a similar pattern. You wouldn't
need hasMore(), but something like done() or close() might be appropriate to
indicate that you've finished.

Arcane Jill



August 16, 2004
In article <cfp7v5$1h84$1@digitaldaemon.com>, Nick says...

>Ok, here's my shot at it:

I think we should establish what we need, who needs what and why, etc., before committing any code to a public library. Although the transcoding issue is "urgent" in the sense that lots of people want it, I'd say it was more important to get it right, than to write it fast.

There's nothing wrong with your code. I just think that it addresses a different problem than the ones faced by stream developers.

Jill


August 16, 2004
That is ok. You raise some interesting points in your other post, and I might rewrite my code later based on what you said, if I have the time. My code is more a proof of concept, and the point was that encoding can be done easily through libiconv and you don't have to reinvent the wheel. The library already supports all the features you want, and rewriting my code for use with streams shouldn't be very hard.

Nick

In article <cfpvc5$2297$1@digitaldaemon.com>, Arcane Jill says...
>
>I think we should establish what we need, who needs what and why, etc., before committing any code to a public library. Although the transcoding issue is "urgent" in the sense that lots of people want it, I'd say it was more important to get it right, than to write it fast.
>
>There's nothing wrong with your code. I just think that it addresses a different problem than the ones faced by stream developers.
>
>Jill
>
>


August 16, 2004
In article <pan.2004.08.16.06.29.47.206851@teqdruid.com>, teqDruid says...
>
>On Sun, 15 Aug 2004 17:12:25 -0700, antiAlias wrote:
>
>> Might I suggest something along the following lines:
>> 
>> int utf8ToDChar (char[] input, dchar[] output);
>> int dCharToUtf8 (dchar[] input, char[] output);
>
>That's what I was getting at... I don't know much about Unicode transcoding, but I don't see a reason for the core functionality to be any more complicated than that.
>
>John

Suppose you want to decode a dchar from a stream, and then immediately read a ubyte from the same stream. The above functions won't let you do that.

To decode a dchar from a stream you must first read /some/ bytes from that stream, in order to pass those bytes to the above function. But how many? One? Two? Four? In UTF-7, some Unicode characters require no less than /eight/ bytes. (One can invent or imagine encodings that require even more). If you've read too few bytes from the stream, your conversion function will throw an exception. If you've read too many, the stream's seek position will be incorrect for the next read.

You could argue that streams themselves could be rewritten to call functions like the above internally, but now you're adding complexity to something that doesn't need it.

You said: "I don't see a reason for the core functionality to be any more complicated than that". But those functions are not "core" - they are constructable from yet lower level functionality. The lowest level of abstraction about which it makes sense to talk is "get one Unicode character from somewhere" and "write one Unicode character somewhere". The minute you start talking about /strings/ instead of merely /characters/, you've made an implementation assumption.

Anyway, it's not the function/class/interface/whatever that needs to be simple, it's the code which calls it. We make classes do complicated things so that callers don't have to.

Arcane Jill


August 16, 2004
On Mon, 16 Aug 2004 09:34:36 +0000, Arcane Jill wrote:

> In article <cfou0o$1a91$1@digitaldaemon.com>, antiAlias says...
> 
>>Might I suggest something along the following lines:
>>
>>int utf8ToDChar (char[] input, dchar[] output);
>>int dCharToUtf8 (dchar[] input, char[] output);
>>where both return the number of bytes converted (or something like that).
...
> (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.
> 
> In fact, the minimal functionality that a decoder requires, is this:
> 
> #    interface DcharGetter
> #    {
> #        bool hasMore();
> #        dchar next();
> #    }
> 
> (next() could be called get(), or read(), or whatever). The minimal
> functionality upon which a decoder would rely, is this:
> 
> #    interface UbyteGetter
> #    {
> #        bool hasMore();
> #        ubyte next();
> #    }
> 
> For comparison, look at the way Walter's format() function uses an underlying put() function to write a single character. He /could/ have used strings throughout, but he recognised (correctly) that the one-byte-at-a-time approach was conceptually at a lower level. Strings can then be handled /in terms of/ those lower-level functions.
> 
> With these two interfaces, you can put together the concept of a decoder. Thus:
> 
> #    class Decoder : DcharGetter
> #    {
> #        this(UbyteGettter underlyingUG) { ug = underlyingUG; }
> #        bool hasNext() { return ug.hasNext(); }
> #        abstract dchar next();
> #
> #        protected UbyteGetter ug;
> #    }
> 
> And a /specific/ decoder could derive from this, thus:
> 
> #    class UTF8Decoder : Decoder
> #    {
> #        this(UbyteGettter underlyingUG) { super(ug); }
> #
> #        dchar next()
> #        {
> #            ubyte b = ug.next();
> #            if (b < 0x80) return cast(dchar) b;
> #            uint sequenceLength = b < 0xE0 ? 2 : (b < 0xF0 ? 3 : 4);
> #            char[4] sequence;
> #            sequence[0] = b;
> #            if (sequenceLength >= 2) sequence[1] = ug.next();
> #            if (sequenceLength >= 3) sequence[2] = ug.next();
> #            if (sequenceLength >= 4) sequence[3] = ug.next();
> #            dchar[] a = std.utf.toUTF32(sequence[0..sequenceLength]);
> #            return a[0];
> #        }
> #    }
> 
> This could be implemented more efficiently, but I wrote it that way to illustrate the point that the decoder - not the caller - is the only entity capable of knowing the length of the byte sequence corresponding to the next (dchar) character.
> 
> So, NOW, if you want to plug this into a std.Stream, you could make one of these:
> 
> #    class StdStreamUbyteGetter : UByteGetter
> #    {
> #        this(Stream underlying Stream) { s = underlyingStream; }
> #        bool hasNext() { return !s.eof(); }
> #        ubyte next() { ubyte b; s.read(b); return b; }
> #
> #        private Stream underlyingStream;
> #    }
> 
> And then simply make the magic decoder like so:
> 
> #    Decoder d = new UTF8Decoder(new StdStreamUbyteGetter(stdin))
> 
> And similarly for mango streams, InputStreams, strings, and so on. Strings are just not sufficiently low-level. We can rely on the compiler to inline these very simple functions.
> 
> Encoding - the reverse process - would follow a similar pattern. You wouldn't
> need hasMore(), but something like done() or close() might be appropriate to
> indicate that you've finished.
> 
> Arcane Jill

Understood.  This code looks reasonably agnostic, and even simple enough
the use.  The only difference is in thinking- streams vs strings.  I might
note, however that you use:
dchar[] toUTF32(char[] s);
Which could also be written as:
int toUTF32(char[] s, out dchar[]);
Which looks very similar to:
int utf8ToDChar (char[] input, dchar[] output);

This is the function that I would define as implementing the "core" functionality.  You then (to quote myself) "use those to create the readers and writers."

The stream implementation is a bit more complex than I imagined, but I can blame that up to a total lack of experience with variable-width character encodings. (And hey, I'm a first-year undergrad... what'dya expect?)

John
August 16, 2004
Confusion abounds! I follow you Jill, but please don't underestimate the usefulness of D arrays. I'll try to explain as we go along ...


"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfpv3c$2253$1@digitaldaemon.com...
> In article <cfou0o$1a91$1@digitaldaemon.com>, antiAlias says...
>
> >Might I suggest something along the following lines:
> >
> >int utf8ToDChar (char[] input, dchar[] output);
> >int dCharToUtf8 (dchar[] input, char[] output);
> >where both return the number of bytes converted (or something like that).
>
> That would be bad. I think it's possible you haven't understood the
issues, so
> I'll try to explain in this post what some of them are, and why you would
want
> to do certain things in certain ways.
>
>
> >I think it's perhaps best to make these kind of things completely
independent
> >of any other layer, if at all possible.
>
> I don't have any problem with that.
>
>
> >These also happen to be the kind of
> >functions that might be worth optimizing with a smattering of assembly
...
>
> I disagree. Transcoding almost never happens in performance-critical code.
It
> happens during input and output. A typical scenario is to get input from a console and then decode it, to to encode a string and then write it to a
file.
> The CPU time utilized in the I/O will outweigh the time spent transcoding
by a
> very large factor. Of course it still makes sense to do this efficiently,
but
> assembler - given that it's not portable, decreases maintainability,
etc. - is
> probably going a bit too far.

What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently. The latter still matters, and perhaps always will. Still, it was just a suggestion.

>
> Okay, back to these function signatures:
>
> >int utf8ToDChar (char[] input, dchar[] output);
> >int dCharToUtf8 (dchar[] input, char[] output);
>
> (1) The encoding is not necessarily known at compile time. This problem
would
> also exist had you used classes/interfaces, of course, but at least with
classes
> or interfaces instead of plain functions, you can rely on polymorphism and factory methods to do the dispatching, giving you a single point of
decision.
> Functions like the above would lead to switch statements all over the
place, and
> also to inconsistent encoding names (e.g. "ISO-8859-1" vs "iso-8859-1" vs "LATIN-1" vs "Latin1"). Only by a single point of decision can enforce the
IANA
> encoding names, case conventions, etc..

Agreed. I wouldn't presume to fashion a "complete" solution on /this/ NG <g>. Thus, encoding was deliberately ommited to clarify the means of getting data into and out of these converters. As far as encoding-names go, I would have expected such converters to be implemented as methods in a class; the constructor would be given the encoding identifier.


> I see that in "charset.d" you made the encoding name a runtime parameter -
but
> that too is bad, partly because you don't have a single point of decision,
but
> partly also because you're now having to make that runtime check with
/every/
> fragment of text - not merely at construction time.

Not sure what you mean. I've never written anything called "charset.d" ... besides, you can safely assume that efficiency is important to me.


> (2) (Trival) you forgot "out" on the output variables. You cannot expect
the
> caller to be aware in advance of the resulting required buffer size.

Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome".

If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.


> (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want
to
> get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of
functionality).
> But you can't build such a function out of your string routines, because
you
> have no way of knowing in advance how many bytes will need to be consumed
from
> the stream in order to build one character. So what do you do? Read too
many and
> then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.

Wholly agreed: pushback is a big "no no". But it's not an issue when using a pair of arrays in the suggested manner.


> In fact, the minimal functionality that a decoder requires, is this:
>
> #    interface DcharGetter
> #    {
> #        bool hasMore();
> #        dchar next();
> #    }
>
> (next() could be called get(), or read(), or whatever). The minimal
> functionality upon which a decoder would rely, is this:
>
> #    interface UbyteGetter
> #    {
> #        bool hasMore();
> #        ubyte next();
> #    }
>
> For comparison, look at the way Walter's format() function uses an
underlying
> put() function to write a single character. He /could/ have used strings
> throughout, but he recognised (correctly) that the one-byte-at-a-time
approach
> was conceptually at a lower level. Strings can then be handled /in terms
of/
> those lower-level functions.

There are several valid ways to skin that particular cat <g>

<snip>


Here's a fuller implementation of the array approach (in pseudo-code)

class Transcoder
{
      this (char[] encoding) {...}

      dchar[] toUnicode (char[] input, dchar[] output, out int consumed)
      {
          while (room_for_more_output)
                    while (enough_input_for_another_dchar)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_dchars;
      }

      char[] toUtf8 (dchar[] input, char[] output, out int consumed)
      {
          while (room_for_more_output)
                    while (enough_input_for_another_char)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_chars;
      }
}


This would be wrapped at some higher level such as within a Phobos Stream, or a Mango Reader/Writer, to handle the mapping of arrays to variables. The benefit of this approach is it's throughput, and the ability for the 'controller' to direct the input and output arrays to anywhere it likes (including scalar variables), leading to further efficiencies. Functions such as these do not need to be exposed to the typical programmer. In fact, I vaguely recall Java has something along these lines that's hidden in some sun.x.x library, which the Java Streams utilize at some level.

A variation on the theme might initially provide a buffer to house the conversion output instead. There's pros and cons to both approaches. In this case, you'd probably want to split the transcoding into separate encoding and decoding:

class Decoder
{
      private dchar[] unicode;

      this (char[] encoding, dchar[] output)
     {
        do_something_with_encoding;
        unicode = output;
     }

      this (char[] encoding, int outputSize)
     {
        this (encoding,  new dchar[outputSize]);
     }

      dchar[] convert (char[] input, out int consumed)
      {
          while (room_for_more_output_in_output_buffer)
                    while (enough_input_for_another_dchar)
                              do_actual_conversion_into_output_buffer;

          emit_quantity_of_input_consumed;
          return_slice_of_output_representing_converted_dchars;
      }
}

class Encoder
{
   // similar approach to Decoder
}


These are just suggestions, to take or leave at one's discretion.



August 16, 2004
> class Transcoder
> {
>       this (char[] encoding) {...}
>
>       dchar[] toUnicode (char[] input, dchar[] output, out int consumed)
>       {
>           while (room_for_more_output)
>                     while (enough_input_for_another_dchar)
>                               do_actual_conversion_into_output_buffer;
>
>           emit_quantity_of_input_consumed;
>           return_slice_of_output_representing_converted_dchars;
>       }
> }


Whoops! Those twin while loops should, of course, be a single while() with an && between the two conditions.


August 16, 2004
"Arcane Jill" <Arcane_member@pathlink.com> skrev i en meddelelse news:cfqcrh$2cs2$1@digitaldaemon.com...
> In article <pan.2004.08.16.06.29.47.206851@teqdruid.com>, teqDruid says... To decode a dchar from a stream you must first read /some/ bytes from that stream, in order to pass those bytes to the above function. But how many?
One?
> Two? Four? In UTF-7, some Unicode characters require no less than /eight/
bytes.
> (One can invent or imagine encodings that require even more).

Another verbose, yet useful representation is the character entities used in
HTML:
http://www.w3.org/TR/REC-html40/sgml/entities.html

Regards,
Martin M. Pedersen