Jump to page: 1 2 3
Thread overview
Transcoding - who's doing what?
Aug 15, 2004
Arcane Jill
Aug 15, 2004
Ben Hinkle
Aug 15, 2004
Arcane Jill
Aug 15, 2004
antiAlias
Aug 15, 2004
Arcane Jill
Aug 16, 2004
antiAlias
Aug 15, 2004
teqDruid
Aug 15, 2004
Arcane Jill
Aug 16, 2004
antiAlias
Aug 16, 2004
Nick
Aug 16, 2004
Arcane Jill
Aug 16, 2004
Nick
Aug 16, 2004
teqDruid
Aug 16, 2004
Arcane Jill
Aug 16, 2004
Martin M. Pedersen
Aug 16, 2004
Arcane Jill
Aug 16, 2004
teqDruid
Aug 17, 2004
Arcane Jill
Aug 16, 2004
antiAlias
Aug 16, 2004
antiAlias
Aug 17, 2004
Arcane Jill
Aug 17, 2004
Ben Hinkle
Aug 17, 2004
Sean Kelly
Aug 16, 2004
Walter
Aug 16, 2004
Sean Kelly
Aug 17, 2004
Arcane Jill
Aug 17, 2004
antiAlias
Aug 17, 2004
stonecobra
Aug 17, 2004
Sean Kelly
August 15, 2004
There have been loads and loads of discussions in recent weeks about Unicode, streams, and transcodings. There seems to be a general belief that "things are happening", but I'm not quite clear on the specifics - hence this post, which is basically a question.

To clarify my own plans on the Unicode front, the purpose of the etc.unicode library is to implement all of the algorithms defined by the Unicode standard on the Unicode website. ("All" is quite ambitious, actually, and it will take a long time to achieve that, but obviously the core ones will come first, and most of the property-getting functions are already there). But I'm /not/ planning on writing any transcoding functions, simply because they're not part of the Unicode standard. Transcoding, in fact, is all about converting /to/  Unicode from something else (and vice versa).

Transcoding functions are easy to write - for most encodings a simple 256-entry lookup table will suffice, at least in one direction. But transcoding in strings is not necessarily the best architecture, and it would probably be better to do it at a lower level, using streams (aka filters/readers/writers) - basically just classes which implement a read() function and/or a write() function.

I don't know who, if anyone, is currently working on this. In post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said: "I'm currently working on ... a string interface that abstracts from the specific encoding + a bunch of implementations for the most common ones (UTF-8, 16, 32, system codepage, etc...).", but it's possible I may have read too much into that.

I also know that Sean is doing some stream stuff, and that in post
http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said 'Rather
than having the encoding handled
directly by the Stream layer perhaps it should be dropped into another class.  I
can't imagine coding a base lib to support "Joe's custom encoding scheme."  For
the moment though, I think I'll leave stream.d as-is.  This seems like a design
issue that will take a bit of talk to get right.' and 'I'll
probably have the first cut of stream.d done in a few more days and after that
we can talk about what's wrong with it, etc.'

I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?

Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too.

So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't been done yet, basically because we haven't agreed on an architecture, and I for one am not really sure who's doing it anyway.

Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders yet, or is it still up in the air?

And, (2), if the answer to (1) is no, I'd like to suggest that a couple of simple classes be written which, I believe, will slot nicely into whatever architecture we eventually come up with. This is what I suspect will do the job. Two classes:

#    class ISO_8859_1Reader
#    {
#        this(Stream underlyingStream) // an input stream
#        {
#            s = underlyingStream;
#        }
#
#        void read(out dchar c)
#        {
#            ubyte b;
#            s.read(b);
#            c = cast(dchar) b;
#        }
#
#        private Stream s;
#    }
#
#    class ISO_8859_1Writer
#    {
#        this(Stream underlyingStream) // an output stream
#        {
#            s = underlyingStream;
#        }
#
#        void write(dchar c)
#        {
#            if (c > 0xFF) s.write(replacementChar);
#            else s.write(cast(ubyte) c);
#        }
#
#        public ubyte replacementChar = '?';
#        private Stream s;
#    }

Now these will probably need some adapting to fit into our final architecture. (Should they derive from Stream? Or from some yet-to-be-defined transcoding Reader/Writer base classes? Should they implement some interface? Should they be merged into a single class? etc.) BUT - they won't need /much/ adaptation, and once we've got Latin-1 working, we'll have an example on which to model all the others. So feel free to take the above code and adapt it as necessary.

But I do think we should nail down the architecture soon, as we're getting a lot of questions and discussion on this. But one thing at a time. Someone tell me where streams are going (with regard to above questions) and then I'll have more suggestions.

Arcane Jill


August 15, 2004
> I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?

std.stream.InputStream and OutputStream interfaces already exist (since 0.89). All the "new" stuff in std.stream isn't in the phobos.html doc. Are you thinking of a different InputStream and OutputStream?

August 15, 2004
I'm not doing anything specific for transcoding (yet) Jill; but will as soon as the appropriate knowledge is made available in the shape of some low-level libraries. If etc.unicode already has those, well, I'll get on the job pronto.

As for architecture, this is how mango.io approaches it:

One might consider mango.io to have three separate, but related and bindable, entities. These are Conduit, Buffer, and Reader/Writer. Conduits represent things like files, sockets, and other 'physical', block oriented devices. You can talk to a Conduit directly (via read/write methods) with an instance of a Buffer. The next stage up in pecking order is the Buffer, which acts as a bi-directional queue for Conduit data (or used independently like Outbuffer, for that matter). You can read and write to a buffer using void[], or map it directly to a local array if desired.

Buffers are intended as an abstraction over the more physical Conduit. You can use a common Buffer for both read and write purposes, or you can have a separate instance for both read and write purposes. On top of the Buffer, one can map either a set of Tokenizers (for scanf like processing), or a set of Readers/Writers. The latter convert between representations: usually programmer-idioms to Conduit-idioms and back again. For example, a Reader might convert Buffer content into ints, longs, char[] arrays and so on. Writer does the opposite.

You can make a Reader/Writer pair do whatever you wish in terms of conversion: a classic example is endian conversion, but others might include various transcoding other tasks, including unicode. In addition, you can map multiple Readers/Writers onto a common Buffer, and they will all behave sequentially as one might imagine. The latter is handy for when you need to see what the content is before reading it in some other manner (think HTTP headers, followed by content that's been zip-compressed). You might think of the Reader/Writer layer as "piecemeal" IO: they usually work with small amounts of data at a time.

Finally, the Conduit actually has an optional filter "intercept" layer: you can build a filter to modify either the input or output in void[] style. That is, an output filter is given a void[], and does what ever it wants with it (usually calls the next filter in the chain, which will ultimately cause the modulated content to be written somewhere).

This sounds somewhat complex, but the APIs make it really easy (certainly as easy as phobos.io) to get things hooked up. For example, when reading a file you typically do the following:

FileConduit fc = new FileConduit ("file.name");
Reader r = new Reader (fc);

r.get(x).get(y).get(z);

(or r >> x >> y >> z;)

etc.

So, whenever the appropriate unicode converters are available, I (or someone else) can hook them up either at the Buffer layer, or at the Conduit-filter layer. If you'd be interested in doing that, I'd be very, very, grateful!






"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfog97$12n2$1@digitaldaemon.com...
>
> There have been loads and loads of discussions in recent weeks about
Unicode,
> streams, and transcodings. There seems to be a general belief that "things
are
> happening", but I'm not quite clear on the specifics - hence this post,
which is
> basically a question.
>
> To clarify my own plans on the Unicode front, the purpose of the
etc.unicode
> library is to implement all of the algorithms defined by the Unicode
standard on
> the Unicode website. ("All" is quite ambitious, actually, and it will take
a
> long time to achieve that, but obviously the core ones will come first,
and most
> of the property-getting functions are already there). But I'm /not/
planning on
> writing any transcoding functions, simply because they're not part of the Unicode standard. Transcoding, in fact, is all about converting /to/
Unicode
> from something else (and vice versa).
>
> Transcoding functions are easy to write - for most encodings a simple
256-entry
> lookup table will suffice, at least in one direction. But transcoding in
strings
> is not necessarily the best architecture, and it would probably be better
to do
> it at a lower level, using streams (aka filters/readers/writers) -
basically
> just classes which implement a read() function and/or a write() function.
>
> I don't know who, if anyone, is currently working on this. In post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/5925, Hauke said:
"I'm
> currently working on ... a string interface that abstracts from the
specific
> encoding + a bunch of implementations for the most common ones (UTF-8, 16,
32,
> system codepage, etc...).", but it's possible I may have read too much
into
> that.
>
> I also know that Sean is doing some stream stuff, and that in post http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/8236, he said
'Rather
> than having the encoding handled
> directly by the Stream layer perhaps it should be dropped into another
class.  I
> can't imagine coding a base lib to support "Joe's custom encoding scheme."
For
> the moment though, I think I'll leave stream.d as-is.  This seems like a
design
> issue that will take a bit of talk to get right.' and 'I'll
> probably have the first cut of stream.d done in a few more days and after
that
> we can talk about what's wrong with it, etc.'
>
> I also need a bit of educating on the future of D's streams. Are we going
to get
> separate InputStream and OutputStream interfaces, or what?
>
> Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I
don't
> mind either way, but if Phobos is going to go off in some completely
tangential
> direction, I want to know that too.
>
> So, the simple, easy peasy task of converting between Latin-1 and Unicode
hasn't
> been done yet, basically because we haven't agreed on an architecture, and
I for
> one am not really sure who's doing it anyway.
>
> Therefore, (1), I would like to ask, is anyone /actually/ writing
transcoders
> yet, or is it still up in the air?
>
> And, (2), if the answer to (1) is no, I'd like to suggest that a couple of simple classes be written which, I believe, will slot nicely into whatever architecture we eventually come up with. This is what I suspect will do
the job.
> Two classes:
>
> #    class ISO_8859_1Reader
> #    {
> #        this(Stream underlyingStream) // an input stream
> #        {
> #            s = underlyingStream;
> #        }
> #
> #        void read(out dchar c)
> #        {
> #            ubyte b;
> #            s.read(b);
> #            c = cast(dchar) b;
> #        }
> #
> #        private Stream s;
> #    }
> #
> #    class ISO_8859_1Writer
> #    {
> #        this(Stream underlyingStream) // an output stream
> #        {
> #            s = underlyingStream;
> #        }
> #
> #        void write(dchar c)
> #        {
> #            if (c > 0xFF) s.write(replacementChar);
> #            else s.write(cast(ubyte) c);
> #        }
> #
> #        public ubyte replacementChar = '?';
> #        private Stream s;
> #    }
>
> Now these will probably need some adapting to fit into our final
architecture.
> (Should they derive from Stream? Or from some yet-to-be-defined
transcoding
> Reader/Writer base classes? Should they implement some interface? Should
they be
> merged into a single class? etc.) BUT - they won't need /much/ adaptation,
and
> once we've got Latin-1 working, we'll have an example on which to model
all the
> others. So feel free to take the above code and adapt it as necessary.
>
> But I do think we should nail down the architecture soon, as we're getting
a lot
> of questions and discussion on this. But one thing at a time. Someone tell
me
> where streams are going (with regard to above questions) and then I'll
have more
> suggestions.
>
> Arcane Jill
>
>


August 15, 2004
On Sun, 15 Aug 2004 20:15:35 +0000, Arcane Jill wrote:
> 
> #    class ISO_8859_1Reader
> #    {
> #        this(Stream underlyingStream) // an input stream
> #        {
> #            s = underlyingStream;
> #        }
> #
> #        void read(out dchar c)
> #        {
> #            ubyte b;
> #            s.read(b);
> #            c = cast(dchar) b;
> #        }
> #
> #        private Stream s;
> #    }
> #
> #    class ISO_8859_1Writer
> #    {
> #        this(Stream underlyingStream) // an output stream
> #        {
> #            s = underlyingStream;
> #        }
> #
> #        void write(dchar c)
> #        {
> #            if (c > 0xFF) s.write(replacementChar);
> #            else s.write(cast(ubyte) c);
> #        }
> #
> #        public ubyte replacementChar = '?';
> #        private Stream s;
> #    }
> 

I, for one, would prefer that the core functionality NOT be phobos-streams specific.  IE, make a set of functions to do the transcoding, then use those to create the readers and writers.  This way, it'll be easier to put the transcoding stuff into mango, which I prefer over std.streams.

John

August 15, 2004
In article <cfogis$12on$1@digitaldaemon.com>, Ben Hinkle says...
>
>
>> I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?
>
>std.stream.InputStream and OutputStream interfaces already exist (since
>0.89).

I didn't know that. Thanks.

>All the "new" stuff in std.stream isn't in the phobos.html doc.

Ah. That would be why I didn't know it. I've only read the HTML, not the D source. I know a lot have folk have suggested that I should read the source, but I guess it's an ideological thing - using the specifics of the source smacks of relying on undocumented features to me, something not guaranteed to work in future incarnations. How hard would it be to update the documentation?


>Are
>you thinking of a different InputStream and OutputStream?

I wasn't thinking of anything. I just didn't know there was such a beast. Thanks for educating me.

Jill


August 15, 2004
In article <cfoiln$13re$1@digitaldaemon.com>, antiAlias says...
>
>I'm not doing anything specific for transcoding (yet) Jill; but will as soon as the appropriate knowledge is made available in the shape of some low-level libraries. If etc.unicode already has those, well, I'll get on the job pronto.

I'm afraid it doesn't have anything relevant to encoding or decoding, sorry - just character properties, like isWhitespace(dchar) and so on. Transcoding is a different issue, basically just a mapping to/from a sequence of bytes from/to a Unicode character, and the actual mapping will be different for each encoding. Latin-1 is easy, because the codepoints are identical to those of Unicode.


>As for architecture, this is how mango.io approaches it:
<snip>

Cool.

>So, whenever the appropriate unicode converters are available, I (or someone else) can hook them up either at the Buffer layer, or at the Conduit-filter layer. If you'd be interested in doing that, I'd be very, very, grateful!

I think I follow that. But presumably, if people don't want it to be std-specific, then it shouldn't be mango-specific either.

I can write a converter for Latin-1, once we're all happy with the architecture.
(Actually, I think any of us could). But I certainly wouldn't be able to do (for
example) SHIFT-JIS. I imagine once we have the architecture nailed down, lots of
transcoder classes will get written (one for each encoding).

Jill



August 15, 2004
In article <pan.2004.08.15.21.19.34.123236@teqdruid.com>, teqDruid says...

>I, for one, would prefer that the core functionality NOT be phobos-streams specific.

Fair enough.

>IE, make a set of functions to do the transcoding, then use
>those to create the readers and writers.  This way, it'll be easier to put
>the transcoding stuff into mango, which I prefer over std.streams.

Right, but this "set of functions" (or classes, which I'd prefer) would still have to have a common format, or you wouldn't be able to call them polymorphically at runtime.

Would you have a problem if they just implemented (or relied upon) the InputStream and OutputStream interfaces which I only just learned about a few posts ago?

Jill


August 16, 2004
Might I suggest something along the following lines:

int utf8ToDChar (char[] input, dchar[] output);
int dCharToUtf8 (dchar[] input, char[] output);

where both return the number of bytes converted (or something like that). I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible. These also happen to be the kind of functions that might be worth optimizing with a smattering of assembly ...


"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cforuu$191p$1@digitaldaemon.com...
> In article <pan.2004.08.15.21.19.34.123236@teqdruid.com>, teqDruid says...
>
> >I, for one, would prefer that the core functionality NOT be
phobos-streams
> >specific.
>
> Fair enough.
>
> >IE, make a set of functions to do the transcoding, then use
> >those to create the readers and writers.  This way, it'll be easier to
put
> >the transcoding stuff into mango, which I prefer over std.streams.
>
> Right, but this "set of functions" (or classes, which I'd prefer) would
still
> have to have a common format, or you wouldn't be able to call them polymorphically at runtime.
>
> Would you have a problem if they just implemented (or relied upon) the InputStream and OutputStream interfaces which I only just learned about a
few
> posts ago?
>
> Jill
>
>


August 16, 2004
"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cforp2$18vc$1@digitaldaemon.com...
> >So, whenever the appropriate unicode converters are available, I (or
someone
> >else) can hook them up either at the Buffer layer, or at the
Conduit-filter
> >layer. If you'd be interested in doing that, I'd be very, very, grateful!

Oops. Should have written "either at the Reader/Writer layer, or at the Conduit-filter layer" instead.

> I think I follow that. But presumably, if people don't want it to be std-specific, then it shouldn't be mango-specific either.

Yep; I think it's feasible to avoid all dependencies by limiting the API to arrays.


August 16, 2004
In article <cfou0o$1a91$1@digitaldaemon.com>, antiAlias says...
>
>Might I suggest something along the following lines:
>
>int utf8ToDChar (char[] input, dchar[] output);
>int dCharToUtf8 (dchar[] input, char[] output);
>
>where both return the number of bytes converted (or something like that). I think it's perhaps best to make these kind of things completely independent of any other layer, if at all possible. These also happen to be the kind of functions that might be worth optimizing with a smattering of assembly ...

Ok, here's my shot at it:
http://folk.uio.no/mortennk/encoding/ (released under LGPL)

I'm not a professional programmer, so please excuse bad programming style, naming conventions or other crimes against humanity.

Like mentioned earlier, I use iconv() from libiconv, which can convert between a large set of encodings with little hassle. Only tested on Linux. I'll leave the Windows porting/testing to someone else. A Win32 port of libiconv can be found here:

http://gnuwin32.sourceforge.net/packages/libiconv.htm

Nick


« First   ‹ Prev
1 2 3