Transcoding - who's doing what? (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Transcoding - who's doing what? (page 3)

August 16, 2004

Re: Transcoding - who's doing what?

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <cfog97$12n2$1@digitaldaemon.com>, Arcane Jill says...
>
>I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?

I'd like to have them.  In fact my partial rewrite of stream.d already has this.

>Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too.

Hard to answer as I don't really know what will happen with Phobos in the long term, however...

I would like to see unFormat/readf get into Phobos, though that may have to wait for TypeInfo to be working for pointers since the current calling convention is still a bit inconsistent with writef (ie. it requires a format string as the first argument, just like scanf).  I could work around this with a big if/else block to get the underlying type of pointer arguments but I'd prefer to just work off classinfo.name like Walter does for doFormat.  Perhaps I'll just drop that into a separate function and replace it later when TypeInfo gets fixed.

As for my std.stream rewrite... I like it better than what's in std.stream now but I have no idea what will sort out in the long term.  Is adopting Mango.io a better idea?  Perhaps streams should be dropped from Phobos completely?  I consider my version of stream.d to be more of a prototype than a full-featured replacement.

>So, the simple, easy peasy task of converting between Latin-1 and Unicode hasn't been done yet, basically because we haven't agreed on an architecture, and I for one am not really sure who's doing it anyway.

Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate.  Both of these functions use the functions in std.utf for conversion.  Is this enough to start with?

>Therefore, (1), I would like to ask, is anyone /actually/ writing transcoders yet, or is it still up in the air?

I guess that depends on what still needs to be done.

>But I do think we should nail down the architecture soon, as we're getting a lot of questions and discussion on this. But one thing at a time. Someone tell me where streams are going (with regard to above questions) and then I'll have more suggestions.

Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode.  I think doFormat/unFormat might be the answer to this, but I don't know the remaining issues well enough to say for sure.


Sean

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Arcane Jill
in reply to teqDruid

Arcane Jill

Posted in reply to teqDruid

In article <pan.2004.08.16.18.58.22.898270@teqdruid.com>, teqDruid says...

>Understood.  This code looks reasonably agnostic, and even simple enough the use.  The only difference is in thinking- streams vs strings.

Yes. That's because a string can always be viewed as a stream, but a stream cannot always be viewed as a string.


>I might
>note, however that you use:
>dchar[] toUTF32(char[] s);
>Which could also be written as:
>int toUTF32(char[] s, out dchar[]);

Actually I was just calling the function in std.utf. For any other encoding, I probably would have inlined the code right there, rather than written a function, but I figured, why re-invent stuff? std.utf.toUTF32() throws exceptions if the input is wrong, so it's just what you'd need in this circumstance. (The tests I made to determine the length didn't weed out illegal sequences - I was relying on std.utf to do that for me).


>Which looks very similar to:
>int utf8ToDChar (char[] input, dchar[] output);
>
>This is the function that I would define as implementing the "core" functionality.

Fair enough. Guess it just depends what you call "core". The main thing is the dispatch mechanism.



>The stream implementation is a bit more complex than I imagined, but I can blame that up to a total lack of experience with variable-width character encodings.

There's more. Some encodings are not merely variable-width, but are also /stateful/. Consider UTF-7. A UTF-7 stream is always in one of two states: "ASCII" or "Radix 64". A '+' character in the stream changes the state to "Radix 64", and a '-' character changes the state back to "ASCII". A UTF-7 decoder needs to be aware at all times of the state of the stream. Incoming bytes are interpretted differently (as though they were two entirely different encodings) depending on the stream state. A function such as:

>int utf7ToDchar (char[] input, dchar[] output);

just wouldn't do the job, because it doesn't preserve/know the state of the stream. You'd need a class, with a member variable to contain the current state of the stream (unless you wanted to use a global variable to store the state - yuk!)

So, in general, basing your architecture on a set of functions with similar signature just wouldn't be adequate to do the job.

Arcane Jill

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Arcane Jill
in reply to Sean Kelly

Arcane Jill

Posted in reply to Sean Kelly

In article <cfrc06$31g8$1@digitaldaemon.com>, Sean Kelly says...

>>I also need a bit of educating on the future of D's streams. Are we going to get separate InputStream and OutputStream interfaces, or what?
>
>I'd like to have them.  In fact my partial rewrite of stream.d already has this.

Apparently, so does Phobos, although I didn't know that at the time I posted the question. Now isn't that cute - an interface with an undocumented interface!


>>Sean, is your stuff part of Phobos-to-be? Or is it external to Phobos? I don't mind either way, but if Phobos is going to go off in some completely tangential direction, I want to know that too.
>
>Hard to answer as I don't really know what will happen with Phobos in the long term, however...

Okay - I just wasn't sure if you were working for Walter in some capacity. Forgive the dumb question.


>As for my std.stream rewrite... I like it better than what's in std.stream now

It's hard to know what's in std.stream now without reading the source. I /really/ wish someone would document it.


>but I have no idea what will sort out in the long term.  Is adopting Mango.io a better idea?

Many people think so. Others argue that we should wait for the new-improved std.stream. But I don't know what that future is. The mango folk have said that they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is free and open-source, but it might be considered a problem for libraries (- when one third-party library dependends on a different third-party library, things start to get messy).


>Perhaps streams should be dropped from Phobos completely?

Perhaps, but I find it unlikely that that will happen. Only Walter is empowered to do that.


>I
>consider my version of stream.d to be more of a prototype than a full-featured
>replacement.

Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?



>Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate.  Both of these functions use the functions in std.utf for conversion.  Is this enough to start with?

It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.



>Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode.

Yes, "formatted" - that is an interesting and important one. printf()/writef()
are currently not very Unicode-aware. A format string like "%5s" will output at
least five /bytes/, not at least five /characters/. What is needed in this
department is a printf() replacement written exclusively for dchars.


>I think doFormat/unFormat might be the
>answer to this, but I don't know the remaining issues well enough to say for
>sure.
>
>Sean

Well, thanks. I think I've got a picture now of what's going on. I'll post a summary shortly, then we can start calling for volunteers for the missing bits.

Jill

August 17, 2004

Re: Transcoding - who's doing what?

Posted by antiAlias
in reply to Arcane Jill

antiAlias

Posted in reply to Arcane Jill

"Arcane Jill" <Arcane_member@pathlink.com> wrote in message news:cfscq6$s8u$1@digitaldaemon.com...
> std.stream. But I don't know what that future is. The mango folk have said
that
> they don't want mango.io moved into std, so it will always be an external library. That isn't a problem for applications, of course, since mango is
free

Please allow me to clarify: from recollection, the position has always been that both mango.io and std.streams should be excluded from Phobos. I think that was Matthew's position also. If mango.io turns out to be a better solution, then it can certainly move from its current home if that's what people want; but I'm not holding my breath waiting for a consensus on that one <g>

BTW; mango.io is completely independent from the rest of the Mango Tree (has no dependencies), so it can be easily cut away. In fact, it's almost totally independent of Phobos too ...


> and open-source, but it might be considered a problem for libraries (-
when one
> third-party library dependends on a different third-party library, things
start
> to get messy).

Right -- that thread regarding placing the .lib dependencies inside the D source-code might help out with this (to the extent that it can).

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Arcane Jill
in reply to antiAlias

Arcane Jill

Posted in reply to antiAlias

In article <cfr0tf$2pgm$1@digitaldaemon.com>, antiAlias says...

>> >These also happen to be the kind of
>> >functions that might be worth optimizing with a smattering of assembly
>>
>> The CPU time utilized in the I/O will outweigh the time spent transcoding
>by a
>> very large factor.
>
>What about HTTP servers? What about SOAP servers? Pretty much anything XML oriented has to at least think about doing this kind of thing often and efficiently.

I would say that here the time spent getting the web page from the server to client across the internet will outweigh the time spent encoding by many orders of magnitude.

But I'm not /against/ efficiency. If people want to recode this stuff in assembler then obviously I'm not going to object.



>Not sure what you mean. I've never written anything called "charset.d" ... besides, you can safely assume that efficiency is important to me.

I think I was confusing you with Nick. My bad.


>> (2) (Trival) you forgot "out" on the output variables. You cannot expect
>the
>> caller to be aware in advance of the resulting required buffer size.
>
>Au contraire! Both input and output are /provided/ by the caller. This is why the return value specifies the number of items converted. D arrays have some wonderful properties worth taking advantage of -- the length is always provided, you can slice and dice to your hearts' content, and void[] arrays can easily be mapped onto pretty much anything (including a single char or dchar instance). The caller has already said "here's a set of input data, and here's a place to put the output. Convert what you can within the constraints of input & output limits, and tell me the resultant outcome".
>
>If (for example) there's only space in the output for one dchar, the algorithm will halt after converting just one. If there's not enough input provided to construct a dchar, the algorithm indicates nothing was converted. Of course, this points out a flaw in the original prototypes: two return values are needed instead of one (the number of items used from the input, as well as the number of items placed into the output). Alternatively, the implementing class could provide it's own output buffer during initial construction.

Gotcha. Sorry - I misinterpretted the intent of the function signatures.


>Wholly agreed: pushback is a big "no no". But it's not an issue when using a pair of arrays in the suggested manner.

Wholly agreed.


>There are several valid ways to skin that particular cat <g>
>Here's a fuller implementation of the array approach (in pseudo-code)
>
> <snip>
>
>This would be wrapped at some higher level such as within a Phobos Stream, or a Mango Reader/Writer, to handle the mapping of arrays to variables. The benefit of this approach is it's throughput, and the ability for the 'controller' to direct the input and output arrays to anywhere it likes (including scalar variables), leading to further efficiencies. Functions such as these do not need to be exposed to the typical programmer. In fact, I vaguely recall Java has something along these lines that's hidden in some sun.x.x library, which the Java Streams utilize at some level.
>
>A variation on the theme might initially provide a buffer to house the conversion output instead. There's pros and cons to both approaches. In this case, you'd probably want to split the transcoding into separate encoding and decoding:
>
> <snip>
>
>These are just suggestions, to take or leave at one's discretion.

They are good suggestions. They have the benefit of efficiency without losing generality. They have the disadvantage of having a slightly confusing signature, but good documentation should solve that.

Nice one.

Arcane Jill

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Ben Hinkle
in reply to Arcane Jill

Ben Hinkle

Posted in reply to Arcane Jill

> (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.

std.stream supports ungetc, which pushes a character back by maintaining an array of pushed-back characters. Right now only the text functions check this array for content, though. I think the idea was that if one is storing text and binary data mixed together that the text are stored with writeString which puts a length byte followed by the text.

August 17, 2004

Re: Transcoding - who's doing what?

Posted by stonecobra
in reply to Arcane Jill

stonecobra

Posted in reply to Arcane Jill

Arcane Jill wrote:

>>but I have no idea what will sort out in the long term.  Is adopting Mango.io a
>>better idea?
> 
> 
> Many people think so. Others argue that we should wait for the new-improved
> std.stream. But I don't know what that future is. The mango folk have said that
> they don't want mango.io moved into std, so it will always be an external
> library. That isn't a problem for applications, of course, since mango is free
> and open-source, but it might be considered a problem for libraries (- when one
> third-party library dependends on a different third-party library, things start
> to get messy).
> 
> 
> 

Or, since it is open source, you can just compile it in ala std.* and not have a library dependency

Scott

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Sean Kelly
in reply to Arcane Jill

Sean Kelly

Posted in reply to Arcane Jill

In article <cfscq6$s8u$1@digitaldaemon.com>, Arcane Jill says...
>
>In article <cfrc06$31g8$1@digitaldaemon.com>, Sean Kelly says...
>
>>I
>>consider my version of stream.d to be more of a prototype than a full-featured
>>replacement.
>
>Well, that's good and bad. A prototype is good - it implies that better, future versions will exist. But "not a replacement"? If it's not a replacement, are you envisaging that people will use both? Do they interact somehow?

It's only a prototype in the sense that I haven't really finished it yet.  There are some notable functions missing (like ignore), etc.  If people are interested then I'll flesh it out a bit.  I don't have a ton of free time so I figured I'd see what the response was before I worked any more on it.

>>Well, std.doFormat/writef will take char/wchar/dchar strings and output UTF-8 or UTF-16.  My unFormat will read UTF-8 and UTF-16 following the same convention as std.doFormat and will convert everything to char/wchar/dchar strings as appropriate.  Both of these functions use the functions in std.utf for conversion.  Is this enough to start with?
>
>It's certainly enough for now, but it's not transcoding in the more general sense. UTF8/16/32 are fundamental to D - they simply have to be there.
>
>>Frankly, I can live without streams so long as there is *some* way to do formatted i/o that can handle Unicode.
>
>Yes, "formatted" - that is an interesting and important one. printf()/writef()
>are currently not very Unicode-aware. A format string like "%5s" will output at
>least five /bytes/, not at least five /characters/. What is needed in this
>department is a printf() replacement written exclusively for dchars.

unFormat operates entirely in terms of dchars.  So the width modifiers are in terms of UTF-32 characters, etc.  But I agree.  If doFormat doesn't work this way then it probably should.  The results are unpredictable otherwise.


Sean

August 17, 2004

Re: Transcoding - who's doing what?

Posted by Sean Kelly
in reply to Ben Hinkle

Sean Kelly

Posted in reply to Ben Hinkle

In article <cfsu27$122d$1@digitaldaemon.com>, Ben Hinkle says...
>
>
>> (3) /This is most important/. In the typical scenario, the caller will be reading bytes from some source - which /could/ be a stream - and will want to get a single dchar. We're talking about a "get the next Unicode character" function, which is about as low level as it gets (in terms of functionality). But you can't build such a function out of your string routines, because you have no way of knowing in advance how many bytes will need to be consumed from the stream in order to build one character. So what do you do? Read too many and then put some back? Not all byte sources will allow you to "put back" or "unconsume" bytes.

For the record, this is exactly what my mods to std.utf are for.  In fact, unFormat and my stream mods already use them.

>std.stream supports ungetc, which pushes a character back by maintaining an array of pushed-back characters. Right now only the text functions check this array for content, though. I think the idea was that if one is storing text and binary data mixed together that the text are stored with writeString which puts a length byte followed by the text.

Most stream routines allow for at least one byte to put back.  Obviously this isn't possible in all cases, but it *is* always possible to carry an unget buffer around with the stream, as std.stream already does.  Only the formatted routines check this area for content and I consider that correct behavior, as some translation may have been done between the stream and the buffer.


Sean

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation