Thread overview | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
July 20, 2004 Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Is there any instance where we might want to output UTF-16 or UTF-32 encoded strings from doFormat? Alternately, is there any instance where we might want to read UTF-16 or UTF-32 encoded streams? I've been kicking around how best to handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation in the current implementation of doFormat. Sean |
July 20, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | "Sean Kelly" <sean@f4.ca> wrote in message news:cdjpll$1pks$1@digitaldaemon.com... > Is there any instance where we might want to output UTF-16 or UTF-32 encoded > strings from doFormat? Alternately, is there any instance where we might want > to read UTF-16 or UTF-32 encoded streams? I've been kicking around how best to > handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation > in the current implementation of doFormat. doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's, which are UTF-32. Look at std.stdio.writef(), it writes its output based on what the stream's format is set to. |
July 20, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | In article <cdk2sj$1tvi$1@digitaldaemon.com>, Walter says... > >doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's, which >are UTF-32. Look at std.stdio.writef(), it writes its output based on what >the stream's format is set to. Oops. I misread part of the documentation on std.format where it was talking about printing portions of the format string. So input and output is in dchars? That makes life quite easy. I guess unFormat is pretty much finished then. Sean |
July 20, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | In article <cdjpll$1pks$1@digitaldaemon.com>, Sean Kelly says... > >Is there any instance where we might want to output UTF-16 or UTF-32 encoded strings from doFormat? Alternately, is there any instance where we might want to read UTF-16 or UTF-32 encoded streams? I've been kicking around how best to handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation in the current implementation of doFormat. > >Sean Conventionally, streams are considered to be a sequence of octets (bytes). So, by that reasoning, you would never want to read or write UTF-16 or UTF-32 to/from a stream because the units of those formats are not eight bits wide. However, the formats UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE are eight-bit-wide standards, merely consisting of UTF-16 and UTF-32 in little-endian and big-endian representation respectively. You would certainly expect to encounter these. Text files can use them, for example. Going beyond streams, there are also "wide" streams, called Readers and Writers in Java, and maybe filters more generically. They exist for such purposes as transcoding, uppercasing, etc., but don't, in general, send their data to consoles, files, sockets, etc (because those devices expect 8-bit wide input). Transcoders of course are capable of transforming a "wide" stream to a normal stream. A UTF-16LE encoder, for example, is utterly trivial. I don't know what doFormat does, so I can't answer your specific question. I can say, though, that if you're processing bytes or ubytes, you don't need to bother with UTF-32 or UTF-16 (or even UTF-8). If you're processing characters, however, you may well be better off keeping everything in dchars throughout. I don't know if that's helpful or not. What does doFormat() do? What's it for? (I should probably know that, but I'm not that on-the-case right now). Arcane Jill |
July 20, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Arcane Jill | In article <cdk49v$1ump$1@digitaldaemon.com>, Arcane Jill says... > >In article <cdjpll$1pks$1@digitaldaemon.com>, Sean Kelly says... >> >>Is there any instance where we might want to output UTF-16 or UTF-32 encoded strings from doFormat? Alternately, is there any instance where we might want to read UTF-16 or UTF-32 encoded streams? I've been kicking around how best to handle this aspect of unFormat and was suddenly struck by the UTF-8 limitation in the current implementation of doFormat. >> >>Sean > >Conventionally, streams are considered to be a sequence of octets (bytes). So, by that reasoning, you would never want to read or write UTF-16 or UTF-32 to/from a stream because the units of those formats are not eight bits wide. > >However, the formats UTF-16LE, UTF-16BE, UTF-32LE and UTF-32BE are eight-bit-wide standards, merely consisting of UTF-16 and UTF-32 in little-endian and big-endian representation respectively. You would certainly expect to encounter these. Text files can use them, for example. > >Going beyond streams, there are also "wide" streams, called Readers and Writers in Java, and maybe filters more generically. They exist for such purposes as transcoding, uppercasing, etc., but don't, in general, send their data to consoles, files, sockets, etc (because those devices expect 8-bit wide input). Transcoders of course are capable of transforming a "wide" stream to a normal stream. A UTF-16LE encoder, for example, is utterly trivial. > >I don't know what doFormat does, so I can't answer your specific question. I can say, though, that if you're processing bytes or ubytes, you don't need to bother with UTF-32 or UTF-16 (or even UTF-8). If you're processing characters, however, you may well be better off keeping everything in dchars throughout. > >I don't know if that's helpful or not. What does doFormat() do? What's it for? >(I should probably know that, but I'm not that on-the-case right now). doFormat handles all the work for writef. And while writef is really for console and file output, doFormat can do in-memory string formatting as well. It turns out I miswrote my original question, as writef uses the wide char output routines but doFormat deals entirely in dchars. So I think it would be safe for me to have readf use wide char input routines and leave unFormat as-is. This puts aside the issue of configurable encoding however, but from what you've said perhaps that isn't much of a problem. Sean |
July 20, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | In article <cdk74m$1vqj$1@digitaldaemon.com>, Sean Kelly says... > >It turns out I miswrote my original question, as writef uses the wide char output routines but doFormat deals entirely in dchars. So I think it would be safe for me to have readf use wide char input routines and leave unFormat as-is. That's what I get for posting hastily. writef actually seems to switch between UTF-8 and (possibly) UTF-16 based on information gleaned from the file pointer. Looks like I'm going to have to play with readf a little more, though I'm not looking forward to handling unget. Stinkin multibyte encoding schemes. Sean |
July 21, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | "Sean Kelly" <sean@f4.ca> wrote in message news:cdk425$1uid$1@digitaldaemon.com... > In article <cdk2sj$1tvi$1@digitaldaemon.com>, Walter says... > > > >doFormat() isn't limited by UTF-8. In fact, it's output is in dchar's, which > >are UTF-32. Look at std.stdio.writef(), it writes its output based on what > >the stream's format is set to. > > Oops. I misread part of the documentation on std.format where it was talking > about printing portions of the format string. So input and output is in dchars? > That makes life quite easy. I guess unFormat is pretty much finished then. Input is chars, wchars, or dchars. |
July 21, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | Walter wrote:
>
> Input is chars, wchars, or dchars.
Right, because all char types can be implicitly cast to dchar, correct?
Sean
|
July 21, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | "Sean Kelly" <sean@f4.ca> wrote in message news:cdkm15$25m7$1@digitaldaemon.com... > Walter wrote: > > > > Input is chars, wchars, or dchars. > > Right, because all char types can be implicitly cast to dchar, correct? Not exactly. doFormat() examines the type of each argument, and does any conversions as necessary. Implicit conversions are not necessary. |
July 21, 2004 Re: Encoding and doFormat | ||||
---|---|---|---|---|
| ||||
Posted in reply to Sean Kelly | In article <cdkm15$25m7$1@digitaldaemon.com>, Sean Kelly says... > >Walter wrote: >> >> Input is chars, wchars, or dchars. > >Right, because all char types can be implicitly cast to dchar, correct? They can be implicitly cast, but they cannot be /correctly/ cast. I have mentioned this before (and suggested that it be considered a bug) but Walter was adamant that the runtime overhead involved in checking would be undesirable. The problem can be demonstrated by example. Suppose you cast a char containing UTF-8-fragment 0xC0 (which would ordinarily be the first byte of a two-byte UTF-8 sequence) into a dchar then it will be erroneously converted to U+00C0, instead of (as I would prefer) throwing a UTF conversion exception. In general, char values >0x7F should not be cast to wchars or dchars, because these values are /not characters/. Arcane Jill |
Copyright © 1999-2021 by the D Language Foundation