Thread overview
[Contribution] std.windows.charset
Aug 31, 2005
Stewart Gordon
Aug 31, 2005
Regan Heath
Aug 31, 2005
Stewart Gordon
Aug 31, 2005
Regan Heath
Sep 01, 2005
Stewart Gordon
Sep 01, 2005
Regan Heath
Sep 01, 2005
Regan Heath
Sep 07, 2005
Stewart Gordon
August 31, 2005
I've got together some functions for converting between Windows character sets and UTF-8.  I propose that it be added to Phobos, since it's a significant step forward in compatibility between D and Windows.

It's basically the stuff taken from std.file with a few additions:
- ability to specify the ANSI or OEM codepage
- the corresponding fromMBSz
- throws an exception on error such as an invalid codepage
- toUTF8(wchar*), essential for converting back null-terminated UTF-16 strings received from WinAPI (though this ought to be moved to std.utf).

Once this is done, they can be removed/deprecated from std.file, and the file functions adjusted to use the ones in the new module.  And listdir can certainly shrink.

I'd thought about making it auto-detect MSLU, so that the same app can run without MSLU or make use of it if it's there.  But having read up a bit more, it would appear that an app has to be linked to depend on MSLU anyway, in which case this won't work.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
August 31, 2005
On Wed, 31 Aug 2005 11:20:15 +0100, Stewart Gordon <smjg_1998@yahoo.com> wrote:
> I've got together some functions for converting between Windows character sets and UTF-8.  I propose that it be added to Phobos, since it's a significant step forward in compatibility between D and Windows.
>
> It's basically the stuff taken from std.file with a few additions:
> - ability to specify the ANSI or OEM codepage
> - the corresponding fromMBSz
> - throws an exception on error such as an invalid codepage
> - toUTF8(wchar*), essential for converting back null-terminated UTF-16
> strings received from WinAPI (though this ought to be moved to std.utf).
>
> Once this is done, they can be removed/deprecated from std.file, and the file functions adjusted to use the ones in the new module.  And listdir can certainly shrink.
>
> I'd thought about making it auto-detect MSLU, so that the same app can run without MSLU or make use of it if it's there.  But having read up a bit more, it would appear that an app has to be linked to depend on MSLU anyway, in which case this won't work.

Good idea. I don't know if this will help or if you have something already but here are the functions I used for converting code page 1252 to/from UTF-8/16/32. I used them for logging cp1252 data to the screen, which doesn't actually display correctly anyway (unless you tell your windows console to go into utf-8 mode), but it did stop writef from throwing an exception.

Regan

August 31, 2005
Regan Heath wrote:
<snip>
> Good idea. I don't know if this will help or if you have something already but here are the functions I used for converting code page 1252 to/from UTF-8/16/32.

Both my contribution and yours are written with specific aims in mind. Mine to convert UTF-8 strings to/from null-terminated strings in Windows character sets and hence facilitate communication with the Windows API.  Yours is made to convert D arrays between 1252 and UTFs.

Neither is designed to be a comprehensive converter between character encodings.  There's a project underway (part of Indigo) to implement such a thing, and I'm involved in it.  But what I've contributed here is designed to be something simple to do the job it's made for, as is yours.

The only things yours adds are:
- cross-platform support, though only for one codepage
- direct translation between 1252 and the other two UTFs
- treating 1252 text as a D array

none of which are needed for a std.windows module, though admittedly one or two WinAPI functions (such as TextOut) could find the (address, length) form of D strings useful.

> I used them for logging cp1252 data to the screen, which doesn't actually display correctly anyway (unless you tell your windows console to go into utf-8 mode), but it did stop writef from throwing an exception.

For that matter, which versions of Windows support UTF-8 console mode? Windows 9x doesn't, and so writef is of little use if you want to output anything but plain ASCII to the console.  And even if it did, it would still be good to be able to detect the codepage and output appropriately.  I'm working towards getting TextStream implemented in Indigo, which will facilitate this.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y?
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on
the 'group where everyone may benefit.
August 31, 2005
On Wed, 31 Aug 2005 15:23:53 +0100, Stewart Gordon <smjg_1998@yahoo.com> wrote:
<snip>

Ahh. No problem.

>> I used them for logging cp1252 data to the screen, which doesn't actually display correctly anyway (unless you tell your windows console to go into utf-8 mode), but it did stop writef from throwing an exception.
>
> For that matter, which versions of Windows support UTF-8 console mode? Windows 9x doesn't, and so writef is of little use if you want to output anything but plain ASCII to the console.  And even if it did, it would still be good to be able to detect the codepage and output appropriately.  I'm working towards getting TextStream implemented in Indigo, which will facilitate this.

I briefly tried to add a conversion stream to my code. The immediate problem was that when you try to implement say, readExact, to read exactly x bytes, you might find after conversion x bytes of cp1252 becomes x+y bytes of UTF (due to some multibyte codepoints).

In fact it's possible, I believe, for it to be impossible to get exactly x bytes, if say you had x-1 bytes and the next character became a 2 byte codepoint. That is, if you're using char or wchar, it would probably work fine if you used dchar.

Regan
September 01, 2005
Regan Heath wrote:
<snip>
> I briefly tried to add a conversion stream to my code. The immediate problem was that when you try to implement say, readExact, to read exactly x bytes, you might find after conversion x bytes of cp1252 becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K-@ w++@ O? M V? PS- PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.
September 01, 2005
On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998@yahoo.com> wrote:
> Regan Heath wrote:
> <snip>
>> I briefly tried to add a conversion stream to my code. The immediate problem was that when you try to implement say, readExact, to read exactly x bytes, you might find after conversion x bytes of cp1252 becomes x+y bytes of UTF (due to some multibyte codepoints).
>
> What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8?

Being able to read exactly x bytes wasn't the purpose, it was simply the problem I encountered writing a conversion stream. A stream that would allow:

ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
char[] line;

while(!s.eof()) {
	line = s.readLine(line);
	..etc..
}

s.close();

This would read my cp1252 encoded file as char[] lines, instead of reading them into ubyte[] or char[] (technically illegally) and then converting each line.

The problem arose in implementing the readExact method for the ConversionStream. Firstly it would have required a buffer to deal with the expansion of the data, secondly as I mentioned last post it is possible to fail to get exactly x bytes, but have x-1 or x+1 instead.

So, while I didn't want to read exactly x bytes the Stream interface requires that it is possible, right?

Regan
September 01, 2005
On Fri, 02 Sep 2005 00:11:05 +1200, Regan Heath <regan@netwin.co.nz> wrote:
> On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998@yahoo.com> wrote:
>> Regan Heath wrote:
>> <snip>
>>> I briefly tried to add a conversion stream to my code. The immediate problem was that when you try to implement say, readExact, to read exactly x bytes, you might find after conversion x bytes of cp1252 becomes x+y bytes of UTF (due to some multibyte codepoints).
>>
>> What is the practical use of being able to read x UTF-8 bytes from a stream that isn't in UTF-8?
>
> Being able to read exactly x bytes wasn't the purpose, it was simply the problem I encountered writing a conversion stream. A stream that would allow:
>
> ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
> char[] line;
>
> while(!s.eof()) {
> 	line = s.readLine(line);
> 	..etc..
> }
>
> s.close();
>
> This would read my cp1252 encoded file as char[] lines, instead of reading them into ubyte[] or char[] (technically illegally) and then converting each line.
>
> The problem arose in implementing the readExact method for the ConversionStream. Firstly it would have required a buffer to deal with the expansion of the data, secondly as I mentioned last post it is possible to fail to get exactly x bytes, but have x-1 or x+1 instead.
>
> So, while I didn't want to read exactly x bytes the Stream interface requires that it is possible, right?

I didn't pursue this far because I didn't need it, but the more I think on it...

I guess breaking the stream in the middle of a char 'character' (one which is several bytes/codepoints long) isn't a problem as the next read will grab the rest and append it to the first part, unless, the user is reading a char at a time and expecting them to be complete and/or valid (which is just wrong).

When reading lines it will keep reading till it gets a '\n' or '\r\n' so it shouldn't ever return any "part of a character"s.

So, perhaps there is no problem after all :)

Regan
September 07, 2005
Stewart Gordon wrote:
> I've got together some functions for converting between Windows character sets and UTF-8.  I propose that it be added to Phobos, since it's a significant step forward in compatibility between D and Windows.
<snip>

I could've sworn I'd attached the thing!  Why did nobody point this out?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- C++@ a->--- UB@ P+ L E@ W++@ N+++ o K-@ w++@ O? M V? PS-
PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on the 'group where everyone may benefit.