View mode: basic / threaded / horizontal-split · Log in · Help
August 31, 2005
[Contribution] std.windows.charset
I've got together some functions for converting between Windows 
character sets and UTF-8.  I propose that it be added to Phobos, since 
it's a significant step forward in compatibility between D and Windows.

It's basically the stuff taken from std.file with a few additions:
- ability to specify the ANSI or OEM codepage
- the corresponding fromMBSz
- throws an exception on error such as an invalid codepage
- toUTF8(wchar*), essential for converting back null-terminated UTF-16 
strings received from WinAPI (though this ought to be moved to std.utf).

Once this is done, they can be removed/deprecated from std.file, and the 
file functions adjusted to use the ones in the new module.  And listdir 
can certainly shrink.

I'd thought about making it auto-detect MSLU, so that the same app can 
run without MSLU or make use of it if it's there.  But having read up a 
bit more, it would appear that an app has to be linked to depend on MSLU 
anyway, in which case this won't work.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.
August 31, 2005
Re: [Contribution] std.windows.charset
On Wed, 31 Aug 2005 11:20:15 +0100, Stewart Gordon <smjg_1998@yahoo.com>  
wrote:
> I've got together some functions for converting between Windows  
> character sets and UTF-8.  I propose that it be added to Phobos, since  
> it's a significant step forward in compatibility between D and Windows.
>
> It's basically the stuff taken from std.file with a few additions:
> - ability to specify the ANSI or OEM codepage
> - the corresponding fromMBSz
> - throws an exception on error such as an invalid codepage
> - toUTF8(wchar*), essential for converting back null-terminated UTF-16  
> strings received from WinAPI (though this ought to be moved to std.utf).
>
> Once this is done, they can be removed/deprecated from std.file, and the  
> file functions adjusted to use the ones in the new module.  And listdir  
> can certainly shrink.
>
> I'd thought about making it auto-detect MSLU, so that the same app can  
> run without MSLU or make use of it if it's there.  But having read up a  
> bit more, it would appear that an app has to be linked to depend on MSLU  
> anyway, in which case this won't work.

Good idea. I don't know if this will help or if you have something already  
but here are the functions I used for converting code page 1252 to/from  
UTF-8/16/32. I used them for logging cp1252 data to the screen, which  
doesn't actually display correctly anyway (unless you tell your windows  
console to go into utf-8 mode), but it did stop writef from throwing an  
exception.

Regan
August 31, 2005
Re: [Contribution] std.windows.charset
Regan Heath wrote:
<snip>
> Good idea. I don't know if this will help or if you have something 
> already but here are the functions I used for converting code page 1252 
> to/from UTF-8/16/32.

Both my contribution and yours are written with specific aims in mind. 
Mine to convert UTF-8 strings to/from null-terminated strings in Windows 
character sets and hence facilitate communication with the Windows API. 
 Yours is made to convert D arrays between 1252 and UTFs.

Neither is designed to be a comprehensive converter between character 
encodings.  There's a project underway (part of Indigo) to implement 
such a thing, and I'm involved in it.  But what I've contributed here is 
designed to be something simple to do the job it's made for, as is yours.

The only things yours adds are:
- cross-platform support, though only for one codepage
- direct translation between 1252 and the other two UTFs
- treating 1252 text as a D array

none of which are needed for a std.windows module, though admittedly one 
or two WinAPI functions (such as TextOut) could find the (address, 
length) form of D strings useful.

> I used them for logging cp1252 data to the screen, 
> which doesn't actually display correctly anyway (unless you tell your 
> windows console to go into utf-8 mode), but it did stop writef from 
> throwing an exception.

For that matter, which versions of Windows support UTF-8 console mode? 
Windows 9x doesn't, and so writef is of little use if you want to output 
anything but plain ASCII to the console.  And even if it did, it would 
still be good to be able to detect the codepage and output 
appropriately.  I'm working towards getting TextStream implemented in 
Indigo, which will facilitate this.

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K- w++@ O? M V? PS- PE- Y?
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on
the 'group where everyone may benefit.
August 31, 2005
Re: [Contribution] std.windows.charset
On Wed, 31 Aug 2005 15:23:53 +0100, Stewart Gordon <smjg_1998@yahoo.com>  
wrote:
<snip>

Ahh. No problem.

>> I used them for logging cp1252 data to the screen, which doesn't  
>> actually display correctly anyway (unless you tell your windows console  
>> to go into utf-8 mode), but it did stop writef from throwing an  
>> exception.
>
> For that matter, which versions of Windows support UTF-8 console mode?  
> Windows 9x doesn't, and so writef is of little use if you want to output  
> anything but plain ASCII to the console.  And even if it did, it would  
> still be good to be able to detect the codepage and output  
> appropriately.  I'm working towards getting TextStream implemented in  
> Indigo, which will facilitate this.

I briefly tried to add a conversion stream to my code. The immediate  
problem was that when you try to implement say, readExact, to read exactly  
x bytes, you might find after conversion x bytes of cp1252 becomes x+y  
bytes of UTF (due to some multibyte codepoints).

In fact it's possible, I believe, for it to be impossible to get exactly x  
bytes, if say you had x-1 bytes and the next character became a 2 byte  
codepoint. That is, if you're using char or wchar, it would probably work  
fine if you used dchar.

Regan
September 01, 2005
Re: [Contribution] std.windows.charset
Regan Heath wrote:
<snip>
> I briefly tried to add a conversion stream to my code. The immediate 
> problem was that when you try to implement say, readExact, to read 
> exactly x bytes, you might find after conversion x bytes of cp1252 
> becomes x+y bytes of UTF (due to some multibyte codepoints).

What is the practical use of being able to read x UTF-8 bytes from a 
stream that isn't in UTF-8?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- a->--- UB@ P+ L E@ W++@ N+++ o K-@ w++@ O? M V? PS- PE- Y? 
PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.
September 01, 2005
Re: [Contribution] std.windows.charset
On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998@yahoo.com>  
wrote:
> Regan Heath wrote:
> <snip>
>> I briefly tried to add a conversion stream to my code. The immediate  
>> problem was that when you try to implement say, readExact, to read  
>> exactly x bytes, you might find after conversion x bytes of cp1252  
>> becomes x+y bytes of UTF (due to some multibyte codepoints).
>
> What is the practical use of being able to read x UTF-8 bytes from a  
> stream that isn't in UTF-8?

Being able to read exactly x bytes wasn't the purpose, it was simply the  
problem I encountered writing a conversion stream. A stream that would  
allow:

ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
char[] line;

while(!s.eof()) {
	line = s.readLine(line);
	..etc..
}

s.close();

This would read my cp1252 encoded file as char[] lines, instead of reading  
them into ubyte[] or char[] (technically illegally) and then converting  
each line.

The problem arose in implementing the readExact method for the  
ConversionStream. Firstly it would have required a buffer to deal with the  
expansion of the data, secondly as I mentioned last post it is possible to  
fail to get exactly x bytes, but have x-1 or x+1 instead.

So, while I didn't want to read exactly x bytes the Stream interface  
requires that it is possible, right?

Regan
September 01, 2005
Re: [Contribution] std.windows.charset
On Fri, 02 Sep 2005 00:11:05 +1200, Regan Heath <regan@netwin.co.nz> wrote:
> On Thu, 01 Sep 2005 12:09:24 +0100, Stewart Gordon <smjg_1998@yahoo.com>  
> wrote:
>> Regan Heath wrote:
>> <snip>
>>> I briefly tried to add a conversion stream to my code. The immediate  
>>> problem was that when you try to implement say, readExact, to read  
>>> exactly x bytes, you might find after conversion x bytes of cp1252  
>>> becomes x+y bytes of UTF (due to some multibyte codepoints).
>>
>> What is the practical use of being able to read x UTF-8 bytes from a  
>> stream that isn't in UTF-8?
>
> Being able to read exactly x bytes wasn't the purpose, it was simply the  
> problem I encountered writing a conversion stream. A stream that would  
> allow:
>
> ConversionStream s = new ConversionStream(new BufferedFile("a.txt"));
> char[] line;
>
> while(!s.eof()) {
> 	line = s.readLine(line);
> 	..etc..
> }
>
> s.close();
>
> This would read my cp1252 encoded file as char[] lines, instead of  
> reading them into ubyte[] or char[] (technically illegally) and then  
> converting each line.
>
> The problem arose in implementing the readExact method for the  
> ConversionStream. Firstly it would have required a buffer to deal with  
> the expansion of the data, secondly as I mentioned last post it is  
> possible to fail to get exactly x bytes, but have x-1 or x+1 instead.
>
> So, while I didn't want to read exactly x bytes the Stream interface  
> requires that it is possible, right?

I didn't pursue this far because I didn't need it, but the more I think on  
it...

I guess breaking the stream in the middle of a char 'character' (one which  
is several bytes/codepoints long) isn't a problem as the next read will  
grab the rest and append it to the first part, unless, the user is reading  
a char at a time and expecting them to be complete and/or valid (which is  
just wrong).

When reading lines it will keep reading till it gets a '\n' or '\r\n' so  
it shouldn't ever return any "part of a character"s.

So, perhaps there is no problem after all :)

Regan
September 07, 2005
Re: [Contribution] std.windows.charset
Stewart Gordon wrote:
> I've got together some functions for converting between Windows 
> character sets and UTF-8.  I propose that it be added to Phobos, since 
> it's a significant step forward in compatibility between D and Windows.
<snip>

I could've sworn I'd attached the thing!  Why did nobody point this out?

Stewart.

-- 
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCS/M d- s:- C++@ a->--- UB@ P+ L E@ W++@ N+++ o K-@ w++@ O? M V? PS- 
PE- Y? PGP- t- 5? X? R b DI? D G e++>++++ h-- r-- !y
------END GEEK CODE BLOCK------

My e-mail is valid but not my primary mailbox.  Please keep replies on 
the 'group where everyone may benefit.
Top | Discussion index | About this forum | D home