View mode: basic / threaded / horizontal-split · Log in · Help
June 04, 2005
char, wchar and dchar should be supported equally
I like D having char, wchar and dchar. And I like the way that they will 
(soon?) implicitly convert between each other. But I don't like the way 
that D is biased towards char. I think that char, dchar and wchar should 
be supported equally.

For example, modern Windows systems support UTF-16 (via the W 
functions). So you might decide to use wchar, because that is also 
UTF-16. The Windows API expects zero-terminated strings, and you can 
clearly indicate this in your code by calling toStringz. But toStringz 
takes char, so your wchar will be implicitly converted to char and then 
implicitly converted back to wchar. So there is no point using wchar!

But what if every function in std.string had wchar and dchar versions?
Then you could use wchar and call wtoStringz. (At the end of this email, 
there is some working code showing how this could be implemented using 
templates and aliases. There are other ways that std.string could 
support wchar and dchar, such as function overloading or function 
templates.)

Also, in order for char, wchar and dchar to be supported equally, Object 
should have wtoString and dtoString methods. (Because toString cannot be 
overloaded based on its return type.)

Does anyone else out there feel the same? Or should I get over it and 
JUC (Just Use Char) like I already JUB (Just Use Bit)?

James McComb

<code>
import std.stdio;

template TStringFunctions(T) {
    T[] toStringz(T[] str) {
        if (!str)
            return "";

        T[] copy = str.dup;
        return copy ~= '\0';
    }

    // Other string functions...
}

alias TStringFunctions!(char)  stringFunctions;
alias TStringFunctions!(wchar) wstringFunctions;
alias TStringFunctions!(dchar) dstringFunctions;

alias stringFunctions.toStringz  toStringz;
alias wstringFunctions.toStringz wtoStringz;
alias dstringFunctions.toStringz dtoStringz;

// Other string function aliases...

// Example usage
void main() {
    char[]   str = "utf-8 string";
    wchar[] wstr = "utf-16 string";

    str  = toStringz(str);
    wstr = wtoStringz(wstr);
}
</code>
June 04, 2005
Re: char, wchar and dchar should be supported equally
James McComb wrote:
> I like D having char, wchar and dchar. And I like the way that they will 
> (soon?) implicitly convert between each other. But I don't like the way 
> that D is biased towards char. I think that char, dchar and wchar should 
> be supported equally.
> 
> For example, modern Windows systems support UTF-16 (via the W 
> functions). So you might decide to use wchar, because that is also 
> UTF-16. The Windows API expects zero-terminated strings, and you can 
> clearly indicate this in your code by calling toStringz. But toStringz 
> takes char, so your wchar will be implicitly converted to char and then 
> implicitly converted back to wchar. So there is no point using wchar!
> 
> But what if every function in std.string had wchar and dchar versions?
> Then you could use wchar and call wtoStringz. (At the end of this email, 
> there is some working code showing how this could be implemented using 
> templates and aliases. There are other ways that std.string could 
> support wchar and dchar, such as function overloading or function 
> templates.)
> 
> *snip* Object should have wtoString and dtoString methods. 
> 

well.. wtoString is a bad naming convention.. I think toWString or 
toDString makes a little more sense, but to be honest, I think it should 
work like read and write, and return char[], wchar[], or dchar[] based 
on what you cast.

That's my two cents anyhoo, as an avid dchar[] user.

-- 
Thanks,
Trevor Parscal
www.trevorparscal.com
trevorparscal@hotmail.com
June 04, 2005
Re: char, wchar and dchar should be supported equally
Trevor Parscal wrote:
> James McComb wrote:
> 
>> I like D having char, wchar and dchar. And I like the way that they 
>> will (soon?) implicitly convert between each other. But I don't like 
>> the way that D is biased towards char. I think that char, dchar and 
>> wchar should be supported equally.
>>
>> For example, modern Windows systems support UTF-16 (via the W 
>> functions). So you might decide to use wchar, because that is also 
>> UTF-16. The Windows API expects zero-terminated strings, and you can 
>> clearly indicate this in your code by calling toStringz. But toStringz 
>> takes char, so your wchar will be implicitly converted to char and 
>> then implicitly converted back to wchar. So there is no point using 
>> wchar!
>>
>> But what if every function in std.string had wchar and dchar versions?
>> Then you could use wchar and call wtoStringz. (At the end of this 
>> email, there is some working code showing how this could be 
>> implemented using templates and aliases. There are other ways that 
>> std.string could support wchar and dchar, such as function overloading 
>> or function templates.)
>>
>> *snip* Object should have wtoString and dtoString methods.
> 
> 
> well.. wtoString is a bad naming convention.. I think toWString or 
> toDString makes a little more sense, but to be honest, I think it should 
> work like read and write, and return char[], wchar[], or dchar[] based 
> on what you cast.
> 
> That's my two cents anyhoo, as an avid dchar[] user.
> 

I think that toString or any std function that takes a string and 
processes it, should always take dchar and return dchar.

Assuming that dchar is implicitly convertable to char and wchar, there 
can be no loss of information when doing something like:

<code>
dchar[] someFunction(dchar[]) ...

...

wchar[] wtest = ...
wtest = someFunction(wtest); //no loss

...

char[] test = ..
test = someFunction(test); //no loss
</code>

of course I maybe wrong, but I'm assuming that converting a char to 
wchar is like converting an int to double .. where any extra space is 
just filled with zeros (speaking in the bit level), and you can convert 
an int to double, process it, and convert it back to int, and assume 
that no information will be lost because of the conversion to double.
ofcourse information can be lost if "int" is not enough to store the 
value returned from the function, but this has nothing to do with 
converting back and forth to double then to int.
June 04, 2005
Re: char, wchar and dchar should be supported equally
Hasan Aljudy wrote:
> 
> I think that toString or any std function that takes a string and 
> processes it, should always take dchar and return dchar.
> 

The best idea for this I have heard thus far.. Especially since, anytime 
you are doing a toString you aren't going to be worried about the 
addtional overhead of a dchar[] (or so I believe)

-- 
Thanks,
Trevor Parscal
www.trevorparscal.com
trevorparscal@hotmail.com
June 04, 2005
Re: char, wchar and dchar should be supported equally
On Fri, 03 Jun 2005 21:37:23 -0600, Hasan Aljudy <hasan.aljudy@gmail.com>  
wrote:
> of course I maybe wrong, but I'm assuming that converting a char to  
> wchar is like converting an int to double .. where any extra space is  
> just filled with zeros (speaking in the bit level)

Yes and No. In many cases, yes, especially where ASCII is used. However  
some UTF-8 'characters'/'glyphs' (not sure what the correct term is  
exactly) take 2 or more char's (UTF-8 codepoints) to represent, so when  
converting them you might go from 3 chars to 1 wchar (1 UTF-16 codepoint)  
which is a decrease in byte space required, and often a change in the  
value of the codepoint.

> , and you can convert an int to double, process it, and convert it back  
> to int, and assume that no information will be lost because of the  
> conversion to double.

Converting to/from char[], wchar[] and dchar[] causes no loss of data,  
ever. All existing glyphs can be represented in UTF-8(char[]),  
UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be  
represented in all types. Of course that representation uses a different  
number of bytes and may in fact use different bit patterns(codepoints) as  
well.

Regan
June 04, 2005
Re: char, wchar and dchar should be supported equally
Regan Heath wrote:
 > Converting to/from char[], wchar[] and dchar[] causes no loss of data,
> ever. All existing glyphs can be represented in UTF-8(char[]),  
> UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be  
> represented in all types. Of course that representation uses a 
> different  number of bytes and may in fact use different bit 
> patterns(codepoints) as  well.
> 
> Regan

What then is the point of having all of these different types?

How does UTF-8 work? when you only have 256 possible values?
June 04, 2005
Re: char, wchar and dchar should be supported equally
On Fri, 03 Jun 2005 20:42:25 -0700, Trevor Parscal  
<trevorparscal@hotmail.com> wrote:
> Hasan Aljudy wrote:
>>  I think that toString or any std function that takes a string and  
>> processes it, should always take dchar and return dchar.
>>
>
> The best idea for this I have heard thus far.. Especially since, anytime  
> you are doing a toString you aren't going to be worried about the  
> addtional overhead of a dchar[] (or so I believe)

If you're using char[] then it gets converted to dchar[], processed, then  
converted back. That's not ideal IMO.

Ideally we only want conversion to happen in 1, or at most 2 places.

1. Data is converted on input from <input format> to <internal format>.
2. Data is converted on output from <internal format> to <output format>.

Sometimes applications will do #1, sometimes they will do #2, sometimes  
they will do both (for one reason or another). Each application will have  
a different <internal format> chosen for some specific reason, perhaps  
even a different <internal format> for each group of data.

So, Ideally we require 3 variants of every single string function. But of  
course, we dont want to be repeating ourselves all the time, in fact we  
want only one 'function' we just want to re-use it for all 3 string types.  
So, might I suggest using templates eg.

import std.stdio;
import std.ctype;

template toLowerT(Type) {
  Type[] toLowerT(Type[] input) {
    Type[] res = input.dup;
    foreach(inout Type c; res)
    	c = tolower(c);
    return res;
  }
}

alias toLowerT!(char) toLower;
alias toLowerT!(wchar) toLower;
alias toLowerT!(dchar) toLower;

void main()
{
	char[] a = "REGAN";
	wchar[] b = "WAS";
	dchar[] c = "HERE";
	
	//we can even use the x.fn() form as opposed to fn(x) if we wish.
	writefln("%s=%s",a,a.toLower());
	writefln("%s=%s",b,b.toLower());
	writefln("%s=%s",c,c.toLower());
}

NOTE: I realise using ctype's tolower function will only work with ASCII,  
not the full compliment of unicode characters. This is a semi-functional  
example only.

Regan
June 04, 2005
Re: char, wchar and dchar should be supported equally
On Sat, 04 Jun 2005 00:05:46 -0600, Hasan Aljudy <hasan.aljudy@gmail.com>  
wrote:
> Regan Heath wrote:
>   > Converting to/from char[], wchar[] and dchar[] causes no loss of  
> data,
>> ever. All existing glyphs can be represented in UTF-8(char[]),   
>> UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be   
>> represented in all types. Of course that representation uses a  
>> different  number of bytes and may in fact use different bit  
>> patterns(codepoints) as  well.
>>  Regan
>
> What then is the point of having all of these different types?

They're each better or worse depending on the data you're operating on.

Terminology: (I think this is correct)
  Codepoint == one char, wchar, or dchar.
  Character == a symbol, made up of 1 or more codepoints.

UTF-8 is perfect if most/all of your data is ASCII, as UTF-8 characters  
have the same values as they do in ASCII, ASCII is a sub-set of UTF-8  
(which can represent characters that do not exist in ASCII).

UTF-16 is better than UTF-8 in cases where most/all of your data would  
take 2 or more UTF-8 codepoints to represent. Essentially UTF-16 can store  
some characters in less space than UTF-8 can.

UTF-32 is better than UTF-16 in cases where most/all of your data would  
take 2 or more UTF-16 codepoints to represent.

Some people choose to use UTF-32 as you can guarantee a codepoint == a  
character, meaning the dchar's length property is the 'string' length  
(this is not always the case with wchar, or char, due to some characters  
taking more than 1 codepoint).

> How does UTF-8 work? when you only have 256 possible values?

In essence it uses between 1 and 4 codepoints to represent a single  
character.

Someone probably has a better reference than this:
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA

I just quickly googled that up.

Regan
June 04, 2005
Re: char, wchar and dchar should be supported equally
Regan Heath wrote:

> template toLowerT(Type) {
>   Type[] toLowerT(Type[] input) {
>     Type[] res = input.dup;
>     foreach(inout Type c; res)
>         c = tolower(c);
>     return res;
>   }
> }
> 
> alias toLowerT!(char) toLower;
> alias toLowerT!(wchar) toLower;
> alias toLowerT!(dchar) toLower;

Thinks: so that's how you do it! :)

This is the kind of thing I had in mind. Is there any chance that 
std.string actually *will* be implemented like this?

James McComb
June 04, 2005
Re: char, wchar and dchar should be supported equally
On Sat, 04 Jun 2005 11:20:47 +1000, James McComb wrote:

> I like D having char, wchar and dchar. And I like the way that they will 
> (soon?) implicitly convert between each other. But I don't like the way 
> that D is biased towards char. I think that char, dchar and wchar should 
> be supported equally.

Yes please. I've had to write dchar[] versions of a lot of things in
std.string and others. 

I tend to use char[] only when reading to and from files/streams, and use
dchar[] for internal routines. The application I'm working on now does a
lot of text processing and it is too slow to convert char[] -> dchar[],
process it, convert dchar[] -> char[]. 

The simplicity of dchar[] is that the array index always points to the
start of a character, where as with char[] and wchar[] the index can point
to somewhere inside a character. (Remembering that each character in a
dchar[] string is the same size - a dchar - but characters in wchar[] and
char[] have variable sizes.)

The current Phobos routines are heavily biased to char[]. Also, the use of
templates is not always the best solution because there are some
optimizations available, depending on the UTF encoding format used.


-- 
Derek Parnell
Melbourne, Australia
4/06/2005 6:08:29 PM
« First   ‹ Prev
1 2
Top | Discussion index | About this forum | D home