Jump to page: 1 2
Thread overview
char, wchar and dchar should be supported equally
Jun 04, 2005
James McComb
Jun 04, 2005
Trevor Parscal
Jun 04, 2005
Hasan Aljudy
Jun 04, 2005
Trevor Parscal
Jun 04, 2005
Regan Heath
Jun 04, 2005
James McComb
Jun 04, 2005
Regan Heath
Jun 04, 2005
Hasan Aljudy
Jun 04, 2005
Regan Heath
Jun 04, 2005
Hasan Aljudy
Jun 04, 2005
Kris
Jun 04, 2005
Vathix
Jun 05, 2005
Derek Parnell
Jun 05, 2005
Derek Parnell
Jun 04, 2005
Derek Parnell
June 04, 2005
I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally.

For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar!

But what if every function in std.string had wchar and dchar versions?
Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.)

Also, in order for char, wchar and dchar to be supported equally, Object should have wtoString and dtoString methods. (Because toString cannot be overloaded based on its return type.)

Does anyone else out there feel the same? Or should I get over it and JUC (Just Use Char) like I already JUB (Just Use Bit)?

James McComb

<code>
import std.stdio;

template TStringFunctions(T) {
    T[] toStringz(T[] str) {
        if (!str)
            return "";

        T[] copy = str.dup;
        return copy ~= '\0';
    }

    // Other string functions...
}

alias TStringFunctions!(char)  stringFunctions;
alias TStringFunctions!(wchar) wstringFunctions;
alias TStringFunctions!(dchar) dstringFunctions;

alias stringFunctions.toStringz  toStringz;
alias wstringFunctions.toStringz wtoStringz;
alias dstringFunctions.toStringz dtoStringz;

// Other string function aliases...

// Example usage
void main() {
    char[]   str = "utf-8 string";
    wchar[] wstr = "utf-16 string";

    str  = toStringz(str);
    wstr = wtoStringz(wstr);
}
</code>
June 04, 2005
James McComb wrote:
> I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally.
> 
> For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar!
> 
> But what if every function in std.string had wchar and dchar versions?
> Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.)
> 
> *snip* Object should have wtoString and dtoString methods. 
> 

well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast.

That's my two cents anyhoo, as an avid dchar[] user.

-- 
Thanks,
Trevor Parscal
www.trevorparscal.com
trevorparscal@hotmail.com
June 04, 2005
Trevor Parscal wrote:
> James McComb wrote:
> 
>> I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally.
>>
>> For example, modern Windows systems support UTF-16 (via the W functions). So you might decide to use wchar, because that is also UTF-16. The Windows API expects zero-terminated strings, and you can clearly indicate this in your code by calling toStringz. But toStringz takes char, so your wchar will be implicitly converted to char and then implicitly converted back to wchar. So there is no point using wchar!
>>
>> But what if every function in std.string had wchar and dchar versions?
>> Then you could use wchar and call wtoStringz. (At the end of this email, there is some working code showing how this could be implemented using templates and aliases. There are other ways that std.string could support wchar and dchar, such as function overloading or function templates.)
>>
>> *snip* Object should have wtoString and dtoString methods.
> 
> 
> well.. wtoString is a bad naming convention.. I think toWString or toDString makes a little more sense, but to be honest, I think it should work like read and write, and return char[], wchar[], or dchar[] based on what you cast.
> 
> That's my two cents anyhoo, as an avid dchar[] user.
> 

I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.

Assuming that dchar is implicitly convertable to char and wchar, there can be no loss of information when doing something like:

<code>
dchar[] someFunction(dchar[]) ...

...

wchar[] wtest = ...
wtest = someFunction(wtest); //no loss

...

char[] test = ..
test = someFunction(test); //no loss
</code>

of course I maybe wrong, but I'm assuming that converting a char to wchar is like converting an int to double .. where any extra space is just filled with zeros (speaking in the bit level), and you can convert an int to double, process it, and convert it back to int, and assume that no information will be lost because of the conversion to double.
ofcourse information can be lost if "int" is not enough to store the value returned from the function, but this has nothing to do with converting back and forth to double then to int.
June 04, 2005
Hasan Aljudy wrote:
> 
> I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.
> 

The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe)

-- 
Thanks,
Trevor Parscal
www.trevorparscal.com
trevorparscal@hotmail.com
June 04, 2005
On Fri, 03 Jun 2005 21:37:23 -0600, Hasan Aljudy <hasan.aljudy@gmail.com> wrote:
> of course I maybe wrong, but I'm assuming that converting a char to wchar is like converting an int to double .. where any extra space is just filled with zeros (speaking in the bit level)

Yes and No. In many cases, yes, especially where ASCII is used. However some UTF-8 'characters'/'glyphs' (not sure what the correct term is exactly) take 2 or more char's (UTF-8 codepoints) to represent, so when converting them you might go from 3 chars to 1 wchar (1 UTF-16 codepoint) which is a decrease in byte space required, and often a change in the value of the codepoint.

> , and you can convert an int to double, process it, and convert it back to int, and assume that no information will be lost because of the conversion to double.

Converting to/from char[], wchar[] and dchar[] causes no loss of data, ever. All existing glyphs can be represented in UTF-8(char[]), UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be represented in all types. Of course that representation uses a different number of bytes and may in fact use different bit patterns(codepoints) as well.

Regan
June 04, 2005
Regan Heath wrote:
 > Converting to/from char[], wchar[] and dchar[] causes no loss of data,
> ever. All existing glyphs can be represented in UTF-8(char[]),  UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be  represented in all types. Of course that representation uses a different  number of bytes and may in fact use different bit patterns(codepoints) as  well.
> 
> Regan

What then is the point of having all of these different types?

How does UTF-8 work? when you only have 256 possible values?
June 04, 2005
On Fri, 03 Jun 2005 20:42:25 -0700, Trevor Parscal <trevorparscal@hotmail.com> wrote:
> Hasan Aljudy wrote:
>>  I think that toString or any std function that takes a string and processes it, should always take dchar and return dchar.
>>
>
> The best idea for this I have heard thus far.. Especially since, anytime you are doing a toString you aren't going to be worried about the addtional overhead of a dchar[] (or so I believe)

If you're using char[] then it gets converted to dchar[], processed, then converted back. That's not ideal IMO.

Ideally we only want conversion to happen in 1, or at most 2 places.

1. Data is converted on input from <input format> to <internal format>.
2. Data is converted on output from <internal format> to <output format>.

Sometimes applications will do #1, sometimes they will do #2, sometimes they will do both (for one reason or another). Each application will have a different <internal format> chosen for some specific reason, perhaps even a different <internal format> for each group of data.

So, Ideally we require 3 variants of every single string function. But of course, we dont want to be repeating ourselves all the time, in fact we want only one 'function' we just want to re-use it for all 3 string types. So, might I suggest using templates eg.

import std.stdio;
import std.ctype;

template toLowerT(Type) {
  Type[] toLowerT(Type[] input) {
    Type[] res = input.dup;
    foreach(inout Type c; res)
    	c = tolower(c);
    return res;
  }
}

alias toLowerT!(char) toLower;
alias toLowerT!(wchar) toLower;
alias toLowerT!(dchar) toLower;

void main()
{
	char[] a = "REGAN";
	wchar[] b = "WAS";
	dchar[] c = "HERE";
	
	//we can even use the x.fn() form as opposed to fn(x) if we wish.
	writefln("%s=%s",a,a.toLower());
	writefln("%s=%s",b,b.toLower());
	writefln("%s=%s",c,c.toLower());
}

NOTE: I realise using ctype's tolower function will only work with ASCII, not the full compliment of unicode characters. This is a semi-functional example only.

Regan

June 04, 2005
On Sat, 04 Jun 2005 00:05:46 -0600, Hasan Aljudy <hasan.aljudy@gmail.com> wrote:
> Regan Heath wrote:
>   > Converting to/from char[], wchar[] and dchar[] causes no loss of data,
>> ever. All existing glyphs can be represented in UTF-8(char[]),  UTF-16(wchar[]) and UTF-32(dchar[]), thus all existing strings can be  represented in all types. Of course that representation uses a different  number of bytes and may in fact use different bit patterns(codepoints) as  well.
>>  Regan
>
> What then is the point of having all of these different types?

They're each better or worse depending on the data you're operating on.

Terminology: (I think this is correct)
  Codepoint == one char, wchar, or dchar.
  Character == a symbol, made up of 1 or more codepoints.

UTF-8 is perfect if most/all of your data is ASCII, as UTF-8 characters have the same values as they do in ASCII, ASCII is a sub-set of UTF-8 (which can represent characters that do not exist in ASCII).

UTF-16 is better than UTF-8 in cases where most/all of your data would take 2 or more UTF-8 codepoints to represent. Essentially UTF-16 can store some characters in less space than UTF-8 can.

UTF-32 is better than UTF-16 in cases where most/all of your data would take 2 or more UTF-16 codepoints to represent.

Some people choose to use UTF-32 as you can guarantee a codepoint == a character, meaning the dchar's length property is the 'string' length (this is not always the case with wchar, or char, due to some characters taking more than 1 codepoint).

> How does UTF-8 work? when you only have 256 possible values?

In essence it uses between 1 and 4 codepoints to represent a single character.

Someone probably has a better reference than this:
http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-AppendixA

I just quickly googled that up.

Regan
June 04, 2005
Regan Heath wrote:

> template toLowerT(Type) {
>   Type[] toLowerT(Type[] input) {
>     Type[] res = input.dup;
>     foreach(inout Type c; res)
>         c = tolower(c);
>     return res;
>   }
> }
> 
> alias toLowerT!(char) toLower;
> alias toLowerT!(wchar) toLower;
> alias toLowerT!(dchar) toLower;

Thinks: so that's how you do it! :)

This is the kind of thing I had in mind. Is there any chance that std.string actually *will* be implemented like this?

James McComb
June 04, 2005
On Sat, 04 Jun 2005 11:20:47 +1000, James McComb wrote:

> I like D having char, wchar and dchar. And I like the way that they will (soon?) implicitly convert between each other. But I don't like the way that D is biased towards char. I think that char, dchar and wchar should be supported equally.

Yes please. I've had to write dchar[] versions of a lot of things in std.string and others.

I tend to use char[] only when reading to and from files/streams, and use dchar[] for internal routines. The application I'm working on now does a lot of text processing and it is too slow to convert char[] -> dchar[], process it, convert dchar[] -> char[].

The simplicity of dchar[] is that the array index always points to the start of a character, where as with char[] and wchar[] the index can point to somewhere inside a character. (Remembering that each character in a dchar[] string is the same size - a dchar - but characters in wchar[] and char[] have variable sizes.)

The current Phobos routines are heavily biased to char[]. Also, the use of templates is not always the best solution because there are some optimizations available, depending on the UTF encoding format used.


-- 
Derek Parnell
Melbourne, Australia
4/06/2005 6:08:29 PM
« First   ‹ Prev
1 2