October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote: >> >> And yet we have "toString" and not "toCharArray" or "toUTF"! > > True, and some have called for renaming char to utf8. While that would > be technically more correct (as toUTF would be, too), it just looks awful. > Nope, it just looks correct. -- Lars Ivar Igesund blog at http://larsivi.net DSource & #D: larsivi |
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote: > One alternative that I could live with would use 4 character types: > > char one codeunit in whatever encoding the runtime uses > schar one 8 bit code unit (ASCII or utf-8) > wchar one 16 bit code unit (same as before) > dchar one 32 bit code unit (same as before) We have that already: ubyte one codeunit in whatever encoding the runtime uses char one 8 bit code unit (ASCII or utf-8) There is no support in Phobos for runtime/native encodings, but you can use the "iconv" library to do such conversions ? > (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field) All ASCII characters are valid UTF-8 code units, so it's OK. --anders |
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter Bright | Walter Bright wrote:
> True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.
Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
|
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Anders F Björklund wrote: > BCS wrote: > >> One alternative that I could live with would use 4 character types: >> >> char one codeunit in whatever encoding the runtime uses >> schar one 8 bit code unit (ASCII or utf-8) >> wchar one 16 bit code unit (same as before) >> dchar one 32 bit code unit (same as before) > > > We have that already: > > ubyte one codeunit in whatever encoding the runtime uses > char one 8 bit code unit (ASCII or utf-8) ubyte is an 8 bit unsigned number not a character encoding. [after some more reading] I may be just rambling but... how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 32. Direct assignment would be illegal (as with, say int[] -> Object) or implicitly converted (as with int -> real). Casts would be provided. Indexing would be by codepoint. Non-array variables would be big enough to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of "whatever the system uses" data type (ah la C's int) could be used for actual output, maybe even escaping anything that won't get displayed correctly. This all sort of follows the idea of "call it what it is and don't hide the overhead". 1) Characters are a different type of data than numbers (see the threads on bool) and as such, that should be reflected in the type system. 2) I have no problem with high overhead operations as long as I can avoid using them when I don't want to. > > There is no support in Phobos for runtime/native encodings, > but you can use the "iconv" library to do such conversions ? > >> (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field) > > > All ASCII characters are valid UTF-8 code units, so it's OK. > But UTF-8 is not ASCII. > --anders |
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote:
> I may be just rambling but...
>
> how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid)
Then all Americans would use that instead of UTF-8.
This is natural, since first you code for yourself, later maybe for your boss, etc. And, you'd only become aware of any problems when a Latino tries to use his own name José, talk about Motörhead, or Anaïs the fragrance. And the mail and newsreader you wrote in D simply would not work.
Guess if anybody would heed the warning "Only use this new ASCII encoding when you are absolutely positive the program never will encounter a single foreign sentence or letter".
So, better not.
---
D's current setup and documentation engourage this kind of suggestions, and I don't blame you.
Things being like they are, a programmer who wants to write a crossword puzzle generator, would of course begin with:
char[20][20] theGrid;
It's a shame that an otherwise so excellent language ( + the wording it its docs) downright leads you to do this.
The guy naturally assumes that D being a "UTF-8" language, this would work even in Chinese. (Hey, char[] foo = "José Motörhead from the band Anaïs is on stage!"; works, so why wouldn't theGrid? Poor guy.
I can't blame anyone then wanting to stay within ASCII for the rest of D's life.
|
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to BCS | BCS wrote: > ubyte is an 8 bit unsigned number not a character encoding. Right, I actually meant ubyte[] but void[] might have been more accurate for representing any (even non-UTF) encoding. (I used ubyte[] in my mapping functions, since they only used legacy 8-bit encodings like "cp1252" or "macroman") Re-reading your post, it seems to me that you were more talking about doing an alias to the UTF type most suitable for the OS ? I guess UTF-8 would be a good choice if the operating system doesn't use Unicode, since then it'll have to do lookups anyway. Otherwise the existing "wchar_t" isn't bad for such an UTF type, it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...) >> All ASCII characters are valid UTF-8 code units, so it's OK. > > But UTF-8 is not ASCII. So you would like a char "type" that would only take ASCII ? I guess that is *one* way of dealing with it, you could also have a wchar type that wouldn't accept surrogates (BMP only) Then it would be OK to index them by code unit / character... (since each allowed character would fit into one code unit) Sounds a little like signed vs. unsigned integers actually ? Then again, 5 character types is even worse than the 3 now. --anders |
October 01, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Bruno Medeiros | Bruno Medeiros wrote:
> Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the brain as 'array of strings' than its current counterpart.
>
There are also many cases where char arrays are not strings:
Single array of characters, not strings:
char GAME_10PT_LETTERS[] = { 'x', 'z' };
Two-dimensional array of characters, not string arrays:
char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. };
char m_scrabbleBoard[20][20];
|
October 02, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Lars Ivar Igesund | Lars Ivar Igesund wrote:
> Walter Bright wrote:
>
>>> And yet we have "toString" and not "toCharArray" or "toUTF"!
>> True, and some have called for renaming char to utf8. While that would
>> be technically more correct (as toUTF would be, too), it just looks awful.
>>
>
> Nope, it just looks correct.
>
I don't think renaming toString to toUTF gets rid of any confusion. AFAIK, toString is meant for debugging and char[] should be enough, and yet flexible enough for unicode strings.
In fact, "string toString()" would be a good solution too.
---
My 4 reasons for the "string" aliases:
* readability: less [] pairs;
* safety: char[] is not zero-terminated, so lets not pretend there's a relation with C's char*. In fact: lets hide any relation;
* clarity: a char[] should not be iterated 1 char at a time, which makes it different from an int[].
* consistency: "string toString()"
L.
|
October 02, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Anders F Björklund | Anders F Björklund wrote:
[...]
>
> Then again, 5 character types is even worse than the 3 now.
>
> --anders
The more I think about it the worse this get.
What I really would like is a system that allows O(1) operations on strings (slice out char 7 to 27), allows somewhat compact encoding (8bit) and allows safe operations on UTF (if I do something dumb, it complains). All at the same time would be nice, but is not needed.
Come to think about it, a lib that will do good FAST convention between buffers:
//note: "in" is intentional, it wont allocate anything
UTF8to16(in char[], in wchar[]);
UTF8to32(in char[], in dchar[]);
UTF16to32(in wchar[], in dchar[]);
...
would get most of what I want.
<sarcasm>
And while I'm at it, I'd like a million bucks please.
</sarcasm>
|
October 03, 2006 Re: First Impressions | ||||
---|---|---|---|---|
| ||||
Posted in reply to Georg Wrede | Georg Wrede wrote:
> Walter Bright wrote:
>> True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.
>
> Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.
I would kind of agree with this, but I think it's a two-edged knife.
If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...)
If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work.
For instance, from a Java perspective:
char[] : Users don't know that it's "String"; users see it as low-level.
Some will try to write things like 'find()' by hand since they
will figure arrays are low level and not expect this to exist.
string : Users will think it's immutable, special; they will ask "how do
I get one of the characters out of a string", "how do I convert
string to char[]?", and other things that would be obvious
without the alias.
Kevin
|
Copyright © 1999-2021 by the D Language Foundation