First Impressions (page 7) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » First Impressions (page 7)

October 01, 2006

Re: First Impressions

Posted by Lars Ivar Igesund
in reply to Walter Bright

Lars Ivar Igesund

Posted in reply to Walter Bright

Walter Bright wrote:

>> 
>> And yet we have "toString" and not "toCharArray" or "toUTF"!
> 
> True, and some have called for renaming char to utf8. While that would
> be technically more correct (as toUTF would be, too), it just looks awful.
> 

Nope, it just looks correct.

-- 
Lars Ivar Igesund
blog at http://larsivi.net
DSource & #D: larsivi

October 01, 2006

Re: First Impressions

Posted by Anders F Björklund
in reply to BCS

Anders F Björklund

Posted in reply to BCS

BCS wrote:

> One alternative that I could live with would use 4 character types:
> 
> char    one codeunit in whatever encoding the runtime uses
> schar    one 8 bit code unit (ASCII or utf-8)
> wchar    one 16 bit code unit (same as before)
> dchar    one 32 bit code unit (same as before)

We have that already:

ubyte   one codeunit in whatever encoding the runtime uses
char    one 8 bit code unit (ASCII or utf-8)

There is no support in Phobos for runtime/native encodings,
but you can use the "iconv" library to do such conversions ?

> (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field)

All ASCII characters are valid UTF-8 code units, so it's OK.

--anders

October 01, 2006

Re: First Impressions

Posted by Georg Wrede
in reply to Walter Bright

Georg Wrede

Posted in reply to Walter Bright

Walter Bright wrote:
> True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.

Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

October 01, 2006

Re: First Impressions

Posted by BCS
in reply to Anders F Björklund

BCS

Posted in reply to Anders F Björklund

Anders F Björklund wrote:
> BCS wrote:
> 
>> One alternative that I could live with would use 4 character types:
>>
>> char    one codeunit in whatever encoding the runtime uses
>> schar    one 8 bit code unit (ASCII or utf-8)
>> wchar    one 16 bit code unit (same as before)
>> dchar    one 32 bit code unit (same as before)
> 
> 
> We have that already:
> 
> ubyte   one codeunit in whatever encoding the runtime uses
> char    one 8 bit code unit (ASCII or utf-8)

ubyte is an 8 bit unsigned number not a character encoding.

[after some more reading]
I may be just rambling but...

how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid), same for UTF-8,16 and 32. Direct assignment would be illegal (as with, say int[] -> Object) or implicitly converted (as with int -> real). Casts would be provided. Indexing would be by codepoint. Non-array variables would be big enough to store any codepoint (ASCII -> 8bit, !ASCII -> 32-bit). Some sort of "whatever the system uses" data type (ah la C's int) could be used for actual output, maybe even escaping anything that won't get displayed correctly.

This all sort of follows the idea of "call it what it is and don't hide the overhead". 1) Characters are a different type of data than numbers (see the threads on bool) and as such, that should be reflected in the type system. 2) I have no problem with high overhead operations as long as I can avoid using them when I don't want to.

> 
> There is no support in Phobos for runtime/native encodings,
> but you can use the "iconv" library to do such conversions ?
> 
>> (using the same thing for ASCII and UTF-8 may be a problem, but this isn't my field)
> 
> 
> All ASCII characters are valid UTF-8 code units, so it's OK.
> 

But UTF-8 is not ASCII.

> --anders

October 01, 2006

Re: First Impressions

Posted by Georg Wrede
in reply to BCS

Georg Wrede

Posted in reply to BCS

BCS wrote:
> I may be just rambling but...
> 
> how about have the type of the value denote the encoding. One for ASCII would only ever store ASCII (UTF-8 is invalid)

Then all Americans would use that instead of UTF-8.

This is natural, since first you code for yourself, later maybe for your boss, etc. And, you'd only become aware of any problems when a Latino tries to use his own name José, talk about Motörhead, or Anaïs the fragrance. And the mail and newsreader you wrote in D simply would not work.

Guess if anybody would heed the warning "Only use this new ASCII encoding when you are absolutely positive the program never will encounter a single foreign sentence or letter".

So, better not.

---

D's current setup and documentation engourage this kind of suggestions, and I don't blame you.

Things being like they are, a programmer who wants to write a crossword puzzle generator, would of course begin with:

char[20][20] theGrid;

It's a shame that an otherwise so excellent language ( + the wording it its docs) downright leads you to do this.

The guy naturally assumes that D being a "UTF-8" language, this would work even in Chinese. (Hey, char[] foo = "José Motörhead from the band Anaïs is on stage!"; works, so why wouldn't theGrid? Poor guy.

I can't blame anyone then wanting to stay within ASCII for the rest of D's life.

October 01, 2006

Re: First Impressions

Posted by Anders F Björklund
in reply to BCS

Anders F Björklund

Posted in reply to BCS

BCS wrote:

> ubyte is an 8 bit unsigned number not a character encoding.

Right, I actually meant ubyte[] but void[] might have been
more accurate for representing any (even non-UTF) encoding.
(I used ubyte[] in my mapping functions, since they only
used legacy 8-bit encodings like "cp1252" or "macroman")

Re-reading your post, it seems to me that you were more talking
about doing an alias to the UTF type most suitable for the OS ?

I guess UTF-8 would be a good choice if the operating system
doesn't use Unicode, since then it'll have to do lookups anyway.
Otherwise the existing "wchar_t" isn't bad for such an UTF type,
it will be UTF-16 on Windows and UTF-32 on Unix (linux,darwin,...)

>> All ASCII characters are valid UTF-8 code units, so it's OK.
> 
> But UTF-8 is not ASCII.

So you would like a char "type" that would only take ASCII ?
I guess that is *one* way of dealing with it, you could also
have a wchar type that wouldn't accept surrogates (BMP only)

Then it would be OK to index them by code unit / character...
(since each allowed character would fit into one code unit)
Sounds a little like signed vs. unsigned integers actually ?

Then again, 5 character types is even worse than the 3 now.

--anders

October 01, 2006

Re: First Impressions

Posted by Geoff Carlton
in reply to Bruno Medeiros

Geoff Carlton

Posted in reply to Bruno Medeiros

Bruno Medeiros wrote:
> Precisely! And even if such conceptual difference didn't exist, or is very rare, 'string' is nonetheless more readable than 'char[]', a fact I am constantly reminded of when I see 'int main(char[][] args)' instead of 'int main(string[] args)', which translates much more quickly into the  brain as 'array of strings' than its current counterpart.
> 

There are also many cases where char arrays are not strings:

Single array of characters, not strings:
 char GAME_10PT_LETTERS[] = { 'x', 'z' };

Two-dimensional array of characters, not string arrays:
 char GAME_LETTERS[][] = { GAME_0PT_LETTERS, GAME_1PT_LETTERS, .. };
 char m_scrabbleBoard[20][20];

October 02, 2006

Re: First Impressions

Posted by Lionello Lunesu
in reply to Lars Ivar Igesund

Lionello Lunesu

Posted in reply to Lars Ivar Igesund

Lars Ivar Igesund wrote:
> Walter Bright wrote:
> 
>>> And yet we have "toString" and not "toCharArray" or "toUTF"!
>> True, and some have called for renaming char to utf8. While that would
>> be technically more correct (as toUTF would be, too), it just looks awful.
>>
> 
> Nope, it just looks correct.
> 

I don't think renaming toString to toUTF gets rid of any confusion. AFAIK, toString is meant for debugging and char[] should be enough, and yet flexible enough for unicode strings.

In fact, "string toString()" would be a good solution too.

---
My 4 reasons for the "string" aliases:

* readability: less [] pairs;
* safety: char[] is not zero-terminated, so lets not pretend there's a relation with C's char*. In fact: lets hide any relation;
* clarity: a char[] should not be iterated 1 char at a time, which makes it different from an int[].
* consistency: "string toString()"

L.

October 02, 2006

Re: First Impressions

Posted by BCS
in reply to Anders F Björklund

BCS

Posted in reply to Anders F Björklund

Anders F Björklund wrote:
[...]
> 
> Then again, 5 character types is even worse than the 3 now.
> 
> --anders

The more I think about it the worse this get.

What I really would like is  a system that allows O(1) operations on strings (slice out char 7 to 27), allows somewhat compact encoding (8bit) and allows safe operations on UTF (if I do something dumb, it complains). All at the same time would be nice, but is not needed.

Come to think about it, a lib that will do good FAST convention between buffers:

//note: "in" is intentional, it wont allocate anything
UTF8to16(in char[], in wchar[]);
UTF8to32(in char[], in dchar[]);
UTF16to32(in wchar[], in dchar[]);
...

would get most of what I want.

<sarcasm>
And while I'm at it, I'd like a million bucks please.
</sarcasm>

October 03, 2006

Re: First Impressions

Posted by Kevin Bealer
in reply to Georg Wrede

Kevin Bealer

Posted in reply to Georg Wrede

Georg Wrede wrote:
> Walter Bright wrote:
>> True, and some have called for renaming char to utf8. While that would be technically more correct (as toUTF would be, too), it just looks awful.
> 
> Let's just say it would be a first step in lessening the confusion _we_ create in newcomers' heads.

I would kind of agree with this, but I think it's a two-edged knife.

If we say 'char[]' then users don't know it's a string until they read the 'why D arrays are great' page (which they should read, but...)

If we say 'string' then we hide the fact that [] can be applied and that other array-like operations can work.

For instance, from a Java perspective:

char[] : Users don't know that it's "String"; users see it as low-level.
         Some will try to write things like 'find()' by hand since they
         will figure arrays are low level and not expect this to exist.

string : Users will think it's immutable, special; they will ask "how do
         I get one of the characters out of a string", "how do I convert
         string to char[]?", and other things that would be obvious
         without the alias.

Kevin

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation