Jump to page: 1 2
Thread overview
Evolution (Hello World)
Feb 10, 2005
Sebastian Beschke
Feb 10, 2005
Matthew
Feb 10, 2005
Matthew
Feb 10, 2005
Matthew
Feb 10, 2005
Derek
Feb 10, 2005
Derek Parnell
Feb 11, 2005
James McComb
Feb 12, 2005
Walter
February 10, 2005
Taking a quick look at "Hello World"
shows a remarkable language evolution...


From C:
> #include <stdio.h>
> #include <stdlib.h>
> 
> int main(void)
> {
>   puts("Hello, World!");
>   return EXIT_SUCCESS;
> }
> 
> int main(int argc, char *argv[])
> {
>   int i;
>   for (i = 0; i < argc; i++)
>     printf("%d %s\n", i, argv[i]);
>   return EXIT_SUCCESS;
> }

To "old D":
> import std.c.stdio;
> import std.c.stdlib;
> 
> int main()
> {
>   puts("Hello, World!");
>   return EXIT_SUCCESS;
> }
> 
> int main(char[][] args)
> {
>   for (int i = 0; i < args.length; i++)
>     printf("%d %.*s\n", i, args[i]);
>   return EXIT_SUCCESS;
> }

To "new D":
> import std.stdio;
> 
> void main()
> {
>   writeln("Hello, World!");
> }
> 
> void main(str[] args)
> {
>   foreach (int i, str a; args)
>     writefln("%d %s", i, a);
> }


Where I took the liberty of adding a few of my own RFEs:
1) "void main" should return 0 back to the operating system
2) new std.stdio.writeln, a formatless version of writefln
3) the "str" alias for the char[] type, like "bool" for bit

Not too bad for the first five years, if I say so myself ?
Then again, I couldn't even use it here until GDC arrived...
And it was not even a year ago, that David Friedman did that.

--anders
February 10, 2005
Anders F Björklund schrieb:
> 3) the "str" alias for the char[] type, like "bool" for bit

This is an obvious question, but what would you propose for wchar[] and dchar[]?

I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise.

-Sebastian
February 10, 2005
Sebastian Beschke wrote:

>> 3) the "str" alias for the char[] type, like "bool" for bit
> 
> This is an obvious question, but what would you propose for wchar[] and dchar[]?

My proposition was "ustr" for wchar[] (since it rhymes with "uint",
and is to be pronounced as Unicode string, as in Unicode-optimized)

There will be no alias for dchar[], as it is a silly type anyway.
dchar's are cool, but UTF-32 strings are just too much lossage...

> I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise.

There is no "one and only one string type" in D, just as there
is no "one and only one boolean type". There are three of each,
and char[] is the preferred string type (what does "main" use ?)
so it gets to be the str. And bit is the type of "true" and "false"
so it gets to be the default bool type. If you want to speed up
or optimize your code, you can change to using wchar[] or wbool[]...
And when it is needed in a few places, you have dchar[] and dbool

The contents are exactly the same, just encoded differently -
all strings are in Unicode and all booleans are in Zero-is-False.

I just find this shortform to be easier on the eyes:

    void main(str[] args);

    str[str] dictionary;

UTF-8 has two major advantages: 1) it's optimized for ASCII and
does not require a BOM mark, making it compatible for files too
2) it is Endian agnostic, no more X86 vs PPC gruffs like the others

If you do a lot of Unicode, or non-Western languages, switch to
ustr instead? It's equally well supported in all D std libraries.
(the only downside of using str is that it's a little bigger/slower)

--anders
February 10, 2005
I wrote:

>>> 3) the "str" alias for the char[] type, like "bool" for bit
>>
>> This is an obvious question, but what would you propose for wchar[] and dchar[]?
> 
> My proposition was "ustr" for wchar[] (since it rhymes with "uint",
> and is to be pronounced as Unicode string, as in Unicode-optimized)
> 
> There will be no alias for dchar[], as it is a silly type anyway.
> dchar's are cool, but UTF-32 strings are just too much lossage...

Another possibility is wstr for wchar[] ("wide string")
and ustr for dchar[] ("Unicode string"), which might
perhaps work better and be a tad more logical too...

I like "str" better than "string", because it:
1) rhymes with int, char, bool and the others
2) is shorter to type, easilly 50% saved
3) doesn't confuse anyone with C++ std::string

--anders
February 10, 2005
"Anders F Björklund" <afb@algonet.se> wrote in message news:cugd6f$2ptf$1@digitaldaemon.com...
>I wrote:
>
>>>> 3) the "str" alias for the char[] type, like "bool" for bit
>>>
>>> This is an obvious question, but what would you propose for wchar[] and dchar[]?
>>
>> My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized)
>>
>> There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...
>
> Another possibility is wstr for wchar[] ("wide string")
> and ustr for dchar[] ("Unicode string"), which might
> perhaps work better and be a tad more logical too...
>
> I like "str" better than "string", because it:
> 1) rhymes with int, char, bool and the others
> 2) is shorter to type, easilly 50% saved
> 3) doesn't confuse anyone with C++ std::string

I don't think char[] should have an alias. Strings in D are slices, for very good reason, and it's good for that to be foremost in peoples' minds.



February 10, 2005
Matthew wrote:

> I don't think char[] should have an alias. Strings in D are slices, for very good reason, and it's good for that to be foremost in peoples' minds.

I thought that strings in D were sliceable codepoint arrays,
but not necessarily slices always ? It's just an alias, the
type is still char[] ? (and wchar[] and dchar(), but anyway)

But I rethought and found "ustr" to be silly altogether...

alias  char[]   str; // ASCII-optimized
alias wchar[]  wstr; // Unicode-optimized
alias dchar[]  dstr; // codepoint-optimized

More orthogonal that way ? (with "char[]" = "str", always)

char[] by itself is actually not that bad, but this is:
	int main(char[][] args);
	char[][char[]] dictionary;

--anders
February 10, 2005
> char[] by itself is actually not that bad, but this is:
> int main(char[][] args);
> char[][char[]] dictionary;

Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc.


February 10, 2005
Matthew wrote:

> Now you've got something of a point there. But, still, I'd prefer to leave it as char[]. The example you give is only 1-dim string / 2-dim char. What about higher dimensionality (of anything)? We could end up in the cow-dung of LPPPCSTR, etc. 

Ehrm, nooooo ? "The line must be drawn here". :-)

I just wanted some easier basics, for beginners ?
For the higher levels, you still need to learn
about bit and char[] and other behind-the-scenes.

It's just similar to the "alias foo* fooPtr;",
that seems to always enter the picture after
one has seen one too many stars fly by...

I'll (re)post my Grand Scheme of Std Aliases.

--anders
February 10, 2005
On Thu, 10 Feb 2005 20:01:50 +0100, Anders F Björklund wrote:

> Sebastian Beschke wrote:
> 
>>> 3) the "str" alias for the char[] type, like "bool" for bit
>> 
>> This is an obvious question, but what would you propose for wchar[] and dchar[]?
> 
> My proposition was "ustr" for wchar[] (since it rhymes with "uint", and is to be pronounced as Unicode string, as in Unicode-optimized)
> 
> There will be no alias for dchar[], as it is a silly type anyway. dchar's are cool, but UTF-32 strings are just too much lossage...
> 
>> I think, considering that UTF-8 is not the optimal encoding in a huge number of cases, promoting it as "The One and Only String Type" would not be wise.
> 
> There is no "one and only one string type" in D, just as there
> is no "one and only one boolean type". There are three of each,
> and char[] is the preferred string type (what does "main" use ?)
> so it gets to be the str. And bit is the type of "true" and "false"
> so it gets to be the default bool type. If you want to speed up
> or optimize your code, you can change to using wchar[] or wbool[]...
> And when it is needed in a few places, you have dchar[] and dbool
> 
> The contents are exactly the same, just encoded differently - all strings are in Unicode and all booleans are in Zero-is-False.
> 
> I just find this shortform to be easier on the eyes:
> 
>      void main(str[] args);
> 
>      str[str] dictionary;
> 
> UTF-8 has two major advantages: 1) it's optimized for ASCII and
> does not require a BOM mark, making it compatible for files too
> 2) it is Endian agnostic, no more X86 vs PPC gruffs like the others
> 
> If you do a lot of Unicode, or non-Western languages, switch to ustr instead? It's equally well supported in all D std libraries. (the only downside of using str is that it's a little bigger/slower)

One cannot easily address individual code points using utf8. For example...

  char[] SomeText;

You cannot be sure that SomeText[5] address the beginning of a code point or not. Remembering that code points in utf8 are variable length, but are fixed length in utf32.

So if using utf8, and one is doing some form of character manipulation, one should first convert to utf32, do the work, then convert back to utf8.

-- 
Derek
Melbourne, Australia
February 10, 2005
Derek wrote:

>>UTF-8 has two major advantages: 1) it's optimized for ASCII and
>>does not require a BOM mark, making it compatible for files too
>>2) it is Endian agnostic, no more X86 vs PPC gruffs like the others
>>
>>If you do a lot of Unicode, or non-Western languages, switch to
>>ustr instead? It's equally well supported in all D std libraries.
>>(the only downside of using str is that it's a little bigger/slower)
> 
> One cannot easily address individual code points using utf8. For example...
> 
>   char[] SomeText;
> 
> You cannot be sure that SomeText[5] address the beginning of a code point
> or not. Remembering that code points in utf8 are variable length, but are
> fixed length in utf32. 

This is not that much of a problem, since you should not address
individual code points anyway but treat the code units as a string.

See http://oss.software.ibm.com/icu/docs/papers/forms_of_unicode:

> Code-point boundaries, iteration, and indexing are very fast with
> UTF-32. Code-point boundaries, accessing code points at a given offset,
> and iteration involve a few extra machine instructions for UTF-16; UTF-8
> is a bit more cumbersome. Indexing is slow for both of them, but in
> practice indexing by different code units is done very rarely, except
> when communicating with specifications that use UTF-32 code units, such
> as XSL.
> 
> This point about indexing is true unless an API for strings allows
> access only by code point offsets. This is a very inefficient design:
> strings should always allow indexing with code unit offsets.

But char[] works fine for ASCII and wchar[] works fine for Unicode,
*as long* as you watch out for any surrogates in the code units...

Which means you can have a fast standard route, and extra code
to handle the exceptional characters if and when they occur ?

> So if using utf8, and one is doing some form of character manipulation, one
> should first convert to utf32, do the work, then convert back to utf8.

Yes, and this is easily done with a foreach(dchar c; SomeText) loop,
as D can transparently handle the transition between char[] and dchar...

There are also readily available functions in the std.utf module:
"encode" and "decode", and the toUTF8 / toUTF16 / toUTF32 wrappers.

If you lot of loops like that, you can use a dchar[] (dstr alias) as a
intermediate storage. But char[] and wchar[] are better for long term.


--anders
« First   ‹ Prev
1 2