March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to J. Daniel Smith | "J. Daniel Smith" <j_daniel_smith@HoTMaiL.com> wrote in message news:a6j90i$1me0$1@digitaldaemon.com... > UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages). But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long. Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes. Which means - it simply cannot consume more memory than UTF-32. (ISO/IEC 10646 may require up to 6 bytes in UTF-8, but it is the superset for Unicode.) |
March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:a6jfpg$2vd$1@digitaldaemon.com... > Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. If the whole program text would be in UNICODE, would the string be UNICODE as well? And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"? |
March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jakob Kemi | "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com... > You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class There are no iterators in D, nor there is a string class. |
March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Pavel Minayev | "Pavel Minayev" <evilone@omen.ru> wrote in message news:a6l1sf$l7f$1@digitaldaemon.com... > > "Walter" <walter@digitalmars.com> wrote in message news:a6jfpg$2vd$1@digitaldaemon.com... > > > Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. > > So how is the context determined? > > void foo(char[] s) { ... } > void foo(wchar[] s) { ... } > foo("Hello, world!"); > > My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. That's a bug, it should give an ambiguity error. > If the whole program text would > be in UNICODE, would the string be UNICODE as well? Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii. > And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"? It won't implicitly convert it to char[], then. |
March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Pavel Minayev | On Tue, 12 Mar 2002 20:06:34 +0100, Pavel Minayev wrote:
> "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com...
>
>> You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class
>
> There are no iterators in D, nor there is a string class.
I'm not talking about some STL iterators here, what I mean is that you just desing your loops like this:
for (char* s = string; get_char(s) != '\0'; s = next_char(s) ) {
...
}
Loops operating on strings is rare anyway, most string functions
should be optimized library functions. get_char() & next_char()
should be inlined and can use whatever syntax sugar applicatable.
Jakob
|
March 12, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jakob Kemi | "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com... > On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote: > > > > "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com... > >> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0) > > > > At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. > > > > It turned out to be a lot of trouble :-( and I finally converted it to wchar's. > > You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers. That's true, but I was never comfortable using such things, and hiding the performance hit of it behind syntactic sugar doesn't make the hit go away. When you're trying to compile 100,000 lines of code, every cycle in the lexer matters. |
March 16, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Walter | "Walter" <walter@digitalmars.com> wrote in message news:a6lccr$psj$2@digitaldaemon.com... > > So how is the context determined? > > > > void foo(char[] s) { ... } > > void foo(wchar[] s) { ... } > > foo("Hello, world!"); > > > > My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. > > That's a bug, it should give an ambiguity error. Hmmm... I'm thinking that flagging an ambiguity here would still be bad. How about using attributes to add the ability to resolve ambiguities in a user-defined manner? For example: priority(9) void foo(char[] s) { ... } priority(5) void foo(wchar[] s) { ... } foo("Hello, world!"); // Calls using char[], as it's higher priority. This way, ambiguities will only be flagged if multiple possibilities exist that have the same priority. The default priority could be 5 for all functions, and the range could be 0 to 9, so you can always define higher or lower ones as needed. I admit that this might open a whole new can of worms, but I'd definitely be willing to explore this if the language supported it. Salutaciones, JCAB > > > If the whole program text would > > be in UNICODE, would the string be UNICODE as well? > > Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii. > > > And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"? > > It won't implicitly convert it to char[], then. > > |
March 27, 2002 Re: String type. | ||||
---|---|---|---|---|
| ||||
Posted in reply to Juan Carlos Arevalo Baeza | "Juan Carlos Arevalo Baeza" <jcab@roningames.com> wrote in message news:a6u823$bip$1@digitaldaemon.com... > priority(9) void foo(char[] s) { ... } > priority(5) void foo(wchar[] s) { ... } > foo("Hello, world!"); // Calls using char[], as it's higher priority. > > This way, ambiguities will only be flagged if multiple possibilities > exist that have the same priority. The default priority could be 5 for all > functions, and the range could be 0 to 9, so you can always define higher or > lower ones as needed. > > I admit that this might open a whole new can of worms, but I'd definitely > be willing to explore this if the language supported it. D was trying to migrate to a simpler overloading scheme <g>. It can be less convenient at times, but I think it's more than made up for by having simple and obvious rules. |
Copyright © 1999-2021 by the D Language Foundation