String type. (page 2)

"J. Daniel Smith" <j_daniel_smith@HoTMaiL.com> wrote in message news:a6j90i$1me0$1@digitaldaemon.com... > UTF-8 is fine for strings that are mostly ASCII with some UNICODE (source code, Western European languages). But if the string is entirely UNICODE (something in Chinese for example), the UTF-8 encoding can consume MORE memory since the UTF-8 tranformation can be as many as six bytes long. Actually, UTF-8 can represent all Unicode 3.2 characters with 1..4 bytes. Which means - it simply cannot consume more memory than UTF-32. (ISO/IEC 10646 may require up to 6 bytes in UTF-8, but it is the superset for Unicode.)

"Walter" <walter@digitalmars.com> wrote in message news:a6jfpg$2vd$1@digitaldaemon.com... > Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. So how is the context determined? void foo(char[] s) { ... } void foo(wchar[] s) { ... } foo("Hello, world!"); My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. If the whole program text would be in UNICODE, would the string be UNICODE as well? And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com... > You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class There are no iterators in D, nor there is a string class.

"Pavel Minayev" <evilone@omen.ru> wrote in message news:a6l1sf$l7f$1@digitaldaemon.com... > > "Walter" <walter@digitalmars.com> wrote in message news:a6jfpg$2vd$1@digitaldaemon.com... > > > Actually, string literals are uncommitted by default. They then get converted to char[], wchar[], char, or wchar depending on the context. > > So how is the context determined? > > void foo(char[] s) { ... } > void foo(wchar[] s) { ... } > foo("Hello, world!"); > > My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument. That's a bug, it should give an ambiguity error. > If the whole program text would > be in UNICODE, would the string be UNICODE as well? Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii. > And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"? It won't implicitly convert it to char[], then.

On Tue, 12 Mar 2002 20:06:34 +0100, Pavel Minayev wrote: > "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com... > >> You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class > > There are no iterators in D, nor there is a string class. I'm not talking about some STL iterators here, what I mean is that you just desing your loops like this: for (char* s = string; get_char(s) != '\0'; s = next_char(s) ) { ... } Loops operating on strings is rare anyway, most string functions should be optimized library functions. get_char() & next_char() should be inlined and can use whatever syntax sugar applicatable. Jakob

"Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6jitg$118$1@digitaldaemon.com... > On Tue, 12 Mar 2002 01:00:49 +0100, Walter wrote: > > > > "Jakob Kemi" <jakob.kemi@telia.com> wrote in message news:a6j4v5$1koq$1@digitaldaemon.com... > >> As I understand from the docs, D is supposed to use wchars (2 to 4 bytes) for representing non-ASCII strings. I think it would be better to let all string functions only handle UTF-8 (which is fully backwards compatible with ASCII). UTF-8 is slowly becoming standard in UNIX. (just look at X and Gtk+ 2.0) > > > > At one time I had written a lexer that handled utf-8 source. It turned out to cause a lot of problems because strings could no longer be simply indexed by character position, nor could pointers be arbitrarilly incremented and decremented. > > > > It turned out to be a lot of trouble :-( and I finally converted it to wchar's. > > You already have this problem in windows with linebreaks being two bytes. Just use custom iterators for your string class implementation and if you need to set/get positions in streams you use tell and seek (you're not supposed to assume that 1 character == 1 byte anyway according to standards.) There should be no real _need_ to index characters in strings with pointers. That's true, but I was never comfortable using such things, and hiding the performance hit of it behind syntactic sugar doesn't make the hit go away. When you're trying to compile 100,000 lines of code, every cycle in the lexer matters.

March 16, 2002

Re: String type.

Posted by Juan Carlos Arevalo Baeza
in reply to Walter

Permalink

Juan Carlos Arevalo Baeza

Posted in reply to Walter

Permalink

"Walter" <walter@digitalmars.com> wrote in message news:a6lccr$psj$2@digitaldaemon.com...

> > So how is the context determined?
> >
> >     void foo(char[] s)  { ... }
> >     void foo(wchar[] s) { ... }
> >     foo("Hello, world!");
> >
> > My tests show that in the above snippet "Hello, world!" is passed to the function that takes char[] argument.
>
> That's a bug, it should give an ambiguity error.

   Hmmm... I'm thinking that flagging an ambiguity here would still be bad.
How about using attributes to add the ability to resolve ambiguities in a
user-defined manner? For example:

priority(9) void foo(char[] s)  { ... }
priority(5) void foo(wchar[] s) { ... }
foo("Hello, world!"); // Calls using char[], as it's higher priority.

   This way, ambiguities will only be flagged if multiple possibilities
exist that have the same priority. The default priority could be 5 for all
functions, and the range could be 0 to 9, so you can always define higher or
lower ones as needed.

   I admit that this might open a whole new can of worms, but I'd definitely
be willing to explore this if the language supported it.

Salutaciones,
                         JCAB


>
> > If the whole program text would
> > be in UNICODE, would the string be UNICODE as well?
>
> Yes, but if the string doesn't contain any characters with the high bits set, it can be implicitly converted to ascii.
>
> > And what if I insert some UNICODE chars into the literal? Will the compiler complain about "invalid characters"?
>
> It won't implicitly convert it to char[], then.
>
>

"Juan Carlos Arevalo Baeza" <jcab@roningames.com> wrote in message news:a6u823$bip$1@digitaldaemon.com... > priority(9) void foo(char[] s) { ... } > priority(5) void foo(wchar[] s) { ... } > foo("Hello, world!"); // Calls using char[], as it's higher priority. > > This way, ambiguities will only be flagged if multiple possibilities > exist that have the same priority. The default priority could be 5 for all > functions, and the range could be 0 to 9, so you can always define higher or > lower ones as needed. > > I admit that this might open a whole new can of worms, but I'd definitely > be willing to explore this if the language supported it. D was trying to migrate to a simpler overloading scheme <g>. It can be less convenient at times, but I think it's more than made up for by having simple and obvious rules.

Forums