Jump to page: 1 2
Thread overview
understanding string suffixes
Aug 13, 2005
Manfred Nowak
Aug 13, 2005
Derek Parnell
Aug 14, 2005
Manfred Nowak
Aug 14, 2005
Derek Parnell
Aug 14, 2005
Manfred Nowak
Aug 14, 2005
Derek Parnell
Aug 14, 2005
Manfred Nowak
Aug 14, 2005
Manfred Nowak
Aug 14, 2005
Derek Parnell
Aug 14, 2005
Ben Hinkle
Aug 14, 2005
xs0
Aug 14, 2005
Manfred Nowak
August 13, 2005
What are the default suffixes depending on the bom of the source?

What is the meaning of a c-suffix in an utf32 source?

-manfred
August 13, 2005
On Sat, 13 Aug 2005 07:20:01 +0000 (UTC), Manfred Nowak wrote:

> What are the default suffixes depending on the bom of the source?
> 
> What is the meaning of a c-suffix in an utf32 source?

I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme.

I think that "qwerty"c  is formed as a UTF8 string in RAM by the compiler regardless of UTF encoding of the source.

-- 
Derek Parnell
Melbourne, Australia
13/08/2005 11:26:11 PM
August 14, 2005
Derek Parnell <derek@psych.ward> wrote:

>> What are the default suffixes depending on the bom of the source?
>> 
>> What is the meaning of a c-suffix in an utf32 source?
> 
> I don't believe that the string literal suffixes are effected in any way by the source code encoding scheme.

True, true. I never thaught that the meaning of a supplied suffix may change depending on the source code encoding scheme. But the specs state:

| The optional Postfix character gives a specific type to the
| string, rather than it being inferred from the context. This is
| useful when the type cannot be unambiguously inferred, such as
| when overloading based on string type.

But when is the type of a string ambiguous? The BOM, either missing or existing, supplies always a context for every string in the source:

  missing BOM    c
  UTF8-BOM       ? (probably c)
  UTF16-BOM      w
  UTF32-BOM      d


Therefore a string literal s in an UTF32-source is aequivalent to the same string literal followed by the d-suffix: sd.

I have done some tests and found, that a valid UTF32-code in a string literal suffixed with w throws an error, because it is not a legal UTF16-code. Therefore at least the w-suffix denotes not only type but also a check of the semantically correctness of the content of the string literal.


> I think that "qwerty"c  is formed as a UTF8 string in RAM by the compiler regardless of UTF encoding of the source.

UTF8? Because it is indistinguishable from ASCII in this case?

-manfred

August 14, 2005
On Sun, 14 Aug 2005 12:17:44 +0000 (UTC), Manfred Nowak wrote:

> Derek Parnell <derek@psych.ward> wrote:
> 

[snip]

> But when is the type of a string ambiguous?

The ambiguity is not in the encoding of the source text but in the way that a string literal is used when matching function signatures.

Given ...

 void func(char[] x) { . . . }
 func( "some string" );

There is no problem so far, as there is only one possible match, but add this ...

 void func(dchar[] x) { . . . }

And now there is an ambiguity. It is in this situation that string literal suffixes are useful. We need to do ...

 func( "some string"c );

or before suffixes

 func( cast(char[]) "some string" );

-- 
Derek Parnell
Melbourne, Australia
14/08/2005 10:54:46 PM
August 14, 2005
Derek Parnell <derek@psych.ward> wrote:

[...]
> but add this ...
> 
>  void func(dchar[] x) { . . . }
> 
> And now there is an ambiguity.
[...]

Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse.

A source containing an overloaded function like

  void func( char[] s){}
  void func( wchar[] s){}
  void func( dchar[] s){}

and a call with an unsuffixed string literal like

  func( "SomeString");

is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post:

In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?

"Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"?

Nope. No chinese should be forced to act this way. However, if a string in his source is not an UTF32-string he now can use the c- or d-suffix.

Of course this would also imply, that an UTF32-source may have severe behaviour changes, if the BOM is changed.

There is one more problem I do not understand:

what will now happen with the call:

  func( "\u00001111"d "qwerty"c);

Is this ambiguous?

-manfred
August 14, 2005
On Sun, 14 Aug 2005 14:12:21 +0000 (UTC), Manfred Nowak wrote:

> Derek Parnell <derek@psych.ward> wrote:
> 
> [...]
>> but add this ...
>> 
>>  void func(dchar[] x) { . . . }
>> 
>> And now there is an ambiguity.
> [...]
> 
> Ouch. Now I see, that the old story on string literals has been covered with a fig leaf excuse.
> 
> A source containing an overloaded function like
> 
>   void func( char[] s){}
>   void func( wchar[] s){}
>   void func( dchar[] s){}
> 
> and a call with an unsuffixed string literal like
> 
>   func( "SomeString");
> 
> is unambiguously solvable by looking at the BOM of the source file, as I have already mentioned in the foregoing post:
> 
> In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?

I can see where you are going with this, but the encoding of the source text should be independent of the interpretation of undecorated string literals. Just because a file is encoded as UTF8 there should be no restriction on me deciding to save the file as UTF16. The compiler should not go choosing which function to call based on how the file just happens to be encoded.

> "Hey dear chinese, you have written all your strings in this UTF32- source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"?

This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.

> Nope. No chinese should be forced to act this way. However, if a string in his source is not an UTF32-string he now can use the c- or d-suffix.
> 
> Of course this would also imply, that an UTF32-source may have severe behaviour changes, if the BOM is changed.

Exactly, so we should avoid this trap. Keep the default encoding as UTF8, but I still think that a pragma would be a good (and easy to implement) idea.


> There is one more problem I do not understand:
> 
> what will now happen with the call:
> 
>   func( "\u00001111"d "qwerty"c);
> 
> Is this ambiguous?

Not to the compiler. If you try this you get the error message

  " mismatched string literal postfixes 'd' and 'c' "

-- 
Derek Parnell
Melbourne, Australia
15/08/2005 12:17:33 AM
August 14, 2005
Manfred Nowak wrote:
> In an ASCII-source the char[]-overload has to be used, whereas in an UTF32-source the dchar[]-overload has to be used. What else should be natural?

Are you sure that it's a good idea to change behavior of code based on the encoding of the file? I sure don't...

That would be like if "123.456" would be interpreted either as 123456 or 123.456, depending on your regional settings.. A definite disaster :)

> "Hey dear chinese, you have written all your strings in this UTF32-
> source in chinese letters, but please assure your D-compiler that you really meant to write chinese letters by appending the d-suffix to all your strings!"?

Aren't the characters the same in all cases, just the string type changes?


xs0
August 14, 2005
Derek Parnell <derek@psych.ward> wrote:

[...]
> the encoding of the
> source text should be independent of the interpretation of
> undecorated string literals. Just because a file is encoded as
> UTF8 there should be no restriction on me deciding to save the
> file as UTF16. The compiler should not go choosing which
> function to call based on how the file just happens to be
> encoded.

Nice example, but is this argument suited in general? How will your embedded string literals be saved to the UTF16-source by your editor? And once you have changed some of them to real utf16-codes, how will your editor save them, if you decide to revert to utf8?


> This sounds more like we need to have a pragma that specifies which default encoding we mean to have on the undecorated literals in a specific source text.

Agreed. That might be a solution.

[...]
>> what will now happen with the call:
>> 
>>   func( "\u00001111"d "qwerty"c);
>> 
>> Is this ambiguous?
> 
> Not to the compiler. If you try this you get the error message
> 
>   " mismatched string literal postfixes 'd' and 'c' "

Yes. But have you tried any further?

func( "" ""c); //mismatched string literal postfixes ' ' and 'c'
func( "" ""d); //mismatched string literal postfixes ' ' and 'd'

Then: an unsuffixed string literal is neither compatibel with c nor d. So what is it, that the compiler complains about undecorated string literals match both char[] and dchar[]?

Are we chasing a phantom, because the overloading routine of dmd is broken?

vathix  and some others have already reported on similar problems:

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D.bugs/3206

-manfred
August 14, 2005
xs0 <xs0@xs0.com> wrote:

[...]
> Are you sure that it's a good idea to change behavior of code based on the encoding of the file? I sure don't...

What is the code embedded in a file if you do not know the encoding of the file? Please explain.


[...]
> That would be like if "123.456" would be interpreted either as 123456 or 123.456, depending on your regional settings.. A definite disaster :)

.. or 123,456. A desaster the germans are totally aware of, because comma and point change role when changing from english to german encoding.


[...]
> Aren't the characters the same in all cases, just the string type changes?

I might get you wro9ng, but why should the string literal consisting of the one letter d-string for "true" be the characters "true" as a 4-letter c-string?

-manfred
August 14, 2005
Manfred Nowak <svv1999@hotmail.com> wrote:

[...]
> Are we chasing a phantom, because the overloading routine of dmd is broken?

http://www.digitalmars.com/drn-bin/wwwnews?digitalmars.D/27069

is the other reference where a `dchar' was matched by a `char' and a `creal'.

-manfred
« First   ‹ Prev
1 2