View mode: basic / threaded / horizontal-split · Log in · Help
November 24, 2005
Unified String Theory..
With the recent Physics slant on some posts here I couldn't resist that  
subject, in actual fact this is an idea for string handling in D which I  
have cooked up recently.

I am going to paste the text here and attach my original document, the  
document may be easier to read than the NG.

I like this idea, it may however be too much of a change for D, I'm hoping  
the advantages outweigh this fact but I'm not going to hold my breath.

It is possible I have missed something obvious and/or am talking out of a  
hole in my head, if that is the case I would appreciate being told so,  
politely ;)

Enough rambling, here is it, be nice!

-----

Proposal: A single unified string type.
Author  : Regan Heath
Version : 1.0a
Date    : 24 Nov 2005 +1300 (New Zealand DST)

[Preamble/Introduction]
After the recent discussion on Unicode, UTF encodings and the current D  
situation it occured to me that many of the issues D has with strings  
could be side-stepped if there was a single string type.

In the past we have assumed that to obtain this we have to choose one of  
the 3 available types and encodings. This wasn't an attractive option  
because each type has different pros/cons and each application may prefer  
one type over another. Another suggested solution was a string class which  
hides the details, this solution suffers from being a class and the  
limitations imposed by that and not being tied directly into the language.

My proposal is a single "string" type built into the language, which can  
represent it's string data in any given UTF encoding. Which will allow  
slicing of "characters" as opposed to what is essentially bytes, shorts,  
and ints. Whose default encoding can be selected at compile time, or  
specified at runtime. Which will implicitly or explicitly transcode where  
required.

There are some requirements for this to be possible, namely knowledge of  
the UTF encodings being built into D, these requirements may prohibit the  
proposal being favourable as it increases the knowledge required to write  
a D compiler. However it occurs to me that DMD and thus D? already  
requires a fair bit of UTF knowledge.


[Key]
First, lets start with some terminology, these are the terms I am going to  
be using and what they mean, if these are incorrect please correct me, but  
take them to have the stated meanings for this document.

code point      := the unicode value for a single and complete character.
code unit       := part of, or a complete character in one of the 3 UTF  
encodings UTF-8,16,32.
code value      := AKA code unit.
transcoding     := the process of converting from one encoding to another.
source          := a file, the keyboard, a tcp socket, a com port, an OS/C  
function call, a 3rd party library.
sink            := a file, the screen, a tcp socket, a com port, an OS/C  
function call, a 3rd party library.
native encoding := application specific "preferred" encoding (more on this  
later)
string          := a sequence of code points.

Anything I am unsure about will be suffixed with (x) where x is a letter  
of the alphabet, and my thoughts will be detailed in the [Questions]  
section.


[Assumptions]
These are what I base my argument/suggestion on, if you disagree with any  
of these you will likely disagree with the proposal. If that is the case  
please post your concerns with any given assumption in it's own post (I  
would like to discuss each issue in it's own thread and avoid mixing  
several issues)

#1: Any given string can be represented in any UTF encoding, it can be  
transcoded to/from any UTF encoding with no loss of data/meaning.

#2: Transcoding has a performance penalty at runtime. This proposal will  
mention the possible runtime penalty wherever appropriate.

#3: There are 2 places where transcoding cannot be avoided; input and  
ouput. Input is the process of obtaining data from a source. Output is the  
process of sending data to a sink. In either case the source or sink will  
have a fixed encoding and is that encoding does not match the native  
encoding the application will need to transcode. (see definitions above  
for what classifies as a source or sink)

#4: String literals can be stored in the binary in any encoding (#1) the  
encoding chosen may have repercusions at runtime (#2 & #3).


[Details]
Many of the details are flexible, i.e. the names of the types etc, the  
important/inflexible details are how it all fits together and achieves  
it's results. I've chosen a bullet point format and tried to make each  
change/point as succint and clear as possible. Feel free to ask for  
clarification on any point or points. Or to ask general questions. Or to  
pose general problems. I will do my best to answer all questions.

* remove char[], wchar[] and dchar[].

* add a new type "string". "string" will store code points in the  
application specific native encoding and be implicitly or explicitly  
transcoded as required (more below).

* the application specific native encoding will default to UTF-8. An  
application can choose another with a compile option or pragma, this  
choice will have no effect on the behaviour of the program (as we only  
have 1 type and all transcoding is handled where required) it will only  
affect performance.

The performance cost cannot be avoided, presuming it is only being done at  
input and output (which is part of what this proposal aims to achieve).  
This cost is application specific and will depend on the tasks and data  
the application is designed to perform and use.

Given that, letting the programmer choose a native encoding will allow  
them to test different encodings for speed and/or provide different builds  
based on the target language, eg an application destined to be used with  
the Japanese language would likely benefit from using UTF-32  
internally/natively.

* keep char, wchar, and dchar but rename them utf8, utf16, utf32. These  
types represent code points (always, not a code units/values) in each  
encoding. Only code points that fit in utf8 will ever be represented by  
utf8, and so on. Thus some code points will always be utf32 values and  
never utf8 or 16. (much like byte/short/int)

* add promotion/comparrison rules for utf8, 16 and 32:

- any given code point represented as utf8 will compare equal to the same  
code point represented as a utf16 or utf32 and vice versa(a)

- any given code point represented as utf8 will be implicitly  
converted/promoted to the same code point represented as utf16 or utf32 as  
required and vice versa(a). If promotion from utf32 to utf16 or 8 causes  
loss in data it should be handled just like int to short or byte.

* add a new type/alias "utf", this would alias utf8, 16 or 32. It  
represents the application specific native encoding. This allows efficient  
code, like:

string s = "test";
foreach(utf c; s) {
}

regardless of the applications selected native encoding.

* slicing string gives another string

* indexing a string gives a utf8, 16, or 32 code point.

* string literals would be of type "string" encoded in the native  
encoding, or if another encoding can be determined at compile time, in  
that encoding (see ASCII example below).

* character literals would default to the native encoding, failing that  
the smallest possible type, and promoted/converted as required.

* there are occasions where you may want to use a specific encoding for a  
part of your application, perhaps you're loading a UTF-16 file and parsing  
it. If all the work is done in a small section of code and it doesn't  
interact with the bulk of your application data which is all in UTF-8 your  
native encoding it likely to be UTF-8.

In this case, for performance reasons, you want to be able to specify the  
encoding to use for your "string" types at runtime, they are exceptions to  
the native encoding. To do this we specify the encoding at  
construction/declaration time, eg.

string s(UTF16);
s.utf16 = ..data read from UTF-16 source..

(or similar, the exact syntax is not important at this stage)

thus...

* the type of encoding used by "string" should be selectable at runtime,  
some sort of encoding type flag must exist for each string at runtime,  
this is starting to head into "implementation details" which I want to  
avoid at this point, however it is important to note the requirement.


[Output]
* the type "char" will still exist, it will now _only_ represent a C  
string, thus when a string is passed as a char it can be implicitly  
transcoded into ASCII(b) with a null terminator, eg.

int strcmp(const char *src, const char *dst);

string test = "this is a test";
if (strcmp(test,"this is a test")==0) { }

the above will implicitly transcode 'test' into ASCII and ensure there is  
a null terminator. The literal "this is a test" will likely be stored in  
the binary as ASCII with a null terminator.

* Native OS functions requiring "char" will use the rule above. eg.

CreateFileA(char *filename...

* Native OS functions requiring unicode will be defined as:

CreateFileW(utf16 *filename...

and "string" will be implicitly transcoded to utf16, with a null  
terminator added..

* When the required encoding is not apparent, eg.

void CreateFile(char *data) { }
void CreateFile(utf16 *data) { }

string test = "this is a test";
CreateFile(test);

an explicit property should be used, eg.

CreateFile(test.char);
CreateFile(test.utf16);

NOTE: this problem still exists! It should however now be relegated to  
interaction with C API's as opposed to happening for native D methods.


[Input]
* Old encodings, Latin-1 etc would be loaded into ubyte[] or byte[] and  
could be cast (painted) to char*, utf8*, 16 or 32 or converted to "string"  
using a routine i.e. string toStringFromXXX(ubyte[] raw).

* A stream class would have a selectable encoding and hide these details  
from us handling the data and giving a natively encoded "string" instead.  
Meaning, transcoding will naturally occur on input or output where  
required.


[Example application types and the effect of this change]

* the quick and dirty console app which handles ASCII only. It's native  
encoding will be UTF-8, and no transcoding will ever need to occur  
(assuming none of it's input or output is in another encoding)

* an app which loads files in different encodings and needs to process  
them efficiently. In this case the code can select the encoding of  
"string" at runtime and avoid transcoding the data until such time as it  
needs to interface with another part of the application in another  
encoding or it needs to output to a sink, also in another encoding.

* an international app which will handle many languages. this app can be  
custom built with the native string type selected to match each language.


[Advantages]
As I see it, this change would have the following advantages:

* "string" requires no knowledge of UTF encodings (and the associated  
problems) to use making it easy for begginners and for a quick and dirty  
program.

* "string" can be sliced/indexed by character regardless of the encoding  
used for the data.

* overload resolution has only 1 type, not 3 to choose from.

* code written in D would all use the same type "string". no more this  
library uses char[] this one wchar and my app dchar[] problems.


[Disadvantages]
* requirements listed below

* libraries built for a different native type will likely cause  
transcoding. This problem already exists, at least with this suggestion  
the library can be built 3 times, once for each native encoding and the  
correct one linked to your app.

* possibility of implicit and silent transcoding. This can occur between  
libraries built with different native encodings and between "string" and  
char*, utf8*, utf16* and utf32*, the compiler _could_ identify all such  
locations if desired.


[Requirements]
In order to implement all this "string" requires knowledge of all code  
points, how they are encoded in the 3 encodings and how to compare and  
convert between them. So, D and thus any D compiler eg DMD, requires this  
knowledge. I am not entirely sure just how big an "ask" this is. I believe  
DMD and thus D already has much of this capability built in.


[Questions]
(a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have  
the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other  
words is it the same numerical value in all encodings? If so then  
comparing utf8, 16 and 32 is no different to comparing byte, short and int  
and all the same promotion and comparrison rules can apply.

(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.
November 24, 2005
Re: Unified String Theory..
On Thu, 24 Nov 2005 16:09:13 +1300, Regan Heath wrote:

> With the recent Physics slant on some posts here I couldn't resist that  
> subject

LOL


> Enough rambling, here is it, be nice!

Just some quick thoughts are recorded here. More will come later I suspect.

[snip]

> [Key]
> First, lets start with some terminology, these are the terms I am going to  
> be using and what they mean, if these are incorrect please correct me, but  
> take them to have the stated meanings for this document.
> 
> code point      := the unicode value for a single and complete character.
> code unit       := part of, or a complete character in one of the 3 UTF  
> encodings UTF-8,16,32.
> code value      := AKA code unit.

The Unicode Consortium defines code value as the smallest (in terms of
bits) value that will hold a character in the various encoding formats.
Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.


[snip]


> * remove char[], wchar[] and dchar[].

Do we still have to cater for strings that were formatted in specific
encodings outside of our D applications? For example, a C library routine
might insist that a pointer to a UTF16 string be supplied, thus we would
have to force a specific encoding somehow.

> * add a new type "string". "string" will store code points in the  
> application specific native encoding and be implicitly or explicitly  
> transcoded as required (more below).
> 
> * the application specific native encoding will default to UTF-8. An  
> application can choose another with a compile option or pragma, this  
> choice will have no effect on the behaviour of the program (as we only  
> have 1 type and all transcoding is handled where required) it will only  
> affect performance.
> 
> The performance cost cannot be avoided, presuming it is only being done at  
> input and output (which is part of what this proposal aims to achieve).  
> This cost is application specific and will depend on the tasks and data  
> the application is designed to perform and use.
> 
> Given that, letting the programmer choose a native encoding will allow  
> them to test different encodings for speed and/or provide different builds  
> based on the target language, eg an application destined to be used with  
> the Japanese language would likely benefit from using UTF-32  
> internally/natively.
> 
> * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These  
> types represent code points (always, not a code units/values) in each  
> encoding. Only code points that fit in utf8 will ever be represented by  
> utf8, and so on. Thus some code points will always be utf32 values and  
> never utf8 or 16. (much like byte/short/int)

I think you've lost track of your 'code point' definition. A 'code point'
is a character. All encodings can hold all characters, every character will
fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are
still all code points. There are no exclusive code points in utf32. Every
UTF32 code point can also be expressed in UTF8.

> * add promotion/comparrison rules for utf8, 16 and 32:
> 
> - any given code point represented as utf8 will compare equal to the same  
> code point represented as a utf16 or utf32 and vice versa(a)
> 
> - any given code point represented as utf8 will be implicitly  
> converted/promoted to the same code point represented as utf16 or utf32 as  
> required and vice versa(a). If promotion from utf32 to utf16 or 8 causes  
> loss in data it should be handled just like int to short or byte.

I assume by 'promotion' you really mean 'transcoding'. There is never any
data loss when converting between the different encodings. This is your #1
assumption.

> * add a new type/alias "utf", this would alias utf8, 16 or 32. It  
> represents the application specific native encoding. This allows efficient  
> code, like:
> 
> string s = "test";
> foreach(utf c; s) {
> }

But utf8, utf16, and utf32 are *strings* not characters, so 'utf' could not
be an *alias* for these in your example. I guess you mean it to be a term
for a character (code point) in a utf string.

> regardless of the applications selected native encoding.
> 
> * slicing string gives another string
> 
> * indexing a string gives a utf8, 16, or 32 code point.
> 
> * string literals would be of type "string" encoded in the native  
> encoding, or if another encoding can be determined at compile time, in  
> that encoding (see ASCII example below).
> 
> * character literals would default to the native encoding, failing that  
> the smallest possible type, and promoted/converted as required.

By 'smallest possible type' do you mean the smallest memory usage?

> * there are occasions where you may want to use a specific encoding for a  
> part of your application, perhaps you're loading a UTF-16 file and parsing  
> it. If all the work is done in a small section of code and it doesn't  
> interact with the bulk of your application data which is all in UTF-8 your  
> native encoding it likely to be UTF-8.
> 
> In this case, for performance reasons, you want to be able to specify the  
> encoding to use for your "string" types at runtime, they are exceptions to  
> the native encoding. To do this we specify the encoding at  
> construction/declaration time, eg.
> 
> string s(UTF16);
> s.utf16 = ..data read from UTF-16 source..
> 
> (or similar, the exact syntax is not important at this stage)

But the idea is that a string has the property of 'utf8', and 'utf16' and
'utf32' encoding at runtime?


-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
24/11/2005 2:34:13 PM
November 24, 2005
Re: Unified String Theory..
On Thu, 24 Nov 2005 15:04:08 +1100, Derek Parnell <derek@psych.ward> wrote:
>> [Key]
>> First, lets start with some terminology, these are the terms I am going  
>> to
>> be using and what they mean, if these are incorrect please correct me,  
>> but
>> take them to have the stated meanings for this document.
>>
>> code point      := the unicode value for a single and complete  
>> character.
>> code unit       := part of, or a complete character in one of the 3 UTF
>> encodings UTF-8,16,32.
>> code value      := AKA code unit.
>
> The Unicode Consortium defines code value as the smallest (in terms of
> bits) value that will hold a character in the various encoding formats.
> Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.

Thanks for the detailed description. That is what I meant above.

>> * remove char[], wchar[] and dchar[].
>
> Do we still have to cater for strings that were formatted in specific
> encodings outside of our D applications? For example, a C library routine
> might insist that a pointer to a UTF16 string be supplied, thus we would
> have to force a specific encoding somehow.

Yes, that is the purpose of char*, utf16*, etc. eg.

int strlen(const char *string) {}
int CreateFileW(utf16 *filename, ...

>> * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These
>> types represent code points (always, not a code units/values) in each
>> encoding. Only code points that fit in utf8 will ever be represented by
>> utf8, and so on. Thus some code points will always be utf32 values and
>> never utf8 or 16. (much like byte/short/int)
>
> I think you've lost track of your 'code point' definition.

Not so. I've just failed to explain what I mean here, let me try some  
more...

> A 'code point' is a character.

Correct.

> All encodings can hold all characters, every character will
> fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there  
> are still all code points. There are no exclusive code points in utf32.  
> Every
> UTF32 code point can also be expressed in UTF8.

I realise all this. It is not what I mean't above.

Think of the type "utf8" as being identical to "byte", except that the  
values it stores are always complete code points, never fragments or code  
units/values. The type "utf8" will never have part of a complete character  
in it, it'll either have the whole character or it will be an error.

"utf8" can represent the range of code points which are between 0 and 255  
(or perhaps it's 127, not sure).

perhaps the name "utf8" is missleading, it's not in fact a UTF-8 code  
unit/value, it is a codepoint, that fits in a byte. The reason it's not  
called "byte" is because the seperate type is used to trigger transcoding,  
see my original utf16* example.

>> * add promotion/comparrison rules for utf8, 16 and 32:
>>
>> - any given code point represented as utf8 will compare equal to the  
>> same
>> code point represented as a utf16 or utf32 and vice versa(a)
>>
>> - any given code point represented as utf8 will be implicitly
>> converted/promoted to the same code point represented as utf16 or utf32  
>> as
>> required and vice versa(a). If promotion from utf32 to utf16 or 8 causes
>> loss in data it should be handled just like int to short or byte.
>
> I assume by 'promotion' you really mean 'transcoding'.

No, I think I mean promotion. This is one of the things I am not 100% sure  
of, bear with me.

The character 'A' has ASCII value 65 (decimal). Assuming it's code point  
is 65 (decimal), then this code point will fit in my "utf8" type. Thus  
"utf8" can represent the code point 'A'. If you assign that "utf8" to a  
"utf16", eg.

utf8 a = 'A';
utf16 b = a;

The utf8 value will be promoted to a utf16 value. The value itself doesn't  
change (it's not transcoded). It happens in exactly the same way a byte is  
promoted to a short. Is promoted the right word?

That is, provided the value doesn't _need_ to change when going from utf8  
to utf16, I am not 100% sure of this. I don't think it does. I believe all  
the code points that fit in the 1 byte type, have the same numerical value  
in the 2 byte type (UTF-16), and also the 4 byte type (UTF-32).

>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It
>> represents the application specific native encoding. This allows  
>> efficient
>> code, like:
>>
>> string s = "test";
>> foreach(utf c; s) {
>> }
>
> But utf8, utf16, and utf32 are *strings* not characters

No, they're not, not in my proposal. I think I picked bad names.

> , so 'utf' could not be an *alias* for these in your example. I guess  
> you mean it to be a term
> for a character (code point) in a utf string.

utf, utf8, utf16, and utf32 are all types that store complete code points,  
never code units/values/fragments. Think of them as being identical to  
byte, short, and int.

>> regardless of the applications selected native encoding.
>>
>> * slicing string gives another string
>>
>> * indexing a string gives a utf8, 16, or 32 code point.
>>
>> * string literals would be of type "string" encoded in the native
>> encoding, or if another encoding can be determined at compile time, in
>> that encoding (see ASCII example below).
>>
>> * character literals would default to the native encoding, failing that
>> the smallest possible type, and promoted/converted as required.
>
> By 'smallest possible type' do you mean the smallest memory usage?

Yes. utf8 is smaller than utf16 is smaller than utf32.

>> * there are occasions where you may want to use a specific encoding for  
>> a
>> part of your application, perhaps you're loading a UTF-16 file and  
>> parsing
>> it. If all the work is done in a small section of code and it doesn't
>> interact with the bulk of your application data which is all in UTF-8  
>> your
>> native encoding it likely to be UTF-8.
>>
>> In this case, for performance reasons, you want to be able to specify  
>> the
>> encoding to use for your "string" types at runtime, they are exceptions  
>> to
>> the native encoding. To do this we specify the encoding at
>> construction/declaration time, eg.
>>
>> string s(UTF16);
>> s.utf16 = ..data read from UTF-16 source..
>>
>> (or similar, the exact syntax is not important at this stage)
>
> But the idea is that a string has the property of 'utf8', and 'utf16' and
> 'utf32' encoding at runtime?

Yes. But you will only need to use these properties when performing input  
or output (see my definitions of source and sink) and only when the type  
cannot be inferred by the context, i.e. it's not required here:

int CreateFile(utf16* filename) {}
string test = "test";
CreateFile(test);

Regan
November 24, 2005
Re: Unified String Theory..
Hi Regan,

Two small remarks:

* "wchar" might still be useful for those applications / libraries that 
support 16-bit unicode without aggregates like in Windows NT if I'm correct. 
It's not utf16 since it can't contain a big, >2-byte code point, ie. it's 
ushort.

* I don't see the point of the utf8, utf16 and utf32 types. They can all 
contain any code point, so they should all be just as big? Or do you mean 
that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint? 
Actually pieces from the respective strings.

L.
November 24, 2005
Re: Unified String Theory..
By the way, I like the proposal! I prefer different compiled libraries to 
many runtime checks or version blocks. It's like the #define UNICODE in 
Windows.

L.
November 24, 2005
Re: Unified String Theory..
Regan Heath wrote:
[snip]
> [Questions]
> (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply.

I think you are making this more complicated than it is by using the 
name UTF when you actually mean something like:

ascii_char (not utf8) (code point < 128)
ucs2_char (not utf16) (code point < 65536)
unicode_char (not utf32)

And yes: ascii is a subset of ucs2 is a subset of unicode.

> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.

ASCII is equal to the first 128 code points in Unicode.
Latin-1 is equal to the first 256 code points in Unicode.

Regards,

/Oskar
November 24, 2005
Re: Unified String Theory..
On Thu, 24 Nov 2005 09:56:51 +0200, Lionello Lunesu  
<lio@remove.lunesu.com> wrote:
> Two small remarks:
>
> * "wchar" might still be useful for those applications / libraries that
> support 16-bit unicode without aggregates like in Windows NT if I'm  
> correct.
> It's not utf16 since it can't contain a big, >2-byte code point, ie. it's
> ushort.
>
> * I don't see the point of the utf8, utf16 and utf32 types. They can all
> contain any code point, so they should all be just as big? Or do you mean
> that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint?
> Actually pieces from the respective strings.

No. I seem to have done a bad job of explaining it _and_ picked terrible  
names.

The "utf8", "utf16" and "utf32" types I refer to are essentially byte,  
short and int. They cannot contain any code point, only those that fit (I  
thought I said that?)

We don't need wchar because utf16 replaces it.

Perhaps if I had kept the original names... doh!

Regan
November 24, 2005
Re: Unified String Theory [READ THIS FIRST]
Ok, it appears I picked some really bad type names in my proposal and it  
is causing some confusion.

The types "utf8" "utf16" and "utf32" do not in fact have anything to do  
with UTF. (Bad Regan).

They are in fact essentially byte, short and int with different names.  
Having different names is important because it triggers the transcoding of  
"string" to the required C, OS, or UTF type.

I could have left them called "char", "wchar" and "dchar", except that I  
wanted a 4th type to represent C's char as well. That type was called  
"char" in the proposal.

So, for the sake of our sanity can we all please assume I have used these  
type names instead:

"utf8"  == "cp1"
"utf16" == "cp2"
"utf32" == "cp4"
"utf"   == "cpn"

(the actual type names are unimportant at this stage, we can pick the best  
possible names later)

The idea behind these types is that they represent code points/characters  
_never_ code units/values/fragments. Which means cp1 can only represent a  
small subset of unicode code points, cp2 slightly more and cp4 all of them  
(IIRC).

It means assigning anything outside their range to them is an error.

It means that you can assign a cp1 to a cp2 and it simply promotes it  
(like it would from byte to short).

"cpn" is simply and alias for the type that is best suited for the chosen  
native encoding. If the native encoding is UTF-8, cpn is an alias for cp1,  
if the native encoding is UTF-16, cpn is an alias for cp2, and so on.

Sorry for all the confusion.

Regan
November 24, 2005
Re: Unified String Theory..
On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde  
<oskar.lindeREM@OVEgmail.com> wrote:
> Regan Heath wrote:
> [snip]
>> [Questions]
>> (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A'  
>> have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in  
>> other words is it the same numerical value in all encodings? If so then  
>> comparing utf8, 16 and 32 is no different to comparing byte, short and  
>> int and all the same promotion and comparrison rules can apply.
>
> I think you are making this more complicated than it is by using the  
> name UTF when you actually mean something like:
>
> ascii_char (not utf8) (code point < 128)
> ucs2_char (not utf16) (code point < 65536)
> unicode_char (not utf32)

I agree, it appears my choice of type names was really confusing. I have  
posted a change, but perhaps I should repost all over again, perhaps I  
should have bounced this off one person before posting.

> And yes: ascii is a subset of ucs2 is a subset of unicode.

Excellent. Thanks.

>> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or  
>> similar. Is it ASCII values 127 or less perhaps? To be honest I'm not  
>> sure.
>
> ASCII is equal to the first 128 code points in Unicode.
> Latin-1 is equal to the first 256 code points in Unicode.

And which does a C function expect? Or is that defined by the C function?  
Does strcmp care? Does strlen, strchr, ...?

Regan
November 24, 2005
Re: Unified String Theory [READ THIS FIRST]
Replying to myself now, in addition to bolloxing the initial proposal up  
with bad type names, I'm on a roll!

Here is version 1.1 of the proposal, with different type names and some  
changes to the other content. Hopefully this one will make more sense,  
fingers crossed.

Regan
« First   ‹ Prev
1 2 3 4 5
Top | Discussion index | About this forum | D home