Unified String Theory..

With the recent Physics slant on some posts here I couldn't resist that subject, in actual fact this is an idea for string handling in D which I have cooked up recently.

I am going to paste the text here and attach my original document, the document may be easier to read than the NG.

I like this idea, it may however be too much of a change for D, I'm hoping the advantages outweigh this fact but I'm not going to hold my breath.

It is possible I have missed something obvious and/or am talking out of a hole in my head, if that is the case I would appreciate being told so, politely ;)

Enough rambling, here is it, be nice!

-----

Proposal: A single unified string type.
Author : Regan Heath
Version : 1.0a
Date : 24 Nov 2005 +1300 (New Zealand DST)

[Preamble/Introduction]
After the recent discussion on Unicode, UTF encodings and the current D
situation it occured to me that many of the issues D has with strings
could be side-stepped if there was a single string type.

In the past we have assumed that to obtain this we have to choose one of the 3 available types and encodings. This wasn't an attractive option because each type has different pros/cons and each application may prefer one type over another. Another suggested solution was a string class which hides the details, this solution suffers from being a class and the limitations imposed by that and not being tied directly into the language.

My proposal is a single "string" type built into the language, which can represent it's string data in any given UTF encoding. Which will allow slicing of "characters" as opposed to what is essentially bytes, shorts, and ints. Whose default encoding can be selected at compile time, or specified at runtime. Which will implicitly or explicitly transcode where required.

There are some requirements for this to be possible, namely knowledge of the UTF encodings being built into D, these requirements may prohibit the proposal being favourable as it increases the knowledge required to write a D compiler. However it occurs to me that DMD and thus D? already requires a fair bit of UTF knowledge.

[Key]
First, lets start with some terminology, these are the terms I am going to
be using and what they mean, if these are incorrect please correct me, but
take them to have the stated meanings for this document.

code point := the unicode value for a single and complete character.
code unit := part of, or a complete character in one of the 3 UTF
encodings UTF-8,16,32.
code value := AKA code unit.
transcoding := the process of converting from one encoding to another.
source := a file, the keyboard, a tcp socket, a com port, an OS/C
function call, a 3rd party library.
sink := a file, the screen, a tcp socket, a com port, an OS/C
function call, a 3rd party library.
native encoding := application specific "preferred" encoding (more on this
later)
string := a sequence of code points.

Anything I am unsure about will be suffixed with (x) where x is a letter of the alphabet, and my thoughts will be detailed in the [Questions] section.

[Assumptions]
These are what I base my argument/suggestion on, if you disagree with any
of these you will likely disagree with the proposal. If that is the case
please post your concerns with any given assumption in it's own post (I
would like to discuss each issue in it's own thread and avoid mixing
several issues)

#1: Any given string can be represented in any UTF encoding, it can be transcoded to/from any UTF encoding with no loss of data/meaning.

#2: Transcoding has a performance penalty at runtime. This proposal will mention the possible runtime penalty wherever appropriate.

#3: There are 2 places where transcoding cannot be avoided; input and ouput. Input is the process of obtaining data from a source. Output is the process of sending data to a sink. In either case the source or sink will have a fixed encoding and is that encoding does not match the native encoding the application will need to transcode. (see definitions above for what classifies as a source or sink)

#4: String literals can be stored in the binary in any encoding (#1) the encoding chosen may have repercusions at runtime (#2 & #3).

[Details]
Many of the details are flexible, i.e. the names of the types etc, the
important/inflexible details are how it all fits together and achieves
it's results. I've chosen a bullet point format and tried to make each
change/point as succint and clear as possible. Feel free to ask for
clarification on any point or points. Or to ask general questions. Or to
pose general problems. I will do my best to answer all questions.

* remove char[], wchar[] and dchar[].

* add a new type "string". "string" will store code points in the application specific native encoding and be implicitly or explicitly transcoded as required (more below).

* the application specific native encoding will default to UTF-8. An application can choose another with a compile option or pragma, this choice will have no effect on the behaviour of the program (as we only have 1 type and all transcoding is handled where required) it will only affect performance.

The performance cost cannot be avoided, presuming it is only being done at input and output (which is part of what this proposal aims to achieve). This cost is application specific and will depend on the tasks and data the application is designed to perform and use.

Given that, letting the programmer choose a native encoding will allow them to test different encodings for speed and/or provide different builds based on the target language, eg an application destined to be used with the Japanese language would likely benefit from using UTF-32 internally/natively.

* keep char, wchar, and dchar but rename them utf8, utf16, utf32. These types represent code points (always, not a code units/values) in each encoding. Only code points that fit in utf8 will ever be represented by utf8, and so on. Thus some code points will always be utf32 values and never utf8 or 16. (much like byte/short/int)

* add promotion/comparrison rules for utf8, 16 and 32:

- any given code point represented as utf8 will compare equal to the same code point represented as a utf16 or utf32 and vice versa(a)

- any given code point represented as utf8 will be implicitly converted/promoted to the same code point represented as utf16 or utf32 as required and vice versa(a). If promotion from utf32 to utf16 or 8 causes loss in data it should be handled just like int to short or byte.

* add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like:

string s = "test";
foreach(utf c; s) {
}

regardless of the applications selected native encoding.

* slicing string gives another string

* indexing a string gives a utf8, 16, or 32 code point.

* string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below).

* character literals would default to the native encoding, failing that the smallest possible type, and promoted/converted as required.

* there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 your native encoding it likely to be UTF-8.

In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg.

string s(UTF16);
s.utf16 = ..data read from UTF-16 source..

(or similar, the exact syntax is not important at this stage)

thus...

* the type of encoding used by "string" should be selectable at runtime, some sort of encoding type flag must exist for each string at runtime, this is starting to head into "implementation details" which I want to avoid at this point, however it is important to note the requirement.

[Output]
* the type "char" will still exist, it will now _only_ represent a C
string, thus when a string is passed as a char it can be implicitly
transcoded into ASCII(b) with a null terminator, eg.

int strcmp(const char *src, const char *dst);

string test = "this is a test";
if (strcmp(test,"this is a test")==0) { }

the above will implicitly transcode 'test' into ASCII and ensure there is a null terminator. The literal "this is a test" will likely be stored in the binary as ASCII with a null terminator.

* Native OS functions requiring "char" will use the rule above. eg.

CreateFileA(char *filename...

* Native OS functions requiring unicode will be defined as:

CreateFileW(utf16 *filename...

and "string" will be implicitly transcoded to utf16, with a null terminator added..

* When the required encoding is not apparent, eg.

void CreateFile(char *data) { }
void CreateFile(utf16 *data) { }

string test = "this is a test";
CreateFile(test);

an explicit property should be used, eg.

CreateFile(test.char);
CreateFile(test.utf16);

NOTE: this problem still exists! It should however now be relegated to interaction with C API's as opposed to happening for native D methods.

[Input]
* Old encodings, Latin-1 etc would be loaded into ubyte[] or byte[] and
could be cast (painted) to char*, utf8*, 16 or 32 or converted to "string"
using a routine i.e. string toStringFromXXX(ubyte[] raw).

* A stream class would have a selectable encoding and hide these details
from us handling the data and giving a natively encoded "string" instead.
Meaning, transcoding will naturally occur on input or output where
required.

[Example application types and the effect of this change]

* the quick and dirty console app which handles ASCII only. It's native encoding will be UTF-8, and no transcoding will ever need to occur (assuming none of it's input or output is in another encoding)

* an app which loads files in different encodings and needs to process them efficiently. In this case the code can select the encoding of "string" at runtime and avoid transcoding the data until such time as it needs to interface with another part of the application in another encoding or it needs to output to a sink, also in another encoding.

* an international app which will handle many languages. this app can be custom built with the native string type selected to match each language.

[Advantages]
As I see it, this change would have the following advantages:

* "string" requires no knowledge of UTF encodings (and the associated problems) to use making it easy for begginners and for a quick and dirty program.

* "string" can be sliced/indexed by character regardless of the encoding used for the data.

* overload resolution has only 1 type, not 3 to choose from.

* code written in D would all use the same type "string". no more this library uses char[] this one wchar and my app dchar[] problems.

[Disadvantages]
* requirements listed below

* libraries built for a different native type will likely cause transcoding. This problem already exists, at least with this suggestion the library can be built 3 times, once for each native encoding and the correct one linked to your app.

* possibility of implicit and silent transcoding. This can occur between libraries built with different native encodings and between "string" and char*, utf8*, utf16* and utf32*, the compiler _could_ identify all such locations if desired.

[Requirements]
In order to implement all this "string" requires knowledge of all code
points, how they are encoded in the 3 encodings and how to compare and
convert between them. So, D and thus any D compiler eg DMD, requires this
knowledge. I am not entirely sure just how big an "ask" this is. I believe
DMD and thus D already has much of this capability built in.

[Questions]
(a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have
the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other
words is it the same numerical value in all encodings? If so then
comparing utf8, 16 and 32 is no different to comparing byte, short and int
and all the same promotion and comparrison rules can apply.

(b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.

November 24, 2005

Re: Unified String Theory..

Posted by Derek Parnell
in reply to Regan Heath

Permalink

Derek Parnell

Posted in reply to Regan Heath

Permalink

On Thu, 24 Nov 2005 16:09:13 +1300, Regan Heath wrote:

> With the recent Physics slant on some posts here I couldn't resist that subject

LOL


> Enough rambling, here is it, be nice!

Just some quick thoughts are recorded here. More will come later I suspect.

[snip]

> [Key]
> First, lets start with some terminology, these are the terms I am going to
> be using and what they mean, if these are incorrect please correct me, but
> take them to have the stated meanings for this document.
> 
> code point      := the unicode value for a single and complete character.
> code unit       := part of, or a complete character in one of the 3 UTF
> encodings UTF-8,16,32.
> code value      := AKA code unit.

The Unicode Consortium defines code value as the smallest (in terms of bits) value that will hold a character in the various encoding formats. Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.


[snip]


> * remove char[], wchar[] and dchar[].

Do we still have to cater for strings that were formatted in specific encodings outside of our D applications? For example, a C library routine might insist that a pointer to a UTF16 string be supplied, thus we would have to force a specific encoding somehow.

> * add a new type "string". "string" will store code points in the application specific native encoding and be implicitly or explicitly transcoded as required (more below).
> 
> * the application specific native encoding will default to UTF-8. An application can choose another with a compile option or pragma, this choice will have no effect on the behaviour of the program (as we only have 1 type and all transcoding is handled where required) it will only affect performance.
> 
> The performance cost cannot be avoided, presuming it is only being done at input and output (which is part of what this proposal aims to achieve). This cost is application specific and will depend on the tasks and data the application is designed to perform and use.
> 
> Given that, letting the programmer choose a native encoding will allow them to test different encodings for speed and/or provide different builds based on the target language, eg an application destined to be used with the Japanese language would likely benefit from using UTF-32 internally/natively.
> 
> * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These types represent code points (always, not a code units/values) in each encoding. Only code points that fit in utf8 will ever be represented by utf8, and so on. Thus some code points will always be utf32 values and never utf8 or 16. (much like byte/short/int)

I think you've lost track of your 'code point' definition. A 'code point' is a character. All encodings can hold all characters, every character will fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are still all code points. There are no exclusive code points in utf32. Every UTF32 code point can also be expressed in UTF8.

> * add promotion/comparrison rules for utf8, 16 and 32:
> 
> - any given code point represented as utf8 will compare equal to the same code point represented as a utf16 or utf32 and vice versa(a)
> 
> - any given code point represented as utf8 will be implicitly converted/promoted to the same code point represented as utf16 or utf32 as required and vice versa(a). If promotion from utf32 to utf16 or 8 causes loss in data it should be handled just like int to short or byte.

I assume by 'promotion' you really mean 'transcoding'. There is never any data loss when converting between the different encodings. This is your #1 assumption.

> * add a new type/alias "utf", this would alias utf8, 16 or 32. It represents the application specific native encoding. This allows efficient code, like:
> 
> string s = "test";
> foreach(utf c; s) {
> }

But utf8, utf16, and utf32 are *strings* not characters, so 'utf' could not be an *alias* for these in your example. I guess you mean it to be a term for a character (code point) in a utf string.

> regardless of the applications selected native encoding.
> 
> * slicing string gives another string
> 
> * indexing a string gives a utf8, 16, or 32 code point.
> 
> * string literals would be of type "string" encoded in the native encoding, or if another encoding can be determined at compile time, in that encoding (see ASCII example below).
> 
> * character literals would default to the native encoding, failing that the smallest possible type, and promoted/converted as required.

By 'smallest possible type' do you mean the smallest memory usage?

> * there are occasions where you may want to use a specific encoding for a part of your application, perhaps you're loading a UTF-16 file and parsing it. If all the work is done in a small section of code and it doesn't interact with the bulk of your application data which is all in UTF-8 your native encoding it likely to be UTF-8.
> 
> In this case, for performance reasons, you want to be able to specify the encoding to use for your "string" types at runtime, they are exceptions to the native encoding. To do this we specify the encoding at construction/declaration time, eg.
> 
> string s(UTF16);
> s.utf16 = ..data read from UTF-16 source..
> 
> (or similar, the exact syntax is not important at this stage)

But the idea is that a string has the property of 'utf8', and 'utf16' and 'utf32' encoding at runtime?


-- 
Derek
(skype: derek.j.parnell)
Melbourne, Australia
24/11/2005 2:34:13 PM

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Derek Parnell

Permalink

Regan Heath

Posted in reply to Derek Parnell

Permalink

On Thu, 24 Nov 2005 15:04:08 +1100, Derek Parnell <derek@psych.ward> wrote:
>> [Key]
>> First, lets start with some terminology, these are the terms I am going to
>> be using and what they mean, if these are incorrect please correct me, but
>> take them to have the stated meanings for this document.
>>
>> code point      := the unicode value for a single and complete character.
>> code unit       := part of, or a complete character in one of the 3 UTF
>> encodings UTF-8,16,32.
>> code value      := AKA code unit.
>
> The Unicode Consortium defines code value as the smallest (in terms of
> bits) value that will hold a character in the various encoding formats.
> Thus for UTF8 it is 1 byte, UTF16 = 2 bytes, and UTF32 = 4 bytes.

Thanks for the detailed description. That is what I meant above.

>> * remove char[], wchar[] and dchar[].
>
> Do we still have to cater for strings that were formatted in specific
> encodings outside of our D applications? For example, a C library routine
> might insist that a pointer to a UTF16 string be supplied, thus we would
> have to force a specific encoding somehow.

Yes, that is the purpose of char*, utf16*, etc. eg.

int strlen(const char *string) {}
int CreateFileW(utf16 *filename, ...

>> * keep char, wchar, and dchar but rename them utf8, utf16, utf32. These
>> types represent code points (always, not a code units/values) in each
>> encoding. Only code points that fit in utf8 will ever be represented by
>> utf8, and so on. Thus some code points will always be utf32 values and
>> never utf8 or 16. (much like byte/short/int)
>
> I think you've lost track of your 'code point' definition.

Not so. I've just failed to explain what I mean here, let me try some more...

> A 'code point' is a character.

Correct.

> All encodings can hold all characters, every character will
> fit into UTF8. Sure some might take 1, 2 or 4 'code values', but there are still all code points. There are no exclusive code points in utf32. Every
> UTF32 code point can also be expressed in UTF8.

I realise all this. It is not what I mean't above.

Think of the type "utf8" as being identical to "byte", except that the values it stores are always complete code points, never fragments or code units/values. The type "utf8" will never have part of a complete character in it, it'll either have the whole character or it will be an error.

"utf8" can represent the range of code points which are between 0 and 255 (or perhaps it's 127, not sure).

perhaps the name "utf8" is missleading, it's not in fact a UTF-8 code unit/value, it is a codepoint, that fits in a byte. The reason it's not called "byte" is because the seperate type is used to trigger transcoding, see my original utf16* example.

>> * add promotion/comparrison rules for utf8, 16 and 32:
>>
>> - any given code point represented as utf8 will compare equal to the same
>> code point represented as a utf16 or utf32 and vice versa(a)
>>
>> - any given code point represented as utf8 will be implicitly
>> converted/promoted to the same code point represented as utf16 or utf32 as
>> required and vice versa(a). If promotion from utf32 to utf16 or 8 causes
>> loss in data it should be handled just like int to short or byte.
>
> I assume by 'promotion' you really mean 'transcoding'.

No, I think I mean promotion. This is one of the things I am not 100% sure of, bear with me.

The character 'A' has ASCII value 65 (decimal). Assuming it's code point is 65 (decimal), then this code point will fit in my "utf8" type. Thus "utf8" can represent the code point 'A'. If you assign that "utf8" to a "utf16", eg.

utf8 a = 'A';
utf16 b = a;

The utf8 value will be promoted to a utf16 value. The value itself doesn't change (it's not transcoded). It happens in exactly the same way a byte is promoted to a short. Is promoted the right word?

That is, provided the value doesn't _need_ to change when going from utf8 to utf16, I am not 100% sure of this. I don't think it does. I believe all the code points that fit in the 1 byte type, have the same numerical value in the 2 byte type (UTF-16), and also the 4 byte type (UTF-32).

>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It
>> represents the application specific native encoding. This allows efficient
>> code, like:
>>
>> string s = "test";
>> foreach(utf c; s) {
>> }
>
> But utf8, utf16, and utf32 are *strings* not characters

No, they're not, not in my proposal. I think I picked bad names.

> , so 'utf' could not be an *alias* for these in your example. I guess you mean it to be a term
> for a character (code point) in a utf string.

utf, utf8, utf16, and utf32 are all types that store complete code points, never code units/values/fragments. Think of them as being identical to byte, short, and int.

>> regardless of the applications selected native encoding.
>>
>> * slicing string gives another string
>>
>> * indexing a string gives a utf8, 16, or 32 code point.
>>
>> * string literals would be of type "string" encoded in the native
>> encoding, or if another encoding can be determined at compile time, in
>> that encoding (see ASCII example below).
>>
>> * character literals would default to the native encoding, failing that
>> the smallest possible type, and promoted/converted as required.
>
> By 'smallest possible type' do you mean the smallest memory usage?

Yes. utf8 is smaller than utf16 is smaller than utf32.

>> * there are occasions where you may want to use a specific encoding for a
>> part of your application, perhaps you're loading a UTF-16 file and parsing
>> it. If all the work is done in a small section of code and it doesn't
>> interact with the bulk of your application data which is all in UTF-8 your
>> native encoding it likely to be UTF-8.
>>
>> In this case, for performance reasons, you want to be able to specify the
>> encoding to use for your "string" types at runtime, they are exceptions to
>> the native encoding. To do this we specify the encoding at
>> construction/declaration time, eg.
>>
>> string s(UTF16);
>> s.utf16 = ..data read from UTF-16 source..
>>
>> (or similar, the exact syntax is not important at this stage)
>
> But the idea is that a string has the property of 'utf8', and 'utf16' and
> 'utf32' encoding at runtime?

Yes. But you will only need to use these properties when performing input or output (see my definitions of source and sink) and only when the type cannot be inferred by the context, i.e. it's not required here:

int CreateFile(utf16* filename) {}
string test = "test";
CreateFile(test);

Regan

November 24, 2005

Re: Unified String Theory..

Posted by Lionello Lunesu
in reply to Regan Heath

Permalink

Lionello Lunesu

Posted in reply to Regan Heath

Permalink

Hi Regan,

Two small remarks:

* "wchar" might still be useful for those applications / libraries that support 16-bit unicode without aggregates like in Windows NT if I'm correct. It's not utf16 since it can't contain a big, >2-byte code point, ie. it's ushort.

* I don't see the point of the utf8, utf16 and utf32 types. They can all contain any code point, so they should all be just as big? Or do you mean that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint? Actually pieces from the respective strings.

L.

November 24, 2005

Re: Unified String Theory..

Posted by Lionello Lunesu
in reply to Lionello Lunesu

Permalink

Lionello Lunesu

Posted in reply to Lionello Lunesu

Permalink

By the way, I like the proposal! I prefer different compiled libraries to many runtime checks or version blocks. It's like the #define UNICODE in Windows.

L.

November 24, 2005

Re: Unified String Theory..

Posted by Oskar Linde
in reply to Regan Heath

Permalink

Oskar Linde

Posted in reply to Regan Heath

Permalink

Regan Heath wrote:
[snip]
> [Questions]
> (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply.

I think you are making this more complicated than it is by using the name UTF when you actually mean something like:

ascii_char (not utf8) (code point < 128)
ucs2_char (not utf16) (code point < 65536)
unicode_char (not utf32)

And yes: ascii is a subset of ucs2 is a subset of unicode.

> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.

ASCII is equal to the first 128 code points in Unicode.
Latin-1 is equal to the first 256 code points in Unicode.

Regards,

/Oskar

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Lionello Lunesu

Permalink

Regan Heath

Posted in reply to Lionello Lunesu

Permalink

On Thu, 24 Nov 2005 09:56:51 +0200, Lionello Lunesu <lio@remove.lunesu.com> wrote:
> Two small remarks:
>
> * "wchar" might still be useful for those applications / libraries that
> support 16-bit unicode without aggregates like in Windows NT if I'm correct.
> It's not utf16 since it can't contain a big, >2-byte code point, ie. it's
> ushort.
>
> * I don't see the point of the utf8, utf16 and utf32 types. They can all
> contain any code point, so they should all be just as big? Or do you mean
> that utf8 is like a ubyte[4], utf16 like ushort[2] and utf32 like uint?
> Actually pieces from the respective strings.

No. I seem to have done a bad job of explaining it _and_ picked terrible names.

The "utf8", "utf16" and "utf32" types I refer to are essentially byte, short and int. They cannot contain any code point, only those that fit (I thought I said that?)

We don't need wchar because utf16 replaces it.

Perhaps if I had kept the original names... doh!

Regan

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Regan Heath
in reply to Regan Heath

Permalink

Regan Heath

Posted in reply to Regan Heath

Permalink

Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.

The types "utf8" "utf16" and "utf32" do not in fact have anything to do with UTF. (Bad Regan).

They are in fact essentially byte, short and int with different names. Having different names is important because it triggers the transcoding of "string" to the required C, OS, or UTF type.

I could have left them called "char", "wchar" and "dchar", except that I wanted a 4th type to represent C's char as well. That type was called "char" in the proposal.

So, for the sake of our sanity can we all please assume I have used these type names instead:

"utf8"  == "cp1"
"utf16" == "cp2"
"utf32" == "cp4"
"utf"   == "cpn"

(the actual type names are unimportant at this stage, we can pick the best possible names later)

The idea behind these types is that they represent code points/characters _never_ code units/values/fragments. Which means cp1 can only represent a small subset of unicode code points, cp2 slightly more and cp4 all of them (IIRC).

It means assigning anything outside their range to them is an error.

It means that you can assign a cp1 to a cp2 and it simply promotes it (like it would from byte to short).

"cpn" is simply and alias for the type that is best suited for the chosen native encoding. If the native encoding is UTF-8, cpn is an alias for cp1, if the native encoding is UTF-16, cpn is an alias for cp2, and so on.

Sorry for all the confusion.

Regan

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Oskar Linde

Permalink

Regan Heath

Posted in reply to Oskar Linde

Permalink

On Thu, 24 Nov 2005 09:23:21 +0100, Oskar Linde <oskar.lindeREM@OVEgmail.com> wrote:
> Regan Heath wrote:
> [snip]
>> [Questions]
>> (a) Is UTF-8 a subset of UTF-16 and so on? does the codepoint for 'A' have the numerical value 65 decimal in UTF-8, UTF-16 _and_ UTF-32, in other words is it the same numerical value in all encodings? If so then comparing utf8, 16 and 32 is no different to comparing byte, short and int and all the same promotion and comparrison rules can apply.
>
> I think you are making this more complicated than it is by using the name UTF when you actually mean something like:
>
> ascii_char (not utf8) (code point < 128)
> ucs2_char (not utf16) (code point < 65536)
> unicode_char (not utf32)

I agree, it appears my choice of type names was really confusing. I have posted a change, but perhaps I should repost all over again, perhaps I should have bounced this off one person before posting.

> And yes: ascii is a subset of ucs2 is a subset of unicode.

Excellent. Thanks.

>> (b) Is this really ASCII or is it system dependant? i.e. Latin-1 or similar. Is it ASCII values 127 or less perhaps? To be honest I'm not sure.
>
> ASCII is equal to the first 128 code points in Unicode.
> Latin-1 is equal to the first 256 code points in Unicode.

And which does a C function expect? Or is that defined by the C function? Does strcmp care? Does strlen, strchr, ...?

Regan

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Regan Heath
in reply to Regan Heath

Permalink

Regan Heath

Posted in reply to Regan Heath

Attachments:

D and UTF.txt

Permalink

Replying to myself now, in addition to bolloxing the initial proposal up with bad type names, I'm on a roll!

Here is version 1.1 of the proposal, with different type names and some changes to the other content. Hopefully this one will make more sense, fingers crossed.

Regan

Top | Forum index | About this forum

Forums