Unified String Theory.. (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » General » Unified String Theory.. (page 3)

November 24, 2005

Re: Unified String Theory..

Posted by Dawid Ciężarkiewicz
in reply to Jari-Matti Mäkelä

Dawid Ciężarkiewicz

Posted in reply to Jari-Matti Mäkelä

Jari-Matti Mäkelä wrote:

> Regan, your proposal is absolutely too complex. I don't get it and I
> really don't like it. D is supposed to be a _simple_ language. Here's an
> alternative proposal:
> [CUT]

+1

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Lionello Lunesu

Regan Heath

Posted in reply to Lionello Lunesu

On Thu, 24 Nov 2005 17:35:24 +0200, Lionello Lunesu <lio@remove.lunesu.com> wrote:
>> The "utf8", "utf16" and "utf32" types I refer to are essentially byte,
>> short and int. They cannot contain any code point, only those that fit (I
>> thought I said that?)
>
> In that case I don't like your idea : )
>
> It makes far more sense to have only 1 _character_ type, that holds any
> UNICODE character. Whether it comes from an utf8, utf16 or utf32 string
> shouldn't matter:

Yeah, I'm starting to think that is the only way it works. The 3 types were an attempt to avoid that for programs which do not need an int-sized type for a char i.e. the quick and dirty ASCII program for example.

Interestingly it seems std.stdio and std.format are already involved in a conspiracy to convert all our char[] output to dchar and back again one character at a time before it eventually makes it to the screen.

> string s="Whatever";    //imagine it with a small circle on the a, comma
> under the t
> foreach(uchar u; s) {}
>
> Read "uchar" as "unicode char", essentially dchar, could in fact still be
> named dchar, I just didn't want to mix old/new terminology. The underlying
> type of "string" would be determined at compile time, but still convertable
> using properties (that part I liked very much).
>
> D's "char" should go back to C's char, signed even. Many decissions in D
> where made to ease the porting of C code, so why this "char" got overriden
> beats me. char[] should then behave no differently from byte[] (except maybe
> the element being signed).

I like "uchar". I agree "char" should go back to being C's char type. I don't think we need a char[] all the C functions expect a null terminated char*.

Regan

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Oskar Linde

Regan Heath

Posted in reply to Oskar Linde

On Thu, 24 Nov 2005 12:01:04 +0100, Oskar Linde <oskar.lindeREM@OVEgmail.com> wrote:
> Regan Heath wrote:
>> On Thu, 24 Nov 2005 10:18:20 +0100, Oskar Linde  <oskar.lindeREM@OVEgmail.com> wrote:
>>
>>> Regan Heath wrote:
>>>
>>>> * add a new type/alias "utf", this would alias utf8, 16 or 32. It   represents the application specific native encoding. This allows  efficient  code, like:
>>>>  string s = "test";
>>>> foreach(utf c; s) {
>>>> }
>>>>  regardless of the applications selected native encoding.
>>>
>>>
>>> I will rewrite this with your changed names (cp*):
>>>
>>>  > * add a new type/alias "cpn", this would alias cp1, cp2 or cp4. It
>>>  > represents the application specific native encoding. This allows
>>>  > efficient  code, like:
>>>  >
>>>  > string s = "test";
>>>  > foreach(cpn c; s) {
>>>  > }
>>>  >
>>>  > regardless of the applications selected native encoding.
>>>
>>> Say you instead have:
>>>
>>> string s = "smörgåsbord";
>>> foreach(cpn c; s) {
>>> }
>>>
>>> This code would then work on Win32 (with UTF-16 being the native  encoding), but not on Linux (with UTF-8).
>>   No. "string" would be UTF-8 encoded internally on both platforms.
>
>> My proposal stated that "cpn" would thus be an alias for "cp1" but
>
> Ok. I assumed cpn would be the platform native (preferred) encoding.

Not platform native, application native. But it's not going to work anyway. It seems an int sized char type is required, I was trying to avoid that.

>> clearly  that idea isn't going to work in this case as (I'm assuming) it's  impossible to represent some of those characters using a single byte. Java  uses an int, maybe we should just do the same?
>
> D uses dchar. Better would maybe be to rename it to char (or maybe character), giving:
>
> utf8  (todays char)
> utf16 (todays wchar)
> char  (todays dchar)
>
>>> As I see it, there are only two views you need on a unicode string:
>>> a) The code units
>>> b) The unicode characters
>>  (a) is seldom required. (b) is the common and thus goal view IMO.
>
> Actually, I think it is the other way around. (b) is seldom required.
> You can search, split, trim, parse, etc.. D:s char[], without any regard of encoding.

If you split it without regard for encoding you can get 1/2 a character, which is then an illegal UTF-8 sequence.

> This is the beauty of UTF-8 and the reason D strings all work on code units rather than characters.

But people don't care about code units, they care about characters. When do you want to inspect or modify a single code unit? I would say, just about never. On the other hand you might want to change the 4th character of "smörgåsbord" which may be the 4th, 5th, 6th, or 7th index in a char[] array. Ick.

> When would you actually need character based indexing?
> I believe the answer is less often than you think.

Really? How often do you care what UTF-8 fragments are used to represent your characters? The answer _should_ be never, however D forces you to know, you cannot replace the 4th letter of "smörgåsbord" without knowing. This is the problem, IMO.

Regan

November 24, 2005

Re: Unified String Theory

Posted by Regan Heath
in reply to Georg Wrede

Regan Heath

Posted in reply to Georg Wrede

On Thu, 24 Nov 2005 13:29:10 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Congrats, Regan! Great job!
> And the thread subject is simply a Killer!
>
> If I understand you correctly, then the following would work:
>
> string st = "aaa\U41bbb\UC4ccc\U0107ddd";   // aaaAbbbÄcccćddd
> cp1 s3 = st[3];   // A
> cp1 s7 = st[7];   // Ä
> cp1 s11 = st[11]; // error, too narrow
> cp2 s11 = st[11]; // ć
>
> assert( s3 == 0x41 && s7 == 0xC4 && s11 == 0x107 );
>
> So, s3 would contain "A", which the old system would store as utf8 with no problem. s3 is 8 bits.
>
> s7 would contain "Ä", which the old system shouldn't have stored in 8-bit (char) because it is too big, but with your proposal it would be ok, since the _code_point_ (i.e. the "value" of the character in Unicode) does fit in 8 bits. And _we_are_storing_ the codepoint, not the UTF character here, right?

Yes. That's exactly what I was thinking. However it appears that the idea does hold together to well when it comes to "cpn" the alias, eg:

string s = "smörgåsbord";
foreach(cpn c; s) {
}

"cpn" would need to change size for each character. It would be more than a simple alias.

If it cannot change size, then it would need to be the largest size required.

If that was also too weird/difficult then it would need to be 32 bits in size all the time. I was trying to avoid this but it seems it may be required?

> s11 would error, since even the Unicode value is too big for 8 bits.
>
> The second s11 assignment would be ok, since the Unicode value of ć fits in 16 bits.
>
> And, st itself would be "regular" UTF-8 on a Linux, and (probably) UTF-16 on Windows.
>
> Yes?

My proposal didn't suggest different encodings based on the system. It was UTF-8 by default (all systems) and application specific otherwise. There is nothing stopping us making the windows default to UTF-16 if that makes sense. Which it seems to.

Regan

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Regan Heath
in reply to Georg Wrede

Regan Heath

Posted in reply to Georg Wrede

On Thu, 24 Nov 2005 14:33:53 +0200, Georg Wrede <georg.wrede@nospam.org> wrote:
> Regan Heath wrote:
>>  * add a new type/alias "cpn", this alias will be cp1, cp2 or cp4
>> depending on the native encoding chosen. This allows efficient
>> code, like:
>>  string s = "test";
>> foreach(cpn c; s) {
>> }
>>  * slicing string gives another string
>>  * indexing a string gives a cp1, cp2 or cp4
>
> I hope you are not implying that indexing would choose between cp1..4 based on content? And if not, then the cpX would be either some "default", or programmer chosen? Now, that leads to Americans choosing cp1 all over the place, right?

I didn't think this part thru enough and Oskar gave me an example which broke my original idea. It seems for this to work cpn would need to be a type which changed size for each character, or always 32 bits large (as you suggest below). I was trying to avoid it being 32 bits large all the time, but it seems to be the only way it works.

> If this is true, then we might consider blatantly skipping cp1 and cp2, and only having cp4 (possibly also renaming it utfchar).
>
> Then it would be a lot simpler for the programmer, right? He'd have even less need to start researching in this UTF swamp. And everything would "just work".
>
> This would make it possible for us to fully automate the extraction and insertion of single "characters" into our new strings.
>
>      string foo = "gagaga";
>      utfchar bar = '\UFE9D'; // you don't want to know the name :-)
>      utfchar baf = 'a';
>      foo ~= bar ~ baf;
>
> (I admit the last line doesn't probably work currently, but it should, IMHO.) Anyhow, the point being that if the utfchar type is 32 bits, then it doesn't hurt anybody, and also doesn't lead to gratuituous incompatibility with foreign characters -- which is the D aim all along.

It seems this may be the best solution. Oskar had a good name for it "uchar". It means quick and dirty ASCII apps will have to use a 32 bit sized char type. I can hear people complain already.. but it's odd that no-one is complaining about writef doing this exact same thing!

> For completeness, we could have the painting casts (as opposed to converting casts). They'd be for the (seldom) situations where the programmer _does_ want to do serious tinkering on our strings.
>
>      ubyte[] myarr1 = cast(ubyte[])foo;
>      ushort[] myarr2 = cast(ushort[]) foo;
>      uint[] myarr3 = cast(uint[]) foo;
>
> These give raw arrays, like exact images of the string. The burden of COW would lie on the programmer.

I was thinking of using properties (Sean's idea) to access the data as a certain type, eg.

ubyte[]  b = foo.utf8;
ushort[] s = foo.utf16;
uint[]   i = foo.utf32;

these properties would return the string in the specified encoding using those array types.

> ---
>
> The cpn remark: I think D programs should be (as much as possible) UTF clean, even if the programmer didn't come to think about it. This has the advantage that his programs won't break embarrassingly when a guy in China suddenly uses them.
>
> It would also be quite nice if the programmer didn't have to think about such issues at all. Just code his stuff.
>
> Having cpn as something else than 32 bits, will prevent this dream.
> (Heh, and only having single chars as 32 bits would make writing the libraries so much easier, too, I think.)

Sad but probably true. I was hoping to avoid using 32bits everywhere :(

Regan

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Derek Parnell
in reply to Regan Heath

Derek Parnell

Posted in reply to Regan Heath

On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:

> Derek, I must have done a terrible job explaining this, because you've completely missunderstood me, in fact your counter proposal is essentially what my proposal was intended to be.

You seemed to be wanting to have data types that could only hold characters-fragments. I can't see the point of that.

If strings must be arrays, then let there be an atomic data type that represents a character and then strings can be arrays of characters. The UTF encoding of the string is just an implementation detail then. All indexing would be done on a character basis regardless of the underlying encoding.

In other words, if 'uchar' is the data type that holds a character then

  alias uchar[] string;

could be used. The main 'new thinking' is that uchar could actually have variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on the character it holds and the encoding used at the time.

However I still prefer my earlier suggestion.

> On Thu, 24 Nov 2005 20:15:56 +1100, Derek Parnell <derek@psych.ward> wrote:
>> On Thu, 24 Nov 2005 21:46:50 +1300, Regan Heath wrote:
>>
>>> Ok, it appears I picked some really bad type names in my proposal and it is causing some confusion.
>>
>> Regan,
>> the idea stinks. Sorry, but that *is* the nice response.
>>
>> It is far more complicated than it needs to be. Maybe it's the name confusion, but I don't think so.
>>
>> When dealing with strings, almost nobody needs to deal with partial-characters.
> 
> I think you're confused. My proposal removes the need for dealing with partial characters completely, if you think otherwise then I've done a bad job explaining it.

Apparently, or I'm a bit thicker than suspected ;-)

>> So we don't need to deal with the individual bytes that make up the
>> characters in the various UTF encodings. Sure, we will need to know how
>> big a character is from time to time. For example, given a string
>> (regardless
>> of encoding format), we might need to know how many bytes the third
>> character uses. The answer will depend on the UTF encoding *and* the code
>> point value.
> 
> Exactly my point, and the reason for the "cpn" alias.

But why the need for cp1, cp2, cp4?

>> Mostly we won't even need to know the encoding format. We might, if that
>> is an interfacing requirement, and we might in some circumstances to
>> improve
>> performance. But generally, we shouldn't care.
> 
> Yes, exactly.
> 
>> So how about we just have a string datatype called 'string'. The default
>> encoding format in RAM is compiler dependant but we can on a declaration
>> basis, define specific internal encoding format for a string.
>> Furthermore, we can access any of the three UTF encoding formats for a
>> string as a
>> property of the string. The compiler would generate the call to transcode
>> if required to. The string could also have array properties such that
>> each element addressed an entire character.
> 
> That, is exactly what I proposed.
> 
>> We can then dismiss wchar, wchar[], dchar, dchar[] entirely. And make char and char[] array have the C/C++ semantics.
> 
> I proposed exactly that, except char[] should not exist either. char and char* are all that are required.

Are you saying that we can have arrays of everything except char? I don't think that'll fly.  And char* is a pointer to a single char.

-- 
Derek Parnell
Melbourne, Australia
25/11/2005 7:27:14 AM

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Derek Parnell
in reply to Regan Heath

Derek Parnell

Posted in reply to Regan Heath

On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:


> Sad but probably true. I was hoping to avoid using 32bits everywhere :(

I also use the Euphoria programming language and this uses 32-bit characters exclusively. You do not notice any performance hit because of that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95).

-- 
Derek Parnell
Melbourne, Australia
25/11/2005 7:39:28 AM

November 24, 2005

Re: Unified String Theory..

Posted by Regan Heath
in reply to Jari-Matti Mäkelä

Regan Heath

Posted in reply to Jari-Matti Mäkelä

On Thu, 24 Nov 2005 13:34:51 +0200, Jari-Matti Mäkelä <jmjmak@invalid_utu.fi> wrote:
> Regan, your proposal is absolutely too complex. I don't get it and I really don't like it. D is supposed to be a _simple_ language. Here's an alternative proposal:

<snip>

Thanks for your opinion. It appears some parts of my idea were badly thought out. I was trying to end up with something simple, it seems a few of my choices were bad ones and they simply complicated the idea.

I was trying to avoid picking any 1 type over the others (as you have suggested here).

It appears now that I should replace all my talk about cp1, cp2, cp4 and cpn with "all characters are stored in a 32 bit type called uchar". If anyone has a problem with that, I'd direct them to take a look at std.format.doFormat and std.stdio.writef which convert all char[] data into individual dchar's before converting it back to UTF-8 for output to the screen.

> Please stop whining about the slowness of utf-conversions. If it's really so slow, I would certainly want to see some real world benchmarks.

I mention performance only becase people have been concerned with it in the past. I too have no idea how much time it takes and would like to see a benchmark. The fact that D is already doing it with writef and no-one has complained...

Regan.

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Regan Heath
in reply to Derek Parnell

Regan Heath

Posted in reply to Derek Parnell

On Fri, 25 Nov 2005 07:37:18 +1100, Derek Parnell <derek@psych.ward> wrote:
> On Thu, 24 Nov 2005 22:49:36 +1300, Regan Heath wrote:
>
>> Derek, I must have done a terrible job explaining this, because you've
>> completely missunderstood me, in fact your counter proposal is essentially
>> what my proposal was intended to be.
>
> You seemed to be wanting to have data types that could only hold
> characters-fragments. I can't see the point of that.

No, never fragments, always complete code points. I tried to stress this point. The 8 bit type would hold all the code points with values that fit in 8 bits and never anything else, it's value would always be a code point, not a fragment.

> If strings must be arrays, then let there be an atomic data type that
> represents a character and then strings can be arrays of characters. The
> UTF encoding of the string is just an implementation detail then. All
> indexing would be done on a character basis regardless of the underlying
> encoding.
>
> In other words, if 'uchar' is the data type that holds a character then
>
>   alias uchar[] string;
>
> could be used. The main 'new thinking' is that uchar could actually have
> variable size in terms of bytes. It could be 1, 2 or 4 bytes depending on
> the character it holds and the encoding used at the time.
>
> However I still prefer my earlier suggestion.

I suspect now that all individual characters will have to be represented by a 32 bit type, uchar is a good name for it. If you take my proposal, throw away all the garbage about cp1, cp2, cp4, and cpn then replace them with a new type "uchar" which is 32 bits large, and always use this to represent individual characters then it starts to work, I believe.

> Apparently, or I'm a bit thicker than suspected ;-)

I've just used confusing terms and done a bad job explaining I think.

> But why the need for cp1, cp2, cp4?

This was intended to avoid ASCII programs having to use a 32 bit type for all their characters, and so on.

>> I proposed exactly that, except char[] should not exist either.
>> char and char* are all that are required.
>
> Are you saying that we can have arrays of everything except char?

Yes. Because we don't need an array of char[]. It's simply there for interfacing to C.

> I don't think that'll fly.  And char* is a pointer to a single char.

Technically true, but when you're talking about a C function it's a pointer to the start of a string which is null terminated. That's all we need it for in D.

Regan

November 24, 2005

Re: Unified String Theory [READ THIS FIRST]

Posted by Regan Heath
in reply to Derek Parnell

Regan Heath

Posted in reply to Derek Parnell

On Fri, 25 Nov 2005 07:41:15 +1100, Derek Parnell <derek@psych.ward> wrote:
> On Fri, 25 Nov 2005 09:34:30 +1300, Regan Heath wrote:
>
>
>> Sad but probably true. I was hoping to avoid using 32bits everywhere :(
>
> I also use the Euphoria programming language and this uses 32-bit
> characters exclusively. You do not notice any performance hit because of
> that. The only complaint that some people have is that string use too much RAM (but these people also use Windows 95).

Interesting. In that case I think my "string" type has an advantage. The data could actually be stored in either UTF-8, UTF-16 or UTF-32 internally and only convertedd to/from the 32 bit char when required.

Regan.

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation