Unicode discussion (page 7) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode discussion (page 7)

December 20, 2003

Re: Unicode discussion

Posted by Hauke Duden
in reply to Hauke Duden

Hauke Duden

Posted in reply to Hauke Duden

Hauke Duden wrote:
> Another solution would be if there was some way to write global conversion functions that are called to do implicit conversions between different types. Such functions could also be useful in many other circumstances, so that might be an idea to think about.

Just to clarify: I meant this in the context of creating a string interface instance from a string constant, not to convert between different string objects (which wouldn't make much sense).

E.g.

interface string
{
...
}

class MyString implements string
{
...
}

void print(string msg)
{
...
}



Without an implicit conversion we'd have to write:

print(new MyString("Hello World"));



With an implicit conversion that'd look like this:

string opConvert(char[] s)
{
	return new MyString(s);
}

print("Hello World");

[The last line would translate to print(opConvert("Hello World")) ]


Hauke

December 20, 2003

Re: Unicode discussion

Posted by Walter
in reply to Rupert Millard

Walter

Posted in reply to Rupert Millard

The problem with the operater* or operator~ syntax is it is ambiguous. It's also not greppable.

"Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in message news:brvr60$2il5$1@digitaldaemon.com...
> I agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judge
of
> these things.
>
> "Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brvlj9$29qh$1@digitaldaemon.com...
> > Cool beans!  Thanks, Rupert!
> >
> > This brings up a point.  The main reason that I do not like
opAssign/opAdd
> > syntax for operator overloading is that it is not self-documenting that opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or
that
> > opCatAssign corresponds to a ~= b.  This information either has to be present in a comment or you have to go look it up.  Yeah, D gurus will
> have
> > it memorized, but I'd rather there be just one "name" for the function,
> and
> > it should be the same both in the definition and at the point of call.
> >
> > Sean
> >
> > "Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in
message
> > news:brvghd$21n8$2@digitaldaemon.com...
> > > There has been a lot of talk about doing things, but very little has actually happened. Consequently, I have made a string interface and
two
> > > rough and ready string classes for UTF-8 and UTF-32, which are
attached
> to
> > > this message.
> > >
> > > Currently they only do a few things, one of which is to provide a
> > consistent
> > > interface for character manipulation. The UTF-8 class also provides
> direct
> > > access to the bytes for when the user can do things more efficiently
> with
> > > these. They can also be appended to each other. In addition, each
> provides
> > a
> > > constructor taking the other one as a parameter.
> > >
> > > Please bear in mind that I am only an amateur programmer, who knows
very
> > > little about Unicode and has no experience of programming in the real
> > world.
> > > Nevertheless, I can appreciate some of the issues here and I hope that
> > these
> > > classes can be the foundation of something more useful.
> > >
> > > From,
> > >
> > > Rupert
> >
> >
>
>

December 20, 2003

Re: Unicode discussion

Posted by Rupert Millard
in reply to Walter

Rupert Millard

Posted in reply to Walter

If you say it's ambiguous, I'll take your word for it and if you think being greppable is important, I'm also happy to accept that. My personal opinions are not all that strong - it's only a minor inconvenience to have to check the overload function names.

More importantly, what do you think of my request for more opSlice overloads?

From,

Rupert

"Walter" <walter@digitalmars.com> wrote in message news:bs08b8$527$2@digitaldaemon.com...
> The problem with the operater* or operator~ syntax is it is ambiguous.
It's
> also not greppable.
>
> "Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in message news:brvr60$2il5$1@digitaldaemon.com...
> > I agree with you, but we just have to grin and bear it, unless / until Walter changes his mind. I suppose I could have commented my code better though. Hopefully as I become more experienced, I will be a better judge
> of
> > these things.
> >
> > "Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brvlj9$29qh$1@digitaldaemon.com...
> > > Cool beans!  Thanks, Rupert!
> > >
> > > This brings up a point.  The main reason that I do not like
> opAssign/opAdd
> > > syntax for operator overloading is that it is not self-documenting
that
> > > opSlice corresponds to a[x..y] or that opAdd corresponds to a + b or
> that
> > > opCatAssign corresponds to a ~= b.  This information either has to be present in a comment or you have to go look it up.  Yeah, D gurus will
> > have
> > > it memorized, but I'd rather there be just one "name" for the
function,
> > and
> > > it should be the same both in the definition and at the point of call.
> > >
> > > Sean

December 20, 2003

Re: Unicode discussion

Posted by Karl Bochert
in reply to Walter

Karl Bochert

Posted in reply to Walter

On Thu, 18 Dec 2003 16:05:47 -0800, "Walter" <walter@digitalmars.com> wrote:
> 
> "Sean L. Palmer" <palmer.sean@verizon.net> wrote in message news:brssrg$135p$1@digitaldaemon.com...
> > So you're saying that char[] means UTF-8, and wchar[] means UTF-16, and dchar[] means UTF-32?
> 
> Yes. Exactly.
> 
> > Unfortunately then a char won't hold a single Unicode character,
> 
> Correct. But a dchar will.
> 

A char is defined as a UTF-8 character but does not have enough storage to hold one!?

ubute[4]  declares storage for 4 ubytes btytes, but char[4]

The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Can't a single UTF-8 character require multiple bytes for representation?

A datatype is some storage and a set of operations that can be done on that storage. In what way are char and ubyte different datatypes?

An array of a datatype is an indexable set of elements of that type. (Isn't it?)
Given
    char foo[4];

does foo[2] not represent the third char in foo !!??

I would think that the datatype char would be a UTF-8 character, with no indication of the amount of storage it used. The compiler would be free to represent it internally however it chose. Indexing should work (perhaps inefficiently)

D's datatypes seem to be of two different varieties; names for units of memory
and names for abstract types. Some (ubyte) describe a fixed amount af physical
storage, while others ( ifloat?)  describe an abstract datatype whose physical structure
is hidden (or at least irrelevant)
Which is char?

Karl Bochert

December 20, 2003

Re: Unicode discussion

Posted by Sean L. Palmer
in reply to Walter

Sean L. Palmer

Posted in reply to Walter

It would be greppable if it were required that there be no space between the operator and the symbol.  (if you use regexp you can get around this)

There should be some other way to embed the symbol into the identifier, if it's causing too many lexer problems.

Sean

"Walter" <walter@digitalmars.com> wrote in message news:bs08b8$527$2@digitaldaemon.com...
> The problem with the operater* or operator~ syntax is it is ambiguous.
It's
> also not greppable.

December 21, 2003

Re: Unicode discussion

Posted by Elias Martenson
in reply to Karl Bochert

Elias Martenson

Posted in reply to Karl Bochert

Den Sat, 20 Dec 2003 19:33:59 +0000 skrev Karl Bochert:

> D's datatypes seem to be of two different varieties; names for units of memory
> and names for abstract types. Some (ubyte) describe a fixed amount af physical
> storage, while others ( ifloat?)  describe an abstract datatype whose physical
> structure
> is hidden (or at least irrelevant)
> Which is char?

It's a fixed memory type. Look at it as an ubyte, but with some special guarantees (upheld by convention).

By your own question you have pointed out that the name "char" is not very good. But I really should stop pointing this out, or I'll be banned before I even get started with providing any actual value to the project. :-)

Regards

Elias Mårtenson

December 21, 2003

Re: Unicode discussion

Posted by Walter
in reply to Rupert Millard

Walter

Posted in reply to Rupert Millard

"Rupert Millard" <rupertamillard@hotmail.DELETE.THIS.com> wrote in message news:bs1d9b$2033$1@digitaldaemon.com...
> More importantly, what do you think of my request for more opSlice overloads?

I haven't got that far yet!

December 21, 2003

Re: Unicode discussion

Posted by Walter
in reply to Karl Bochert

Walter

Posted in reply to Karl Bochert

"Karl Bochert" <kbochert@copper.net> wrote in message news:1103_1071948839@bose...
> A char is defined as a UTF-8 character but does not have enough storage to
hold one!?

Right.

> The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?

Yes.

> Can't a single UTF-8 character require multiple bytes for representation?

No.

> A datatype is some storage and a set of operations that can be done on
that storage.
> In what way are char and ubyte different datatypes?

Only how they are overloaded, and how string literals are handled.

> An array of a datatype is an indexable set of elements of that type.
(Isn't it?)
> Given
>     char foo[4];
>
> does foo[2] not represent the third char in foo !!??

If it makes more sense, it is the third byte in foo.

> I would think that the datatype char would be a UTF-8 character, with no
indication of
> the amount of storage it used. The compiler would be free to represent it
internally
> however it chose. Indexing should work (perhaps inefficiently)

That would be a higher level view of it, and I suggest a wrapper class around it can provide this.

> D's datatypes seem to be of two different varieties; names for units of
memory
> and names for abstract types. Some (ubyte) describe a fixed amount af
physical
> storage, while others ( ifloat?)  describe an abstract datatype whose
physical structure
> is hidden (or at least irrelevant)
> Which is char?

char is a fixed 8 bits of storage.

December 21, 2003

Re: Unicode discussion

Posted by Roald Ribe
in reply to Walter

Roald Ribe

Posted in reply to Walter

"Walter" <walter@digitalmars.com> wrote in message news:bs3pmm$2m0v$2@digitaldaemon.com...
>
> "Karl Bochert" <kbochert@copper.net> wrote in message news:1103_1071948839@bose...
> > A char is defined as a UTF-8 character but does not have enough storage
to
> hold one!?
>
> Right.
>
> > The D manual  derscribes a char as being a UTF-8 char AND being 8-bits ?
>
> Yes.
>
> > Can't a single UTF-8 character require multiple bytes for
representation?
>
> No.

???
A unicode character can result in up to 6 bytes used, when encoded
with UTF-8. Which is what the poster meant to ask, I think.

Roald

December 21, 2003

Re: Unicode discussion

Posted by Rupert Millard
in reply to Walter

Rupert Millard

Posted in reply to Walter

> > I would think that the datatype char would be a UTF-8 character, with no
> indication of
> > the amount of storage it used. The compiler would be free to represent
it
> internally
> > however it chose. Indexing should work (perhaps inefficiently)
>
> That would be a higher level view of it, and I suggest a wrapper class around it can provide this.

On Friday 19th, I posted a class that provides this functionality to this thread.

You can see the message here: http://www.digitalmars.com/drn-bin/wwwnews?D/20619

As for the attached file - it does not appear to be accessible to users of the webservice, so I have placed it on the wiki at: http://www.wikiservice.at/wiki4d/wiki.cgi?StringClasses

Rupert

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation