Unicode Character and String Intrinsics (page 3) - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » Unicode Character and String Intrinsics (page 3)

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Matthew Wilson

Mark Evans

Posted in reply to Matthew Wilson

>This sounds like a nice idea - array of 1st-byte plus lookups.

Thanks.  Correction, "array of first code words." Only in UTF-8 are they byte-sized.

>I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity?

No. There is one table per string.

>I'd be keen to participate in the
>serialisation stuff

No need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.

>It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with in the fashion that you've outlined.

I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc.

C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own.  The ugly char/wchar gimmick has also seen its day and needs replacement.

Mark

The internal implementation might read like this in C++-ish, heavy on the "ish," this is the ideal, it's just a communication vehicle for the concept:

// code word storage types
typedef ubyte    UTF8_CODE;
typedef ushort   UTF16_CODE;
typedef uint     UTF32_CODE;

// max code words per Unicode character
const ushort     UTF8_CODE_MAX  = 6;
const ushort     UTF16_CODE_MAX = 2;
const ushort     UTF32_CODE_MAX = 1;

template <typename UTF_CODE, ushort UTF_CODE_MAX>
class ExtensionTableEntry
{
public:
int       myStringPositionIndex;
UTF_CODE  myStorage[UTF_CODE_MAX+1]; // null terminated?
};

// a partially defined Unicode String class concept
template <typename UTF_CODE, ushort UTF_CODE_MAX>
class UnicodeString
{
public:
long                    length;
UTF_CODE*               operator[];
private:
UTF_CODE*               firstWordsArray;
std::hash_map<
int,
ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
>                       myLookup;
};

typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX>    String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX>  String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX>  String32;

/* Walter - each table entry should hold the full Unicode char not
just its extension codes. This tactic would create some redundancy,
but not much. Having the whole character in contiguous memory could be
advantageous for passing pointers around. So the C++ operator[] either
returns a pointer into the firstWordsArray, or a pointer to the table
entry's myStorage field. In all cases the firstWordsArray always holds
the first code word of the char, whether it's an extended one or not. */

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Bill Cox
in reply to Mark Evans

Bill Cox

Posted in reply to Mark Evans

Hi, Mark.

Mark Evans wrote:
> Bill the point is that trying to paint me this or that color, instead of
> focusing on something specific, is ad hominem.  I find it patronizing.
> Especially since on this point you've already agreed with me explicitly.
> 
> We can quibble on specifics.  I want 3 char types, you want 2 (UTF8 + char) or
> maybe even 3 (UTF8 + char + wchar).
> 
> I have much to say about those bizarre meta programming concepts.  I have worked
> in EDA and know that domain - you can't blow smoke in my face, even if others
> are impressed.  All I would say here is that by your own admission, you're
> trying to write code for 'average' or 'dumb' programmers, so please focus on
> doing just that.

Ok, I'll bite... Why do you feel I'm blowing smoke in your face?

As for the meta-programming stuff, we use DataDraw today to do lots of it, and I find it very productive, particularly for our EDA work.  In particular, we added dynamic class extensions, recursive destructors, array bounds checking, pointer indrection checking to C.  The code generators also give us much of the power of template framworks.  We also use a memory mapping model that works great on 64-bit machines, where EDA is headed fast (we use theSheesh Kabob code generator).  All of these have very specific benifits for EDA, which I've covered in previous posts.

Before calling it bizarre, why not look into it?  A fairly receint version of DataDraw is available at:

http://www.viasic.com/download/datadraw.tar.gz

Most GUI programmers use Class Wizzard, which is much the same kind of thing.  Should that capability be in the language?  Possibly.  The concept has been researched by other groups, and one way to do it is to add "compile-time reflection classes" to the language.  OpenC++ is one example of this aproach.  XL does it, too.

Also, we don't hire average or dumb programmers.  We hire brilliant programmers, and train them to code as-if the target audience were stupid people.  This really helps them work together, and helps the code last over time.  It helps our business output a consistent product - the code looks much the same no matter who wrote it.  There are good business reasons for this.

Putting a restrictive coding methodology in place doesn't restrict how an algorithm works, just how the implementation looks.  So far, there have been exactly 0 algorithms that had to be changed in order to fit into our methodology.  We encourage our programmers to be as creative as possible in algorithm development, and to come up with brilliant solutions.  We enable them to implement those algorithms quickly and efficiently with a consistent, solid, and proven coding methodology. They spend less time thinking about how to write code, and more time writing it.  It's one of our competitive tools for success.

Bill

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Bill Cox
in reply to Mark Evans

Bill Cox

Posted in reply to Mark Evans

Hi, Mark.

Mark Evans wrote:
> Bill Cox wrote,
> 
> The compiler is open-source.  Contributions are welcome.  (Wasn't it you who
> said recently, 'I had a few days off and rewrote the D compiler' or words to
> that effect?  Forgive me if memory fails, I think it was you.)

I wrote a toy compiler to test out some ideas in a few days off, not a D compiler.  There's a huge difference between a week's effort, and what D has become.  In fact C++ is so complex, the compilers out there still aren't complete.  Keeping D simple is key to avoiding this fate.

The fact that D's front-end is open-source is an even greater reason for the language itself to be simple.  The author of Linux has a lot to say about keeping open-source code simple.  He blasted GNU's Herd effort for it's complexity.  I agree with him.  The fact that I'm writing this note using a Linux kernel instead of a GNU Herd kernel supports his assertion.

Last I checked, the D front-end was 35K lines of hand written code, which is impressively small given the functionality and commenting. However, that's still a lot to learn if you just want to contribute, but it's doable.  When it reaches 100K lines, the language is in real trouble.  Not many of us will be willing to work with a program that huge, unless we're getting paid.

Bill

April 01, 2003

Re: Unicode Character and String Intrinsics

Posted by Sean L. Palmer
in reply to Mark Evans

Sean L. Palmer

Posted in reply to Mark Evans

The only problem with this idea is that passing this dual structure to a piece of code that expects a linear string of data won't work.

Typecasting to ubyte[] or ushort[] should solve that, right?

You would probably need to know the length of such a string both in bytes and in chars.

Sean


"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6bpf9$22g9$1@digitaldaemon.com...
>
> >This sounds like a nice idea - array of 1st-byte plus lookups.
>
> Thanks.  Correction, "array of first code words." Only in UTF-8 are they byte-sized.
>
> >I'm intrigued as to the nature of the lookup table. Is this a constant, process-wide, entity?
>
> No. There is one table per string.
>
> >I'd be keen to participate in the
> >serialisation stuff
>
> No need for serialization. Even the compiler can do serialization with no memory footprint. Only something like an explicit conversion to ubyte[] would mandate that.
>
> >It's not clear now whether you've dropped the suggestion for a separate string class, or just that arrays of "char" types would be dealt with in
the
> >fashion that you've outlined.
>
> I never suggested a string 'class,' just Unicode string and char intrinsic types. My list of proposed intrinsics has already been supplied. Think int, float, string8, string16, char 8, etc.
>
> C made a huge mistake in confusing arrays with strings. Strings deserve intrinsic status and a type all their own.  The ugly char/wchar gimmick
has also
> seen its day and needs replacement.
>
> Mark
>
> The internal implementation might read like this in C++-ish, heavy on the "ish," this is the ideal, it's just a communication vehicle for the concept:
>
> // code word storage types
> typedef ubyte    UTF8_CODE;
> typedef ushort   UTF16_CODE;
> typedef uint     UTF32_CODE;
>
> // max code words per Unicode character
> const ushort     UTF8_CODE_MAX  = 6;
> const ushort     UTF16_CODE_MAX = 2;
> const ushort     UTF32_CODE_MAX = 1;
>
> template <typename UTF_CODE, ushort UTF_CODE_MAX>
> class ExtensionTableEntry
> {
> public:
> int       myStringPositionIndex;
> UTF_CODE  myStorage[UTF_CODE_MAX+1]; // null terminated?
> };
>
> // a partially defined Unicode String class concept
> template <typename UTF_CODE, ushort UTF_CODE_MAX>
> class UnicodeString
> {
> public:
> long                    length;
> UTF_CODE*               operator[];
> private:
> UTF_CODE*               firstWordsArray;
> std::hash_map<
> int,
> ExtensionTableEntry<UTF_CODE,UTF_CODE_MAX>
> >                       myLookup;
> };
>
> typedef UnicodeString<UTF8_CODE,UTF8_CODE_MAX>    String8; typedef UnicodeString<UTF16_CODE,UTF16_CODE_MAX>  String16; typedef UnicodeString<UTF32_CODE,UTF32_CODE_MAX>  String32;
>
> /* Walter - each table entry should hold the full Unicode char not
> just its extension codes. This tactic would create some redundancy,
> but not much. Having the whole character in contiguous memory could be
> advantageous for passing pointers around. So the C++ operator[] either
> returns a pointer into the firstWordsArray, or a pointer to the table
> entry's myStorage field. In all cases the firstWordsArray always holds
> the first code word of the char, whether it's an extended one or not. */
>
>

April 02, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Sean L. Palmer

Mark Evans

Posted in reply to Sean L. Palmer

Sean L. Palmer says...
>The only problem with this idea is that passing this dual structure to a piece of code that expects a linear string of data won't work.

Serialization at choke points has a cost of (a) zero, because the string has no
extended codes (say typ. 95%+ of UTF-16 and by definition 100% of UTF-32), or
(b) an alloc plus copy equivalent, which is acceptable for small to medium
strings (another statistically large class in software programs).

You run into problems only with large UTF-8 strings that are frequently passed to/from Unicode APIs.  Windows uses UTF-16 so it's no problem.  Where you find UTF-8 happening is on the web, but that has inherent delays of its own, so the cost might go unnoticed.  Consider for example that plenty of web sites are driven with UTF-8 by languages far slower than D.

Mark

April 02, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Bill Cox

Mark Evans

Posted in reply to Bill Cox

Please don't turn this into yet another thread about DataDraw or dubious management 'expertise.'  (Put up a wiki board somewhere, OK?  I could show you five different ways from Sunday to replace DataDraw with better code using standard languages/libraries/mixins/design patterns/tools of which you seem ignorant.  Sorry you'll have to pay me though.)

Thank you for supporting the idea that D needs some kind of native Unicode support.

Mark

April 02, 2003

Re: Unicode Character and String Intrinsics

Posted by Mark Evans
in reply to Bill Cox

Mark Evans

Posted in reply to Bill Cox

>Keeping D simple is key to avoiding this fate.

Unicode intrinsics make D a simple language.  That is the point of having them. I assume you are still with me that D needs them.

The notion is to rid D of ugly 30-year-old C confusions about strings, and to bring their formats up to modern standards in the bargain.  We can't help the extra work of Unicode; that is what the world wants.

>The fact that D's front-end is open-source is an even greater reason for the language itself to be simple.

No one said otherwise.  You keep propping up straw-men to tear down.  They are purely your own creations.  It's amusing to watch you rip them down, but little else beyond that.  We all want the language to be as simple and orthogonal as possible.  That's why I worry about D's rigid adherence to C++ as a design baseline.

Look Bill - my design sense is as good as yours, maybe better, and definitely more informed.  You need not lecture me about simplicity.  To be frank, your work belies complicated over-engineering and reinvented wheels. From my viewpoint you are the one who needs simplicity lessons.

Furthermore I do not 'advocate' everything that I post.  You halfway accused me of 'advocating' multimethods, and I don't recall once doing that.  I merely linked to a short article showing how multimethods simplify code.

I do advocate functional approaches, for this reason:  they allow me to simplify my code.  You see, I like simplicity.

There are software engineering concepts that C++ does not offer and it's important for a new language effort to know about them.  That way, even if rejected, a decision about the concepts was made on facts, not ignorance.

If you agree with me about Unicode intrinsics, to whatever degree, then bite the bullet and be done with it.  You really are going over the top on this.

Mark

April 02, 2003

DataDraw (was Re: Unicode Character and String Intrinsics)

Posted by Helmut Leitner
in reply to Bill Cox

Helmut Leitner

Posted in reply to Bill Cox

Bill Cox wrote:
> Before calling it bizarre, why not look into it?  A fairly receint version of DataDraw is available at:
> 
> http://www.viasic.com/download/datadraw.tar.gz

When I read one of your postings a week ago, I googled for DataDraw and didn't find references or a download page, although you said it is open source. I found this very weird.

I also didn't get the impression that you were connected to the project. Now a see in the About-Box, that you are the lead developer...

There is no LICENSE. The documentation is so imcomplete that I wouldn't even start trying to use it (Although it's date says 1993).

There are surely better ways to advertise you project.
Why don't you set up an official OS project at sourceforge
and complete the documentation.

--
Helmut Leitner    leitner@hls.via.at Graz, Austria   www.hls-software.com

April 02, 2003

Re: Unicode Character and String Intrinsics

Posted by Matthew Wilson
in reply to Mark Evans

Matthew Wilson

Posted in reply to Mark Evans

Mark

Not wishing to get in the middle of you two stags, but aren't you getting a bit over the top? I don't doubt that all your skills are as incomparable as you assert - though I note you did not add an entry to the "Introductions" thread, why was that? - but do we really need to be told all the time?

Frankly it's beginning to taste a little like Boost, not to mention a waste of time in the lives of lots of busy people in reading through them to get to the technical points (which are very interesting, I must say) that you're making.




"Mark Evans" <Mark_member@pathlink.com> wrote in message news:b6du7v$jiv$1@digitaldaemon.com...
>
> >Keeping D simple is key to avoiding this fate.
>
> Unicode intrinsics make D a simple language.  That is the point of having
them.
> I assume you are still with me that D needs them.
>
> The notion is to rid D of ugly 30-year-old C confusions about strings, and
to
> bring their formats up to modern standards in the bargain.  We can't help
the
> extra work of Unicode; that is what the world wants.
>
> >The fact that D's front-end is open-source is an even greater reason for the language itself to be simple.
>
> No one said otherwise.  You keep propping up straw-men to tear down.  They
are
> purely your own creations.  It's amusing to watch you rip them down, but
little
> else beyond that.  We all want the language to be as simple and orthogonal
as
> possible.  That's why I worry about D's rigid adherence to C++ as a design baseline.
>
> Look Bill - my design sense is as good as yours, maybe better, and
definitely
> more informed.  You need not lecture me about simplicity.  To be frank,
your
> work belies complicated over-engineering and reinvented wheels. From my viewpoint you are the one who needs simplicity lessons.
>
> Furthermore I do not 'advocate' everything that I post.  You halfway
accused me
> of 'advocating' multimethods, and I don't recall once doing that.  I
merely
> linked to a short article showing how multimethods simplify code.
>
> I do advocate functional approaches, for this reason:  they allow me to
simplify
> my code.  You see, I like simplicity.
>
> There are software engineering concepts that C++ does not offer and it's important for a new language effort to know about them.  That way, even if rejected, a decision about the concepts was made on facts, not ignorance.
>
> If you agree with me about Unicode intrinsics, to whatever degree, then
bite the
> bullet and be done with it.  You really are going over the top on this.
>
> Mark
>
>

April 02, 2003

Re: DataDraw (was Re: Unicode Character and String Intrinsics)

Posted by Bill Cox
in reply to Helmut Leitner

Bill Cox

Posted in reply to Helmut Leitner

Hi, Helmut.

Helmut Leitner wrote:
> 
> Bill Cox wrote:
> 
>>Before calling it bizarre, why not look into it?  A fairly receint
>>version of DataDraw is available at:
>>
>>http://www.viasic.com/download/datadraw.tar.gz
> 
> 
> When I read one of your postings a week ago, I googled for DataDraw
> and didn't find references or a download page, although you said
> it is open source. I found this very weird.
> 
> I also didn't get the impression that you were connected to the project.
> Now a see in the About-Box, that you are the lead developer...
> 
> There is no LICENSE. The documentation is so imcomplete that I
> wouldn't even start trying to use it (Although it's date says 1993).
> 
> There are surely better ways to advertise you project.
> Why don't you set up an official OS project at sourceforge
> and complete the documentation.
> 
> --
> Helmut Leitner    leitner@hls.via.at   Graz, Austria   www.hls-software.com

I'm not trying to advertise DataDraw.  In fact, I'd love to see D incorporate features that would allow me to kill it.  I'd prefer that user's didn't start adopting DataDraw, as I don't have the time to do free support.

It's open-source, as the copyright file describes.  It's a very weak copyright, meant to be weaker than the GNU GPL.  The documentation sucks, and I think it will probably stay that way.

I did write the first version, and place it into the open-source domain.  The guys who wrote the second one kept me listed in the about box, but I didn't write the code.  So far as I know, DataDraw is only in use at ViASIC (my company), QuickLogic, and Synplicity.  None of these companies has any reason to promote it.

It's specific insights I've gained in working with DataDraw that I've been trying to describe in this group, rather than trying to promote DataDraw.  I only posted it because someone asked me to, and the license requres that I do.

Through using DataDraw for many years, however, I think I've had some fairly unique insights into language design.  Adding features to a target langauge is what DataDraw is for, and I've been able to try out several features not found in C++ in a real industrial coding environment.  Some of those features I've described in other posts.

As I said, I was hoping D could be extended to make DataDraw obsolete. That turns out not to be the case.  I'll describe some of my current thinking about this matter below.

DataDraw currently just models data structures, and allows me to write code generators.  This is much like the old OM tool for UML (which DataDraw preceeds).  It gives me the power of compile-time reflection classes, like those in OpenC++.  However, for each new language, or coding style, I have to write a new code generator, and these things get really complex.  DataDraw currenly has 5.  That kind of sucks.

Instead, DataDraw should allow me to write one awesome code generator that targets in an intermediate language.  Then, it should allow me to write simple translators for each target language and coding style.  The bulk of the work could then be shared.

With a built-in language translator, DataDraw would be much simpler than it is now.  However, with a built-in language translator, DataDraw becomes a language in itself.  What's unique about it?  Simple.  It's extendable by me and others I work with who are familiar with the DataDraw code base.  I can generate code of any type, and add literally any feature I wish.  However, I do that by directly editing the code generators, which are written in C and which link into DataDraw's database.  That's not elegant, or usable by anyone not familiar with the DataDraw code base, although it does cover my needs.

So, I've been looking into what it takes to get the same power, but in a language that anyone could work with.  In particular, I've been examining what it would take for D to cover DataDraw's functionality. That, it turns out, is hard (which is one reason the XL compiler isn't done).  The more power you give the user, the more you open up the internals of the compiler, and the more complex you make the language.

For example, to do that in D, a natural way would be to make Walter's representation of D as data structures part of the language definition (thus greatly restricting how D compilers are built).  Then, you could offer access to reflection classes at compile time (as OpenC++ does).  A natural way to use these classes at compile time is to interpret D code.  Now, you have to write a D interpreter as well as a compiler.  This is the aproach taken by VHDL for their generators, and it really complicated implementations of compilers.  An alternative is to re-compile the compiler instead.  This is a bit brain-bending, but I think getting rid of the interpreter is worth it.  Besides, I already recompile DataDraw every time I fix or add a feature, and that's never been much of a problem.

Even if we added compile-time reflection classes, I still don't get all the power of DataDraw, which I can extend in any way, because I directly edit the source.  What's still missing?

For one thing, reflection classes can't be used to add syntax to the language.  That's a serious limitation.  XL's aproach allows some syntax extension.  Scheme also has a nice mechanism.  However, both systems are limited, and complex, and slow.  I'm toying with another aproach that is easy if you already allow users to compile custom versions of the compiler (which you do to get rid of the interpreter).  Just provide a simple mechanism for generating a syntax description for use by bison. That nails the problem.  Any new syntax can then be added by a user, so long as it's compatible with what's already there.  A drawback is that bison now becomes part of the language, along with all its quirks and strong points.  At least bison is pretty much available everywhere.

Just adding new syntax to the language doesn't get you all the way there.  You still are stuck with those reflection classes used to model the language.  If you have a new construct to implement, you can add the syntax, but what objects do you build to represent it?  The reflection classes themselves need to be extendable.  Really.  At that point, nothing in the language is left as non-configurable.  You're stuck with LAR1 parsers, but that's no big deal.

However, adding reflection classes is tricky.  Being C-derived, the language still needs to link with the C linker, including the compiler itself, especially if users are going to compile custom compilers for their applications.  That means that new types can't be added to the compiler's database, since C libraries are limited that way.  I'm currently toying with the age-old style of non-typed syntax trees rather than fully typed reflection classes.  It looks like it will work out, but in the end, all this has done is provide a compiler that's easy to extend.  It's easy to extend because it's parser, and internal data structures are simple, and extendable.  Plug-ins should be easy to write.  However, it's not really a standard language any more.  It's just a customizable compiler that's fairly easy to work with.

I'm left with the conclusion that D can't be enhanced be extendable the way XL wants to be, or the way I'd like D to be.

I don't see how D can get there from here.

Bill

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation