View mode: basic / threaded / horizontal-split · Log in · Help
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Fri, Apr 27, 2012 at 04:12:25AM +0200, Matt Peterson wrote:
> On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:
> >When I get the time? Hah... I really need to get my lazy bum back to
> >working on the new AA implementation first. I think that would
> >contribute greater value than optimizing Unicode algorithms. :-) I
> >was hoping *somebody* would be inspired by my idea and run with it...
> 
> I actually recently wrote a lexer generator for D that wouldn't be
> that hard to adapt to something like this.

That's awesome! Would you like to give it a shot? ;-)

Also, I'm in love with lexer generators... I'd love to make good use of
your lexer generator if the code is available somewhere.


T

-- 
Nothing in the world is more distasteful to a man than to take the path
that leads to himself. -- Herman Hesse
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
[...]
> Crazy stuff! Some of them look rather similar to Arabic or Korean's
> Hangul (sp?), at least to my untrained eye. And then others are just
> *really* interesting-looking, like:
> 
> http://www.omniglot.com/writing/12480.htm
> http://www.omniglot.com/writing/ayeri.htm
> http://www.omniglot.com/writing/oxidilogi.htm
> 
> You're right though, if I were in charge of Unicode and tasked with
> handling some of those, I think I'd just say "Screw it. Unicode is now
> depricated.  Use ASCII instead. Doesn't have the characters for your
> langauge? Tough! Fix your language!" :)

You think that's crazy, huh? Check this out:

	http://www.omniglot.com/writing/sumerian.htm

Now take a deep breath...

... this writing was *actually used* in ancient times. Yeah.

Which means it probably has a Unicode block assigned to it, right now.
:-)


> > When I get the time? Hah... I really need to get my lazy bum back to
> > working on the new AA implementation first. I think that would
> > contribute greater value than optimizing Unicode algorithms. :-) I
> > was hoping *somebody* would be inspired by my idea and run with
> > it...
> >
> 
> Heh, yea. It is a tempting project, but my plate's overflowing too.
> (Now if only I could make the same happen to bank account...!)
[...]

On the other hand though, sometimes it's refreshing to take a break from
"serious" low-level core language D code, and just write plain ole
normal boring application code in D. It's good to be reminded just how
easy and pleasant it is to write application code in D.

For example, just today I was playing around with a regex-based version
of formattedRead: you pass in a regex and a bunch of pointers, and the
function uses compile-time introspection to convert regex matches into
the correct value types. So you could call it like this:

	int year;
	string month;
	int day;
	regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day);

Basically, each pair of parentheses corresponds with a pointer argument;
non-capturing parentheses (?:) can be used for grouping without
assigning to an item.

Its current implementation is still kinda crude, but it does support
assigning to user-defined types if you define a fromString() method that
does the requisite conversion from the matching substring.

The next step is to standardize on enums in user-defined types that
specify a regex substring to be used for matching items of that type, so
that the caller doesn't have to know what kind of string pattern is
expected by fromString(). I envision something like this:

	struct MyDate {
		enum stdFmt = `(\d{4}-\d{2}-\d{2})`;
		enum americanFmt = `(\d{2}-\d{2}-\d{4})`;
		static MyDate fromString(Char)(Char[] value) { ... }
	}
	...
	string label1, label2;
	MyDate dt1, dt2;
	regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.stdFmt~`\s*$`,
			&label1, &dt1);
	regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.americanFmt~`\s*$`,
			&label2, &dt2);

So the user can specify, in the regex, which date format to use in
parsing the dates.

I think this is a vast improvement over the current straitjacketed
formattedRead. ;-) And it's so much fun to code (and use).


T

-- 
Let X be the set not defined by this sentence...
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On 27.04.2012 5:36, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
> [...]
>> Heh, any usage of Notepad *needs* to be justified. For example, it has an
>> undo buffer of exactly ONE change.
>

Come on, notepad is a real nice in one job only: getting rid of style 
and fonts of a copied text fragment. I use it as clean-up scratch pool 
daily. Would be a shame if they ever add fonts and layout to it ;)


-- 
Dmitry Olshansky
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On 27.04.2012 1:23, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller"<james@aatch.net>  wrote in message
>> news:qdgacdzxkhmhojqcettj@forum.dlang.org...
>>> I'm writing an introduction/tutorial to using strings in D, paying
>>> particular attention to the complexities of UTF-8 and 16. I realised
>>> that when you want the number of characters, you normally actually
>>> want to use walkLength, not length. Is is reasonable for the
>>> compiler to pick this up during semantic analysis and point out this
>>> situation?
>>>
>>> It's just a thought because a lot of the time, using length will get
>>> the right answer, but for the wrong reasons, resulting in lurking
>>> bugs. You can always cast to immutable(ubyte)[] or
>>> immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>
> So we really need all four lengths. Ain't unicode fun?! :-)
>
> Array length is simple.  Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>
> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?

Of course they are generated.

>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs),

FSA are based on tables so it's all runs in the circle. Only the layout 
changes. Yet the speed gains of non-decoding are huge.

 codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>
This year Unicode in D will receive a nice upgrade.
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Anyway keep me posted if you have these FSA ever come to soil your sleep ;)

-- 
Dmitry Olshansky
April 27, 2012
Re: Notice/Warning on narrowStrings .length
"Dmitry Olshansky" <dmitry.olsh@gmail.com> wrote in message 
news:jndkji$23ni$2@digitalmars.com...
> On 27.04.2012 5:36, H. S. Teoh wrote:
>> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
>> [...]
>>> Heh, any usage of Notepad *needs* to be justified. For example, it has 
>>> an
>>> undo buffer of exactly ONE change.
>>
>
> Come on, notepad is a real nice in one job only: getting rid of style and 
> fonts of a copied text fragment. I use it as clean-up scratch pool daily. 
> Would be a shame if they ever add fonts and layout to it ;)
>

That's the #1 biggest thing I use it for!! :) And yes, daily.

I frequently wish I had a global setting for "Don't include style in the 
clipboard", and maybe a *separate* "Copy with style" command. Or at least a 
standard "copy without style", or "remove style from clipboard" command. 
*Something*. 99% of the times I copy/paste text I *don't* want to include 
style. Drives me crazy.
April 27, 2012
Re: Notice/Warning on narrowStrings .length
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
news:mailman.1.1335507187.22023.digitalmars-d@puremagic.com...
> On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
> [...]
>> Crazy stuff! Some of them look rather similar to Arabic or Korean's
>> Hangul (sp?), at least to my untrained eye. And then others are just
>> *really* interesting-looking, like:
>>
>> http://www.omniglot.com/writing/12480.htm
>> http://www.omniglot.com/writing/ayeri.htm
>> http://www.omniglot.com/writing/oxidilogi.htm
>>
>> You're right though, if I were in charge of Unicode and tasked with
>> handling some of those, I think I'd just say "Screw it. Unicode is now
>> depricated.  Use ASCII instead. Doesn't have the characters for your
>> langauge? Tough! Fix your language!" :)
>
> You think that's crazy, huh? Check this out:
>
> http://www.omniglot.com/writing/sumerian.htm
>
> Now take a deep breath...
>
> ... this writing was *actually used* in ancient times. Yeah.
>

Jesus, I could *easily* mistake that for hardware schematics. That's wild.
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On 27.04.2012 12:31, Nick Sabalausky wrote:
> "Dmitry Olshansky"<dmitry.olsh@gmail.com>  wrote in message
> news:jndkji$23ni$2@digitalmars.com...
>> On 27.04.2012 5:36, H. S. Teoh wrote:
>>> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
>>> [...]
>>>> Heh, any usage of Notepad *needs* to be justified. For example, it has
>>>> an
>>>> undo buffer of exactly ONE change.
>>>
>>
>> Come on, notepad is a real nice in one job only: getting rid of style and
>> fonts of a copied text fragment. I use it as clean-up scratch pool daily.
>> Would be a shame if they ever add fonts and layout to it ;)
>>
>
> That's the #1 biggest thing I use it for!! :) And yes, daily.
>
> I frequently wish I had a global setting for "Don't include style in the
> clipboard", and maybe a *separate* "Copy with style" command. Or at least a
> standard "copy without style", or "remove style from clipboard" command.
> *Something*. 99% of the times I copy/paste text I *don't* want to include
> style. Drives me crazy.
>
>
Yup I certainly wouldn't mind a separate "copy with my font settings" ;)

-- 
Dmitry Olshansky
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Fri, Apr 27, 2012 at 12:20:13PM +0400, Dmitry Olshansky wrote:
> On 27.04.2012 1:23, H. S. Teoh wrote:
[...]
> >What we *really* should be doing, esp. for commonly-used functions
> >like computing various lengths, is to automatically process said
> >tables and encode the computation in finite-state machines that can
> >then be optimized at the FSM level (there are known algos for
> >generating optimal FSMs),
> 
> FSA are based on tables so it's all runs in the circle. Only the
> layout changes. Yet the speed gains of non-decoding are huge.

Yes, but hand-coded tables tend to go out of date, be prone to bugs, or
are missing optimizations done by an FSA generator (e.g. a lexer
generator). Collapsed FSA states, for example, can greatly reduce table
size and speed things up.


>  codegen'd, and then optimized again at the assembly level by the
> >compiler. These FSMs will operate at the native narrow string char
> >type level, so that there will be no need for explicit decoding.
> >
> >The generation algo can then be run just once per unicode release,
> >and everything will Just Work.
> >
> This year Unicode in D will receive a nice upgrade.
> http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#
> 
> Anyway keep me posted if you have these FSA ever come to soil your
> sleep ;)
[...]

One area where autogenerated Unicode algos will be very useful is in
normalization. Unicode normalization is non-trivial, to say the least;
it involves looking up various character properties and performing
mappings between them in a specified order.

If we can encode this process as FSA, then we can let an automated FSA
optimizer produce code that maps directly between the (non-decoded!)
source string and the target (non-decoded!) normalized string. Similar
things can be done for string concatenation (which requires
arbitrarily-distant scanning in either direction from the joining point,
though in normal use cases the distance should be very short).


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
April 27, 2012
Re: Notice/Warning on narrowStrings .length
On Friday, 27 April 2012 at 06:12:01 UTC, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
> [...]
>> Crazy stuff! Some of them look rather similar to Arabic or 
>> Korean's
>> Hangul (sp?), at least to my untrained eye. And then others 
>> are just
>> *really* interesting-looking, like:
>> 
>> http://www.omniglot.com/writing/12480.htm
>> http://www.omniglot.com/writing/ayeri.htm
>> http://www.omniglot.com/writing/oxidilogi.htm
>> 
>> You're right though, if I were in charge of Unicode and tasked 
>> with
>> handling some of those, I think I'd just say "Screw it. 
>> Unicode is now
>> depricated.  Use ASCII instead. Doesn't have the characters 
>> for your
>> langauge? Tough! Fix your language!" :)
>
> You think that's crazy, huh? Check this out:
>
> 	http://www.omniglot.com/writing/sumerian.htm
>
> Now take a deep breath...
>
> ... this writing was *actually used* in ancient times. Yeah.
>
> Which means it probably has a Unicode block assigned to it, 
> right now.
> :-)

It was actually the first human writing ever. Which Phoenician 
scribe knew that his innovation of the alphabet would make 
programming easier thousands of years later?
April 28, 2012
Re: Notice/Warning on narrowStrings .length
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message 
news:mailman.1.1335507187.22023.digitalmars-d@puremagic.com...
>
> For example, just today I was playing around with a regex-based version
> of formattedRead: you pass in a regex and a bunch of pointers, and the
> function uses compile-time introspection to convert regex matches into
> the correct value types. So you could call it like this:
>
> int year;
> string month;
> int day;
> regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day);
> [...]

That's pretty cool.
Next ›   Last »
1 2 3
Top | Discussion index | About this forum | D home