April 27, 2012
On Fri, Apr 27, 2012 at 04:12:25AM +0200, Matt Peterson wrote:
> On Friday, 27 April 2012 at 01:35:26 UTC, H. S. Teoh wrote:
> >When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...
> 
> I actually recently wrote a lexer generator for D that wouldn't be that hard to adapt to something like this.

That's awesome! Would you like to give it a shot? ;-)

Also, I'm in love with lexer generators... I'd love to make good use of your lexer generator if the code is available somewhere.


T

-- 
Nothing in the world is more distasteful to a man than to take the path that leads to himself. -- Herman Hesse
April 27, 2012
On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote: [...]
> Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like:
> 
> http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm
> 
> You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated.  Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)

You think that's crazy, huh? Check this out:

	http://www.omniglot.com/writing/sumerian.htm

Now take a deep breath...

... this writing was *actually used* in ancient times. Yeah.

Which means it probably has a Unicode block assigned to it, right now. :-)


> > When I get the time? Hah... I really need to get my lazy bum back to working on the new AA implementation first. I think that would contribute greater value than optimizing Unicode algorithms. :-) I was hoping *somebody* would be inspired by my idea and run with it...
> >
> 
> Heh, yea. It is a tempting project, but my plate's overflowing too. (Now if only I could make the same happen to bank account...!)
[...]

On the other hand though, sometimes it's refreshing to take a break from "serious" low-level core language D code, and just write plain ole normal boring application code in D. It's good to be reminded just how easy and pleasant it is to write application code in D.

For example, just today I was playing around with a regex-based version of formattedRead: you pass in a regex and a bunch of pointers, and the function uses compile-time introspection to convert regex matches into the correct value types. So you could call it like this:

	int year;
	string month;
	int day;
	regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day);

Basically, each pair of parentheses corresponds with a pointer argument; non-capturing parentheses (?:) can be used for grouping without assigning to an item.

Its current implementation is still kinda crude, but it does support assigning to user-defined types if you define a fromString() method that does the requisite conversion from the matching substring.

The next step is to standardize on enums in user-defined types that specify a regex substring to be used for matching items of that type, so that the caller doesn't have to know what kind of string pattern is expected by fromString(). I envision something like this:

	struct MyDate {
		enum stdFmt = `(\d{4}-\d{2}-\d{2})`;
		enum americanFmt = `(\d{2}-\d{2}-\d{4})`;
		static MyDate fromString(Char)(Char[] value) { ... }
	}
	...
	string label1, label2;
	MyDate dt1, dt2;
	regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.stdFmt~`\s*$`,
			&label1, &dt1);
	regexRead(input, `\s+(\w+)\s*=\s*`~MyDate.americanFmt~`\s*$`,
			&label2, &dt2);

So the user can specify, in the regex, which date format to use in parsing the dates.

I think this is a vast improvement over the current straitjacketed
formattedRead. ;-) And it's so much fun to code (and use).


T

-- 
Let X be the set not defined by this sentence...
April 27, 2012
On 27.04.2012 5:36, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
> [...]
>> Heh, any usage of Notepad *needs* to be justified. For example, it has an
>> undo buffer of exactly ONE change.
>

Come on, notepad is a real nice in one job only: getting rid of style and fonts of a copied text fragment. I use it as clean-up scratch pool daily. Would be a shame if they ever add fonts and layout to it ;)


-- 
Dmitry Olshansky
April 27, 2012
On 27.04.2012 1:23, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 01:51:17PM -0400, Nick Sabalausky wrote:
>> "James Miller"<james@aatch.net>  wrote in message
>> news:qdgacdzxkhmhojqcettj@forum.dlang.org...
>>> I'm writing an introduction/tutorial to using strings in D, paying
>>> particular attention to the complexities of UTF-8 and 16. I realised
>>> that when you want the number of characters, you normally actually
>>> want to use walkLength, not length. Is is reasonable for the
>>> compiler to pick this up during semantic analysis and point out this
>>> situation?
>>>
>>> It's just a thought because a lot of the time, using length will get
>>> the right answer, but for the wrong reasons, resulting in lurking
>>> bugs. You can always cast to immutable(ubyte)[] or
>>> immutable(short)[] if you want to work with the actual bytes anyway.
>>
>> I find that most of the time I actually *do* want to use length. Don't
>> know if that's common, though, or if it's just a reflection of my
>> particular use-cases.
>>
>> Also, keep in mind that (unless I'm mistaken) walkLength does *not*
>> return the number of "characters" (ie, graphemes), but merely the
>> number of code points - which is not the same thing (due to existence
>> of the [confusingly-named] "combining characters").
> [...]
>
> And don't forget that some code points (notably from the CJK block) are
> specified as "double-width", so if you're trying to do text layout,
> you'll want yet a different length (layoutLength?).
>
> So we really need all four lengths. Ain't unicode fun?! :-)
>
> Array length is simple.  Walklength is already implemented. Grapheme
> length requires recognition of 'combining characters' (or rather,
> ignoring said characters), and layout length requires recognizing
> widthless, single- and double-width characters.
>
> I've been thinking about unicode processing recently. Traditionally, we
> have to decode narrow strings into UTF-32 (aka dchar) then do table
> lookups and such. But unicode encoding and properties, etc., are static
> information (at least within a single unicode release). So why bother
> with hardcoding tables and stuff at all?

Of course they are generated.

>
> What we *really* should be doing, esp. for commonly-used functions like
> computing various lengths, is to automatically process said tables and
> encode the computation in finite-state machines that can then be
> optimized at the FSM level (there are known algos for generating optimal
> FSMs),

FSA are based on tables so it's all runs in the circle. Only the layout changes. Yet the speed gains of non-decoding are huge.

 codegen'd, and then optimized again at the assembly level by the
> compiler. These FSMs will operate at the native narrow string char type
> level, so that there will be no need for explicit decoding.
>
> The generation algo can then be run just once per unicode release, and
> everything will Just Work.
>
This year Unicode in D will receive a nice upgrade.
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#

Anyway keep me posted if you have these FSA ever come to soil your sleep ;)

-- 
Dmitry Olshansky
April 27, 2012
"Dmitry Olshansky" <dmitry.olsh@gmail.com> wrote in message news:jndkji$23ni$2@digitalmars.com...
> On 27.04.2012 5:36, H. S. Teoh wrote:
>> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote: [...]
>>> Heh, any usage of Notepad *needs* to be justified. For example, it has
>>> an
>>> undo buffer of exactly ONE change.
>>
>
> Come on, notepad is a real nice in one job only: getting rid of style and fonts of a copied text fragment. I use it as clean-up scratch pool daily. Would be a shame if they ever add fonts and layout to it ;)
>

That's the #1 biggest thing I use it for!! :) And yes, daily.

I frequently wish I had a global setting for "Don't include style in the clipboard", and maybe a *separate* "Copy with style" command. Or at least a standard "copy without style", or "remove style from clipboard" command. *Something*. 99% of the times I copy/paste text I *don't* want to include style. Drives me crazy.


April 27, 2012
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message news:mailman.1.1335507187.22023.digitalmars-d@puremagic.com...
> On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote: [...]
>> Crazy stuff! Some of them look rather similar to Arabic or Korean's Hangul (sp?), at least to my untrained eye. And then others are just *really* interesting-looking, like:
>>
>> http://www.omniglot.com/writing/12480.htm http://www.omniglot.com/writing/ayeri.htm http://www.omniglot.com/writing/oxidilogi.htm
>>
>> You're right though, if I were in charge of Unicode and tasked with handling some of those, I think I'd just say "Screw it. Unicode is now depricated.  Use ASCII instead. Doesn't have the characters for your langauge? Tough! Fix your language!" :)
>
> You think that's crazy, huh? Check this out:
>
> http://www.omniglot.com/writing/sumerian.htm
>
> Now take a deep breath...
>
> ... this writing was *actually used* in ancient times. Yeah.
>

Jesus, I could *easily* mistake that for hardware schematics. That's wild.


April 27, 2012
On 27.04.2012 12:31, Nick Sabalausky wrote:
> "Dmitry Olshansky"<dmitry.olsh@gmail.com>  wrote in message
> news:jndkji$23ni$2@digitalmars.com...
>> On 27.04.2012 5:36, H. S. Teoh wrote:
>>> On Thu, Apr 26, 2012 at 09:03:59PM -0400, Nick Sabalausky wrote:
>>> [...]
>>>> Heh, any usage of Notepad *needs* to be justified. For example, it has
>>>> an
>>>> undo buffer of exactly ONE change.
>>>
>>
>> Come on, notepad is a real nice in one job only: getting rid of style and
>> fonts of a copied text fragment. I use it as clean-up scratch pool daily.
>> Would be a shame if they ever add fonts and layout to it ;)
>>
>
> That's the #1 biggest thing I use it for!! :) And yes, daily.
>
> I frequently wish I had a global setting for "Don't include style in the
> clipboard", and maybe a *separate* "Copy with style" command. Or at least a
> standard "copy without style", or "remove style from clipboard" command.
> *Something*. 99% of the times I copy/paste text I *don't* want to include
> style. Drives me crazy.
>
>
Yup I certainly wouldn't mind a separate "copy with my font settings" ;)

-- 
Dmitry Olshansky
April 27, 2012
On Fri, Apr 27, 2012 at 12:20:13PM +0400, Dmitry Olshansky wrote:
> On 27.04.2012 1:23, H. S. Teoh wrote:
[...]
> >What we *really* should be doing, esp. for commonly-used functions like computing various lengths, is to automatically process said tables and encode the computation in finite-state machines that can then be optimized at the FSM level (there are known algos for generating optimal FSMs),
> 
> FSA are based on tables so it's all runs in the circle. Only the layout changes. Yet the speed gains of non-decoding are huge.

Yes, but hand-coded tables tend to go out of date, be prone to bugs, or are missing optimizations done by an FSA generator (e.g. a lexer generator). Collapsed FSA states, for example, can greatly reduce table size and speed things up.


>  codegen'd, and then optimized again at the assembly level by the
> >compiler. These FSMs will operate at the native narrow string char type level, so that there will be no need for explicit decoding.
> >
> >The generation algo can then be run just once per unicode release, and everything will Just Work.
> >
> This year Unicode in D will receive a nice upgrade. http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/dolsh/20002#
> 
> Anyway keep me posted if you have these FSA ever come to soil your sleep ;)
[...]

One area where autogenerated Unicode algos will be very useful is in normalization. Unicode normalization is non-trivial, to say the least; it involves looking up various character properties and performing mappings between them in a specified order.

If we can encode this process as FSA, then we can let an automated FSA optimizer produce code that maps directly between the (non-decoded!) source string and the target (non-decoded!) normalized string. Similar things can be done for string concatenation (which requires arbitrarily-distant scanning in either direction from the joining point, though in normal use cases the distance should be very short).


T

-- 
Error: Keyboard not attached. Press F1 to continue. -- Yoon Ha Lee, CONLANG
April 27, 2012
On Friday, 27 April 2012 at 06:12:01 UTC, H. S. Teoh wrote:
> On Thu, Apr 26, 2012 at 09:55:54PM -0400, Nick Sabalausky wrote:
> [...]
>> Crazy stuff! Some of them look rather similar to Arabic or Korean's
>> Hangul (sp?), at least to my untrained eye. And then others are just
>> *really* interesting-looking, like:
>> 
>> http://www.omniglot.com/writing/12480.htm
>> http://www.omniglot.com/writing/ayeri.htm
>> http://www.omniglot.com/writing/oxidilogi.htm
>> 
>> You're right though, if I were in charge of Unicode and tasked with
>> handling some of those, I think I'd just say "Screw it. Unicode is now
>> depricated.  Use ASCII instead. Doesn't have the characters for your
>> langauge? Tough! Fix your language!" :)
>
> You think that's crazy, huh? Check this out:
>
> 	http://www.omniglot.com/writing/sumerian.htm
>
> Now take a deep breath...
>
> ... this writing was *actually used* in ancient times. Yeah.
>
> Which means it probably has a Unicode block assigned to it, right now.
> :-)

It was actually the first human writing ever. Which Phoenician scribe knew that his innovation of the alphabet would make programming easier thousands of years later?


April 28, 2012
"H. S. Teoh" <hsteoh@quickfur.ath.cx> wrote in message news:mailman.1.1335507187.22023.digitalmars-d@puremagic.com...
>
> For example, just today I was playing around with a regex-based version of formattedRead: you pass in a regex and a bunch of pointers, and the function uses compile-time introspection to convert regex matches into the correct value types. So you could call it like this:
>
> int year;
> string month;
> int day;
> regexRead(input, `(\d{4})\s+(\w+)\s+(\d{2})`, &year, &month, &day);
> [...]

That's pretty cool.


1 2 3
Next ›   Last »