Thread overview
Is º an unicode alphabetic character?
Sep 12, 2014
AsmMan
Sep 12, 2014
Ali Çehreli
Sep 12, 2014
AsmMan
Sep 12, 2014
Ali Çehreli
Sep 12, 2014
AsmMan
Sep 12, 2014
AsmMan
September 12, 2014
what's an unicode alphabetic character? I misunderstood isAlpha(), I used to think it's to validate letters like a, b, è, é .. z etc but isAlpha('º') from std.uni module return true. How can I validate only the letters of an unicode alphabet in D or should I write one?

I know I can do:

bool is_id(dchar c)
{
	return c >= 'a' && c <= 'z' || c >= 'A' && c <= 'z' || c >= 0xc0;
}

but I'm looking for a native, if any
September 12, 2014
On 09/11/2014 08:04 PM, AsmMan wrote:

> what's an unicode alphabetic character?

Alphabetic is defined as Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic, all of which are explained here:

  http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values

> I misunderstood isAlpha(), I
> used to think it's to validate letters like a, b, è, é .. z etc but
> isAlpha('º') from std.uni module return true.

º happens to be in the "Letter, Lowercase" category so yes, it is isAlpha().

> How can I validate only
> the letters of an unicode alphabet in D or should I write one?

There are so many alphabets in the world. It is likely that a Unicode character will be a part of one.

> I know I can do:
>
> bool is_id(dchar c)
> {
>      return c >= 'a' && c <= 'z' || c >= 'A' && c <= 'z' || c >= 0xc0;
> }

There is a misunderstanding. There are so many Unicode characters that are >= 0xc0 but not a part of the Alphabetic category. For example: ← (U+2190 LEFTWARDS ARROW).

Ali

September 12, 2014
On Friday, 12 September 2014 at 04:04:22 UTC, Ali Çehreli wrote:
> On 09/11/2014 08:04 PM, AsmMan wrote:
>
> > what's an unicode alphabetic character?
>
> Alphabetic is defined as Lu + Ll + Lt + Lm + Lo + Nl + Other_Alphabetic, all of which are explained here:
>
>   http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values
>
> > I misunderstood isAlpha(), I
> > used to think it's to validate letters like a, b, è, é .. z
> etc but
> > isAlpha('º') from std.uni module return true.
>
> º happens to be in the "Letter, Lowercase" category so yes, it is isAlpha().
>
> > How can I validate only
> > the letters of an unicode alphabet in D or should I write one?
>
> There are so many alphabets in the world. It is likely that a Unicode character will be a part of one.
>
> > I know I can do:
> >
> > bool is_id(dchar c)
> > {
> >      return c >= 'a' && c <= 'z' || c >= 'A' && c <= 'z' || c
> >= 0xc0;
> > }
>
> There is a misunderstanding. There are so many Unicode characters that are >= 0xc0 but not a part of the Alphabetic category. For example: ← (U+2190 LEFTWARDS ARROW).
>
> Ali

If I want ASCII and latin only alphabet which range should I use?
ie, how should I rewrite is_id() function?
September 12, 2014
On 09/11/2014 11:38 PM, AsmMan wrote:

> If I want ASCII and latin only alphabet which range should I use?
> ie, how should I rewrite is_id() function?

This seems to be it:

import std.stdio;
import std.uni;

void main()
{
    alias latin = unicode.script.latin;
    assert('ç' in latin);
    assert('7' !in latin);

    writeln(latin);
}

Ali

September 12, 2014
On Friday, 12 September 2014 at 07:57:43 UTC, Ali Çehreli wrote:
> On 09/11/2014 11:38 PM, AsmMan wrote:
>
> > If I want ASCII and latin only alphabet which range should I
> use?
> > ie, how should I rewrite is_id() function?
>
> This seems to be it:
>
> import std.stdio;
> import std.uni;
>
> void main()
> {
>     alias latin = unicode.script.latin;
>     assert('ç' in latin);
>     assert('7' !in latin);
>
>     writeln(latin);
> }
>
> Ali

Sorry, I shouldn't asked for latin but an alphabet like French instead of: http://www.importanceoflanguages.com/Images/French/FrenchAlphabet.jpg (including the diacritics, of course)

As you mentioned, º happend to be a letter so it still pass in: assert('º' in latin);

so isn't different from isAlpha(). Is the UTF-8 table organized so that I can use a range (like we do for ASCII ch >= 'a' && ch <= 'z' || ch >= 'A' && ch <= 'Z') or should I put these alpha characters myself on table and then do look up?
September 12, 2014
Thanks Ali, I think I get close:

bool is_id(dchar c)
{
	return c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z' || c >= 0xc0 && c <= 0x0d || c >= 0xd8 && c <= 0xf6 || c >= 0xf8 && c <= 0xff;
}

this doesn't include some math symbols. like c >= 0xc0 did.