Thread overview
Converting a character to upper case in string
Sep 21, 2018
NX
Sep 21, 2018
Laurent Tréguier
Sep 21, 2018
NX
Sep 21, 2018
Laurent Tréguier
Sep 21, 2018
Laurent Tréguier
Sep 21, 2018
Laurent Tréguier
Sep 21, 2018
Gary Willoughby
Sep 22, 2018
Vladimir Panteleev
Sep 22, 2018
Patrick Schluter
Sep 22, 2018
bauss
September 21, 2018
How can I properly convert a character, say, first one to upper case in a unicode correct manner?
In which code level I should be working on? Grapheme? Or maybe code point is sufficient?

There are few phobos functions like asCapitalized() none of which are what I want.
September 21, 2018
On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
> How can I properly convert a character, say, first one to upper case in a unicode correct manner?
> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?
>
> There are few phobos functions like asCapitalized() none of which are what I want.

I would probably go for std.utf.decode [1] to get the character and its length in code units, capitalize it, and concatenate the result with the rest of the string.

[1] https://dlang.org/phobos/std_utf.html#.decode
September 21, 2018
On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
> How can I properly convert a character, say, first one to upper case in a unicode correct manner?
> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?
>
> There are few phobos functions like asCapitalized() none of which are what I want.

----------
import std.conv : to;
import std.stdio : writeln;
import std.string : capitalize;
import std.utf : decode;

size_t index = 1;
size_t oldIndex = index;
auto theString = "hëllo, world";
auto firstLetter = theString.decode(index);
auto result = theString[0 .. oldIndex] ~ capitalize(firstLetter.to!string) ~ theString[index .. $];
writeln(result);
----------

(This could be a lot prettier, but this seems to basically work)
September 21, 2018
On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier wrote:
> I would probably go for std.utf.decode [1] to get the character and its length in code units, capitalize it, and concatenate the result with the rest of the string.
>
> [1] https://dlang.org/phobos/std_utf.html#.decode

So by this I assume it is sufficient to work with dchars rather than graphemes?
September 21, 2018
On Friday, 21 September 2018 at 13:32:54 UTC, NX wrote:
> On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier wrote:
>> I would probably go for std.utf.decode [1] to get the character and its length in code units, capitalize it, and concatenate the result with the rest of the string.
>>
>> [1] https://dlang.org/phobos/std_utf.html#.decode
>
> So by this I assume it is sufficient to work with dchars rather than graphemes?

From what I've tested; it seems sufficient. I might be wrong though, I'm no unicode expert. It might still be a good idea to have a look at grapheme related functions.
September 21, 2018
On Friday, 21 September 2018 at 13:32:54 UTC, NX wrote:
> On Friday, 21 September 2018 at 12:34:12 UTC, Laurent Tréguier wrote:
>> I would probably go for std.utf.decode [1] to get the character and its length in code units, capitalize it, and concatenate the result with the rest of the string.
>>
>> [1] https://dlang.org/phobos/std_utf.html#.decode
>
> So by this I assume it is sufficient to work with dchars rather than graphemes?

----------
import std.stdio;
import std.conv;
import std.string;
import std.uni;

size_t index = 1;
auto theString = "he\u0308llo, world";
auto theStringPart = theString[index .. $];
auto firstLetter = theStringPart.decodeGrapheme;
auto result = theString[0 .. index]
    ~ capitalize(firstLetter[].text)
    ~ theString[index + graphemeStride(theString, index) .. $];
writeln(result);
----------

This will capitalize graphemes as a whole, and might be better than what I previously wrote.
September 21, 2018
On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
> How can I properly convert a character, say, first one to upper case in a unicode correct manner?
> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?
>
> There are few phobos functions like asCapitalized() none of which are what I want.

Use `asCapitalized` to capitalize the first letter or use something like this:


import std.conv;
import std.range;
import std.stdio;
import std.uni;

void main(string[] args)
{
	string input = "noe\u0308l";
	int index    = 2;

	auto graphemes    = input.byGrapheme.array;
	string upperCased = [graphemes[index]].byCodePoint.text.toUpper;

	graphemes[index] = upperCased.decodeGrapheme;
	string output    = graphemes.byCodePoint.text;

	writeln(output);
}

September 22, 2018
On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
> How can I properly convert a character, say, first one to upper case in a unicode correct manner?

That would depend on how you'd define correctness. If your application needs to support "all" languages, then (depending how you interpret it) the task may not be meaningful, as some languages don't have the notion of "upper-case" or even "character" (as an individual glyph). Some languages do have those notions, but they serve a specific purpose that doesn't align with the one in English (e.g. Lojban).

> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?

Using graphemes is necessary if you need to support e.g. combining marks (e.g. ̏◌ + S = ̏S).

September 22, 2018
On Saturday, 22 September 2018 at 06:01:20 UTC, Vladimir Panteleev wrote:
> On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
>> How can I properly convert a character, say, first one to upper case in a unicode correct manner?
>
> That would depend on how you'd define correctness. If your application needs to support "all" languages, then (depending how you interpret it) the task may not be meaningful, as some languages don't have the notion of "upper-case" or even "character" (as an individual glyph). Some languages do have those notions, but they serve a specific purpose that doesn't align with the one in English (e.g. Lojban).

There are other traps in the question of uppercase/lowercase which makes is indeed very difficult to handle correctly if we don't define what correctly means.
Examples:
- It may be necessary to know the locale, i.e. the language of the string to uppercase. In Turkish uppercase of i is not I but İ and lowercase of I is ı (that was a reason for the calamitous low performance of toUpper/toLower in Java for example.
- Some uppercases depend on what they are used for. German ß shouldbe uppercased as SS (note also btw that 1 codepoint becomes 2 in uppercase) in normal text, but for calligraphic work, road signs and other usages it can be capital ẞ.
- Greek has 2 lowercase forms for Σ but two lowercase forms σ and ς depending on the word position.
- While it becomes less and less relevant Serbo-croatian may use digraphs when transcoding the script from Cyrillic (Serbian) to Latin (Croatian), these digraphs have 2 uppercase forms (title-case and all capital):
  - dž -> DŽ or Dž
  - lj -> LJ or Lj
  - NJ -> Nj or nj
Normalization would normally take care of that case.
- Some languages may modify or remove diacritical signs when uppercasing. It is quite usual in French to not put accents on capitals.

It is also clear that the operation of uppercasing is not symetric with lowercasing.

>
>> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?
>
> Using graphemes is necessary if you need to support e.g. combining marks (e.g. ̏◌ + S = ̏S).


September 22, 2018
On Saturday, 22 September 2018 at 06:01:20 UTC, Vladimir Panteleev wrote:
> On Friday, 21 September 2018 at 12:15:52 UTC, NX wrote:
>> How can I properly convert a character, say, first one to upper case in a unicode correct manner?
>
> That would depend on how you'd define correctness. If your application needs to support "all" languages, then (depending how you interpret it) the task may not be meaningful, as some languages don't have the notion of "upper-case" or even "character" (as an individual glyph). Some languages do have those notions, but they serve a specific purpose that doesn't align with the one in English (e.g. Lojban).
>
>> In which code level I should be working on? Grapheme? Or maybe code point is sufficient?
>
> Using graphemes is necessary if you need to support e.g. combining marks (e.g. ̏◌ + S = ̏S).

Uppercase and Lowercase gets even more funky with Turkish.