UTF-8 - D Programming Language Discussion Forum

Forums

New users
- Learn
Community
- General
- Announce
Improvements
- DIP Ideas
- DIP Devel.
Ecosystem
- GDC
- LDC
- Debuggers
- IDEs
- DWT
Development
- Internals
- Issues
- Beta
- DMD
- Phobos
- Druntime
- Study
Turkish
- Genel
- Duyuru

Index » D » UTF-8

Thread overview

UTF-8
Dec 25, 2003 ET_yoza
Dec 26, 2003 Walter
Dec 26, 2003 ET_yoza
Dec 26, 2003 ET_yoza
Dec 26, 2003 Hauke Duden
Dec 28, 2003 Ben Hinkle
Dec 28, 2003 Hauke Duden
Dec 28, 2003 Matthew
Dec 29, 2003 Hauke Duden
Jan 04, 2004 Walter
Jan 04, 2004 Matthew
Dec 29, 2003 Y.Tomino
Dec 29, 2003 Hauke Duden
Dec 29, 2003 Y.Tomino
Dec 26, 2003 Y.Tomino
Dec 26, 2003 ET_yoza
Dec 28, 2003 Matthew
Dec 29, 2003 Y.Tomino

December 25, 2003

Posted by ET_yoza

ET_yoza

Dear Mr. Walter,

I am a user D language in Japan and testing it. I have Windows2000
Japanese Edition installed in my PC for development, and also the D
compiler.
I found a problem during use of D, about encoding of a multi-byte
character sequence.
I know that D is a Unicode-oriented language, so wrote the source code
in Unicode.
API of my OS requires Shift_JIS as encoding of a character sequence.
(MBCS which A-th edition API of Windows requires is encoding without
the compatibility for every different country in UTF-8)
I expected that it would be converted by D language implicitly, but, D
doesn't seem to perform encoding properly while calling API.
As a result, in order to display Japanese, the source code cannot be
written in Unicode.
If the source code written in Shift_JIS, the short-term purpose will
be achieved. However, it is contrary to specification. Moreover, that
vicious problem of C will also be made to recur.
Since encodings other than Unicode have characters containing an
escape character.

"undefined escape sequence \?"

I'd like to write code in Unicode based on the specification, therefore, please solve this problem by adjusting encoding when Phobos calls API.

December 26, 2003

Posted by Walter
in reply to ET_yoza

Walter

Posted in reply to ET_yoza

D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need to write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to shift-JIS on output. I intend to do this for all code pages, but have not written those filters yet.

"ET_yoza" <ET_yoza_member@pathlink.com> wrote in message news:bseb83$2qvq$1@digitaldaemon.com...
> Dear Mr. Walter,
>
> I am a user D language in Japan and testing it. I have Windows2000
> Japanese Edition installed in my PC for development, and also the D
> compiler.
> I found a problem during use of D, about encoding of a multi-byte
> character sequence.
> I know that D is a Unicode-oriented language, so wrote the source code
> in Unicode.
> API of my OS requires Shift_JIS as encoding of a character sequence.
> (MBCS which A-th edition API of Windows requires is encoding without
> the compatibility for every different country in UTF-8)
> I expected that it would be converted by D language implicitly, but, D
> doesn't seem to perform encoding properly while calling API.
> As a result, in order to display Japanese, the source code cannot be
> written in Unicode.
> If the source code written in Shift_JIS, the short-term purpose will
> be achieved. However, it is contrary to specification. Moreover, that
> vicious problem of C will also be made to recur.
> Since encodings other than Unicode have characters containing an
> escape character.
>
> "undefined escape sequence \?"
>
> I'd like to write code in Unicode based on the specification, therefore, please solve this problem by adjusting encoding when Phobos calls API.
>
>
>

December 26, 2003

Posted by ET_yoza
in reply to Walter

ET_yoza

Posted in reply to Walter

Thank you. I understood

In article <bsgrfu$rpp$1@digitaldaemon.com>, Walter says...
>
>D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need to write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to shift-JIS on output. I intend to do this for all code pages, but have not written those filters yet.
>

December 26, 2003

Posted by ET_yoza
in reply to Walter

ET_yoza

Posted in reply to Walter

Thank you. I understood.

In article <bsgrfu$rpp$1@digitaldaemon.com>, Walter says...
>
>D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need to write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to shift-JIS on output. I intend to do this for all code pages, but have not written those filters yet.
>

December 26, 2003

Posted by Hauke Duden
in reply to Walter

Hauke Duden

Posted in reply to Walter

Walter wrote:
> D currently can handle UTF-8 (unicode) strings, but it does not handle
> shift-JIS strings. To make shift-JIS work in your programs, you'll need to
> write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8 to
> shift-JIS on output. I intend to do this for all code pages, but have not
> written those filters yet.

I don't think you have to. The C runtime library provides functions to convert from wide char to the local code page and vice versa. We can use those for conversions of this kind.

I know I'm repeating the same stuff over and over, but maybe this real world example has shifted your position somewhat. I REALLY think UTF-8 strings should not use the "char" type. The CRT expects chars to be encoded in the local code page, so this will lead to all kinds of confusion when you mix C functions with D functions. The latter expects UTF-8, the former the local code page, but both use the same type. Actually, if you get right down to the definition, they use different types with the same name and none of the type-safety features you expect from a typed programming language!

It would be a lot easier if the types had different names.

Hauke

December 26, 2003

Posted by Y.Tomino
in reply to Walter

Y.Tomino

Posted in reply to Walter

The mechanism like "filter" is unnecessary.
It's same problem as WriteString I posted.
A-Version of Windows API always require the string of current code page.
** Don't call A-Version API with UTF-8!! **
Simply, it's necessary that Phobos call A-Version API with
WideCharToMultiByte.
WideCharToMultiByte convert Unicode to SHIFT-JIS in Japan,
and convert to used encoding in another country.

for example...

//current
void mkdir(char[] pathname)
{
  if (!CreateDirectoryA(toStringz(pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

//correct
void mkdir(char[] utf8_pathname)
{
  char[] codepage_pathname = toMBCS(toUTF16(utf8_pathname));
  if (!CreateDirectoryA(toStringz(codepage_pathname), null))
  {
    throw new FileException(pathname, GetLastError());
  }
}

char[] toMBCS(wchar[] s)
{
  char[] result;
  result.length = WideCharToMultiByte(0, 0, s, s.length, null, 0, null,
null);
  WideCharToMultiByte(0, 0, s, s.length, result, result.length, null, null);
  return result;
}

//ideal
//Unicode has many letters more than code-page encoding.
//Please try W-Version API first.
void mkdir(char[] utf8_pathname)
{
  wchar[] utf16_pathname = toUTF16(utf8_pathname);
  if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
  {
    char[] codepage_pathname = toMBCS(utf16_pathname);
    if (!CreateDirectoryA(toStringz(codepage_pathname), null))
    {
      throw new FileException(pathname, GetLastError());
    }
  }
}

Thanks.
YT

"Walter" <walter@digitalmars.com> wrote in message news:bsgrfu$rpp$1@digitaldaemon.com...
> D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need to write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8
to
> shift-JIS on output. I intend to do this for all code pages, but have not written those filters yet.
>
> "ET_yoza" <ET_yoza_member@pathlink.com> wrote in message news:bseb83$2qvq$1@digitaldaemon.com...
> > Dear Mr. Walter,
> >
> > I am a user D language in Japan and testing it. I have Windows2000
> > Japanese Edition installed in my PC for development, and also the D
> > compiler.
> > I found a problem during use of D, about encoding of a multi-byte
> > character sequence.
> > I know that D is a Unicode-oriented language, so wrote the source code
> > in Unicode.
> > API of my OS requires Shift_JIS as encoding of a character sequence.
> > (MBCS which A-th edition API of Windows requires is encoding without
> > the compatibility for every different country in UTF-8)
> > I expected that it would be converted by D language implicitly, but, D
> > doesn't seem to perform encoding properly while calling API.
> > As a result, in order to display Japanese, the source code cannot be
> > written in Unicode.
> > If the source code written in Shift_JIS, the short-term purpose will
> > be achieved. However, it is contrary to specification. Moreover, that
> > vicious problem of C will also be made to recur.
> > Since encodings other than Unicode have characters containing an
> > escape character.
> >
> > "undefined escape sequence \?"
> >
> > I'd like to write code in Unicode based on the specification, therefore, please solve this problem by adjusting encoding when Phobos calls API.

December 26, 2003

Posted by ET_yoza
in reply to Y.Tomino

ET_yoza

Posted in reply to Y.Tomino

Thank you. Thank you. Thank you!!!

In article <bshb44$1kr2$1@digitaldaemon.com>, Y.Tomino says...
>
>The mechanism like "filter" is unnecessary.
>It's same problem as WriteString I posted.
>A-Version of Windows API always require the string of current code page.
>** Don't call A-Version API with UTF-8!! **
>Simply, it's necessary that Phobos call A-Version API with
>WideCharToMultiByte.
>WideCharToMultiByte convert Unicode to SHIFT-JIS in Japan,
>and convert to used encoding in another country.
>
>for example...
>
>//current
>void mkdir(char[] pathname)
>{
>  if (!CreateDirectoryA(toStringz(pathname), null))
>  {
>    throw new FileException(pathname, GetLastError());
>  }
>}
>
>//correct
>void mkdir(char[] utf8_pathname)
>{
>  char[] codepage_pathname = toMBCS(toUTF16(utf8_pathname));
>  if (!CreateDirectoryA(toStringz(codepage_pathname), null))
>  {
>    throw new FileException(pathname, GetLastError());
>  }
>}
>
>char[] toMBCS(wchar[] s)
>{
>  char[] result;
>  result.length = WideCharToMultiByte(0, 0, s, s.length, null, 0, null,
>null);
>  WideCharToMultiByte(0, 0, s, s.length, result, result.length, null, null);
>  return result;
>}
>
>//ideal
>//Unicode has many letters more than code-page encoding.
>//Please try W-Version API first.
>void mkdir(char[] utf8_pathname)
>{
>  wchar[] utf16_pathname = toUTF16(utf8_pathname);
>  if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
>  {
>    char[] codepage_pathname = toMBCS(utf16_pathname);
>    if (!CreateDirectoryA(toStringz(codepage_pathname), null))
>    {
>      throw new FileException(pathname, GetLastError());
>    }
>  }
>}
>

December 28, 2003

Posted by Ben Hinkle
in reply to Hauke Duden

Ben Hinkle

Posted in reply to Hauke Duden

oh boy, more unicode! ;)

"Hauke Duden" <H.NS.Duden@gmx.net> wrote in message news:bshas9$1kgh$1@digitaldaemon.com...
> Walter wrote:
> > D currently can handle UTF-8 (unicode) strings, but it does not handle shift-JIS strings. To make shift-JIS work in your programs, you'll need
to
> > write a filter to convert shift-JIS to UTF-8 on input, and convert UTF-8
to
> > shift-JIS on output. I intend to do this for all code pages, but have
not
> > written those filters yet.
>
> I don't think you have to. The C runtime library provides functions to convert from wide char to the local code page and vice versa. We can use those for conversions of this kind.
>
> I know I'm repeating the same stuff over and over, but maybe this real world example has shifted your position somewhat. I REALLY think UTF-8 strings should not use the "char" type. The CRT expects chars to be encoded in the local code page, so this will lead to all kinds of confusion when you mix C functions with D functions. The latter expects UTF-8, the former the local code page, but both use the same type. Actually, if you get right down to the definition, they use different types with the same name and none of the type-safety features you expect from a typed programming language!
>
> It would be a lot easier if the types had different names.
>
> Hauke

agreed.

I have a question about local code pages: do they all contained ASCII? I thought so but I don't know for sure. Googling around it seems like some old encodings were not actually compatible with ASCII but I don't know if those encoding are in use anymore (EBCDIC?). Remember ASCII is only the first 7 bits. That was why I proposed using the types "ascii","utf8","utf16" and "utf32". Any ascii[] that has non-ASCII bytes is assumed to be encoded in the local code page. The standard C functions all take local encodings, so they would be declared as taking ascii[] strings. If naming a type "ascii" is too offensive then maybe someone can come up with a word for "8-bit local encoding suitable for printf and friends". Hauke had suggested "charz", which doesn't sound too bad to me but then it also makes a statement about the trailing 0. Maybe something like "lchar" for local encoding, "uchar8" for unicode utf-8, "uchar16" for utf-16 and "uchar32" for utf-32.

-Ben

December 28, 2003

Posted by Hauke Duden
in reply to Ben Hinkle

Hauke Duden

Posted in reply to Ben Hinkle

Ben Hinkle wrote:
> I have a question about local code pages: do they all contained ASCII?

According to the C docs there are no guarantees about the way characters are encoded. However, I read some time ago that "most" code pages are downward compatible to ASCII. No clue whether that means "most" as in "all we care about" or "most" as in "except for that one important code page we need to support". Murphy tells me that we should assume the latter ;).

> The standard C functions all take local encodings, so
> they would be declared as taking ascii[] strings.

That seems like another hack to me. Even if the code pages are ASCII compatible, the strings you pass to the functions are not necessarily all ASCII, so I think the type should have a different name.

> If naming a type "ascii"
> is too offensive then maybe someone can come up with a word for "8-bit local
> encoding suitable for printf and friends". Hauke had suggested "charz",
> which doesn't sound too bad to me but then it also makes a statement about
> the trailing 0. Maybe something like "lchar" for local encoding, "uchar8"
> for unicode utf-8, "uchar16" for utf-16 and "uchar32" for utf-32.

I like the uchar8 and uchar16. There is a clash with the naming convention that uTYPE means the unsigned version of that type. But that seems ok, since UTF-8 and UTF16 chars would be unsigned anyway.

How about the following:

cchar: for C-type strings, made up of 8 bit elements ecoded in the local code page. Maybe also null-terminated by convention?

wchar: for C-type wide char strings, made up of 16 bit or 32 bit elements depending on the system (16 for windows, 32 for linux). The encoding is UTF-16 or UTF-32. We need a type like this to be able to write portable code that interoperates with C-functions that take the wchar_t type, which has the same properties. Also null-terminated by convention?

uchar8: UTF-8 encoded string

uchar16: UTF-16 encoded string

uchar32: UTF-32 endoded string

Additionally, I would advocate to add either a "char" or "dchar" alias for uchar32, to make the point that this should be the default character type. uchar32 just does not stand out among the other char types.

Hauke

December 28, 2003

Posted by Matthew
in reply to Y.Tomino

Matthew

Posted in reply to Y.Tomino

> //ideal
> //Unicode has many letters more than code-page encoding.
> //Please try W-Version API first.
> void mkdir(char[] utf8_pathname)
> {
>   wchar[] utf16_pathname = toUTF16(utf8_pathname);
>   if(!CreateDirectoryW(cast(wchar*)(utf16_pathname ~ "\0"), null))
>   {
>     char[] codepage_pathname = toMBCS(utf16_pathname);
>     if (!CreateDirectoryA(toStringz(codepage_pathname), null))
>     {
>       throw new FileException(pathname, GetLastError());
>     }
>   }
> }

I don't have any criticisms to make of your internationalisation postulates, but the above code is *not* the right way to write A/W flexible functions in Win32.

For example, what do I get when I retrieve the win32 error code from the FileException?

I've not followed enough of these threads to know whether much, if any, library code is being written at the moment, but if it is, and if it is done like this, it is a bad thing.

Matthew

Top | Forum index | About this forum

Copyright © 1999-2021 by the D Language Foundation