March 21, 2005
Regan Heath says...
>The simplest possible format...
>
>If you assume your values cannot contain \r\n and your labels/settings cannot contain spaces then you can simply use the following format:
>
>label<space>value<\r\n>
>
>and parse it by calling "find" on each line, looking for a space, and assuming the rest of the line (minus the \r\n) is the value.
>
>If you decide later on that you need \r\n in your values you can encode them as \, r, \, n eg.
>
>label<space>regan\r\nwas\r\nhere<\r\n>
>
>In general the fewer special characters you define, the fewer special cases you have to handle in values. Further if you can pick characters you will never want to use in values you don't have to handle any special cases at all.

Hmm... sure that would work, but space is definately something that needs to be used. But your example would be a nightmare to read. The config file format is supposed to be read and changed by not only myself, but any casual stats user.

A way to do it:

[varlable]
values of the variable the whole line not allowing // comments

sure that would be trivial to do. But also that is a bit less nice to read.

My stats will have quite extensive 4-5 columns ob obituatries using " as a delimiter. So a good way to get rid of \" would help. In my ANSI C code, it simply was not possible, and I was lucky enough not to require it (or I had to hack my code to allow for it).

A quite elegant way to solve the problem would be to disallow tabs as values, and use the tab char to seperate the columns. Problem with that is, the user might accidentally add a tab and never notice it.

AEon
March 21, 2005
On Mon, 21 Mar 2005 23:10:05 +0000 (UTC), AEon <AEon_member@pathlink.com> wrote:
> Regan Heath says...
>> The simplest possible format...
>>
>> If you assume your values cannot contain \r\n and your labels/settings
>> cannot contain spaces then you can simply use the following format:
>>
>> label<space>value<\r\n>
>>
>> and parse it by calling "find" on each line, looking for a space, and
>> assuming the rest of the line (minus the \r\n) is the value.
>>
>> If you decide later on that you need \r\n in your values you can encode
>> them as \, r, \, n eg.
>>
>> label<space>regan\r\nwas\r\nhere<\r\n>
>>
>> In general the fewer special characters you define, the fewer special
>> cases you have to handle in values. Further if you can pick characters you
>> will never want to use in values you don't have to handle any special
>> cases at all.
>
> Hmm... sure that would work, but space is definately something that needs to be
> used.

In the label/setting name?

> But your example would be a nightmare to read.

I don't think so, but I guess this is personal preference.
If you like, replace <space> with <tab>, or allow both.

> The config file format is
> supposed to be read and changed by not only myself, but any casual stats user.

KISS (Keep It Simple Stupid) - no insult intended. The more people who have to edit it, the simpler you should attempt to make it.

Alternately provide a simple program to read/write it and get them to use that.

> A quite elegant way to solve the problem would be to disallow tabs as values,
> and use the tab char to seperate the columns. Problem with that is, the user
> might accidentally add a tab and never notice it.

- Treat consecutive tabs as 1 tab.
- Ignore trailing tabs.

I can see 1 potential problem. Depending on the text editor and length of the values in the file the columns might not line up in the text file. However, Excell and I imagine other spreadsheet style programs can load/save tab seperated value text files, also comma seperated text files i.e.

a,b,c
d,e,f
..etc..

Regan
March 22, 2005
On Mon, 21 Mar 2005 01:27:31 +0000 (UTC), AEon wrote:

> I started to code by parser, a *lot* easier with D, commands like std.string.split work mirracles.
> 
> But I am still wondering how to optimize parsing, in this case of a configuration file:
> 
> <code>
> // comments
> [General]
> game		"Quake III Arena"
> gameInfo	"Retail, Rocket Arena III, Q3: Team Arena"
> gameOpt		"-q3a"			// *** comment
> gameMode	"16"
> // comments
> </code>
> 
> I do a
> 
>    std.string.find(line, "game")
> 
> to find out if the line contains my key-variable. And then a
> 
>   char[][] splitLine = std.string.split(line, "\"");
> 
> accessing the value of the var of interest via
> 
>  splitLine[1]
> 
> Now that is fine and dandy. But when I want to allow the user to use double quotes (") in the config file, this will turn ugly, since the above split does not differ between " and \".
> 
> Any ideas how to elegantly read the var/value pairs should the value contain a \"?
> 
> (In C I did some very evil manual hacking to make that work).
> 
> Thanx.
> 
I have a module that will 'tokenize' lines that will probably suit your needs. I've attached the code but if you can't fetch it that way, let me know and I'll make it available on the web.

-- 
Derek
Melbourne, Australia
22/03/2005 1:46:54 PM



March 22, 2005
Derek Parnell says...

>22/03/2005 1:46:54 PM
>begin 644 test.d

What kind of format it that. Looks like something similar to tar, shar or something. Those I would copy/paste into a test file and unpack them via TotalCommander. But your format?

AEon
March 22, 2005
On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon <AEon_member@pathlink.com> wrote:
> Derek Parnell says...
>
>> 22/03/2005 1:46:54 PM
>> begin 644 test.d
>
> What kind of format it that. Looks like something similar to tar, shar or
> something. Those I would copy/paste into a test file and unpack them via
> TotalCommander. But your format?

It's uuencoded. Search the web for a utility commonly called uudecode. Or, if you have WinACE rename/save the data as a .uue file right click on it in windows explorer and use the winace "extract here" option.

Regan
March 22, 2005
On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon wrote:

> Derek Parnell says...
> 
>>22/03/2005 1:46:54 PM
>>begin 644 test.d
> 
> What kind of format it that. Looks like something similar to tar, shar or something. Those I would copy/paste into a test file and unpack them via TotalCommander. But your format?

It ain't "my" format but a very commonly used one - UUEncode. Most news readers can handle it but I'll make the file available on the web (for now).

  http://www.users.bigpond.com/ddparnell/linetoken.d


-- 
Derek Parnell
Melbourne, Australia
22/03/2005 11:49:15 PM
March 22, 2005
In article <18k6r1o1h3k7g.1478moloz1j32.dlg@40tude.net>, Derek Parnell says...
>
>On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon wrote:
>
>> Derek Parnell says...
>> 
>>>22/03/2005 1:46:54 PM
>>>begin 644 test.d
>> 
>> What kind of format it that. Looks like something similar to tar, shar or something. Those I would copy/paste into a test file and unpack them via TotalCommander. But your format?
>
>It ain't "my" format but a very commonly used one - UUEncode. Most news readers can handle it but I'll make the file available on the web (for now).
>
>  http://www.users.bigpond.com/ddparnell/linetoken.d

I had not been suggesting you invented it ;)...

Copy/paste, name file uue works just fine with TotalCommander. Had totally forgotten about uue.

AEon
March 22, 2005
Derek Parnell says...

>begin 644 linetoken.d

A few questions about your code:


  char[][] TokenizeLine(char[] pSource, char[] pDelim = ",", char[] pComment =
"//")

As I understand this, pDelim and pComment can be set on calling via TokenizeLine(), but need not since both have "default" values? If so, another very useful code example.


 int find(dchar[] pStringToScan, dchar pCharToFind)

I noted that you defined you own find() function. Generally would that function conflict with those defined in the std lib? Or do user-defined functiontions automatically shadow lib functions?


Amazing piece code (will take me a while to read/understand), from your example
output test cases :

TokenizeLine(" abc, def , ghi, ") // default Delim ","  Comment is "//"
--> {"abc", "def", "ghi", ""}

split(" abc, def , ghi, ", ","), and then apply strip() on every element, would
do the same, but not as elegantly :)


TokenizeLine("character    or spaces to be \t inserted", "") --> {"character", "or", "spaces", "to", "be", "inserted"}

An empty delimiter seems to be an alias for \t and " " (space)? Nice!
(Just noted from your info: However, if DelimChar is an empty string, then
tokens are delimited by any group of one or more white-space characters. By
default, DelimChar is ",".)
Duplicating that with split() would be tough.


TokenizeLine(" abc; def , ghi; ", ";")
--> {"abc", "def , ghi", "" }

Noting, you seem to be calling something like strip() though not exactly that
function.


TokenizeLine(" abc, [def , ghi]        ") // default Delim ","  Comment is "//" --> {"abc", "[", "def , ghi"}

(Explanation: If a token begins with a bracket (parenthesis, square, or brace), then you will get back two tokens. The first is the opening bracket as a single character string, and the second is all the characters up to, but not including, the matching end bracket, taking nested brackets (of the same type) into consideration.)

Would not:

--> {"abc", "[", "def , ghi", "]"}

or even

--> {"abc", "def , ghi" }

be "neater"?


TokenizeLine(" abc, [def , [ghi, jkl] ]  ")
--> {"abc", "[", "def , [ghi, jkl] "}

Anything in brackets is treated literally (i.e. as is), so nested brackets are
not interpreted. OK.

So if you actually wanted to use [] or () in strings, and that may well happen often, one would actually need to "escape" those in some way? I am not sure that the special treatment brackets require will always be convenient.

TokenizeLine(` abc, "def , ghi" , jkl `)
--> {"abc", `"`, "def , ghi", "jkl"}


TokenizeLine(` "moo"  \t " oi\"nk\"  " \t "ladida " //Comment`, `"`, `//`)
0-->``
1-->`moo`
2-->`t`
3-->`oi"nk"`
4-->`t`
5-->`ladida`

Wishlist:
0: Would wish to not have element 0.
1: fine, but should be element token 0.
2: \t tab no longer recognized, tab should have been ingnored
3: Perfect
4: same as 2
5: Perfect
"6" comment ignored, fine

I would hope to do this to a line:

`"moo" <whitespaces> " oi\"nk\"  " <whitespaces> "ladida "//Comment`

->0: `moo`
->1: `oi"nk"`
->2: `ladida`

Presently it would not be possibly to rely on a specific column to contain the info a specific double quote pair.

Could that be made possible?

Thanx for your work...

AEon
March 22, 2005
On Tue, 22 Mar 2005 14:14:12 +0000 (UTC), AEon wrote:

> Derek Parnell says...
> 
>>begin 644 linetoken.d
> 
> A few questions about your code:
> 
> 
>   char[][] TokenizeLine(char[] pSource, char[] pDelim = ",", char[] pComment =
> "//")
> 
> As I understand this, pDelim and pComment can be set on calling via TokenizeLine(), but need not since both have "default" values? If so, another very useful code example.
> 
> 
>  int find(dchar[] pStringToScan, dchar pCharToFind)
> 
> I noted that you defined you own find() function. Generally would that function conflict with those defined in the std lib? Or do user-defined functiontions automatically shadow lib functions?

The D method to resolve such ambiguities is to fully qualify the reference with the package/module name, such as ...

    lPos = util.linetoken.find(lResult, lToken);

[snip]

> TokenizeLine("character    or spaces to be \t inserted", "") --> {"character", "or", "spaces", "to", "be", "inserted"}
> 
> An empty delimiter seems to be an alias for \t and " " (space)? Nice!
> (Just noted from your info: However, if DelimChar is an empty string, then
> tokens are delimited by any group of one or more white-space characters. By
> default, DelimChar is ",".)
> Duplicating that with split() would be tough.

Yes, the empty delimiter uses groups of one or more whitespace characters to act as a single delimiter. If you really need *just* the space character to be the delimiter then use that, same with tabs.

> 
> TokenizeLine(" abc; def , ghi; ", ";")
> --> {"abc", "def , ghi", "" }
> 
> Noting, you seem to be calling something like strip() though not exactly that
> function.
> 
> 
> TokenizeLine(" abc, [def , ghi]        ") // default Delim ","  Comment is "//" --> {"abc", "[", "def , ghi"}
> 
> (Explanation: If a token begins with a bracket (parenthesis, square, or brace), then you will get back two tokens. The first is the opening bracket as a single character string, and the second is all the characters up to, but not including, the matching end bracket, taking nested brackets (of the same type) into consideration.)
> 
> Would not:
> 
> --> {"abc", "[", "def , ghi", "]"}
> 
> or even
> 
> --> {"abc", "def , ghi" }
> 
> be "neater"?

Often, when parsing the tokens you need to know if a token was enclosed in
brackets or quotes. By supplying the opening bracket or quote in the
returned tokens, you can quickly see which were bracketed tokens. Also,
there is no need to supply the closing bracket or quote as you know what
that would have been by the opening bracket or quote. In other words, if
you come across a token of "{" you know the next token was enclosed in
braces, so you don't need to see the final brace.

> 
> TokenizeLine(" abc, [def , [ghi, jkl] ]  ")
> --> {"abc", "[", "def , [ghi, jkl] "}
> 
> Anything in brackets is treated literally (i.e. as is), so nested brackets are
> not interpreted. OK.
> 
> So if you actually wanted to use [] or () in strings, and that may well happen often, one would actually need to "escape" those in some way? I am not sure that the special treatment brackets require will always be convenient.

There are two ways (at least) to do that. First method is to use the Escape
character (the back-slash "\").

  TokenizeLine(` abc, \[def , [ghi, jkl] ]  `)
 --> { "abc", "[def", "ghi, jkl", "]" }

  TokenizeLine(`abc, def\, ghi, jkl`)
 --> { "abc", "def, ghi", "jkl"} Note only 3 tokens.

The other way is to enclose it inside a different sort of bracket/quote.

  TokenizeLine(`He said, '"Let's go down to the river".`,  ``)
 --> { `He`, `said,`,  `'`, `"Let's go down to the river".` }


> TokenizeLine(` "moo"  \t " oi\"nk\"  " \t "ladida " //Comment`, `"`, `//`)
> 0-->``
> 1-->`moo`
> 2-->`t`
> 3-->`oi"nk"`
> 4-->`t`
> 5-->`ladida`
> 
> Wishlist:
> 0: Would wish to not have element 0.
> 1: fine, but should be element token 0.
> 2: \t tab no longer recognized, tab should have been ingnored
> 3: Perfect
> 4: same as 2
> 5: Perfect
> "6" comment ignored, fine

Well, you said that the token delimiter was the double-quote. Also, you
used the 'raw' string format so the sequence "\t" is not a tab but
literally a backslash-t combination.
So this would have been broken up like this ...

` `
`moo`
`  \t `
` oi\"nk\"  `
` \t "`
`ladida `
` //Comment`

Then when leading and trailing spaces are removed you get ...
``
`moo`
`\t`
`oi\"nk\"`
`\t`
`ladida`
`//Comment`

Then applying escaped characters
``
`moo`
`t`
`oi"nk"`
`t`
`ladida`
`//Comment`

Then when removing comments ...
``
`moo`
`t`
`oi"nk"`
`t`
`ladida`



> I would hope to do this to a line:
> 
> `"moo" <whitespaces> " oi\"nk\"  " <whitespaces> "ladida "//Comment`
> 
> ->0: `moo`
> ->1: `oi"nk"`
> ->2: `ladida`
> 
Toks = TokenizeLine(`"moo" <whitespaces> " oi\"nk\"  " <whitespaces>
"ladida "//Comment`", "");
// Toks --> { `"`, `moo`, `"`, ` oi"nk"  `"`, `ladida` }
int i;
foreach(char[] aTok; Toks)
{
   if (aTok != `"`)
   {
       writefln("->%d: `%s`", i, std.string.strip(aTok));
       i++;
   }
}


> Presently it would not be possibly to rely on a specific column to contain the info a specific double quote pair.
> 
> Could that be made possible?

I suppose so, but it is designed to handle free form text and not column-delimited stuff.

-- 
Derek Parnell
Melbourne, Australia
23/03/2005 1:40:47 AM
March 22, 2005
Derek Parnell,

At least in my case:

[Weapons]
"0"	" killed @ by MOD_SHOTGUN"	"Shotgun"	"SG"
"1"	" killed @ by MOD_GAUNTLET"	"Gauntlet"	"G"

something quite simple just occured to me. When using std.string.splitline() to read complete lines from a text file you, will *never* encounter a \n in the line, since that would have placed the content on another line.

So when you have something line this:

"1"	" killed \"@\" by MOD_GAUNTLET"	"Gauntlet"	"G"

You cound do a replace \", \n and be sure that line will not loose any information.

Then char[][] spline = split(line, "\""); And finally replace any \n back to \"
(or right back to " depending how you want to use the spline elements).

Obviously your code is a lot more flexible, but as we all strive for "KISS" ;) my idea should work quite well.

BTW: I have been noting, many feedback posts should really be archived, especially all the very useful code-examples.

AEon