View mode: basic / threaded / horizontal-split · Log in · Help
March 21, 2005
Re: String Parsing with \" in a ".." text line
Regan Heath says...
>The simplest possible format...
>
>If you assume your values cannot contain \r\n and your labels/settings  
>cannot contain spaces then you can simply use the following format:
>
>label<space>value<\r\n>
>
>and parse it by calling "find" on each line, looking for a space, and  
>assuming the rest of the line (minus the \r\n) is the value.
>
>If you decide later on that you need \r\n in your values you can encode  
>them as \, r, \, n eg.
>
>label<space>regan\r\nwas\r\nhere<\r\n>
>
>In general the fewer special characters you define, the fewer special  
>cases you have to handle in values. Further if you can pick characters you  
>will never want to use in values you don't have to handle any special  
>cases at all.

Hmm... sure that would work, but space is definately something that needs to be
used. But your example would be a nightmare to read. The config file format is
supposed to be read and changed by not only myself, but any casual stats user.

A way to do it:

[varlable]
values of the variable the whole line not allowing // comments

sure that would be trivial to do. But also that is a bit less nice to read.

My stats will have quite extensive 4-5 columns ob obituatries using " as a
delimiter. So a good way to get rid of \" would help. In my ANSI C code, it
simply was not possible, and I was lucky enough not to require it (or I had to
hack my code to allow for it).

A quite elegant way to solve the problem would be to disallow tabs as values,
and use the tab char to seperate the columns. Problem with that is, the user
might accidentally add a tab and never notice it.

AEon
March 21, 2005
Re: String Parsing with \" in a ".." text line
On Mon, 21 Mar 2005 23:10:05 +0000 (UTC), AEon <AEon_member@pathlink.com>  
wrote:
> Regan Heath says...
>> The simplest possible format...
>>
>> If you assume your values cannot contain \r\n and your labels/settings
>> cannot contain spaces then you can simply use the following format:
>>
>> label<space>value<\r\n>
>>
>> and parse it by calling "find" on each line, looking for a space, and
>> assuming the rest of the line (minus the \r\n) is the value.
>>
>> If you decide later on that you need \r\n in your values you can encode
>> them as \, r, \, n eg.
>>
>> label<space>regan\r\nwas\r\nhere<\r\n>
>>
>> In general the fewer special characters you define, the fewer special
>> cases you have to handle in values. Further if you can pick characters  
>> you
>> will never want to use in values you don't have to handle any special
>> cases at all.
>
> Hmm... sure that would work, but space is definately something that  
> needs to be
> used.

In the label/setting name?

> But your example would be a nightmare to read.

I don't think so, but I guess this is personal preference.
If you like, replace <space> with <tab>, or allow both.

> The config file format is
> supposed to be read and changed by not only myself, but any casual stats  
> user.

KISS (Keep It Simple Stupid) - no insult intended. The more people who  
have to edit it, the simpler you should attempt to make it.

Alternately provide a simple program to read/write it and get them to use  
that.

> A quite elegant way to solve the problem would be to disallow tabs as  
> values,
> and use the tab char to seperate the columns. Problem with that is, the  
> user
> might accidentally add a tab and never notice it.

- Treat consecutive tabs as 1 tab.
- Ignore trailing tabs.

I can see 1 potential problem. Depending on the text editor and length of  
the values in the file the columns might not line up in the text file.  
However, Excell and I imagine other spreadsheet style programs can  
load/save tab seperated value text files, also comma seperated text files  
i.e.

a,b,c
d,e,f
..etc..

Regan
March 22, 2005
Re: String Parsing with \" in a ".." text line - "test.d" (1/1) uuEncoded 697 bytes - "linetoken.d" (1/1) uuEncoded 8058 bytes
On Mon, 21 Mar 2005 01:27:31 +0000 (UTC), AEon wrote:

> I started to code by parser, a *lot* easier with D, commands like
> std.string.split work mirracles.
> 
> But I am still wondering how to optimize parsing, in this case of a
> configuration file:
> 
> <code>
> // comments
> [General]
> game		"Quake III Arena"
> gameInfo	"Retail, Rocket Arena III, Q3: Team Arena"
> gameOpt		"-q3a"			// *** comment
> gameMode	"16"
> // comments
> </code>
> 
> I do a 
> 
>    std.string.find(line, "game")
> 
> to find out if the line contains my key-variable. And then a
> 
>   char[][] splitLine = std.string.split(line, "\"");
> 
> accessing the value of the var of interest via
> 
>  splitLine[1]
> 
> Now that is fine and dandy. But when I want to allow the user to use double
> quotes (") in the config file, this will turn ugly, since the above split does
> not differ between " and \".
> 
> Any ideas how to elegantly read the var/value pairs should the value contain a
> \"?
> 
> (In C I did some very evil manual hacking to make that work).
> 
> Thanx.
> 
I have a module that will 'tokenize' lines that will probably suit your
needs. I've attached the code but if you can't fetch it that way, let me
know and I'll make it available on the web.

-- 
Derek
Melbourne, Australia
22/03/2005 1:46:54 PM
March 22, 2005
Re: String Parsing with \" in a ".." text line
Derek Parnell says...

>22/03/2005 1:46:54 PM
>begin 644 test.d

What kind of format it that. Looks like something similar to tar, shar or
something. Those I would copy/paste into a test file and unpack them via
TotalCommander. But your format?

AEon
March 22, 2005
Re: String Parsing with \" in a ".." text line
On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon <AEon_member@pathlink.com>  
wrote:
> Derek Parnell says...
>
>> 22/03/2005 1:46:54 PM
>> begin 644 test.d
>
> What kind of format it that. Looks like something similar to tar, shar or
> something. Those I would copy/paste into a test file and unpack them via
> TotalCommander. But your format?

It's uuencoded. Search the web for a utility commonly called uudecode. Or,  
if you have WinACE rename/save the data as a .uue file right click on it  
in windows explorer and use the winace "extract here" option.

Regan
March 22, 2005
Re: String Parsing with \" in a ".." text line
On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon wrote:

> Derek Parnell says...
> 
>>22/03/2005 1:46:54 PM
>>begin 644 test.d
> 
> What kind of format it that. Looks like something similar to tar, shar or
> something. Those I would copy/paste into a test file and unpack them via
> TotalCommander. But your format?

It ain't "my" format but a very commonly used one - UUEncode. Most news
readers can handle it but I'll make the file available on the web (for
now).

 http://www.users.bigpond.com/ddparnell/linetoken.d


-- 
Derek Parnell
Melbourne, Australia
22/03/2005 11:49:15 PM
March 22, 2005
Re: String Parsing with \
In article <18k6r1o1h3k7g.1478moloz1j32.dlg@40tude.net>, Derek Parnell says...
>
>On Tue, 22 Mar 2005 12:43:44 +0000 (UTC), AEon wrote:
>
>> Derek Parnell says...
>> 
>>>22/03/2005 1:46:54 PM
>>>begin 644 test.d
>> 
>> What kind of format it that. Looks like something similar to tar, shar or
>> something. Those I would copy/paste into a test file and unpack them via
>> TotalCommander. But your format?
>
>It ain't "my" format but a very commonly used one - UUEncode. Most news
>readers can handle it but I'll make the file available on the web (for
>now).
>
>  http://www.users.bigpond.com/ddparnell/linetoken.d

I had not been suggesting you invented it ;)...

Copy/paste, name file uue works just fine with TotalCommander. Had totally
forgotten about uue.

AEon
March 22, 2005
TokenizeLine() (was String Parsing with \" in a ".." text line)
Derek Parnell says...

>begin 644 linetoken.d

A few questions about your code:


 char[][] TokenizeLine(char[] pSource, char[] pDelim = ",", char[] pComment =
"//")

As I understand this, pDelim and pComment can be set on calling via
TokenizeLine(), but need not since both have "default" values? If so, another
very useful code example.


int find(dchar[] pStringToScan, dchar pCharToFind)

I noted that you defined you own find() function. Generally would that function
conflict with those defined in the std lib? Or do user-defined functiontions
automatically shadow lib functions?


Amazing piece code (will take me a while to read/understand), from your example
output test cases :

TokenizeLine(" abc, def , ghi, ") // default Delim ","  Comment is "//"
--> {"abc", "def", "ghi", ""}

split(" abc, def , ghi, ", ","), and then apply strip() on every element, would
do the same, but not as elegantly :)


TokenizeLine("character    or spaces to be \t inserted", "")
--> {"character", "or", "spaces", "to", "be", "inserted"}

An empty delimiter seems to be an alias for \t and " " (space)? Nice!
(Just noted from your info: However, if DelimChar is an empty string, then
tokens are delimited by any group of one or more white-space characters. By
default, DelimChar is ",".)
Duplicating that with split() would be tough.


TokenizeLine(" abc; def , ghi; ", ";")
--> {"abc", "def , ghi", "" }

Noting, you seem to be calling something like strip() though not exactly that
function.


TokenizeLine(" abc, [def , ghi]        ") // default Delim ","  Comment is "//"
--> {"abc", "[", "def , ghi"}

(Explanation: If a token begins with a bracket (parenthesis, square, or brace),
then you will get back two tokens. The first is the opening bracket as a single
character string, and the second is all the characters up to, but not including,
the matching end bracket, taking nested brackets (of the same type) into
consideration.)

Would not:

--> {"abc", "[", "def , ghi", "]"}

or even

--> {"abc", "def , ghi" }

be "neater"?


TokenizeLine(" abc, [def , [ghi, jkl] ]  ")
--> {"abc", "[", "def , [ghi, jkl] "}

Anything in brackets is treated literally (i.e. as is), so nested brackets are
not interpreted. OK.

So if you actually wanted to use [] or () in strings, and that may well happen
often, one would actually need to "escape" those in some way? I am not sure that
the special treatment brackets require will always be convenient.

TokenizeLine(` abc, "def , ghi" , jkl `)
--> {"abc", `"`, "def , ghi", "jkl"}


TokenizeLine(` "moo"  \t " oi\"nk\"  " \t "ladida " //Comment`, `"`, `//`)
0-->``
1-->`moo`
2-->`t`
3-->`oi"nk"`
4-->`t`
5-->`ladida`

Wishlist:
0: Would wish to not have element 0.
1: fine, but should be element token 0.
2: \t tab no longer recognized, tab should have been ingnored
3: Perfect
4: same as 2
5: Perfect
"6" comment ignored, fine

I would hope to do this to a line:

`"moo" <whitespaces> " oi\"nk\"  " <whitespaces> "ladida "//Comment`

->0: `moo`
->1: `oi"nk"`
->2: `ladida`

Presently it would not be possibly to rely on a specific column to contain the
info a specific double quote pair.

Could that be made possible?

Thanx for your work...

AEon
March 22, 2005
Re: TokenizeLine() (was String Parsing with \" in a ".." text line)
On Tue, 22 Mar 2005 14:14:12 +0000 (UTC), AEon wrote:

> Derek Parnell says...
> 
>>begin 644 linetoken.d
> 
> A few questions about your code:
> 
> 
>   char[][] TokenizeLine(char[] pSource, char[] pDelim = ",", char[] pComment =
> "//")
> 
> As I understand this, pDelim and pComment can be set on calling via
> TokenizeLine(), but need not since both have "default" values? If so, another
> very useful code example.
> 
> 
>  int find(dchar[] pStringToScan, dchar pCharToFind)
> 
> I noted that you defined you own find() function. Generally would that function
> conflict with those defined in the std lib? Or do user-defined functiontions
> automatically shadow lib functions?

The D method to resolve such ambiguities is to fully qualify the reference
with the package/module name, such as ...

   lPos = util.linetoken.find(lResult, lToken);

[snip]

> TokenizeLine("character    or spaces to be \t inserted", "")
> --> {"character", "or", "spaces", "to", "be", "inserted"}
> 
> An empty delimiter seems to be an alias for \t and " " (space)? Nice!
> (Just noted from your info: However, if DelimChar is an empty string, then
> tokens are delimited by any group of one or more white-space characters. By
> default, DelimChar is ",".)
> Duplicating that with split() would be tough.

Yes, the empty delimiter uses groups of one or more whitespace characters
to act as a single delimiter. If you really need *just* the space character
to be the delimiter then use that, same with tabs. 

> 
> TokenizeLine(" abc; def , ghi; ", ";")
> --> {"abc", "def , ghi", "" }
> 
> Noting, you seem to be calling something like strip() though not exactly that
> function.
> 
> 
> TokenizeLine(" abc, [def , ghi]        ") // default Delim ","  Comment is "//"
> --> {"abc", "[", "def , ghi"}
> 
> (Explanation: If a token begins with a bracket (parenthesis, square, or brace),
> then you will get back two tokens. The first is the opening bracket as a single
> character string, and the second is all the characters up to, but not including,
> the matching end bracket, taking nested brackets (of the same type) into
> consideration.)
> 
> Would not:
> 
> --> {"abc", "[", "def , ghi", "]"}
> 
> or even
> 
> --> {"abc", "def , ghi" }
> 
> be "neater"?

Often, when parsing the tokens you need to know if a token was enclosed in
brackets or quotes. By supplying the opening bracket or quote in the
returned tokens, you can quickly see which were bracketed tokens. Also,
there is no need to supply the closing bracket or quote as you know what
that would have been by the opening bracket or quote. In other words, if
you come across a token of "{" you know the next token was enclosed in
braces, so you don't need to see the final brace.

> 
> TokenizeLine(" abc, [def , [ghi, jkl] ]  ")
> --> {"abc", "[", "def , [ghi, jkl] "}
> 
> Anything in brackets is treated literally (i.e. as is), so nested brackets are
> not interpreted. OK.
> 
> So if you actually wanted to use [] or () in strings, and that may well happen
> often, one would actually need to "escape" those in some way? I am not sure that
> the special treatment brackets require will always be convenient.

There are two ways (at least) to do that. First method is to use the Escape
character (the back-slash "\").  

 TokenizeLine(` abc, \[def , [ghi, jkl] ]  `)
--> { "abc", "[def", "ghi, jkl", "]" }

 TokenizeLine(`abc, def\, ghi, jkl`)
--> { "abc", "def, ghi", "jkl"} Note only 3 tokens.

The other way is to enclose it inside a different sort of bracket/quote.

 TokenizeLine(`He said, '"Let's go down to the river".`,  ``)
--> { `He`, `said,`,  `'`, `"Let's go down to the river".` }


> TokenizeLine(` "moo"  \t " oi\"nk\"  " \t "ladida " //Comment`, `"`, `//`)
> 0-->``
> 1-->`moo`
> 2-->`t`
> 3-->`oi"nk"`
> 4-->`t`
> 5-->`ladida`
> 
> Wishlist:
> 0: Would wish to not have element 0.
> 1: fine, but should be element token 0.
> 2: \t tab no longer recognized, tab should have been ingnored
> 3: Perfect
> 4: same as 2
> 5: Perfect
> "6" comment ignored, fine

Well, you said that the token delimiter was the double-quote. Also, you
used the 'raw' string format so the sequence "\t" is not a tab but
literally a backslash-t combination. 
So this would have been broken up like this ...

` `
`moo`
`  \t `
` oi\"nk\"  `
` \t "`
`ladida `
` //Comment`

Then when leading and trailing spaces are removed you get ...
``
`moo`
`\t`
`oi\"nk\"`
`\t`
`ladida`
`//Comment`

Then applying escaped characters
``
`moo`
`t`
`oi"nk"`
`t`
`ladida`
`//Comment`

Then when removing comments ...
``
`moo`
`t`
`oi"nk"`
`t`
`ladida`



> I would hope to do this to a line:
> 
> `"moo" <whitespaces> " oi\"nk\"  " <whitespaces> "ladida "//Comment`
> 
> ->0: `moo`
> ->1: `oi"nk"`
> ->2: `ladida`
> 
Toks = TokenizeLine(`"moo" <whitespaces> " oi\"nk\"  " <whitespaces>
"ladida "//Comment`", "");
// Toks --> { `"`, `moo`, `"`, ` oi"nk"  `"`, `ladida` }
int i;
foreach(char[] aTok; Toks)
{
  if (aTok != `"`) 
  {
      writefln("->%d: `%s`", i, std.string.strip(aTok));
      i++;
  }
}


> Presently it would not be possibly to rely on a specific column to contain the
> info a specific double quote pair.
> 
> Could that be made possible?

I suppose so, but it is designed to handle free form text and not
column-delimited stuff. 

-- 
Derek Parnell
Melbourne, Australia
23/03/2005 1:40:47 AM
March 22, 2005
Re: TokenizeLine()
Derek Parnell,

At least in my case:

[Weapons]
"0"	" killed @ by MOD_SHOTGUN"	"Shotgun"	"SG"
"1"	" killed @ by MOD_GAUNTLET"	"Gauntlet"	"G"

something quite simple just occured to me. When using std.string.splitline() to
read complete lines from a text file you, will *never* encounter a \n in the
line, since that would have placed the content on another line. 

So when you have something line this:

"1"	" killed \"@\" by MOD_GAUNTLET"	"Gauntlet"	"G"

You cound do a replace \", \n and be sure that line will not loose any
information.

Then char[][] spline = split(line, "\""); And finally replace any \n back to \"
(or right back to " depending how you want to use the spline elements).

Obviously your code is a lot more flexible, but as we all strive for "KISS" ;)
my idea should work quite well.

BTW: I have been noting, many feedback posts should really be archived,
especially all the very useful code-examples.

AEon
1 2 3
Top | Discussion index | About this forum | D home