Thread overview | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
January 16, 2012 Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Hi all, In general, I'm enjoying the regex respin. However, I ran into one issue that seems to have no clean workaround. Generally, I want to be able to get the start and end indices of matches. With the complete match, this info can be pieced together with match.pre().length and match.hit.length(). However, I can't do this with captures. For an example: I have a string and the regex .*(a).*(b).*(c).*. I want to find where a, b, and c are located when I match. As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets. That seems wasteful. If you look at the ICU and Java regex APIs, you'll see that this information is retrievable. I believe it's available under the covers of the D regex library API too. Can this please be exposed? It's very helpful for doing text processing where you need to be able to align the results of multiple transformations to the input text. Thanks Jerry |
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Jerry | On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
> As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets.
Not sure if this is what you were referring to, but you can do...
m.pre.length + m.captures[1].ptr - m.hit.ptr
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
>> As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets.
>
> Not sure if this is what you were referring to, but you can do...
Even simpler: m.captures[1].ptr - s.ptr
(s is the string being matched)
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | "Vladimir Panteleev" <vladimir@thecybershadow.net> writes:
> On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
>> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
>>> As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets.
>>
>> Not sure if this is what you were referring to, but you can do...
>
> Even simpler: m.captures[1].ptr - s.ptr
>
> (s is the string being matched)
Ah ok, that'll work.
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
2012/1/17 Mail Mantis <mail.mantis.88@gmail.com>:
> Correct me if I'm wrong, but wouldn't this be better:
> (m_captures[1].ptr - s.ptr) / s[0].sizeof;
No, it wouldn't. Somehow, I forgot the rules for pointer ariphmetics. Sorry.
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | "Vladimir Panteleev" <vladimir@thecybershadow.net> wrote in message news:klzeekkilpzwmjmkudhh@dfeed.kimsufi.thecybershadow.net... > On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote: >> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote: >>> As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets. >> >> Not sure if this is what you were referring to, but you can do... > > Even simpler: m.captures[1].ptr - s.ptr > > (s is the string being matched) That wouldn't work in @safe mode, would it? |
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Vladimir Panteleev | 2012/1/17 Vladimir Panteleev <vladimir@thecybershadow.net>:
> On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
>>
>> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
>>>
>>> As far as I can tell, the only way to do this would be to capture every chunk of text, then iterate to determine the offsets.
>>
>>
>> Not sure if this is what you were referring to, but you can do...
>
>
> Even simpler: m.captures[1].ptr - s.ptr
>
> (s is the string being matched)
Correct me if I'm wrong, but wouldn't this be better:
(m_captures[1].ptr - s.ptr) / s[0].sizeof;
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nick Sabalausky | On 01/17/2012 04:03 AM, Nick Sabalausky wrote:
> "Vladimir Panteleev"<vladimir@thecybershadow.net> wrote in message
> news:klzeekkilpzwmjmkudhh@dfeed.kimsufi.thecybershadow.net...
>> On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
>>> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
>>>> As far as I can tell, the only way to do this would be to capture every
>>>> chunk of text, then iterate to determine the offsets.
>>>
>>> Not sure if this is what you were referring to, but you can do...
>>
>> Even simpler: m.captures[1].ptr - s.ptr
>>
>> (s is the string being matched)
>
> That wouldn't work in @safe mode, would it?
>
>
There is nothing unsafe about the operation, so I'd actually expect it to work.
|
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Timon Gehr | "Timon Gehr" <timon.gehr@gmx.ch> wrote in message news:jf2p5d$2ria$1@digitalmars.com... > On 01/17/2012 04:03 AM, Nick Sabalausky wrote: >> "Vladimir Panteleev"<vladimir@thecybershadow.net> wrote in message news:klzeekkilpzwmjmkudhh@dfeed.kimsufi.thecybershadow.net... >>> On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote: >>>> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote: >>>>> As far as I can tell, the only way to do this would be to capture >>>>> every >>>>> chunk of text, then iterate to determine the offsets. >>>> >>>> Not sure if this is what you were referring to, but you can do... >>> >>> Even simpler: m.captures[1].ptr - s.ptr >>> >>> (s is the string being matched) >> >> That wouldn't work in @safe mode, would it? >> >> > > There is nothing unsafe about the operation, so I'd actually expect it to work. I thought pointer arithmetic was forbidden in @safe? |
January 17, 2012 Re: Limitation with current regex API | ||||
---|---|---|---|---|
| ||||
Posted in reply to Nick Sabalausky | On 01/17/2012 05:00 AM, Nick Sabalausky wrote:
> "Timon Gehr"<timon.gehr@gmx.ch> wrote in message
> news:jf2p5d$2ria$1@digitalmars.com...
>> On 01/17/2012 04:03 AM, Nick Sabalausky wrote:
>>> "Vladimir Panteleev"<vladimir@thecybershadow.net> wrote in message
>>> news:klzeekkilpzwmjmkudhh@dfeed.kimsufi.thecybershadow.net...
>>>> On Tuesday, 17 January 2012 at 01:44:37 UTC, Vladimir Panteleev wrote:
>>>>> On Monday, 16 January 2012 at 19:28:42 UTC, Jerry wrote:
>>>>>> As far as I can tell, the only way to do this would be to capture
>>>>>> every
>>>>>> chunk of text, then iterate to determine the offsets.
>>>>>
>>>>> Not sure if this is what you were referring to, but you can do...
>>>>
>>>> Even simpler: m.captures[1].ptr - s.ptr
>>>>
>>>> (s is the string being matched)
>>>
>>> That wouldn't work in @safe mode, would it?
>>>
>>>
>>
>> There is nothing unsafe about the operation, so I'd actually expect it to
>> work.
>
> I thought pointer arithmetic was forbidden in @safe?
>
>
I don't know exactly, since @safe is neither fully specified nor implemented. In my understanding, in @safe code, operations that may lead to memory corruption are forbidden. Pointer - pointer cannot, other kinds of pointer arithmetic may.
|
Copyright © 1999-2021 by the D Language Foundation