Thread overview
IDEA: Text search engine tailored to a specific schema
Apr 17, 2015
Casey
Apr 17, 2015
Rikki Cattermole
Apr 17, 2015
Jacob Carlborg
Apr 17, 2015
Casey Sybrandy
Apr 17, 2015
Jacob Carlborg
Apr 18, 2015
Casey
April 17, 2015
O.K.  This is just an idea that's been running through my head, so I figured someone here may be interested.

Text search engines that I know of are meant to index unstructured data or apply a schema to data at runtime.  However, since D has the ability to do things at compile time, perhaps it would be an ideal solution for situations where a specific schema is used and much be searched on.  Instead of generic data structures used to represent the data, specialized data structures could be created at compile time to allow for better indexing and performance.

That's about as far as I got with it.  To me, it seemed interesting enough to share.

Enjoy.
April 17, 2015
On 17/04/2015 2:26 p.m., Casey wrote:
> O.K.  This is just an idea that's been running through my head, so I
> figured someone here may be interested.
>
> Text search engines that I know of are meant to index unstructured data
> or apply a schema to data at runtime.  However, since D has the ability
> to do things at compile time, perhaps it would be an ideal solution for
> situations where a specific schema is used and much be searched on.
> Instead of generic data structures used to represent the data,
> specialized data structures could be created at compile time to allow
> for better indexing and performance.
>
> That's about as far as I got with it.  To me, it seemed interesting
> enough to share.
>
> Enjoy.

This sounds a lot like an ORM. Only the schema is specified as a struct/class. In this case it wouldn't need to generate anything as it has already been done.
April 17, 2015
On 2015-04-17 04:26, Casey wrote:
> O.K.  This is just an idea that's been running through my head, so I
> figured someone here may be interested.
>
> Text search engines that I know of are meant to index unstructured data
> or apply a schema to data at runtime.  However, since D has the ability
> to do things at compile time, perhaps it would be an ideal solution for
> situations where a specific schema is used and much be searched on.
> Instead of generic data structures used to represent the data,
> specialized data structures could be created at compile time to allow
> for better indexing and performance.
>
> That's about as far as I got with it.  To me, it seemed interesting
> enough to share.

Sounds a bit like the regular expression module. If you provide the regular expression at compile time it will generate an engine specific for that regular expression.

-- 
/Jacob Carlborg
April 17, 2015
I was thinking something a bit more specific without having to manually generate the structs.

For example, let's say I have a JSON document that has a number of fields in it.  Some are numbers, some are strings, etc.  What I'm thinking either a) based of the JSON structure or b) based on a schema that describes the JSON, the objects and/or indices are defined at compile-time and done so in an optimal manner.  For example, if based on the schema we know that a field is an enumeration, instead of a inverted index a simple associative array that contains arrays of matching document IDs is used instead.  This way, if I search on that specific field, it can be done in the most efficient way possible.  Also, the documents themselves would be stored more optimally.

So, no, this isn't an ORM as I'm not mapping objects to an underlying data store.  I guess what I'm thinking of is the text search equivalent of the regular expression engine.  Thinking about it now, I should have mentioned that this would be like Sphinx/Lucene/ElasticSearch except it would be optimized to a specific document structure vs. more general purpose.  The optimizations would be generated at compile-time based on a sample document structure or schema vs. coding everything manually.
April 17, 2015
On 2015-04-17 16:21, Casey Sybrandy wrote:
> I was thinking something a bit more specific without having to manually
> generate the structs.
>
> For example, let's say I have a JSON document that has a number of
> fields in it.  Some are numbers, some are strings, etc.  What I'm
> thinking either a) based of the JSON structure or b) based on a schema
> that describes the JSON, the objects and/or indices are defined at
> compile-time and done so in an optimal manner.  For example, if based on
> the schema we know that a field is an enumeration, instead of a inverted
> index a simple associative array that contains arrays of matching
> document IDs is used instead.  This way, if I search on that specific
> field, it can be done in the most efficient way possible.  Also, the
> documents themselves would be stored more optimally.

I think this is similar how the D implementation of Thrift works.

-- 
/Jacob Carlborg
April 18, 2015
> I think this is similar how the D implementation of Thrift works.

Yes, exactly!  Except instead of writing code to send/receive messages, I'm thinking that indices are built that are specific to the data.  That I think is the harder part as you have to know what is optimal for the data, what operations are expected on the data, and putting it all together.  However, I have to wonder if by making the indices very specific query performance is improved by a significant amount.