Fixing dub search (page 2)

On Tuesday, 29 December 2020 at 12:04:28 UTC, aberba wrote: > On Tuesday, 29 December 2020 at 11:47:11 UTC, Ola Fosheim Grøstad wrote: > >> It is written in Rust... > > If anyone has one written in D too, we can use that as well. I just want to have the embarrassing search fixed. Alright, if someone wants to start on it, then I'm willing to help out with suggestions and code reviews. This is a decent starting point: https://nlp.stanford.edu/IR-book/information-retrieval-book.html And also Wikipedia https://en.wikipedia.org/wiki/Inverted_index https://en.wikipedia.org/wiki/Approximate_string_matching https://en.wikipedia.org/wiki/Suffix_array https://en.wikipedia.org/wiki/Trie https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm

On Tuesday, 29 December 2020 at 08:45:02 UTC, aberba wrote: > On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote: >> The best place for discussion is here, though: >> https://github.com/dlang/dub-registry/issues/93 > If you've looked at the very discussion you referenced, you'd realize they went around and still came back to using mongodb for search. I read it. It's the bug report for this issue, and it's the discussion thread for this community project. There's no them vs you. There's isn't even a final decision in that thread. MongoDB is being used now because that's what's implemented. > And between elasticsearch and MeiliSearch, MeiliSearch is simpler, lightweight and easy to use. Have you considered anything other than ElasticSearch, MeiliSearch and custom hacks? There are at least four other options mentioned in the thread I linked to. Maybe add MeiliSearch and your reasons for using it.

On Tuesday, 29 December 2020 at 22:27:17 UTC, sarn wrote: > Have you considered anything other than ElasticSearch, MeiliSearch and custom hacks? There are at least four other options mentioned in the thread I linked to. Maybe add MeiliSearch and your reasons for using it. The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return.

On Monday, 28 December 2020 at 20:49:12 UTC, sarn wrote: > On Monday, 28 December 2020 at 10:49:58 UTC, aberba wrote: > The best place for discussion is here, though: > https://github.com/dlang/dub-registry/issues/93 In that thread you wrote > The FTS features of DBs like Sqlite and Postgres are really nice if you're already using those DBs (otherwise other tools are more powerful). Moving all data to Sqlite or PG is obviously a whole bigger decision. sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...

On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote: > sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results... I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?

December 29, 2020

Re: Fixing dub search

Posted by bachmeier
in reply to Ola Fosheim Grøstad

Permalink

bachmeier

Posted in reply to Ola Fosheim Grøstad

Permalink

On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote:
> On Tuesday, 29 December 2020 at 22:53:55 UTC, bachmeier wrote:
>> sqlite was the first thing I thought about when I saw this thread. How much data would have to be copied into an sqlite database for searching of packages? That has the advantage of more or less no dependencies, dead simple to add, claimed good results...
>
> I'm not being dismissive (I also don't use Dub), but in general this would not scale very well. Unless you want to do all searches locally. Also, a high quality search engine requires custom ranking, so not really sure if it is overall less work than rolling your own if you want high quality search results. The text corpus is tiny, so there is really no point in using a generic on-disk solution?

Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me.

On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote: > On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote: > Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me. I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend. But, the advantage in using an existing online service is that you get automatic scaling and better uptime: write once, run forever... I think such a service should be grateful if Dub provided them with: 1. cheap advertising 2. a maintained API to their service Maybe they even will pay for the work, who knows?

On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad wrote: > On Tuesday, 29 December 2020 at 23:32:28 UTC, bachmeier wrote: >> On Tuesday, 29 December 2020 at 23:17:33 UTC, Ola Fosheim Grøstad wrote: >> Well, except that sqlite works now and has been extensively tested. I don't want to discourage anyone from rolling their own, but knowing how long things take around here, and using actuarial tables to compute my life expectancy, it's not obvious that it would impact me. That's also why adding another dependency concerns me. > > I actually implemented Damerau–Levenshtein in Python the other day in order to validate an exam question... It takes <15 minutes from scratch. A faster version on a trie can be done in an evening, debugged and tested. A full system in a weekend. Also, keep in mind that the fuzzy search does not have to be crazy fast when people often search for the same stuff. Just log all search phrases and preload the caches with the most common ones. With some luck maybe 90% of all searches hit caches?

On Tuesday, 29 December 2020 at 23:40:44 UTC, Ola Fosheim Grøstad wrote: > But, the advantage in using an existing online service is that you get automatic scaling and better uptime: write once, run forever... I think such a service should be grateful if Dub provided them with: Her is the SLA for Algolia, they offer 99.99% and 99.999% which translates to 53 minutes and 5 minutes of downtime per year. It would be difficult (highly improbable) to compete with that for a self hosted solution. https://www.algolia.com/blog/for-slas-theres-no-such-thing-as-100-uptime-only-100-transparency/

On 12/29/20 5:34 PM, Ola Fosheim Grøstad wrote: > The easiest option is to see if an indexing service (like the mentioned Algolia) is willing to sponsor Dub as an open source project, then they get some free advertising in return. > This is an excellent suggestion!

Forums