Enhancing Search

I'm part of a team that's looking to help build out the search capabilities in Elgg 1.5, which are admittedly quite anemic. I've seen a few of the "full text" search plugins up here, but I don't think they go far enough. What I'd like to see available:

  • The main search callback implemented as an action or otherwise extendable hook
  • Plugins be given control over the list of entities to display for a query, with the ability to add and remove items in the list
  • The ability to have non-persisted entities in the results list (for things like external data sources) and have these integrated into the search display
  • Entities to know (for display purposes) *why* they were returned as part of the search query
  • Get rid of the generic search-on-metadata that leads to bizarre and unexpected behavior (search for "on" or "1", for example) and replace it entirely with plugin-based search functionality

There are a lot of issues dealing with this, many of which could revolve around performance and memory considerations when a dozen plugins start feeding into the search results.

I have a small team of coders that I think I can throw at this, but I definitely want the support of both the community and, hopefully, Curverider, since this would necessitate changes being patched back into the core.

Who's with me?

  • @Cash - Yes I was referring to the binary data in the session table.  Not knowing what was stored in there, it concerned me a bit.  I have seen some pretty poor usage of blobs in my day.  

    Is the serialized data you are referring located in the metstrings table?  If so then its nice to see it all still in string form where it can be indexed, as you say.  Is there any documentation out there on the best practices for these separate indexes and how curverider uses them?  It would be nice to see what ELGG had in mind for this when they designed the architecture.

    Thanks again for all your input.

  • @Cash: the fulltext search piece (fts in the official plugins repository) actually just pulls a couple main fields to search against, like name and description, and doesn't serialize the entire object. The fts system still ends up doing a parallel search on the metastrings in addition to the serialized, indexed objects.

    What I'm suggesting is one search system that all pieces would be able to contribute to the results of.

  • @danielwells - I think the session data there is just the same binary data that is stored to disk by default by PHP. There shouldn't be binary data anywhere else in the Elgg tables.

    I've only seen the serialized data in the metastrings table and even that is rare.

  • @Justin - who from Curverider did you contact?

  • @Cash: I just used the "contact us" form on the website, haven't emailed or messaged anyone directly yet.

  • This is truly a critical piece for the website I'm currently designing. I have JimBob's Full Text Search, which seems to do a fairly good job. (with the exception of displaying "ElggObjects"). I have tried to use a Custom Google Engine but the "Friends" and "friendsof" pages overpopulate the results. 

  • An external site-crawl won't work for us, as we'll be behind various firewalls for most of our installations. I looked a bit into the other full-text search, and it's a good start, but it still really only does things in the metastrings table for the entities that are already serialized in the DB. What I'm envisioning is a much more robust system.

  • Blobs are not all that scary. Take a good look at Swish-e, an open source indexing and search tool. It depends on blobs and it works just fine.

    I think probably the Curverider approach is the best. I think it should be a separate index. It should NOT be a direct query on the (My)SQL database. There's already enough demand on that. And does anyone really have the resources that Facebook has? Common people. Be realistic. Even with a dedicated server and a few thousand users, it's enough demand. All it takes is one creative genius to bring your site to a halt, if the query is directly tied to the DB.

    I think it should be sufficient, if the site has it's own process via cron, to index two or three times a day and provide that for queries. If any content is out-of-date, the system should just provide a page saying, "Sorry, this content is no longer available" yada yada yada.

    Glad to see everyone discussing and working on this. ...By the way, maybe just an integration of Swish-e into Elgg would do the trick?  Why re-invent the wheel??

  • And by the way, can we all avoid Google like the plague, please?!  I am so tired of Google! I don't even want to use any of their API's!  All I want is good, clean, efficient open source software that I can have reside on my own sites!  Thanks!!!

  • Maybe I missed something but what in the world does google have to do with anything?