Enhancing Search

I'm part of a team that's looking to help build out the search capabilities in Elgg 1.5, which are admittedly quite anemic. I've seen a few of the "full text" search plugins up here, but I don't think they go far enough. What I'd like to see available:

  • The main search callback implemented as an action or otherwise extendable hook
  • Plugins be given control over the list of entities to display for a query, with the ability to add and remove items in the list
  • The ability to have non-persisted entities in the results list (for things like external data sources) and have these integrated into the search display
  • Entities to know (for display purposes) *why* they were returned as part of the search query
  • Get rid of the generic search-on-metadata that leads to bizarre and unexpected behavior (search for "on" or "1", for example) and replace it entirely with plugin-based search functionality

There are a lot of issues dealing with this, many of which could revolve around performance and memory considerations when a dozen plugins start feeding into the search results.

I have a small team of coders that I think I can throw at this, but I definitely want the support of both the community and, hopefully, Curverider, since this would necessitate changes being patched back into the core.

Who's with me?

  • I am with Cash in principle.. Binary data should not be even relevant in this.. 

    At the same time, if you are to implement the 'call & Index' method for indexing, you would of course do it like Cash said through cron job by talking to the relevant plugin... but, my input would be to 'crawl & index'.. this can be applied to all plugins with an initial setting so that the indexing is initiated at the plugin level.. of course, this will minise serialization.. and of course here we are as far away from binaries as possible.. the difficulty in this would be to integrate this, as a component, in plugins.. Having hooks to latch onto the indexing process is the cool idea here, that is a much more practical approach because we don't have to re-invent the wheel.. So I guess, what I am saying is the index would be dormant, and the code producing data to be indexed is integrated into the plugin itself.. I am not sure if I am making myself clear enough, but you need a white board for this topic because you need to engineer the search from scratch.

    Using 3rd party search will not work.. and even if it does.. it will not bring the search function to the level Justin describes.

    @danielwells.. I can confirm what Cash said, no need to even remotely worry about session data (and all binary data).. as matter of fact, one of the prerquisites for the search to work well here is that it's not only intelligent enough to discount that type of data, but also be told, with hardcore code, not to index it.. best way you can achieve this is by isolating 'search indexes' from those binaries.

  • So I've gone and set up a Google Code Repository as a place to store things out in the open while we work on them. I'm still collecting people at work to put toward this project, but we're sketching out an architecture for it now. Any suggestions on a good place to do this? I want to engage in the community with this and work as open as we can to have not only the best code output but the best chances of making something that can make it back into the core.

    I ping'd Curverider again for their input but heard no response. I'm planning on coming to the ElggCamp in Boston in a couple weeks (so long as personal plans don't get in the way) if folks want to sync up there as well, but I'd definitely like to start before then.

    And to a few of the guys in the thread: please keep the personal attacks away. You're not helping.

  • Also, to the binary data: if we let the plugins themselves respond to a search request, then we can let them be smart about searching the relevant data in whatever way best makes sense. I want to get away from a one-size centralized searching system that just does a string query against a single table. I would rather have a dynamic system that lets every component have its say in what comes back from a search, whether that be to add or remove from the display.

  • "I would rather have a dynamic system that lets every component have its say in what comes back from a search, whether that be to add or remove from the display." - Your concept, though interesting, sounds like way too much demand on any site, when you are talking about thousands of users on a site which is probably not all by itself on a dedicated server.

  • @ukr: It does sound like a lot of processing, but then again, computers are very fast. I don't think it will end up being all that much overhead in the end, you'd be able to cache the snot out of it, and I really think that discounting an approach for being to slow before it's been run at all is premature.

     

    @dave: Definitely, look forward to meeting you guys (and everyone else that's going). I've got a few folks from my organization that are planning to attend as well.

  • Some great discussions on this front at the ElggCamp yesterday -- more to come soon.

  • I'm interested and would like to know how do we plan to start working on enhanching the search capability in Elgg.

  • Here's a quick rundown of what we have in mind:

    1. Entities are indexed periodically and this index is stored in its own table.
    2. Indexing will trigger a hook for plugins to add their own info.
    3. Lives searching will pull from the index, but also emit a hook for plugins to add their own results.

    This is a super-simple version of what will need to happen.  I wouldn't expect this until at least 1.7 because it will require changes in core.

  • Before going into too much detail, what types of searching are people talking about functionally.  Full Text searches? Straight value searches? More complex searchs with multiple components and/or together?  What is elgg's role on the back end of all of this?  Elgg is not only doing the search but also performing the "access control" checking on the entities being search to see if the current user has access to the data being requested.  Oh yes, and it all needs to be fast...

    The structure of meta data in elgg makes it very efficient to find 1 piece of information.  The more informaiton a person is looking for at 1 time (Groups withing a 15 mile radius of Boston with an interest in "social networking") the slower the system will be. 

    Maybe a developer could define the meta data associated with an entity type that will need to be queried on (or searched) which could cause data to be stored in a much mor efficient structure for query purposes.  It would also be helpful to define the data type of each peice of meta data for even better searching.  (Dates are a great example of this.)

    Hooks into external systems would be great.  The question would be how would this get integrated into a single search format?  How would pagination be determined?

     

     

  • The index Brett is talking about would be filled in by each of the individual plugins that know about the specific parts of data that ought to be searchable. For example, the core profile plugin would index things like interests, about me, name, etc. This will also let the search system know *why* the entities in question matched (since the index will have to record its "source" or "type" as well as the entity and string information itself). The actual searching would be triggered by the callback system. The first callback would be "search the index", which would allow for a fast search across indexed entities in the database all in one go. If all you have in your system is indexed entities, it should work nicely and allow for a more on-target system than is there now.

    Any plugins that add entities or information to entities that they want to be searchable will need to tie into the indexing system, and then they get the searching part for free. Plugins will also be able to hook into this same callback to extend the the search results *at runtime*, say with external "entities" that are not actually saved into the database. This is a key feature for us.

    The type of search that we were looking at here would be a simple "keyword-OR" method, but having a full-text index on the index table will let us pick up substrings. At the moment we aren't looking into structured data queries (locations and date ranges and the like), but hopefully this new structure will leave enough room for an "Advanced search" method later on.

    Here's a random thought: should we make a callback to let things expand and alter the search *query*? Could lead to a cool stemming engine plugin if someone would want to write it.

    Pagination and sorting, especially of mixed internal and external data, is a very interesting question, and not one we have completely figured out yet. Ideas on that would be greatly appreciated.