Enhancing Search

I'm part of a team that's looking to help build out the search capabilities in Elgg 1.5, which are admittedly quite anemic. I've seen a few of the "full text" search plugins up here, but I don't think they go far enough. What I'd like to see available:

  • The main search callback implemented as an action or otherwise extendable hook
  • Plugins be given control over the list of entities to display for a query, with the ability to add and remove items in the list
  • The ability to have non-persisted entities in the results list (for things like external data sources) and have these integrated into the search display
  • Entities to know (for display purposes) *why* they were returned as part of the search query
  • Get rid of the generic search-on-metadata that leads to bizarre and unexpected behavior (search for "on" or "1", for example) and replace it entirely with plugin-based search functionality

There are a lot of issues dealing with this, many of which could revolve around performance and memory considerations when a dozen plugins start feeding into the search results.

I have a small team of coders that I think I can throw at this, but I definitely want the support of both the community and, hopefully, Curverider, since this would necessitate changes being patched back into the core.

Who's with me?

  • I'm interested.

    For reference, here is Curverider's initial effort on this: http://community.elgg.org/pg/plugins/marcus/read/140017/full-text-search-experimental

    From your list, I don't understand what you mean by the fourth point - entities should know why they were returned

  • I would fully support any effort (for what my support is worth) to improve the search capabilities .  I think its lack of "real" search is one of the last pieces keeping ELGG from moving from "an intersting project" to a full blown enterprise/commerical solution.

    I would love to see these improvements.

    One of my biggest concerns (that I haven't had time to get in and dispell yet) is that ELGG's architecture/data model wouldn't lend well to an advanced full text search, especially after noticing a few "blobs" in the database.

  • Interesting idea for sure... any idea how you wanna handle the project?

  • @danielwells - there are serialized php objects in some tables (from both community plugins and core code). I think discouraging developers from creating their own database tables tends to encourage serialization for storage. Most likely, serialized objects will be ignored by any approach.

  • @Cash: For example, I have a person entity (ElggUser object) that gets flagged by the Profile plugin for having the "interests" field match, but also by the Blog plugin for authoring blog posts that have a title matching the search query. The search results should have those bits highlighted in the display. Do a simple search across Facebook for an example of what that might look like. Also, I've taken a look at the Curverider fulltext search plugin, and it doesn't address the fundamental problems with the search capability in Elgg.

  • @danielwells: One of the key ideas is that each plugin would know what "search" meant across its own data artifacts, so they can do fulltext where appropriate. This doesn't necessarily give us the performance characteristics of a fully integrated search system, but I tend to follow the order of coding software as:

    1. Make it work
    2. Make it work right
    3. Make it work fast

    Taking those out of order tends to lead to pain and suffering. :)

  • @Tom: Do you mean in terms of managing the project with the community and my team, or how we would start the development itself (ie, where to start hacking in the codebase)? In either arena, I'd really like to hear back from Curverider about it, but I haven't heard anything back from my email query yet.

  • @Cash - so what type of things are stored in these serialized objects.  Is it content, something that we would want to search on?  If so search would now have a fairly significant architectural roadblock.  The only way to get at that content would be to index it somewhere else and search on that index (not very clean and defeats the whole purpose of the blob to begin with) or de-serialize to look inside the blobs everytime you run a search (not very efficient).  Off hand this architecture worries me quite a bit.  Someone ease my mind.

  • @danielwells - well, the current Curverider approach is to create a separate set of indexes. The idea being that you update the indexes a few times a day using cron job because it is so expensive. The actual search should be fast on those indexes. If each plugin had search hooks, the indexing code could ask for data to be indexed and so the plugins are responsible for the deserialization. And again, it only happens occasionally. I don't think it defeats the purpose of serialization. That's usually done for fast access and that won't matter for the indexing process. Also, when you say blob, you don't mean binary data, do you? The only binary data that I know of is in the session table which would not be indexed.

    That being said, I don't think many of the serialized objects are things that need to be returned in a search. They are often settings and such.

  • @Justin, I haven't looked at it yet, but I'm assuming it just creates an index off of the object table and the metastring tables using the references from all entities and extenders.

    What do you think of the idea of creating a separate set of indexes?