Elgg Website Crashed - Database overload due to incorrect queries

Ok.. This client operates a sizeable websites with approximately 40,000 members. About 36 hours ago, the website crashed (could not connect to database). This is despite the fact that the site is running on one of the most powerful dedicated servers you can get out there. We have had this trouble with the site on a number of occassions, however, I am sick of performing repair and pruning on the database each time this happens.. I am looking for a permanent fix. I contacted the host who carried out a number of tests and analysis on the databse and cam back with the response below.. Here's the question, given the Elgg 1.7 structure and the way in which DB is queried, is this a problem that can be fixed? Or, is it that Elgg cannot scale to high usage? I would appreciate it if anyone could shed some light on this. The response I received from the host:"

Here's the problem. There is one table which is showing up in all of the locked queries when I look: memmetadata. And the problem is that that table has over 8 million entries!


mysql> SELECT count('id') FROM memmetadata;
+-------------+
| count('id') |
+-------------+
| 8179286 |
+-------------+
1 row in set (4.32 sec)


If you look into the database information we've provided you, you can see that this database keeps causing queries to become Locked, spend forever Sending data, or end up stuck on Copying to tmp table. So what happens is you have somebody run a SELECT DISTINCT query, which causes the ENTIRE memmetadata table to be read into memory and be sorted; because your server can use more than 1.8GB of memory or so at once (as it's 32bit) this ends up having to be written to disk, a very slow operation. While this is happening, all other queries on the memmetadata table are LOCKED, which means they can't update. These queries all stack up and hang.

Above, the query I ran simply COUNTED the number of entries in that table, and it took 4.32 to run this. 4.32 seconds, with Apache offline, so the only query running was that one! And this didn't require a temp table. Any more complex queries will use a temp table and run even longer, which will cause the page to time out as you just saw.

There are really only 3 things which can be done for this, and only one we can assist with:

1) You will need to clean out your tables. Your tables are so large that they can't fit into memory and lock up other queries whenever requested. If you can remove any unnecessary data from the table, you may be able to squeeze a little more performance out of your server. This is something we cannot assist with; you will need support from Elgg.

For example, you may have a whole bunch of junk users who are no longer active on the server, but still must be counted and sorted in these tables. Removing those users may clear up database space and allow your site to function better.

One thing we can do for you here is to add an .htaccess file to your community subdomain which rejects all traffic except from your IP address. This will keep the queries from stacking up and will allow you to access the admin panel. In order to do this we will need your IP address, which you can find in red on this page: http://www.hostgator.com/ip.shtml

2) The second option is to contact the Elgg team and ask them to provide more optimized queries. "SELECT DISTINCT" is a very, very terrible way to query a database; as I said it requires reading the ENTIRE database into memory in order to process. If they can optimize the queries for you it should assist with this load issue.

For $25, we can install Memcached, which stores completed data in memory for you; this may help relieve the load, especially with all those SELECT DISTINCT issues. There's no guarantee that this will resolve your problem. Even once you move to your new 64bit server, Memcached will be a great help.

3) At the end of the day, the best thing would be to buy a new Dedicated Server with the same configuration as you have now, but with a 64bit OS instead. Contact sales@hostgator.com for support with this. Then request that our Migrations team move your sites to the new server. Make sure you get Memcached on the new server as well! We would then configure MySQL to better utilize your expanded memory options under 64bit. This is the best option... but even with the extra power, those database tables are huge, and the queries are badly written, so you will still want to go through and clean out as much as you can.

If you have any further questions or concerns, please let us know. "

 

Thanks.

  • One other thing, when developers on the script itself, cannot even see that there is an issue, it tells me that the issue is more serious.. There is a common thing here on Elgg, everyone tells you that there is a problem and that the problem is not in Elgg.. but that's where it stops. 

     

    If metadata tables in Elgg is not a big bug, there would not be pluggins out there designed specifically to deal with metadat management.

  • Is there any reference to the best options for storage engines on elgg tables?

    Thanks

  • 1) Our data is 100GB zipped ;-) The Kids are killing us ;o-X

    2) DB incidents ? I have seen quite a few similar performance problems on client sites - but *always *after DB hiccups

    3) 'error_log in engine/handlers @ 2GB ? why log is there ? Can rename and re-start server for a new log - will increase yr server's performance by %%

    4) There *may be some issues with *some instances of metadata accessing - but I've not seen any issues so far in 3+ years

    5) Storage engines ? MySql I think offers only MyISAM and Inno, Could rewrite the Core DB Access routines to go for - say DB2 (free & from *the *IBM Crowd)

  • Thanks Dhrup..

    I think the problem in this case may be in the Storage Engines..

    Here's just my assumption, after looking at the storage engines assigned to a number of tables, I found that some were InnoDB and others were MyISAM. Even in the sase of my MyISAM the row format was incorrect, ie, FIXED v DYNAMIC.. I know we do not have much choice with the row format depending on the coded fields, ie, CHAR, VARCHAR..etc.

    I went back and modified the storage engines to what I think is correct, which to me sounds like MyISAM. SQL Processes seem to be free from locking or cluttering now. Queries seem to run reasonable fast and copying to tables is reduced.

    My Presumption is that the previous developer modified some storage engines to deal with the FullText search issue. But I belive that should be a different thing to deal altogether rather than screw up with tables.

    I am running with this and monitoring, cannot tell for sure if this is the fix, but so far, it seems to have restored the site to a robust condition.

  • One other thing, when developers on the script itself, cannot even see that there is an issue, it tells me that the issue is more serious..

    FYI: The only core developers who have commented on this thread are Cash and me.

    The DB validator plugin Cash wrote is because many people use Elgg on cheap hosting that can't deal with the number of MySQL connections required and so the databases end up corrupted. This is as much a problem with the hosting choice as it is with Elgg. Elgg can be more robust in this, but the fact is if you're running an application on an underpowered machine you will have problems.

    It sounds like you've also inherited a project that a previous dev altered. It might be a good idea to start with a fresh Elgg install to compare with the legacy one to see what all has been changed.

  • Brett.. I think this client server is a little better than the server that hosts elgg community itself. The server is nor cheap or underpowered.

     

    And by the way, Cash himself his plugin wouldn't here in this situation.

     

    And finally, elgg community website itself runs slow on high number of loggedin users.