Elgg Website Crashed - Database overload due to incorrect queries

Ok.. This client operates a sizeable websites with approximately 40,000 members. About 36 hours ago, the website crashed (could not connect to database). This is despite the fact that the site is running on one of the most powerful dedicated servers you can get out there. We have had this trouble with the site on a number of occassions, however, I am sick of performing repair and pruning on the database each time this happens.. I am looking for a permanent fix. I contacted the host who carried out a number of tests and analysis on the databse and cam back with the response below.. Here's the question, given the Elgg 1.7 structure and the way in which DB is queried, is this a problem that can be fixed? Or, is it that Elgg cannot scale to high usage? I would appreciate it if anyone could shed some light on this. The response I received from the host:"

Here's the problem. There is one table which is showing up in all of the locked queries when I look: memmetadata. And the problem is that that table has over 8 million entries!


mysql> SELECT count('id') FROM memmetadata;
+-------------+
| count('id') |
+-------------+
| 8179286 |
+-------------+
1 row in set (4.32 sec)


If you look into the database information we've provided you, you can see that this database keeps causing queries to become Locked, spend forever Sending data, or end up stuck on Copying to tmp table. So what happens is you have somebody run a SELECT DISTINCT query, which causes the ENTIRE memmetadata table to be read into memory and be sorted; because your server can use more than 1.8GB of memory or so at once (as it's 32bit) this ends up having to be written to disk, a very slow operation. While this is happening, all other queries on the memmetadata table are LOCKED, which means they can't update. These queries all stack up and hang.

Above, the query I ran simply COUNTED the number of entries in that table, and it took 4.32 to run this. 4.32 seconds, with Apache offline, so the only query running was that one! And this didn't require a temp table. Any more complex queries will use a temp table and run even longer, which will cause the page to time out as you just saw.

There are really only 3 things which can be done for this, and only one we can assist with:

1) You will need to clean out your tables. Your tables are so large that they can't fit into memory and lock up other queries whenever requested. If you can remove any unnecessary data from the table, you may be able to squeeze a little more performance out of your server. This is something we cannot assist with; you will need support from Elgg.

For example, you may have a whole bunch of junk users who are no longer active on the server, but still must be counted and sorted in these tables. Removing those users may clear up database space and allow your site to function better.

One thing we can do for you here is to add an .htaccess file to your community subdomain which rejects all traffic except from your IP address. This will keep the queries from stacking up and will allow you to access the admin panel. In order to do this we will need your IP address, which you can find in red on this page: http://www.hostgator.com/ip.shtml

2) The second option is to contact the Elgg team and ask them to provide more optimized queries. "SELECT DISTINCT" is a very, very terrible way to query a database; as I said it requires reading the ENTIRE database into memory in order to process. If they can optimize the queries for you it should assist with this load issue.

For $25, we can install Memcached, which stores completed data in memory for you; this may help relieve the load, especially with all those SELECT DISTINCT issues. There's no guarantee that this will resolve your problem. Even once you move to your new 64bit server, Memcached will be a great help.

3) At the end of the day, the best thing would be to buy a new Dedicated Server with the same configuration as you have now, but with a 64bit OS instead. Contact sales@hostgator.com for support with this. Then request that our Migrations team move your sites to the new server. Make sure you get Memcached on the new server as well! We would then configure MySQL to better utilize your expanded memory options under 64bit. This is the best option... but even with the extra power, those database tables are huge, and the queries are badly written, so you will still want to go through and clean out as much as you can.

If you have any further questions or concerns, please let us know. "

 

Thanks.

  • Thanks Brett.. When I took on this site, I actually noticed the garbage collector wasn't even enabled, nor crons.. but I enabled them.. this was done 9 days ago.

     

    In terms of pulling the icontime out of metadata into user tables.. does that need to be done with a patch or is there a query that we can run?

  • What is the interval for the garbage collection?

    For the icontime metadata, that is something the core team will have to do.

  • Garbage collection is weekly.

    I also found this which describes a similar situation:

    http://dba.stackexchange.com/questions/1391/mysql-innodb-select-query-hanging-on-copying-to-tmp-table

    Hope I can get some where with all of this.

     

    Here's something I also wanted to ask.. just in case.. is there a way to run garbage collection on demand? IE, can we force it to run now?

     

    Thanks.

  • @Carlos:

    You can execute garbagecollector (and logrotate) manually by triggering the corresponding cron jobs via your browser. If you have set the interval for garbagecollector for example to "weekly" you can trigger the weekly cron job by calling

    http://yoursite.domain/cron/weekly/

    in your browser (for Elgg 1.8) or

    http://yoursite.domain/pg/cron/weekly/

    in case of Elgg 1.7.

  • Carlos, I think I have a solution to your problem. Please check PM.

  • @iionly: thank you. I am having a hard time at the moment accessing the website as I keep getting the 'lgg couldn't connect to the database' error.

    @mike: thanks mike.. I'll have a look.

  • Thanks to iionly, I triggered the cron manually.. what can I make out of this:

     

    DATABASEEXCEPTION

     

    Deadlock found when trying to get lock; try restarting transaction

    QUERY: DELETE from memobjects_entity
    where guid NOT IN (SELECT guid from mementities)

     

    Any help would be appreciated.

     

    Thanks.

  • Well.. it hit the bounderies as far as the client is concerned.. and I cannot blame them.. I really think it is a scalability issue and the way queries are written in elgg..

    I'm hoping this is not the same with elgg 1.8 as I am working on a project for a client already..

    This client now wants it moved to buddypress.. great, another can of worms.

  • 'scalabilty issues' ? with only 40K users ? ;-) so how our 'sizeable' 220K not hiccup'ed even 1x ? if mysql is not configured nicely - b.p. will chunder too sooner or later ;-P you must have had some db accidents earlier..

  • @Dhrup.. I am not aware of other db incident earlier.. and I cannot even confirm whether they occured as I was not on the scene. As for the number of users, well.. I do not think that is an accurate measure, your 220,000 users could have less than 100 bits of data.. This website is active and populated with data on minute by minute basis.

     

    It is not a secret that elgg 1.7 is very poor when in certain areas. You can see that in reviews allover the web.. I guess that is why 1.8 is substantially different to 1.7.

     

    By the way, I just found out that the error_log in engine/handlers is 2GB, and when I looked at it, I thought where the hell do you start and when will you finish? I stopped at counting 93 variations of errors.