Elgg Website Crashed - Database overload due to incorrect queries

Ok.. This client operates a sizeable websites with approximately 40,000 members. About 36 hours ago, the website crashed (could not connect to database). This is despite the fact that the site is running on one of the most powerful dedicated servers you can get out there. We have had this trouble with the site on a number of occassions, however, I am sick of performing repair and pruning on the database each time this happens.. I am looking for a permanent fix. I contacted the host who carried out a number of tests and analysis on the databse and cam back with the response below.. Here's the question, given the Elgg 1.7 structure and the way in which DB is queried, is this a problem that can be fixed? Or, is it that Elgg cannot scale to high usage? I would appreciate it if anyone could shed some light on this. The response I received from the host:"

Here's the problem. There is one table which is showing up in all of the locked queries when I look: memmetadata. And the problem is that that table has over 8 million entries!


mysql> SELECT count('id') FROM memmetadata;
+-------------+
| count('id') |
+-------------+
| 8179286 |
+-------------+
1 row in set (4.32 sec)


If you look into the database information we've provided you, you can see that this database keeps causing queries to become Locked, spend forever Sending data, or end up stuck on Copying to tmp table. So what happens is you have somebody run a SELECT DISTINCT query, which causes the ENTIRE memmetadata table to be read into memory and be sorted; because your server can use more than 1.8GB of memory or so at once (as it's 32bit) this ends up having to be written to disk, a very slow operation. While this is happening, all other queries on the memmetadata table are LOCKED, which means they can't update. These queries all stack up and hang.

Above, the query I ran simply COUNTED the number of entries in that table, and it took 4.32 to run this. 4.32 seconds, with Apache offline, so the only query running was that one! And this didn't require a temp table. Any more complex queries will use a temp table and run even longer, which will cause the page to time out as you just saw.

There are really only 3 things which can be done for this, and only one we can assist with:

1) You will need to clean out your tables. Your tables are so large that they can't fit into memory and lock up other queries whenever requested. If you can remove any unnecessary data from the table, you may be able to squeeze a little more performance out of your server. This is something we cannot assist with; you will need support from Elgg.

For example, you may have a whole bunch of junk users who are no longer active on the server, but still must be counted and sorted in these tables. Removing those users may clear up database space and allow your site to function better.

One thing we can do for you here is to add an .htaccess file to your community subdomain which rejects all traffic except from your IP address. This will keep the queries from stacking up and will allow you to access the admin panel. In order to do this we will need your IP address, which you can find in red on this page: http://www.hostgator.com/ip.shtml

2) The second option is to contact the Elgg team and ask them to provide more optimized queries. "SELECT DISTINCT" is a very, very terrible way to query a database; as I said it requires reading the ENTIRE database into memory in order to process. If they can optimize the queries for you it should assist with this load issue.

For $25, we can install Memcached, which stores completed data in memory for you; this may help relieve the load, especially with all those SELECT DISTINCT issues. There's no guarantee that this will resolve your problem. Even once you move to your new 64bit server, Memcached will be a great help.

3) At the end of the day, the best thing would be to buy a new Dedicated Server with the same configuration as you have now, but with a 64bit OS instead. Contact sales@hostgator.com for support with this. Then request that our Migrations team move your sites to the new server. Make sure you get Memcached on the new server as well! We would then configure MySQL to better utilize your expanded memory options under 64bit. This is the best option... but even with the extra power, those database tables are huge, and the queries are badly written, so you will still want to go through and clean out as much as you can.

If you have any further questions or concerns, please let us know. "

 

Thanks.


  • Weellll... ? ;-) after being MIA for tbat 2 years !! lolz & lolz tee hee !!

    Dedicated + 40K users + 232-bit + entire metadat tbl in memory + htaccess to rejects all traffic except from your IP+ stacking up + more optimized queries + SELECT DISTINCT.. terrible way to query a database + reading ENTIRE database into memory + For $25, we can install Memcached, which stores completed data in memory for you = no guarantee that this will resolve your problem + best thing would be to buy a new Dedicated.. 64bit OS instead ??

    Whatta Load of Horse Manure !!;-) FBFK does 200+++ K users  and never a 0.001% problem with MySQL performance. $25 fo Memcache LOLZ ? It takes a one line command + about 5 secs.

    Looks like your cient is hosting with the wrong crowd ;-P

  • Need to see the actual queries. I really don't think Elgg has any queries like "SELECT count('id') FROM memmetadata;"

    Also, that support person doesn't understand DISTINCT. It operates on result sets, not the entire table. The entire would be loaded if the where clauses are causing it to not hit an index.

    Need Elgg version, too.

  • Could you try the DBvalidator plugin from Cash? Also please post the Elgg version.

  • @Cash: Elgg 1.7.4.. I don't think Elgg has that query either, I think that is just the query that the support person ran to get a count on the entries in the metadata table.

    @Webgalli: Yes, but that didn't repair the issue.

  • I should also mention that the site was getting extremly slow just before it crashed and I did check MySQL Processes and it output a whole load (maybe 50) rows of queries with 'Locked' state.

  •    

    localhost

     

    Query

    473

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    292

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    292

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    372

    Locked

    SELECT DISTINCT e.* FROM mementities e JOIN memobjects_entity oe ON e.guid = oe.guid WHERE (MATCH

    localhost

     

    Query

    313

    Locked

    SELECT DISTINCT e.* FROM mementities e JOIN memobjects_entity oe ON e.guid = oe.guid WHERE (MATCH

    localhost

     

    Query

    268

    Locked

    SELECT DISTINCT e.* FROM mementities e JOIN memobjects_entity oe ON e.guid = oe.guid WHERE (MATCH

    localhost

     

    Query

    230

    Locked

    SELECT DISTINCT e.* FROM mementities e JOIN memobjects_entity oe ON e.guid = oe.guid WHERE (MATCH

    localhost

     

    Query

    102

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    198

    Locked

    SELECT DISTINCT e.* FROM mementities e JOIN memobjects_entity oe ON e.guid = oe.guid WHERE (MATCH

    localhost

     

    Query

    75

    Locked

    SELECT * from memobjects_entity where guid=1183372

    localhost

     

    Query

    102

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    75

    Locked

    SELECT * from memobjects_entity where guid=334395

    localhost

     

    Query

    74

    Locked

    SELECT * from memobjects_entity where guid=1017666

    localhost

     

    Query

    43

    Locked

    SELECT * from memobjects_entity where guid=1183372

    localhost

     

    Query

    43

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    28

    Locked

    DELETE from memobjects_entity where guid NOT IN (SELECT guid from mementities)

    localhost

     

    Query

    12

    Locked

    SELECT * from memobjects_entity where guid=2182874

    localhost

     

    Query

    0

    NULL

    SHOW PROCESSLIST

  • queries will lock only if one query is holding the table.. yr report above does not list that ;( looks like yr problem is really mysql *performance -- u will need to look at yr mysql config parms to see where the perf hit/s exist and tweak those parms. simple restart of mysql server will tepmorarily get db & speed back online.. for a while maybe...

  • Rows in the metadata table are kept current as far as I understand.  So anytime an entity is deleted, associated metadata is deleted from that table, I don't think you can safely remove anything from that table.  The metadata table is really just a reference table for the metastrings table, could the hangup be caused by the join to the metastrings table?  I had previously determined that the metastrings table does not stay current, and when metadata is deleted the strings will remain in that table.  With 8M metadata entries your metastrings table will probably be a whole lot bigger.

     

    I have had success cleaning out orphaned metastrings through a single query:

    DELETE FROM elgg_metastrings WHERE id NOT IN(SELECT DISTINCT(name_id) FROM elgg_metadata UNION SELECT DISTINCT(value_id) FROM elgg_metadata) AND id NOT IN(SELECT DISTINCT(name_id) FROM elgg_annotations UNION SELECT DISTINCT(value_id) FROM elgg_annotations)

    Use with caution, backup beforehand, etc.  If you're managing a site that big you should know the drill.  With that many entries you likely won't be able to run that via phpmyadmin or any web interface, you'll have to do it through a mysql command line.

  • @Dhrup: True, queries will lock only if there is ahanging query.. that hanging query is always either 'JOIN' or 'COPY TO TABLE'. But, restarting MySQL does not, for some reason, terminate queries running on the server.. However, if there is no glitch in the JOIN query, I do not think we would have the COPY running to start with. 

    @Matt: Spot on, I agree. And, all along, I had the same view, but I have to say I was not even sure if I could take out stray mestastrings without affecting other tables, let alone the command line itself. By the way, phpMyAdmin will allow you to run that command even on a large database, but I think it is the backup feature that is capped.. but the latest version will allow yous to breakup your backups in partial moves.

    Thannks to both.

     

     

  • If Matt's query fixes the issue - then goody !;) MySQL restart should be *stop, *then *start - not simple restart - I think those commands VS *restart -> behave different.

    If you turn on <log slow queries> => that will log the slow queries ;-) The can see where probem might be. (It is MyISAM's table locks that is the culprit - InnoDB would do row locks only - so better for locking -- tho we cannot blindly convert MyISAM to InnoDB for Elgg - need more careful study of *mysql text *search requirements on metadata table).

    Try installng some MySQL monitoring utlities to check what's going on inside - if any particular queries are causing hiccups - can see via those. FBFK does have a mucho larger DB (~8GB/ 200KUsers/ 17 Million rows @Metadata table) [What MB/GB size is your DB? ] - but we have never seen types of problems 'lock' you're experiencing - there lies my weakness w/ mySQL ;-)