Elgg Website Crashed - Database overload due to incorrect queries

Ok.. This client operates a sizeable websites with approximately 40,000 members. About 36 hours ago, the website crashed (could not connect to database). This is despite the fact that the site is running on one of the most powerful dedicated servers you can get out there. We have had this trouble with the site on a number of occassions, however, I am sick of performing repair and pruning on the database each time this happens.. I am looking for a permanent fix. I contacted the host who carried out a number of tests and analysis on the databse and cam back with the response below.. Here's the question, given the Elgg 1.7 structure and the way in which DB is queried, is this a problem that can be fixed? Or, is it that Elgg cannot scale to high usage? I would appreciate it if anyone could shed some light on this. The response I received from the host:"

Here's the problem. There is one table which is showing up in all of the locked queries when I look: memmetadata. And the problem is that that table has over 8 million entries!


mysql> SELECT count('id') FROM memmetadata;
+-------------+
| count('id') |
+-------------+
| 8179286 |
+-------------+
1 row in set (4.32 sec)


If you look into the database information we've provided you, you can see that this database keeps causing queries to become Locked, spend forever Sending data, or end up stuck on Copying to tmp table. So what happens is you have somebody run a SELECT DISTINCT query, which causes the ENTIRE memmetadata table to be read into memory and be sorted; because your server can use more than 1.8GB of memory or so at once (as it's 32bit) this ends up having to be written to disk, a very slow operation. While this is happening, all other queries on the memmetadata table are LOCKED, which means they can't update. These queries all stack up and hang.

Above, the query I ran simply COUNTED the number of entries in that table, and it took 4.32 to run this. 4.32 seconds, with Apache offline, so the only query running was that one! And this didn't require a temp table. Any more complex queries will use a temp table and run even longer, which will cause the page to time out as you just saw.

There are really only 3 things which can be done for this, and only one we can assist with:

1) You will need to clean out your tables. Your tables are so large that they can't fit into memory and lock up other queries whenever requested. If you can remove any unnecessary data from the table, you may be able to squeeze a little more performance out of your server. This is something we cannot assist with; you will need support from Elgg.

For example, you may have a whole bunch of junk users who are no longer active on the server, but still must be counted and sorted in these tables. Removing those users may clear up database space and allow your site to function better.

One thing we can do for you here is to add an .htaccess file to your community subdomain which rejects all traffic except from your IP address. This will keep the queries from stacking up and will allow you to access the admin panel. In order to do this we will need your IP address, which you can find in red on this page: http://www.hostgator.com/ip.shtml

2) The second option is to contact the Elgg team and ask them to provide more optimized queries. "SELECT DISTINCT" is a very, very terrible way to query a database; as I said it requires reading the ENTIRE database into memory in order to process. If they can optimize the queries for you it should assist with this load issue.

For $25, we can install Memcached, which stores completed data in memory for you; this may help relieve the load, especially with all those SELECT DISTINCT issues. There's no guarantee that this will resolve your problem. Even once you move to your new 64bit server, Memcached will be a great help.

3) At the end of the day, the best thing would be to buy a new Dedicated Server with the same configuration as you have now, but with a 64bit OS instead. Contact sales@hostgator.com for support with this. Then request that our Migrations team move your sites to the new server. Make sure you get Memcached on the new server as well! We would then configure MySQL to better utilize your expanded memory options under 64bit. This is the best option... but even with the extra power, those database tables are huge, and the queries are badly written, so you will still want to go through and clean out as much as you can.

If you have any further questions or concerns, please let us know. "

 

Thanks.

  • Carlos, is it possible for you to post the queries that are causing it to lock? Slow query logging is definitely a good idea.

  • BTW -- For Slow Query Logging --> my.cnf :=

    [mysqld]
    skip-bdb
    log_slow_queries = /var/log/mysql_slow.log         Your choice
    long_query_time =         5                               OR whatever you want



  • Some comments:

    1. My database validator plugin looks for inconsistencies. It will not help with this.
    2. I think a lot of the queries showing are coming from the validator plugin.
    3. In your locked table query chart, looks like the object entities table is locked. Nothing jumps out as to the cause.
    4. I don't think we yet have the information we need to figure out your issue. A slow query log would be useful.
  • Hi again.. sorry about not keeping this alive for the last few days, but things were a bit hectic..

    Anyway, got onto the host and got them to enable slow query log.. Then, I limited the number of max-connections and query questions to 0.. this terminated all running queries and that brought back the site (which runs very slow and often gives a 500 error - presumably because we have some issues with the queries).. anyway, after bringing the site back up, I logged in as admin, and I could see 4 more users logged in.. so, I went to mysqlslow.log.. and this is all it had:

    ----------------------------------------------------

    /usr/sbin/mysqld, Version: 5.1.56-log (MySQL Community Server (GPL)). started with:

    Tcp port: 0  Unix socket: (null)

    Time                 Id Command    Argument

    /usr/sbin/mysqld, Version: 5.1.56-log (MySQL Community Server (GPL)). started with:

    Tcp port: 0  Unix socket: (null)

    Time                 Id Command    Argument

    ------------------------------------------------------------------
    funny thing is mysql processes show:

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    statistics

    SELECT * from mementities where guid=795150 and ( (access_id IN (2) OR (owner_guid = -1) OR (

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    Copying to tmp table

    SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

    freeing items

    REPLACE INTO memusers_sessions (session, ts, data) VALUES ('946ec2348cafed86f7b49c9d095fb05d',

    I am really lost here.. anyone can make sense out of this?
    Thanks
  • I just contacted the host again.. and argued slowlog looks like it is not even enabled as it is empty.. I asked for them to enable it and I will come back with a log.. hopefully.

  • Ok.. here's the log after the first 5 minutes of enabling slow log properly:

     

    Tcp port: 0  Unix socket: (null)

    Time                 Id Command    Argument

    # Time: 120216  7:17:37

    # User@Host: heartifb_snappy[XXX_snappy] @ localhost []

    # Query_time: 125.390622  Lock_time: 0.000105 Rows_sent: 0  Rows_examined: 0

    use XXX_cmmunity;

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 76.473945  Lock_time: 0.000113 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 125.414097  Lock_time: 0.000080 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 91.456061  Lock_time: 0.000098 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 125.339095  Lock_time: 0.000132 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 85.149377  Lock_time: 0.000130 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

    # User@Host: XXX_snappy[XXX_snappy] @ localhost []

    # Query_time: 64.039687  Lock_time: 0.000110 Rows_sent: 0  Rows_examined: 0

    SET timestamp=1329398257;

    SELECT DISTINCT e.* FROM mementities e  JOIN memmetadata n_table on e.guid = n_table.entity_guid  JOIN memmetastrings msn on n_table.name_id = msn.id  WHERE  (((msn.string IN ('icontime')) AND (  (n_table.access_id IN (2)

    OR (n_table.owner_guid = -1)

    OR (

    n_table.access_id = 0

    AND n_table.owner_guid = -1

    )

    ) and n_table.enabled='yes'))) AND  ((e.type = 'user')) AND  (e.site_guid IN (1)) AND (  (e.access_id IN (2)

    OR (e.owner_guid = -1)

    OR (

    e.access_id = 0

    AND e.owner_guid = -1

    )

    ) and e.enabled='yes') ORDER BY e.time_created desc LIMIT 0, 10;

  • @Matt I run your query and it deleted 28 rows from my database and after that everythig is normal no problem accur.

    DELETE FROM elgg_metastrings WHERE id NOT IN(SELECT DISTINCT(name_id) FROM elgg_metadata UNION SELECT DISTINCT(value_id) FROM elgg_metadata) AND id NOT IN(SELECT DISTINCT(name_id) FROM elgg_annotations UNION SELECT DISTINCT(value_id) FROM elgg_annotations)

    As I have lots of entries in my database from plugins which I am not using right now. Some entries are visible in site stats in admin area. for example I was using IZAP videos and those all entries are still in my database.

    Is there any sql query to delete those entries from database which are related to those plugins which are not active right now.

  • Any viewes on this issue?

  • Here's some more data.. the site seems to be fail when these queries are running:

     

              29 Copying to tmp table SELECT DISTINCT e.*, max(n_table.time_created) as maxtime FROM mementities e JOIN memannotations n_
              49 Copying to tmp table SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO
              37 Copying to tmp table SELECT DISTINCT e.* FROM mementities e JOIN memmetadata n_table on e.guid = n_table.entity_guid JO

     

    What does the 'copy to tmp table' do anyway?

    Thanks

  • The queries in the slow query log are all for user avatar icons. That's a fairly simple query compared to some of the metadata queries some plugins produce. This won't help you now, but the icon time might be something worth pulling out of metadata into the users table. I created a ticket for this: http://trac.elgg.org/ticket/4373

    Have you checked that the garbage collection plugin isn't running too often?