Elgg optimizations - report

Below I paste our report of Elgg optimizations. This is version of report that was sent to Elgg core dev's a few months ago, updated with some additional notes and remarks.

ELGG 1.8 CORE OPTIMIZATIONS

authors: Pawel Sroka, Michał Zacher
version 1.2
last modified: 2012-06-12

* Database IS Elgg's bottleneck *
- benchmark for clean installation shows ~3% of mysql_query time in page loading for single connection but for 64 concurrent connections it becomes ~24%. So for larger capacity, mySQL starts to block
- checking mySQL statisticks shows enormous amounts of filesorts in database. It's really a huge issue. Most lilely caused by 3 factors:
    - DISTINCT modificator for EVERY QUERY. These data must be created anyway, so situations where DISTINCT is really necessary should be minimized. By default we should make simple selects
    - JOINs can be optimized, but in very secific situations, combination with DISTINCT is deadly - almost sure filesort
    - CAST in ORDER BY statement seems to force filesorting anyway. Better to allow different metadata types than force only text values
- we loose already fetched data when selecting entities. We should optimize the selection and instantiation process of objects. Probably by some cache of already selected data.
    - If we make join with object_entities table anyway, it's better to store this data
    - If we select object entities only, we already know that we'll need data from additional table to instantiate it by callback function. It's MUCH faster to make single query
    - The whole "callback" process for entities creation could be made "additional data aware". As well as construction of objects
    - We may consider some more specific entities management policy. To operate with entities with the same guids as singletons. We shouldn't need to duplicate instances. It seems more like a hack than something worth encouraging. So when we call get_entity or elgg_get_entities multiple times, we wanto to create object with particular guid only once.
    - We might consider saving metadata and all changes automatically in destructor. We tried such concept successfully. Note that also some related bugs were fixed in PHP: https://bugs.php.net/bug.php?id=30210
    - Following the above idea. It makes sense to save only object where something actually HAS changed
- tables with datalists and config work very similar. We might want to merge them in single table where site_guid=0 mean global setting
- we may consider fetching pluginsettings and plugin user settings (for logged in user) in one push. It's faster than independent calls.
- Target SQL count for a page load should be well below 20 if we speak of making Elgg truly optimizable. Count of 10 seems to be possible for core itself as near-term goal
- cacheing only active plugins list as serialized data in file changed maximal performance from 7 requests per second to 20 requests per second, so the reduction of SQL calls makes really big impact.


* Cacheing of translations *
- there's task for it on trac already, but it's example of situation where some api for extending cache revalidating action is useful for not having to make core changes
- it speeds up elgg a lot

* CDN-like feature (Content Delivery Network) *
- we'd like to serve as much cached data as possible from static files instead of proxying via PHP. Not involving PHP speeds things up few times.
- it seems we need some separate dir from data root, where we could possibly attach different domain (for example static.example.com). Serving from inside code path is easier, but maybe we don't want to mix code part with dynamically generated files. Anyway, plan minimum would be kind of /cache/ dir in installation path.
- More or less obvious features:
    - serve user and group avatars as static files
    - combine files registered by elgg_register_js or elgg_register_css into single files and serve them statically.
        - We need here to support different sets of requested files and give the name with some form of a hash based on input list
        - We may than easily support automatic minimizing the js and css code and server it without affecting source files
        - We reduce amount of file requests that commonly becomes quite extensive for multiple plugins
        - We don't consider here calls from truly external domains
        - Obviously we'd have to create separate JS files for header and footer

* Distinguish loading core for different tasks *
- currently there's no easy way of distinguishing if core is loaded for pagehandler or action or even api call. Giving option such as simple define('ELGG_PAGE_HANDLER', 1) would help to avoid potentially unnecessary work to be made
- giving option for more specific system events ('init', 'page_handler') or ('init', 'action_handler') could help making whole system more well-aimed
- option to specify "core loading mode" gives also more control. Sometimes we don't care about loading plugins, since we only want to use very separate part of the core, or don't need translations. Currently there's no way to specify particular flags to the core-loading sequence, but some features are already in place, like flags for elgg_load_plugins(), why not to expose them somehow to be set externally?
- there was idea somewhere to merge events and plugin hooks in single mechanism. What if we go further, and specify events by stack of parameters. For example elgg_register_event_handler(array('system', 'pagesetup', 'context', 'additional_param'), callback); This way we aviod unnecessary call_user func calls, that are reported to be 3 times slower than direct call.

* Store path and dataroot in settings.php *
- it makes MUCH easier life for any cacheing mechanism
- if seems that all actual configs thata are not site_id-dependant, could be stored in static files (does not concern parts serving as actual variables)
- we don't require DB connection to serve cached content
- most of the frameworks have no problem with storing such data in configuration file

* Data root storage convention *
- current convention requires resolving owner of entity creation time, it seems very unnecessary work, and even debugging is made complicated
- the storage could be more straightforward convention giving option to setup path basing

* Support for non-guid naming for some kinds of entities *
- this is rather long term feature request to allow working of elgg core with separate id namespaces for example where entities data would be stored outside of typical entities table and keeping track of guid value would become tricky. It's especially important for high-volume sites, where locking of table when incrementing counter becomes a problem.
- this modification would especially require supporting guid values other than integer, for example xyz:123 or even UUID() (maybe in hashed form)
- in case of introuducing some kind of "namespace prefix" to integer id value, we'd need some mechanisms for developpers to register some functions to resolve such namespaces to object or particular functions serving particular purposes
- it's worth considering if we'd like to support metadata feature for such kind of entities. It would require including prefix to metadata table
- we were experimenting with entities based on separate tables quite well, but the problem was integration with elgg core features. The prefix feature would make it much easier.
- of course full transparency is unlikely to obtain, but supporting features like relationships, metadata and annotations for highly customized entity types would make elgg much more tempting solution for systems with high-capacity optimizations plans

* We DON'T support DB transactions *
- we really need to consider making part of core DB access in transactions. Currently there may be problem when writing to metadata very often. It tends to convert itself into array (as multiple entires in metadata table are spawned) what is often a not acceptable. We also had problem with it while integrating elgg with external clients via API. Very rarely metadata used as 0/1 flag converted itself to array('0', '1')
- we should consider moving storage engine to InnoDB to support transactions and row-level locks. For current EAV DB design it is really necessary to have these.
- this is good point to start implementing DB layer as an object (what is planned for ELgg 2.0 AFAIR) (check early draft in: https://github.com/Srokap/Elgg/tree/database_abstraction_layer),
- maybe we start using mysqli as it's claimed to be more optimized and supports transactions and prepared statements (prepared statements seem to be really good option for current design). Obviously it's OO as well. Anyway I was always curious if there was ever particular reason to use old library instead of new one.

Performance and Scalability

Performance and Scalability

If you've got a need for speed, this group is for you.