Tags with non-ascii (international) characters don't work after upgrade to version 1.6

After upgrading from version 1.5 to 1.6 clicking on tags with non-ascii characters yields no results (was ok in 1.5).

First after the upgrade tags didn't work at all but after applying the .htaccess fix from 

http://community.elgg.org/mod/groups/topicposts.php?topic=220244&group_guid=12

then tags only with letters from the English alphabet do work as expected but tags with letters as áéóúíðþæö still don't.

What might be causing this and how can I fix it?

  • @bthj I just disabled the customsearch to see what happens - my tag search is still working well.  However now looking at my version of elgg it seems I'm using 1.6 RC1 2009072201  which was available a few days before 1.6

  • Hi,

    Just checking in to let you know that I upgraded my site to v1.6.1 and the problem still exits.

    I created a ticket for this in Trac:   https://trac.elgg.org/elgg/ticket/1231

  • This could be related to the problem discussed at http://community.elgg.org/mod/groups/topicposts.php?topic=236843&group_guid=16

    Please try the solution posted there and let me know if it helps...

  • I've found the one thing that changed in v1.6 and caused international tag searches to stop working:

    In engine/lib/metastrings.php at line 51 there is this addition:

     

    // Case sensitive

    $cs = "";

    if ($case_sensitive) $cs = " BINARY ";

     

    ...and the usage of that $cs variable in the select statement on the next line.

    Commenting out that IF statement is enough to have the tag searches work like they used to in v1.5

     

    The solution Brett points to is an interesting one and may be The Right Thing to do?

    But... setting 

    mysql_query("SET NAMES utf8");

    in engine/lib/database.php causes the site's content to display all garbled (the non-ascii characters) _except_ for content that has been entered into textareas - so titles, group names, tags, titles of discussion threads are all garbled but the contents of discussion threads are OK.

    If I take an entry, fix the garbled characters and save it back, it displays fine and tag searches work also without having to disable that new case-sensitivity feature mentioned above.

     

    The site I'm working on is newly started so there isn't so much content and it would be manageable to deal with the garbled tags and titles (by hand I guess?).  Or I could just disable that new case-sensitive feature and have things work like they used to...

     

    What would be a better option in the long run?

     

    SET NAMES='UTF8' seems to be the recommended configuration so I guess I should choose that option?

  • Blarg.  I really, really hate PHP vs MySQL encoding problems.  I honestly don't know what the "best" solution is here, though I'm leaning toward SET NAMES='UTF8'.  It will take a bit of research on our part...

    Any encoding specialists in the group?

  • I really, really hate PHP vs MySQL encoding problems.

    Brett: there aren't such problems at all in properly coded application (with "there are a lot of different languages and alpabets in the Real World, not only English" in mind), believe my word

    • Create always UTF8 databases and UTF8 tables
    • Never, never entrust MySQL default settings and redefine charset usage forcedly everywhere, where this is possible
    • All string manipulation functions must be from mb_ family

    and you'll save your time and health (and our - Russian, Ukrainian, Chinese, Indian, Japanese and many other users -  good impression)

  • @Alexander,

    I agree that Elgg needs to do this, but to be fair, MySQL should do this all by default.

    It seems that by "properly coded application" you mean one that anticipates and corrects misconfigured and/or buggy MySQL installations. Elgg developers can do this to some extent but I'm not sure that they can anticipate everything!

  • @Alexander -- I have to disagree.  There are numerous problems with encoding between the browser, PHP, and the database.  Most of this stems from the fact that browser support for UTF-8 is historically dodgy, the mb* functions are not enabled by default in PHP, and the MySQL defaults are not completely worthwhile for non-English alphabets.

    Regardless, Elgg does need to--and is--address this.  Saying that it's solely a problem of the application is inaccurate.

    I speak Japanese and have been slowly testing Elgg's Japanese support.  It's only me, though, so I need help testing and specific patches to files that are problematic in trac would go a long, long way to seeing this resolved.

  • @Kevin

    MySQL should do this all by default

    No... MySQL can do it, but not should or must. Just because MYSQL-UTF8 server is overhead of "less than other configs" usable selection:

    * In "Pax English" - customer will not see any differences in site functionality with any defined or even without

    [mysql]
    ...
    default-character-set=utf8

    [mysqld]
    ...
    default-character-set=utf8

    * in some countries usage of UTF8 in web-app is new and only fashion, and most of (bilingual) hosting-clients are happy with "default historical encoding" and hosters are well informed about this situation

    In short - we cannot require the presence of correct (from our point of view) settings and even we do not have a possibility to expect their presence - we must create this environment

    2Brett: You can disagee freely with everybody and anybody, sure, but... Modern browsers+DB+PHP have less of nothing problems in this area. Mbstring P)HP-extension isn't a big problem also - you can just extent current Prereq. list with "Mbsting must be supported in PHP" line or, as it was done in Dokuwiki: have UTF-aware functions, which work with or without mbstring (inc\utf8.php)

    /**
     * UTF8 helper functions
     *
     * @license    LGPL (http://www.gnu.org/copyleft/lesser.html)
     * @author     Andreas Gohr <andi@splitbrain.org>
     */

    MySQL defaults are not completely worthwhile for non-English alphabets

    Pls, RTFM :-), namely

    • set names
    • set character_set_client
    • set character_set_results
    • set collation_connection

    This way I can handle and fight (more or less succesfull) not only with badly (from my POV) configured hoster's MYSQL, but in some form even with badly created tables and their content. At least I know Elgg (polished) installation, which work on russian hosters servers with Win-1251 MySQL default encoding without problems for russian UTF8-texts

  • @Alexander 

    "MySQL defaults are not completely worthwhile for non-English alphabets

    Pls, RTFM :-), namely

    • set names
    • set character_set_client
    • set character_set_results
    • set collation_connection"

    You've misunderstood what I'm saying.  I know of these functions, but the default settings for MySQL are independent of these options--these options change those defaults.

    Regardless, if you could post any patches to trac relating to encoding it would help this along...