Allowing crawler access to Elgg's entities with private/friend/group/etc access

I've currently installed Elgg v1.12.5, with robots.txt in the root directory of Elgg with the following content:

User-agent: *
Disallow: 

 

Previously, I used the built-in robots.php with the same rules.

The crawler that I have cannot crawl the blog that I've set to private. I need it to be able to crawl the whole site (excluding /admin, registration page, etc)

Advice is appreciated!

  • Does this site have other users who are creating content with the expectation of access control? Exposing their private content to search engines basically makes it public.

  • This site does have users who create content with access rules to their liking. We will filter the results depending on the access of the logged in user (or public user).

    Our application is not hosted on the internet (rather within the our network) and we manage our own search engine (so we're not talking about the public Google Search Crawler or any other ones on the internet). This situation is almost the same as the default search engine (or what we're currently using, Sphinx Search) where it indexes the entire site and returns the results depending on the access rule of the user.

     

  • @pandurx What's Google says about your robots.txt (via GWT)?

  • Does the crawler have a cookie store? E.g. Could it be authenticated as a dedicated admin user?

    Or if you could reliably distinguish crawler requests, you could set an admin user as the session user (without a login()).

  • @rivervanrain
    I haven't used GWT, but I've run my robots.txt through various robot file validators that can be found online and it seems to be correct. I'm not sure how the robots.txt file work for Elgg, but my crawler just isn't able to get into the private/closed content.

    @steve_clay
    We're using the Google Search Appliance, and I've tried to do the authentication through form (or cookie store method) but it doesn't pick up the login form...