URL Scraping / embedding (hypeScraper, oembed)

Sorry for coming back with yet another question. I do not want to give the impression that I post here in the first issue. I really try to do my tests, my research and only I do all I can, come to here for a question as the last resort...

With Elgg 2.x I had been using hypeScraper. I tried (from github) the version 6.2.1 for Elgg 3.x. But I could not get it working. In any case, hypeScraper seems to be not available after Elgg 3.0, and I intend to upgrade to Elgg 5.x, hypeScraper way seems to be a dead-end. I deactivated hypeScraper.

So I started looking for a replacement for hypeScraper functionality. I checked all the Elgg major and minor release notes for any built-in support introduced or any alternatives.

Hence I am trying out oembed plugin (v.2.1). It works great with youtube links (though could not find a way to resize the embed size regardless of changing the setting "Height for embedded content" from 400 to 1000), but the main issue is; I could not get it working for most URLs (probably hypeScraper did not use oembed standard since it worked on the same websites). The good news ıs that when oembed works, it renders a scraped information also in the main body of a discussion (in hypeScrape I could manage this only in comments).

In my experimental Elgg 3.3.25 installation, URLs are scraped for some websites but for most not. From what I read on the web and what ChatGPT says, it should only work websites which support the oembed standard. But for at least one website that it (oembed) worked, when I checked wıth https://oembed.link/ to see if indeed that specific website ıs displayed as NOT having oembed support. Although as I said, oembed perfectly shows the scraped information.

Clearly some of my assumptions are wrong. What is that I am doing wrong? Thanks in advance for any help...

cheers.

  • hypeScraper uses a built-in parser that extracts 3rd-party resources and collects data from them to display in cards on your site.
    hypeScraper also checks sites for oEmbed support ("oEmbed domain whitelist" in the plugin settings).

    oEmbed plugin uses the Embed library.

    The reasons why some links are extracted and others are not can be different. For example, a 3rd-party resource is blocked by your IP range or your hosting provider (a very common situation). There may also be a blocking of content delivery by geography, etc.

    Proxies can help solve such problems.

    We have a demo project on which we created our own parsing tool. And we also encounter problems with many resources, for example, X\Twitter, Reuters, New York Times, Washington Post, Wall Street Journal and others block parsing of their pages. Bloomberg does this on certain links ¯\_(ツ)_/¯

    You can test how your links work by creating a post on this project.

  • Thank yo for your reply! I'll be back!

  • Hi,

    Thanks again for your reply.

    - I already have an embed plugin activated. But it is the one delivered with the Core whereas your link points to oscarotero / Embed . Please note that I all Elgg 3 tests below are with oembed (Coldtrick)/ embed(Core) plugins. Maybe I should also try with oembed (Coldtrick)/ embed(oscarotero) combination.

    - Needless to say, all tests in my environments are specific to my environments and do not necessarily reflect on true oembed /embed plugin behaviors.

    - As for https://pw.wzm.me ones, sorry for polluting your environment with test data, I shall delete them after you read this post...

    - For Elgg 3 site, all setting parameters for oembed plugin are blanks except the "Height for embedded content" with the default value of 400.

    My tests:

    https://id.wikipedia.org/wiki/Algoritma_Dijkstra

    Elgg 2.3.12 site with hypeScraper: Scraped with small size image with title, source next to it.

    Elgg 3.3.25 with oembed: Did not scrape.

    https://pw.wzm.me : Scraped with small size image with title, source next to it.

    https://www.aljazeera.com/news/2024/8/30/israel-presses-on-with-assault-on-occupied-west-bank-for-third-day

    Elgg 2.3.12 site with hypeScraper: Scraped only in discussion comment. Small image size with title, source & the relevant text next to it.

    Elgg 3.3.25 with oembed: Did not scrape.

    https://pw.wzm.me : Scraped the right photo, medium size, with the title, source & relevant text underneath.

    https://www.flickr.com/photos/leightonian/35651650133/in/photolist-Wjq2uZ-xeqDmh-vgBJWV-wzaFRk

    Elgg 2.3.12 site with hypeScraper: Scraped only in discussion comment. Scraped the wrong image, in small size, with title, source & relevant text next to the image.

    Elgg 3.3.25 with oembed: Scraped i both the discussion body & comment. Scraped full size (good) image without any text. (strangely after several discussion posts, it stopped scraping in neither body nor comments. First I thought maybe Flickr had noticed too many scrape requests and prevented it but realized in the river, the scrape photo is displayed.)

    https://pw.wzm.me : Did not scrape.

    https://euphoriayachts.com.tr/index.aspx

    Elgg 2.3.12 site with hypeScraper: Did not scrape (just empty placeholder cards in comments)

    Elgg 3.3.25 with oembed: Did not scrape.

    https://pw.wzm.me : Did not scrape.

    For all URLs, for Elgg 3.3.25, I checked the error.log & access.log both for Apache2 & vhost but could not find anything worth mentioning.

    https://www.cumhuriyet.com.tr/bilim-teknoloji/spacexin-polaris-dawn-gorevi-ertelendi-peki-neden-2241930

    Elgg 2.3.12 site with hypeScraper: Scraped only in discussion comment. Small image size with title, source & the relevant text next to it.

    Elgg 3.3.25 with oembed: Did not scrape.

    https://pw.wzm.me : Scraped the right photo, medium size, with the title, source & relevant text underneath

    I wonder if anybody has different experiences with Elgg 3.x.

    cheers.

  •  I already have an embed plugin activated. But it is the one delivered with the Core whereas your link points to oscarotero / Embed . Please note that I all Elgg 3 tests below are with oembed (Coldtrick)/ embed(Core) plugins. Maybe I should also try with oembed (Coldtrick)/ embed(oscarotero) combination.

    Please check the links I mentioned.

    Embed bundled plugin has nothing to do with it (this plugin is not even about web scraping).

    I was talking about oEmbed plugin by ColdTrick, which ALREADY uses 3rd-party Embed library.

    You just need to look at the capabilities of this library to understand how oEmbed plugin works.

     

    Also my link to the hypeScraper plugin allows you to install this plugin for Elgg 3.

    Of course, for the next versions of Elgg this plugin needs to be updated.

    But we have successfully used this plugin in our projects on Elgg 3. There were no problems with its activation.

     

    As for the link tests, as I mentioned in the previous reply, site parsing depends on many factors (read timeout, user agents, proxies, captchas, IP blocking etc) and it's very difficult to find a perfect solution. 

    Thanks for the Flickr test, we will fix it.

  • Embed bundled plugin has nothing to do with it (this plugin is not even about web scraping).

    I was talking about oEmbed plugin by ColdTrick, which ALREADY uses 3rd-party Embed library.

    Indeed. I had a quick look and I missed the whole point that embed(oscarotero) is a library.

    Also my link to the hypeScraper plugin allows you to install this plugin for Elgg 3.

    Of course, for the next versions of Elgg this plugin needs to be updated.

    But we have successfully used this plugin in our projects on Elgg 3. There were no problems with its activation.

    I had already fetched that hypeScraper version but could not get it scraping. To be honest, I have not tried hard, mainly because it has support up to Elgg 3.x (hopefully also for 3.3) and I would like to try upgrading as quickly as possible, as high Elgg version as possible to see all the new functionality & then decide for an "optimum version" to settle later. For now, doing everything in a virtual machine offline version.

    it's very difficult to find a perfect solution. 

    Indeed it seems to be the case...

    THANK YOU VERY MUCH for your feedback. I really appreciate it!

    cheers!

  • Forgot to mention that hypeScraper plugin has 'Linkify longtext output' option.

    It allows you to display oEmbed links in the body of the output/longtext view, i.e. TheWire isn't available in this case.