Is there a plugin to open, extract and save content from a Web page? I've tried hypeScraper, but it doesn't seem to do what I need. I would like to extract and save images and page elements from a user-provided URL.
Thank you,
- Scott
info@elgg.org
Security issues should be reported to security@elgg.org!
©2014 the Elgg Foundation
Elgg is a registered trademark of Thematic Networks.
Cover image by RaĆ¼l Utrera is used under Creative Commons license.
Icons by Flaticon and FontAwesome.
- Nikolai Shcherbin@rivervanrain
Nikolai Shcherbin - 0 likes
- C0Rrupt@thjMMvw
C0Rrupt - 0 likes
- Nikolai Shcherbin@rivervanrain
Nikolai Shcherbin - 0 likes
- C0Rrupt@thjMMvw
C0Rrupt - 0 likes
- ihayredinov@ihayredinov
ihayredinov - 1 like
- Nikolai Shcherbin@rivervanrain
Nikolai Shcherbin - 0 likes
- Reuven@reuven
Reuven - 0 likes
- Reuven@reuven
Reuven - 0 likes
- Reuven@reuven
Reuven - 0 likes
- C0Rrupt@thjMMvw
C0Rrupt - 0 likes
You must log in to post replies.What is wrong with this plugin?
We've improved parsing in our Elgg Theme plugin, but it's paid.
Demo Elgg 4 is here
Demo Elgg 3 is here
hypeScraper creates clickable previews of URLs, but I don't think it extracts and saves images or page elements from a user-provided URL . If I am wrong, and it does do this, please help me understand how. I've spent a lot of time trying on my own. Otherwise, is there another plugin option?
For example: If a user provides the URL:
https://www.menards.com/main/heating-cooling/indoor-air-quality/air-purifiers-accessories/pro-breeze-5-in-1-true-hepa-air-purifier/pb-p01-us/p-1642874271117117-c-5614.htm,
I want to extract and save:
$title = "Pro Breeze 5-in-1 True HEPA Air Purifier"
$image1 = "https://sp.menardc.com/main/items/media/ONERE001/ProductMedium/Image_1.jpg"
$image2 = "https://sp.menardc.com/main/items/media/ONERE001/ProductMedium/Image_2.jpg"
$image3 = "https://sp.menardc.com/main/items/media/ONERE001/ProductMedium/Image_3.jpg"
$image4 = "https://sp.menardc.com/main/items/media/ONERE001/ProductMedium/Image_4.jpg"
$image5 = "https://sp.menardc.com/main/items/media/ONERE001/ProductMedium/Image_5.jpg"
$description ="The Pro Breeze 5-in-1 Air Purifier ... for rooms up to 500 sq.ft - Touch-button controls"
Hmm.
I'm not sure what you described hypeScraper doesn't makes.
Maybe you have a problem with cookies, user agents, etc. - all the things that make scraping painless.
This is what proxies are for.
Thanks, Nikolai. I looked at hypeScraper further and found a sparsely-documented 'cache' feature that might work, Unfortunately, I can't understand how to get to work. Can you, or anyone else, offer more guidance? The author, Ismayil Khayredinov, is not available to help.
Oh, look, I am still alive. Came over to check what 5.0 is all about and it's crazy that so many years later people still find my plugins useful.
As far as I remember the plugin was based on https://github.com/hypeJunction/http-parser. Not sure if you want to download the images or just set them as metadata, but I am sure you can hook into some part of the process (sorry, no recollection of how the code works)
@ihayredinov
Your plugins will always be useful ;)
@ihayredinov
I'm sure many here happy to see you alive :)
@ihayredinov
I'm sure many here happy to see you alive :)
@ihayredinov
I'm sure many here are happy to see you alive :)
Ismayil (@ihayredinov),
So good to hear you're still alive. I didn't think I'd ever have a chance to say how grateful I am for your work those many years ago. I'm still using several, including, now, hypeScraper. I convinced it to do what I needed, for the most part, but am still bedeviled by its difficulty scraping pages from several different domains. Here's an example: "Parser Error for HEAD request (https://www.walgreens.com/store/c/onetouch-ultra-2-blood-glucose-meter-kit/ID=300400575-product): cURL error 28: Operation timed out after 5000 milliseconds with 0 bytes received (see http://curl.haxx.se/libcurl/c/libcurl-errors.html)" because "filter_var($url, FILTER_VALIDATE_URL))" returns false. If you have an inspiration for where to look for a solution, I'd be incredibly grateful.
Sincerely,
- Scott (@C0Rrupter)