Web Scrape Anonymously

May 24, 2011, 10:18 pm

WebHarvy allows you to scrape websites anonymously via proxy servers. You can either configure WebHarvy to scrape through a single proxy server or to use a list of proxy server addresses which are cycled automatically after a specified time interval.

Scrape via Proxy Servers

You may download the 15 days evaluation copy of WebHarvy Web Scraper from http://www.webharvy.com/download.html .

↧

WebHarvy V1.4.0.20 Released

November 15, 2011, 12:43 am

≫ Next: WebHarvy Web Scraper : Scrape data from sections and sub sections within webpages

≪ Previous: Web Scrape Anonymously

The latest update of WebHarvy (version 1.4.0.20) has gone live and is available for download at www.webharvy.com/download.html.

Changes :

[New Feature] Keyword based Scraping : Allows you to run the same configuration for a set of input keywords (Read more : http://www.webharvy.com/tour71.html)
Edit Configuration : Allows you to edit an already saved WebHarvy configuration XML file (Read more : http://www.webharvy.com/tour41.html)
Option to contact us (WebHarvy Support) directly from the application (See Help menu)
Option to check for new updates directly from the application (See Help menu)
Miner performance improvement : Web mining performance while following links from the main page has been improved
Minor improvements and bug fixes

Miner window remembers its last position/size/state
Issue with Auto Scroll fixed
Issue with loading ‘Next Page’ and ‘Following Links’ in certain scenarios while mining has been fixed
Issue which resulted in application crash while parsing HTML of certain websites has been fixed

↧

WebHarvy Web Scraper : Scrape data from sections and sub sections within webpages

December 8, 2011, 3:11 am

≫ Next: How to scrape search results data for a list of input keywords ?

≪ Previous: WebHarvy V1.4.0.20 Released

The ‘category scraping’ feature of WebHarvy allows you to easily scrape a list of links which leads to similarly formatted pages within a website with a single configuration. This helps to scrape data from sections and subsections listed under the main page of a website.

Please follow this link to know more about Category Scraping.

Category Scraping : Video demonstration

You may download and try the free evaluation version of WebHarvy, the visual Web Scraper software, from http://www.webharvy.com/download.html.

↧

How to scrape search results data for a list of input keywords ?

December 8, 2011, 3:29 am

≫ Next: How to scrape data anonymously ?

≪ Previous: WebHarvy Web Scraper : Scrape data from sections and sub sections within webpages

In most cases the data to be scraped is the result of performing a search operation from the main page of the website. Often it is required that you need to extract data from the search results for a list of input keywords.

The ‘Keyword Scraping’ feature of WebHarvy allows you to perform this task with ease. You can specify a list of input keywords and WebHarvy will automatically scrape data from the search results corresponding to each keyword in the specified list.

Please follow this link to know more about ‘Keyword based Scraping’.

Video Demonstration : Keyword based Scraping

We recommend that you download and try the evaluation version of our Web Scraper to know more about the features.

↧

How to scrape data anonymously ?

December 8, 2011, 3:38 am

≫ Next: WebHarvy Web Scraper V1.5.0.26 released

≪ Previous: How to scrape search results data for a list of input keywords ?

WebHarvy Web Scraper allows you to scrape data from remote websites anonymously with the help of proxy servers. This prevents remote web servers from blocking / black listing your computer’s IP address.

WebHarvy provides you the option to specify either a single proxy server address or a list of proxy servers addresses through which the remote website will be scraped. In case you are providing a list of proxy server addresses, WebHarvy will cycle through the list in a periodic manner.

Please follow this link to know more about this feature.

Download WebHarvy Web Scraper FREE Trial !

↧

WebHarvy Web Scraper V1.5.0.26 released

April 20, 2012, 4:33 am

≫ Next: How to scrape text following a heading using WebHarvy ?

≪ Previous: How to scrape data anonymously ?

The latest version (V1.5.0.26) of WebHarvy Visual Web Scraper is available for download. The changes in this update are :

New option: ‘Capture following text’ added in capture form.
Web Miner has been improved to handle even HTML errors of target websites.
Allows exporting scraped data while mining is paused.
For CSV, TSV exports, column names are added as the first row.
Option to input keywords in CSV format.
Option to manually set page load timeout value in application settings.

The ‘Capture following text’ feature helps to scrape text following a given heading within the page. This feature is useful when data to be scraped does not occur at a fixed position within the page, but is guaranteed to follow a heading text (Example ‘Product Details:‘ or ‘Specification‘).

The option to manually set the page load timeout value from settings window helps to scrape data from websites with slow response times or from those which employ AJAX.

We recommend that you download and try the 15 days free evaluation version.

↧

How to scrape text following a heading using WebHarvy ?

April 23, 2012, 10:48 pm

≫ Next: Schedule scraping tasks

≪ Previous: WebHarvy Web Scraper V1.5.0.26 released

In the latest update of WebHarvy, the Visual Web Scraping Software, the newly introduced ‘capture following text’ option allows you to capture text/block/paragraph following a heading within a webpage.

Often with many websites the data to be scraped may not be located at the same position within all pages, but is guaranteed to be found under a given heading (Example : “Technical Details”, “Product Specification” etc). Sometimes, the text under a given heading may not be selected as a single item during configuring. In such scenarios the ‘Capture following text’ option in the capture window will provide helpful.

How to ?

While in configuration mode, click on the heading and select the ‘Capture following text’ option in the capture window. Provide a suitable name for the field and hit OK. In the preview pane you will be able to see the text following the heading captured.

Refer http://www.webharvy.com/tour1.html#ScrapeFollowingText for more details.

↧

Schedule scraping tasks

January 16, 2013, 10:22 pm

≫ Next: Web Scraping from Command Line

≪ Previous: How to scrape text following a heading using WebHarvy ?

WebHarvy comes with an in-built scheduler using which you may schedule your scraping tasks. The scheduler window can be opened from the Mine menu.

WebHarvy Scheduler

The scheduler enables you to run scraping tasks periodically – daily, weekly or monthly.

Know More about WebHarvy Scheduler

Download and Try the free 15 days evaluation version of WebHarvy Web Data Extraction Software.

↧

Web Scraping from Command Line

January 16, 2013, 10:31 pm

≫ Next: WebHarvy Version 3.0 Released !

≪ Previous: Schedule scraping tasks

WebHarvy supports command line arguments so that you can run the software directly from the command line. This allows you to run WebHarvy from script or batch files, or to invoke it via code from your own applications.

To know more, read : Running WebHarvy Web Scraper from Command Line

↧

WebHarvy Version 3.0 Released !

June 4, 2013, 2:14 am

≫ Next: WebHarvy 3.1 (Minor Update)

≪ Previous: Web Scraping from Command Line

We are happy to announce the release of WebHarvy 3.0. We have added a lot of new features in this major update. The feature/changes list for this update is the longest among all product updates which we have done till date. Here we go. .

Added the following options in the Capture Window (grouped under ‘More Options’)
- Capture following text: Improved by using brute force search for all elements in the page
- Capture HTML: Option to scrape HTML of selected element
- Capture Text as File: Option to scrape text and save it as a local file (useful while scraping articles and blog posts)
- Click: Ability to scrape hidden (partially displayed) fields in webpages which require a click from the user to be displayed in full. For example phone numbers or email addresses which are displayed completely only if you click them.
- Apply Regular Expression: Option to apply Regular Expressions (RegEx) on captured text. RegEx can be applied even after applying ‘Capture following text’, ‘Capture HTML’ & ‘Capture More Content’ options.
- Capture More Content: Option to capture more text than the selected text, captures parent element’s text. For example this would capture the entire article if you apply this option after having selected the first paragraph.
Option to individually select categories/links (one by one) for Category Scraping (Mine menu – Scrape a list of similar links)
Export captured data as JSON
Ability to mine data from tables (row-column / grid layout)
Ability to mine pages which has fewer (less than 10) data items
Option to test proxies before using them (Edit menu – Settings – Proxy Settings)
Non responsive proxies are skipped during mining. Mining would not stop because of a bad/non-responsive proxy in the list.
Option to manually add URLs to an existing configuration (Edit menu – Add URLs to configuration)
Option to remove duplicates while mining (Edit menu – Settings – Miner)
Added ‘Hourly’ frequency option in Scheduler (Mine menu – Scheduler)
Added option to export data directly to database for scheduled mining tasks & command line
Added ‘Clear’ option in Edit menu which will clear both the browser and data preview pane
Language encoding defaulted to ‘utf-8′ for file exports (XML, CSV etc)
CSV/Database export : handles delimiters (comma, quotes etc) in captured data
Keyword/Category scraping allowed for 2 entries in evaluation version
Rendering issues with in-built browser fixed – defaults to IE 9 rendering
New Installer built with InstallShield

Download the latest installation of WebHarvy Web Scraper from https://www.webharvy.com/download.html.

↧

WebHarvy 3.1 (Minor Update)

July 25, 2013, 5:27 am

≫ Next: Scrape with Regular Expressions using WebHarvy

≪ Previous: WebHarvy Version 3.0 Released !

The 3.1 update of WebHarvy which was released yesterday (July 24) has the following changes.

Added option to Tag captured data rows with corresponding Keyword/Category. (Applicable only for Keyword/Category based Scraping). See the new Miner Settings Window (Edit menu – Settings)
Option to separately set Page Load Timeout and AJAX Load Wait Time in Miner Settings.
Option to edit the start URL / Post Data / Headers for the configuration directly from the UI, without editing the XML configuration file. (under Edit menu – Edit Options)
Updates related to Category Scraping, Capture Text following a Heading, Mining multiple pages
Bug Fixes

Download and install the latest update from https://www.webharvy.com/download.html.

↧

Scrape with Regular Expressions using WebHarvy

August 29, 2013, 3:36 am

≫ Next: Scraping hidden (click to display) fields using WebHarvy

≪ Previous: WebHarvy 3.1 (Minor Update)

WebHarvy is designed as a ‘point and click’ visual Web Scraper. The design concentrates on easy of use, so that you can start scraping data within few minutes after downloading the software.

But in case you need more control over what needs to be extracted you can use Regular Expressions (RegEx) with WebHarvy. WebHarvy allows you to extract data by matching RegEx strings on text content as well as on HTML source of the web page.

If you are new to Regular Expressions, see http://en.wikipedia.org/wiki/Regular_expression.

The following video shows how WebHarvy can be used to scrape the image URL from a web page by applying Regular Expression.

The ‘Capture More Content’ feature comes in handy here (as shown in the video) to make sure that the selected text contains the data (text or HTML code) of interest, before RegEx string is applied.

Regular Expressions can also be applied directly on the text content of the page as shown in the following video.

To explore further download the latest version of WebHarvy from https://www.webharvy.com/download.html.

↧

Scraping hidden (click to display) fields using WebHarvy

August 29, 2013, 5:04 am

≫ Next: Scrape HTML

≪ Previous: Scrape with Regular Expressions using WebHarvy

Certain web pages require that you to click on a link or button for the data to be displayed. There are many websites where email addresses or phone numbers are partially displayed, they will be fully displayed only if you click on them.

The ‘Click’ option under ‘More Options’ button in the Capture Window lets you scrape data in such scenarios. (See https://www.webharvy.com/tour1.html#ScrapeHidden).

The following video shows how this option can be used to scrape hidden fields.

Here the phone numbers are partially displayed. Using the Click option, they can be made fully visible and then scraped.

To know more about the features of WebHarvy, see the product feature tour at https://www.webharvy.com/tour.html.

↧

Scrape HTML

August 29, 2013, 10:37 pm

≫ Next: Use ‘Capture Following Text’ option to scrape data from details pages

≪ Previous: Scraping hidden (click to display) fields using WebHarvy

WebHarvy allows you to scrape HTML of page contents in addition to plain text. In the Capture window, click ‘More Options’ button and select the ‘Capture HTML’ option to scrape the HTML of the selected content.

To capture only a portion of the displayed HTML, you may select and highlight the required portion before clicking the Capture button.

Usually Regular Expressions are applied over the HTML source of the content to extract the data of interest like image URL or hidden fields like phone number.

The following video shows how the ‘Capture HTML’ option is used along with Regular Expressions to correctly extract the product price.

Try out the free evaluation copy of WebHarvy from https://www.webharvy.com/download.html.

↧

Use ‘Capture Following Text’ option to scrape data from details pages

August 29, 2013, 11:03 pm

≫ Next: WebHarvy version 3.3 released !

≪ Previous: Scrape HTML

While extracting data from details pages (page reached by navigating a link from the start page), it is recommended that the ‘Capture Following Text‘ option be used whenever possible to correctly and consistently scrape data.

This is because the layout and the amount of data displayed in details pages may not be consistent. For example, if you are trying to scrape Amazon products listing, the data displayed in the product details page (page reached by clicking the product link from the search results) may vary slightly from product to product. Here, if you are tying to extract the Shipping Weight under Product Details, instead of clicking on the data (example: ’1.2 pounds’) click on the heading ‘Shipping Weight’ and apply the ‘Capture following text’ option under the ‘More Options’ button.

Watch the demo :-

So in summary, if the data to be extracted comes under a heading, always click the heading and apply the ‘Capture following Text’ option. This ensures that the data is scraped from all similar pages without missing any, even if the page contents varies slightly.

↧

WebHarvy version 3.3 released !

July 4, 2014, 4:24 am

≫ Next: WebHarvy : 2 new methods of handling pagination

≪ Previous: Use ‘Capture Following Text’ option to scrape data from details pages

3.3 version of WebHarvy was released on June 16, 2014. The major changes are :

Fixed issues related to URL encoding in Category Scraping
Added option to disable automatic pattern (data field repetition) detection in start page (more details)
Option to follow links (URLs) obtained by applying Regular Expression on HTML – handles both absolute and relative URLs (more details)
Option to capture images whose URL is obtained by applying Regular Expression on HTML – handles both absolute and relative URLs – works even when the image URL does not contain image file extension (more details)
Separate options to download image and to capture image URL (more details)
Fixed issue due to which downloaded image files did not have the correct file extension
Added Multiline mode in RegEx processing
Faster mining ‘restart’ from where it stopped (aborted) previously – remembers last mined URL and its PostData.
Context menu options (copy/cut/paste) added for ‘Additional URLs in Configuration‘ window

Download the latest version of WebHarvy

↧

WebHarvy : 2 new methods of handling pagination

September 30, 2015, 3:23 am

≫ Next: WebHarvy 4.1.5.141 released

≪ Previous: WebHarvy version 3.3 released !

The latest version of WebHarvy Web Scraper supports 2 new types of pagination styles for scraping data from multiple pages of websites.

Pages where pagination links are shown in sets

In these types of pages the pagination links are provided in sets. For example the first 5 pages will have direct links to load each of them at the bottom of the page. To load pages 6 to 10, an additional link should be clicked. Now each of the pages 6 to 10 will have direct links to load any of them at their page end, and also a link to load the next set of 5 pages.

WebHarvy Online Help : Scraping pages where pagination links are displayed in sets

The following video demonstrates how these types of pages can be configured and mined using WebHarvy.

When each page URL contains the page number

Suppose the pages from which you need to scrape multiple listings of data have the following format.

http://www.example.com/search/listing?keywords&pageNumber=1
http://www.example.com/search/listing?keywords&pageNumber=2
http://www.example.com/search/listing?keywords&pageNumber=3
http://www.example.com/search/listing?keywords&pageNumber=4
etc..

Pagination in this case can be handled easily by following the method below :-

1. Open WebHarvy and load http://www.example.com/search/listing?keywords&pageNumber=1.
2. Start Config
3. Select required data from the page, Follow links and select data if required.
4. Select Edit menu > Edit Options > Add/Remove URLs from Configuration
5. Paste the following URL and Apply.

http://www.example.com/search/listing?keywords&pageNumber=%%pagenumber%%

Note that the actual page number is replaced by %%pagenumber%% in the above string.

6. Stop Config
7. Start Mine. You should specify the number of pages to mine since ‘Mine all pages’ option will be disabled. WebHarvy will automatically find and load the next pages and extract data.

WebHarvy Online Help : URL page-number based auto pagination

The latest version of WebHarvy Visual Web Scraper can be downloaded from https://www.webharvy.com/download.html. Try and in case you need any assistance please do not hesitate to contact our support team.

↧

WebHarvy 4.1.5.141 released

May 2, 2017, 3:09 am

≫ Next: WebHarvy 5.2 | UI revamp + Oracle db support

≪ Previous: WebHarvy : 2 new methods of handling pagination

The main changes in this release are :-

Pagination via JavaScript – see https://www.webharvy.com/tour3.html#JS
This powerful feature is the main highlight of this release. When all other methods of pagination fails, this method, where you can directly provide a JavaScript code which when run would load the next page, can be used.
Increased size of virtual browser used by miner
The dimensions of miner’s virtual browser has been increased. This solves issues related with websites whose layout changes when the browser has a smaller window dimension (mobile layout). This also helps the miner to load more items in a single page and scroll, in case of websites which display data based on the size of the browser window.
Support for ‘Load more content‘ & ‘Scroll to load next page‘ type pagination even when the real listing page is reached by clicking links/buttons from the start page.
In earlier versions if the listing page loads more data in same page via a button/link click or scroll and if initial navigation (click, java-script etc.) is required in the configuration itself to load the listing page from another start page, then pagination would fail. This release removes this limitation.
More support for extracting data from popups.
Popups now handle clicks and javascript. This can be used to close the popup window, in cases where closing the currently opened popup is required to open the next one.
SQL data export encoding issue related to foreign languages fixed.
Encoding issues while exporting text in non-English languages like Chinese fixed.
Other minor bug fixes

As always you may download and install the latest version from https://www.webharvy.com/download.html.

↧

WebHarvy 5.2 | UI revamp + Oracle db support

March 26, 2018, 6:37 am

≫ Next: WebHarvy’s new user interface

≪ Previous: WebHarvy 4.1.5.141 released

Changes in 5.2 are mainly related to user interface and experience. The most visible change is the introduction of the ribbon menu system for providing easy access to most software features.

In addition to the main interface, other windows like Scheduler / Export etc. have also been updated. The export functionality (to file or database) has now been made cancel-able. User can now cancel an ongoing export to file or database.

As with every release, the Chrome browser has been updated as well. Issues related to URL update (in address bar) while navigating links in some websites has been fixed with this update.

An important non-UI feature addition in this release is the support added for exporting data to Oracle database. The default file export option is changed from CSV to Excel format.

All main settings are now displayed in snippet format in browser view’s status bar.

smarthelp

Help (videos, articles) related to the website loaded in the configuration browser is automatically loaded and displayed as a smart tip.

Miner Settings can now be opened and changed directly from the Miner window.

JavaScript can now be typed in multi-line code format.

Browser settings now include a new option to share user location to the loaded page.

In addition to the above this release also contains minor bug fixes and improvements as always. You may download and try the latest version from https://www.webharvy.com/download.html

↧

WebHarvy’s new user interface

May 4, 2018, 3:05 am

≪ Previous: WebHarvy 5.2 | UI revamp + Oracle db support

We have significantly updated the user interface of WebHarvy in the latest version available in our website and the following video explains how the features and options are laid out in the new UI. Existing users of older versions will find this video useful so that they know where to look for specific features and options.

↧