You can setup different type of extraction with this UNIQUE web crawler spider.

Key words

"Win Web Crawler" spiders top search engines for right web sites and get data from them.

Quick Start

Select "Search Engines" source - Enter keyword - Click OK

What "Win Web Crawler" Does: "Win Web Crawler" will query all popular search engines, extract all matching URLs from search results, remove duplicate URLs and finally visits those websites and extract data from there.

You can tell "Win Web Crawler" how many search engines to use. Click "Engines" button and uncheck listing that you do not want to use. You can add other engine sources as well.

"Win Web Crawler" send queries to search engines to get matching website URLs. Next it visits those matching websites for data extraction. How many deep it spiders in the marching websites depends on "Depth" setting of "External Site" tab.

Depth

Here you need to tell "Win Web Crawler" - how many levels to dig down within the specified website. If you want "Win Web Crawler" to stay within first page, just select "Process First Page Only". A setting of "0" will process and look for data in whole website. A setting of "1" will process index or home page with associated files under root dir only.

For example: "Win Web Crawler" is going to visit URL http://www.xyz.com/product/milk/ for data extraction.

"Win Web Crawler" is powerful and fully featured unique spider! You need to decide how deep you want "Win Web Crawler" to look for data.

"Win Web Crawler" can retrieve: Set options:
Only matching URL page of search ( URL #6 ) Select "Process First Page Only"
Entire milk dir (URL #6 - 10 ) Select "Depth=0" and check "Stay within Full URL"
Entire www.xyz.com site Select "Depth=0"
Only www.xyz.com page Select "Process First Page Only" and
check "Spider Base URL Only"
Only root dir file (URL #1 - 3) Select "Depth=1"
Only URL #1 - 5 Select "Depth=2"

Spider Base URL Only:
With this option you can tell "Win Web Crawler" to process always the Base URLs of external sites. For example: in above case, if an external site found like http://www.xyz.com/product/milk/ then "Win Web Crawler" will grab only base www.xyz.com. It will not visit http://www.xyz.com/product/milk/ unless you set such depth that covers also milk dir.

Ignore Case of URLs: Set this option to avoid duplicate URLs like
http://www.xyz.com/product/milk/
http://www.xyz.com/Product/Milk/
These 2 URLs are same. When you set to ignore URLs case, then "Win Web Crawler" convert all URLs to lowercase and can remove duplicate URLs like above. However - some servers are case-sensitive and you should not use this option on those special sites.

WebSites

Enter website URL and extract all data found in that site.

Quick Start

Select 2nd option "WebSite/Dir" - Enter website URL - Select Depth - Click OK

What "Win Web Crawler" Does: "Win Web Crawler" will retrieve html/text pages of the website according to the Depth you specified and extract all data found in those pages.

# By default, "Win Web Crawler" will stay only the current domain.

# "Win Web Crawler" can also follow external sites!

If you want "Win Web Crawler" to retrieve files of external sites that are linked from starting site specified in "General" tab, then you need to set "Follow External URLs" of "External Site" tab. In this case, by default, "Win Web Crawler" will follow external sites only once, that is - (1) "Win Web Crawler" will process starting address and (2) all external sites found in starting address. It will not follow all external sites found in (2) and so on...

"Win Web Crawler" is powerful, if you want "Win Web Crawler" to follow external sites with unlimited loop, select "Unlimited" in "Spider External URls Loop" combo box, and remember you need to manually stop "Win Web Crawler" session, because this way "Win Web Crawler" can travel entire internet.

Directories:

Choose yahoo, dmoz or other directory and get all data from there.

Quick Start & What "Win Web Crawler" Does:

Lets say you want to extract data of all companies listed at
http://directory.google.com/Top/Computers/Software/Freeware/

Action #1A:
Select 2nd option "Web Site/Dir/Groups" - enter this URL in "Starting Address" box - select "Process First Page Only"

Or, lets say you want to extract data of all companies listed at
http://directory.google.com/Top/Computers/Software/Freeware/
plus all down level folders like
http://directory.google.com/Top/Computers/Software/Freeware/windows
http://directory.google.com/Top/Computers/Software/Freeware/windows/browser
http://directory.google.com/Top/Computers/Software/Freeware/linux
etc....

Action #1B: Select 2nd option "WebSite/Dir/Groups" - enter URL http://directory.google.com/Top/Computers/Software/Freeware/ in "Starting Address" box - select Depth=0 and "Stay within Full URL" option.

With these actions "Win Web Crawler" will download http://directory.google.com/Top/Computers/Software/Freeware/ page and optionally all down level pages and will build a URLs list of companies listed there.

Now you want "Win Web Crawler" to visit all those URLs and extract all data found in those sites.

Action #2: So after either above action you must move to "External Site" tab and check "Follow External URLs" option. (Remember: this setting tells "Win Web Crawler" to process/follow/visit all URLs found while processing "Starting Address" of "General" tab).

List of URL:

Quick Start:
Select 3rd option "URLs from File" - Enter file name that contains all URLs list - Select Depth - Click OK

What "Win Web Crawler" Does: "Win Web Crawler" will scan the contents of specified file. This file must have URL line-by-line, other format is not supported, "Win Web Crawler" will accept only lines that starts with http:// text. Also it will not accept URLs that point to image/binary files, because those files will not have any data.

After building unique URL list form above file, "Win Web Crawler" will process website one-by-one according to the depth you specify.

Frequently Asked Questions

Q:

Does this extractor require 'Internet Explorer'?

Q:

I set-up a project with "URLs from File" extraction, enter the filename - but "Win Web Crawler" can not find any link in the file?

Q:

When I aim this extractor at http://dmoz.org/Kids_and_Teens/Computers/Internet/ I would expect to see all links listed there with descriptions. How come?

Q:

When I run "Win Web Crawler" link extractor, it sucks all my computer power, screen is hardly refreshing?

Q:

Can I resume an interrupted session in "Win Web Crawler"?

Q:

How I can add search engine listing other than those specified in Engine Listing dialog?

Q:

Why the extractor slow down after running whole day?

Q:

How to get more data in "Win Web Crawler"? When I query in search engine I see million of matches.

Q:

What are inactive sites that shown in data tab?

Q:

Should I use more thread to complete the session quickly?

Q:

Does this extractor require 'Internet Explorer'?

A:

No. It doesn't require any third party software/library.

Q: A:

Make sure the file exist in disk. The file must have URL line-by-line, other format is not supported, "Win Web Crawler" will accept only lines that starts with http:// text. Also "Win Web Crawler" will not accept URLs that point to image/binary files, because those files will not have any text data to extract.

Q:

When I aim this extractor at http://dmoz.org/Kids_and_Teens/Computers/Internet/ I would expect to see all links listed there with descriptions. How come?

A:

After entering http://dmoz.org/Kids_and_Teens/Computers/Internet/ in starting address box, move to "External Site" tab and check "Follow External URLs" option. This option tells "Win Web Crawler" to visit all linked site and extract title and other info.

Q:

When I run "Win Web Crawler" link extractor, it sucks all my computer power, screen is hardly refreshing?

A:

It seems you are using high number of threads. Decrease the thread value to "5" in "New Session - Other" tab. "Win Web Crawler" can launch multiple threads simultaneously. But remember, too high a thread setting may be too much for your computer and/or internet connection to handle it and also puts an unfair load on the host server which may slow the process down.

Q:

Can I resume an interrupted session in "Win Web Crawler"?

A:

Yes. Use 'File - Open' command to open a previously stopped session.

Q:

How I can add search engine listing other than those specified in Engine Listing dialog?

A:

It is easy. In "URL" field type the search query URL. Replace the search keyword part with "Win Web Crawler" syntax {SEARCH_KEYWORD}

For Example: an AOL query URL with "Flower Shop" search is:
http://search.aol.com/dirsearch.adp?query=Flower+Shop

You just replace Flower+Shop part with {SEARCH_KEYWORD} like following:

http://search.aol.com/dirsearch.adp?query={SEARCH_KEYWORD}

After adding the new engine list, click "Save" button.

Q:

Why the extractor slow down after running whole day?

A:

Do not use many thread in New Session Dialog - Other tab. Use only 5 or less.

Also do not use it for very broad search because program uses RAM to store extracted url ... to avoid duplicate data and not to visit already visited site.. so this use lots of RAM and may slow up.

If you use for broad search then uncheck 'View - Display data in data tab' menu so no data will be shown in data tab and performance will increase.

Do not use 'Follow External Sites - Spider Unlimited Loop' in New Session Dialog. This way it can travel entire internet and crash easily.

Q:

How to get more data in "Win Web Crawler"? When I query in search engine I see million of matches.

A:

To get more results:

(1) Select all search engines - click Save in New Session Dialog -> Engine Listing Dialog.

Note that: Although you see millions of matches in search result, search engines do not deliver more than 1000 results. For example: try to view 1001 th result in any search engine.

Q:

What are inactive sites that shown in data tab?

A:

"Win Web Crawler" can not connect to these sites.  The site could be down temporarily or domain expired. If you want to try these sites later then save the list using "Save" button and use "New Session Dialog - URLs from File" option to process these sites later.

Q:

Should I use more thread to complete the session quickly?

A:

It is correct for a smaller session which will complete within few hours.
But for large scale sessions that will take many hours, use low thread (say 5).

Thread used to download data simultaneously.
Its not right that - more thread means faster extraction. Because after data download, program needs to analyze, parse the data to extract, .. and get inside links for further extraction, etc.... So more thread you use, the program and CPU will become more and more busy. You should use 10 for smaller session 5 for large session.