#1 (permalink)  
Old 11-20-2008, 04:00 AM
Bruceper's Avatar
Nicecoder Team
 
Join Date: Jun 2002
Location: Winnipeg Canada
Posts: 4,008
Bruceper is on a distinguished road
Default Data mining for a better database

Below is an article I wrote for IndexuSupport.com, so far it's article 4 in the series. I posted all 4 tonight and you can read them on my site for free.

Many users often ask me how they can get a database for their website. There are numerous answers to this question.

1) You can buy databases from multiple sources. Some of them good, a lot of them bad. Remember to ask questions and get a sample before you buy.
2) You can use a DMOZ slice. This is a really good way to go, in fact it's how a lot of my larger sites started. If you don't want to buy the DMOZ Extractor you can buy slices for as low as $5.00 from me.
3) You can scrape it!

First lets get this out of the way. Scraping a database off the net is probably as close as you can come to stealing. Scraping involves processing one or more sites through a few programs that will save and extract the data that you want.

There are millions of web sites on the internet, all with a lot of links, content and data. What kind of database you want will depend on where you go look for your data.

For this example lets just assume you have a CD-ROM with the Yellow Pages on it in HTML format. And for simplicity lets assume that there are 27 pages (a-z plus numbered companies). Yes they would be large pages, but that's not the point.

The first thing you would do is copy those pages to your hard drive so they can be processed faster. The read time off a CD-Rom is not as fast as your hard drive. Similarly if you were trying to scrape webpages you would want to copy them to your hard drive as well.

For copying a website to your hard drive you would probably use a program like Offline Explorer. There are three versions, Offline Explorer, Offline Explorer Pro and Offline Explorer Enterprise. At the minimum you want the Pro version, it's costs around $90 and is essential in your data mining/scraping endeavour.

So you have the web pages on your hard drive in a specific directory. Now you want to suck the data out of it. For our purpose we only want the URL's but you can mine any data you want.

For the actual data mining you need a program called TextPipe. TextPipe comes in three versions, TextPipe Lite, TextPipe Standard and TextPipe Pro. At a minimum you need TextPipe Standard and it costs $199.

I won't get into how to work TextPipe here, but will say that once you start up TextPipe you simply add some filters, point it to a directory, file or set of files and click go. From there TextPipe does all the work and mines the data from the pages. It will output the data where you tell it to.

TextPipe Standard can strip HTML tags for you, which saves you a TON of time messing around with the saved data. What you are left with is simply a list of URL's.

With a list of URL's you can then import them into IndexU, and use the Fetch Meta function of IndexU to download the title, description and keywords of all the URL's in the database.

But don't get me wrong, TextPipe can do a lot more than just URL's. It can save almost any formatted data off of a webpage including addresses, phone numbers, descriptions and much much more. I just used URL's as an example.

Sure I made it all sound simple, and once you're familiar with how these programs work and how the filters work with TextPipe there is really no end to the kinds of databases that you can create.

In all seriousness, if you have a single website running IndexU then this method is not for you. You're looking at a cost of around $300 US just for the software, and then add your time and the learning curve required and it's not worth the effort. However you could pay someone to do the scraping/data mining for you. It may sound expensive, but it sure saves learning how to write filters!

--------------------------------------------------------------------

If you are interested in any data mining please let me know and I can try and get you a quote. I would need the following information

1) The approximate number of pages to mine and URL's
2) The topic you are after for your database (I may have URL suggestions)
3) The exact data you want from each page
4) What format you want the data in (completed database, raw)
5) Your expected time frame


Please understand that data mining is a service that I am offering personally, not Nicecoder.
Reply With Quote
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On

Similar Threads
Thread Thread Starter Forum Replies Last Post
UK Network Data Communications Database - 1.985 RECORDS soundofvtec Buy, Sell or Trade 0 04-29-2008 06:14 AM
Bulk data upload into existing database. ca$h v5.x 1 01-30-2008 12:30 AM
Importing large amount of data in a pre-existing database aladdin1 v5.x 8 06-16-2007 03:51 PM
Importing Data CEC v5.x 18 06-23-2006 07:01 PM
Change to data in database morgankd v5.x 1 06-24-2004 02:05 AM

HACKER SAFE certified sites prevent over 99.9% of hacker crime.

All times are GMT -5. The time now is 04:37 PM.


Powered by vBulletin®
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO
SSL Certificate