................ Where You Want To Be A Part Of The Hood! Listen to our Live CHW Radio Show
 

RSS Feed for This PostCurrent Article

Make a Search Engine in PHP and MySQL

Share/Bookmark


Why would you want to make a search engine anyway? There already is a search engine to rule them all. You can use Google to find just about anything in the Internet and I doubt you will ever have the same computing and storage capabilities as the big G.

So why then make your own search engine?

To make money of course!

… and to become famous as the creator of the next big search engine or because as a programmer or engineer you like challenges. Making a search engine for the public Internet is tricky and if you’re like me you like to solve tricky problems.

The third application is a customized, high speed site search for you large
thousands of pages website. An indexed search engine will be a lot faster than
a full text search function and if Google’s site search isn’t flexible enough
for your site you can make your own search functionality.

THE BASICS OF SEARCH

The basis of any BIG search engine is a word to web page index, basically a long list of words and how well they relate to different web pages.

To make a search engine you have to do four things:

  • Decide what pages to fetch and fetch them
  • Parse out words, phrases and links from the page
  • Give a score to every keyword or key phrase indicating how well the phrase relates to that pages and store the scores in the search engine index
  • Provide a way for users to query the index and get a list of matching web pages

This is not hard for a seasoned programmer. It can be done in a day if you know regular expressions and have some experience with HTML and databases.

Now you have a working search engine, just add a lot of computers and hard drives and you’ll soon index all of the Internet. If you’re not prepared to go that far a one terabyte disk will hold an index of about 50 million pages.

HOW TO SCORE PAGES

After completing basic search functionality there’s a lot of work before anyone will want to use your new machine.

An index is not enough. What’s challenging is how to score pages to give the end user the search results that’s most relevant to his idea of what hi is searching for.

You’ll need to decide how much weight to put on keywords in the tile tag, description and main web page contents. To make good scoring you will also want to boost keywords found in the URL of the page and check the anchor text of inbound links.

Keeping track of inbound links is the most useful and most challenging of the above, you’ll need to keep a separate database table with info on all links between pages you index.

WHAT TO INDEX AND NOT TO INDEX

Other obstacles you will find when you start indexing real Internet content is the fact that there is wast amounts of useless junk floating around everywhere and eventually your index will become full of spam, affiliate pages, parked domains, work in progress homepages without content, link farms used by search engine optimizers, mirror sites using data feeds to create thousands of pages with product listings or other reproduced content etc, etc…

When indexing from the Internet you will have to find ways to filter out the junk content from what people are actually reading and searching for.

To start with you could limit how deep into sub directories you crawl, how many link hops from a domain index page you crawl and how many links per web page to allow.

PARSING WEBSITES

There’s a million ways, both right and wrong to write HTML and when you index from the Internet you will need to handle all of them.

When parsing keywords from pages you not only need to handle the complete HTML standard but also all the non-standard ways that is unofficially supported by Internet browsers.

To be able to read all pages you will also need to parse client side javascript, handle frames, CSS and iframes.

This is a large part of the work on a general search engine, to be able to read all sorts of content.

WHY SO MANY URLS?

Finally you’ll need to deal with the fact that many websites have many URLS pointing to the same web page. Just look at this example:

dmoz.org
www.dmoz.org
dmoz.org/index.html
www.dmoz.org/index.html

All those URLs point to the same web page. If you don’t make special code to handle that you’ll soon have 4 results in your search engine (one for every URL) all going to the same page. Users will not like you.

There is also the possibility of query strings where a session ID after the question mark in the URL will create almost infinite URLs for the same web page.

google.com?SID=4434324325325
google.com?SID=4387483748377
google.com?SID=7654565644466

To the search engine there will be a really big number of pages all containing the same content.

The quick fix of course is to not index pages that include a query string. Or to strip the query string from pages. This works but will also remove a lot of legitimate content (think forums) from your index.

You now have all the information you need to make a site search engine. If you’re going for a general Internet search engine there’s a lot more details you need to include. Like robots.txt, site maps, redirects, proxies, recognizing content types, advanced ranking algorithms as well as handling terabytes of data.

I’ll cover more detail in a future article. Good luck with your next search engine project. engine algorithms.

Simon Byholm is building a new search engine where he will test and describe new and old search algorithms. Simon is a software engineer living on the west coast of Finland and has a B.Sc degree in telecommunication and a burning interest for search engine algorithms.

Article Source:http://www.articlesbase.com/programming-articles/make-a-search-engine-in-php-and-mysql-1266281.html


Written by: OSAblogger / Bill Wardell - Please Read Our Latest OSA eZine Edition

http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/digg_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/reddit_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/stumbleupon_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/delicious_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/blinklist_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/blogmarks_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/furl_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/newsvine_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/technorati_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/google_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/myspace_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/facebook_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/yahoobuzz_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/sphinn_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/mixx_32.png http://www.onlinesecurityauthority.com/wp-content/plugins/sociofluid/images/twitter_32.png

Other Places You Can Find Me…

Digg - LinkedIn - OSA Community - Facebook - StumbleUpon - MyBlogLog


If you're a concerned parent, you may want to subscribe to the: OSA~RSS while your here, please JOIN our: OSA Forum... also Follow Me On Twitter Thanks for visiting!


OSA Technorati Tags: , , , , , , , , , , , , , , , , , , ,

Blog Traffic Exchange OSA Related Posts
  • blog traffic exchangeWeb designers and SEO. As the internet has become increasingly sophisticated, the various professional fields in the internet marketplace have become more and more specialized. In each area there have been massive leaps in complexity, usability, design and technology.Even just a few years back, there were relatively few separate specialties for people who work......
  • blog traffic exchangeIncrease Search Engine Grade - An Overview For Beginners in On-line Concern Stay it plain! A plenty of small affiliate x holder initiating a web site for the firstborn phase, could get overwhelmed really hurriedly when it approach generation to expand examine engine position. For those of you who recognize the fundamental principle on how to expand examine engine place, this clause......
  • blog traffic exchangeA Guide to Finding a Web Designer in Dallas A web designer in Dallas is easy to find.  The right website designer in Dallas may prove to be a little more elusive.  There are hundreds of thousands or perhaps even millions of people that consider themselves web designers.  The problem is most of them really do not understand enough......
  • blog traffic exchangeFacebook Fan Page Creation for your Brand! Facebook has 250 million users. Twitter has 40 million. The potential for your audience on Facebook is clearly much bigger and in a way it's easier to find your target. The Fan Page allows you to add your own apps. We plan on creating a 'box' on fan page which......
  • blog traffic exchangeVirtual Marketing - Vision with the latest in Techniques Virtual marketing is a broad term used to define professional internet marketing services.  This includes web design, development, search engine optimization, social marketing optimization and much more. These days there are many e commerce companies who use latest programs and techniques for improving their business. They use the latest technology......
Blog Traffic Exchange OSA Related Websites
  • blog traffic exchangeHow Search Engines Work And Why Search Engine Optimization Is Important What is a search engine? A search engine is a different type of site on the web which is created to find information that is stored in individual websites for people who type in a keyword or group of keywords seeking information on that topic. Some examples of search engines......
  • blog traffic exchangeGet Your Website Listed With Search Engine Submission Services "Search engine submission" refers to the work of getting your web site listed with search engines. Earlier in the history of the web the submission process could be automated. Nowadays, however, most search engines have implemented steps to prevent this. Today this activity is generally done by experts in Search......
  • blog traffic exchangeThe Buzzword Of Success: Search Engine Optimization Stay ahead in the competition by always outdoing your competitors. Using innovative search engine optimization you can now take the guesswork out of search engine friendliness. Today every online entrepreneur struggles to be within the first ten rankings of a search engine result. If the search engine is one of......
  • blog traffic exchangeSearch Engine Optimisation Pitfalls On page factors - Is your website search engine friendly? So you have a website but where is it on Google? Have you fallen foul of a penalty or have you overlooked one of the many common search engine optimisation pitfalls when designing your site? Understanding what works for the......
  • blog traffic exchangeSelecting The Right Search Engine Optimization (SEO) Company For Yourself And Your Website. When building your site there can be many characteristics that could seem intimidating to you; design, code, functionality etc. You will spend a significant amount of your money in building a website but one of the key areas that businesses and individuals forget about is once their site is built,......

OSA Trackback URL

If you found this page useful, consider linking to it.
Simply copy and paste the code below into your web site (Ctrl+C to copy)
It will look like this: Make a Search Engine in PHP and MySQL

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Post a Comment

Add Me As a Friend

OSA Elite Group


OSA Elite Group
Name:
Email:





Simplify Your Life



Get our Podcasts

Categories

Archives

Friends of OSA

Recent Peeps

OSA Gang

Blog Marketing
Jack Humphrey's blog marketing, social marketing, and link building tips.

The Publicists Assistant
We are experienced in helping clients receive the Online Publicity and Radio Publicity they deserve. Since your success determines our success, we are dedicated to bringing you RESULTS!

Recommends




OSA Latest Headlines


OSA & CHW Radio

Get Your Free
OSA Resource Guide
Email:
Name:



OSA Social Follow


Follow Me!

links for freeHeavy Haul        Article Distribution        bio plastic        ICONaPIX Photography        Mlenny Stock Photography

OSA's Favorite Social Networks




© 2006-2009 Online Security Authority & Bill Wardell - All Rights Reserved -- Copyright notice by Blog Copyright