Search engines are special sites on the World Wide Web that are designed to help people find information stored on other sites. They are similar to a telephone directory where you can see some level of order to the names. There are differences in the ways various search engines work, but they all do three basic tasks:
- They search the world-wide web based on the keywords.
- They keep an index of the words they find against the websites they find these words.
- They allow users to look for words or combinations of words found in that index and display the list of pages on that index in their ranking order.
Early search engines held an index of a few hundred thousand pages and documents, and received may be a few thousand searches per day. Today, a top search engine will index hundreds of millions of pages, and respond to millions of requests every day.
Meta Tags allow the creator of the page to specify the keywords that represents the content of the page. The are the keywords under which the page will be indexed. These meta tags guide the search engines to choose a specific location to index your page in a hierarchy of indexed keys. It is similar to your telephone cards that were arranged in an alphabetical order. However as the creator of a page you may not completely rely on these meta tags.
As the usage of internet grow search engines became sophisticated around the early 2000s due to the problem of the phony sites indexed on the keyword with poor quality of information. Specially Google is lashing those sites that are keyword rich and content poor by utilizing something call Web Crawler.
A search engine need to find your site before it can tell you where a file is. To find information on the millions of Web pages that exist, a search engine employs special software robots, called spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is called Web crawling. Building and maintaining a useful list of words is not an easy task. A search engine’s bot have to look at millions of pages. The owner who want to rank on the first page can also help the crawlers by providing a site map, a xml file that can be pretty easily generated. Learn more about the google crawler on Google Site Admin Tools.
Words placed in the title, subtitles, meta tags and other positions of relative importance were noted for special consideration during a subsequent user search. The Google spider was built to index every significant word on a page, leaving out the articles “a,” “an” and “the.” Other spiders may take different approaches. These approaches are called algorithms, a mathematical formula or a process, that is in works behind all the searching.
Building the index
Once the bots completed their task of scanning the website, the search engine stores this information in a meaningful way where it could find the information when any user of internet searches for some information.
Any indexing requires 2 pieces to be accessible:
- The information stored with the data
- The method in which the search engine can retrieve the data when presented the keyword.
The hashing technique is used to has and store the data with some encoding to save the storage space. The hash table has the hashed number along with a pointer to the actual data, sorted in such a way that it allows to be stored most efficiently. Combining efficient indexing with effective storage makes it possible to get results quickly, even when the user creates a complicated search. As we know these search engine algorithms change very often to improve the efficiency.
This is how google was able to retrieve million of results in few seconds.
The Query for Search
The user build a search query to query these hash tables. They use mostly uses boolean operation to retrieve the results. The most commonly used boolean operations are AND, NOT, Quoted texted, wild card (something followed by) and OR.
Few years ago google implemented a nice functionality of guessing the next word as you type your search query. They accomplished by putting the top 1o searches in the list of their indexing rank order.
Future of Search Engines
Google has been personalizing search results for a years now using our search history and social activity. This is, however, limited to activity within Google products. This works both ways. Google is already encroaching in to and deciding things for me in chrome as soon as I sign up.
Here is an example. Imagine that you are on a strict workout plan, tracking the calories eaten on a phone apps. Google think it can use this information, when you perform a search for recipes. The results could be impacted by the calorie limit you set or even better a search engine decides what you are going to eat.
When we talk about context, limiting this to search and apps alone is a mistake. It needs to incorporate other parts of our daily activities for it to truly work. Just look at your Amazon recommendations after you search for “bachelor party favors,” or the Netflix history when there istrouble in paradise. The results can be terrifying, inaccurate, and not a true reflection of your interests. There needs to be more context behind it.
Google and other search engines are become more context aware but not really integrated as they are not completely connected.