Case Study on Google Search Engine

Need help with assignments?

Our qualified writers can create original, plagiarism-free papers in any format you choose (APA, MLA, Harvard, Chicago, etc.)

Order from us for quality, customized work in due time of your choice.

Click Here To Order Now

Introduction: Overview of why choose these companies:

Google is recognized as the world’s largest search engine company, with a large number of users around the world. It operates more than one million servers in data centers around the world, integrates global information, processes hundreds of millions of search requests every day, automatically ‘browses’ each web page, and scores them one by one. Users only need to input the search home page Keywords, Google search engine will find out the relevant pages with the highest score from the pages it visited, and display them in less than a second so that everyone can access and get the information they want.

Google has been able to grow into a company with a dominant share of the Internet search market, thanks to the effectiveness of the ranking algorithms used at the bottom of its search engine. The underlying system for search has managed to handle more than 88 billion searches per month. During this time, the main search engine has never experienced an outage, and users can expect query results in about 0.2 seconds. [googleblog.blogspot.com]

Main:

Design architecture part:

Google’s search engine is implemented in C or C++, which is efficient and can run on Solaris or Linux. In this section, we will give a high-level overview of how the whole system is designed as pictured in Fig.1.[(]

In Google, Web crawling is done by severely distributed Crawlers. [‡à] The function of the URL server is to send the list of URLs to Crawler, and then Crawler will send all the acquired web pages to the store server, and then the Repository will compress the webpages and store them in the database. When the system starts to parse web pages, because each web page has an ID number (called docID) associated with it, the parsed URL will be assigned that number. The indexer performs many functions that can read repositories, extract documents, and parse them. Every document is converted into the occurrences of a set of a word called hits. The hits are used to record words, and their position in the text, and estimate font size and capitalization. The indexer distributes these hits into a set of ‘barrels’, creating a partially sorted forward index. [‡à] The indexer also has an important function, which parses all links in each web page and stores important information about these links in the anchor’s file. File information can accurately locate the location of each link and to. and the text of the link.

URL resolver reads the anchors file and converts the relative URLs to absolute URLs, then to docID. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also creates database links for each pair of docs. The links database is used to calculate the PageRanks of all documents.

The sorter takes the barrels, which are sorted by docID, and resorts them by wordID to generate the inverted index. [‡à] This operation requires a little temporary space. The sorter also generates a list of words and offsets it into the reverse index. The DuffSimulink function generates a new dictionary for the searcher along with the LeX icon generated by the indexer. The searcher is run by a web server and answers queries using dictionaries built by DopCopION, inverted indexes, and PageRanks.

Scalability, availability, and security:

From the perspective of a distributed system, Google’s search engine is a fascinating case study, which can handle extremely demanding high demand, especially in scalability, reliability, availability, and security.

Scalability:

Scalability refers to the effective and efficient operation of distributed systems on different scales (from the intranet of small enterprises to the Internet). If the number of resources and users surges, the system can still maintain its effectiveness. There are three challenges to achieving scalability.

(1) Control the cost of physical resources

When the demand for resources increases, we should spend reasonable costs to expand the system to meet the requirements. For example, if a search engine server cannot handle all the access requirements, it is necessary to increase the number of servers in order to avoid performance bottlenecks.

In this respect, Google considers scalability in three dimensions:

    1. being able to process more data (x)
    2. being able to process more queries (y)
    3. seeking better results (z)

From the data in the Introduction, Google’s search engine is undoubtedly very good in these aspects. However, in order to be scalable, other functions, including indexing, ranking, and searching, require highly distributed solutions. [(f]

(2) Control the loss of performance

When the distributed system deals with a large number of users or resources, it will produce a lot of data sets. The management of these data sets has a great demand on the performance of the distributed system. In this case, the scalability of the hierarchical algorithm is obviously better than that of the linear algorithm, but the performance loss cannot be completely avoided.

Because Google’s search engine requires high interaction with users, it is necessary to achieve low latency as much as possible. Therefore, the better the performance is, the better the network search operation can be completed within 0.2S. Only in this way can Google make more profits from the sale of advertisements. The annual advertising revenue is as high as US $32 billion, which shows that Google is superior to other search engines in the performance processing of related underlying resources, including network, storage, and computing resources.

(3) Prevent the exhaustion of software resources

The search engine uses 32 bits as the network address. If there are too many Internet addresses, the Internet address will be exhausted.

For this, Google does not have a good solution at present, because if we use a 128-bit Internet address, there is no doubt that many software components need to be modified.

Availability:

The availability of distributed system mainly depends on the extent to which new resource-sharing services can be added and used by multiple clients. Because Google’s search engine needs to handle the highest requirements in the shortest time in web crawling, indexing, and sorting, availability is also a strong demand. To meet these needs, Google has developed a physical architecture (Fig.3)

The middle layer defines a general distributed system infrastructure, which not only enables the development of new applications and services to reuse the underlying system services but also provides integrity for Google’s huge code database.

Security:

There are many information resources with high value to users in distributed systems, so it is very important to protect the security of these resources. The security of information resources includes three parts: confidentiality (to prevent disclosure to unauthorized individuals), integrity (to prevent change or damage), availability (to prevent interference with the means of accessing resources)

When investigating the security of Google’s search engine, we found that Google has not been very successful in security, and even has publicly admitted to divulging user information to seek benefits, which also makes users use Google’s software, information security can not be guaranteed.

Google distributed file system

The implementation of the Google file system is to meet the rapid growth of Google’s big data processing and management needs. In addition to this demand, GFS faces the challenge of managing distribution and the risk of increased hardware failure. Ensuring the safety of data as well as being able to scale up to thousands of computers while managing multiple terabytes of data can thus be considered the key challenges faced by GFS. [(‡à2] So Google made an important decision not to use any of the existing distributed file systems. Instead, it decided to develop a new file system. The biggest difference with other file systems is that it optimizes the use of large files (i.e. Gigabyte to multi-terabyte), resulting in the majority of files being considered immutable, and can be read many times with only one write.

A GFS cluster consists of a single master and multiple chunk servers and is accessed by multiple clients. As shown in Fig. 4. (Figure summary [Vijayakumari, 2014] about GFS)

These machines are common Linux process machines that can run user-level server processes. As long as the user’s resources allow the block server and client to run on one machine at the same time. The stored files are divided into fixed-size blocks, each with a globally unique 64-bit chunk handle. [(] Chunk servers store on local disks as Linux files It can read and write at the same time. Chunk data is assigned by a chunk handle and data range. To improve GFS performance, every chunk needs to be replicated to at least three servers. The chunk master maintains the metadata of the whole GFS. In a certain period, the chunk master will ask every chunk server to upload the state through HeartBeat messages. Data-bearing communication, which does not need to be linked to the Linux Vnode layer, directly connects to the chunk server. Neither the client nor the chunk server caches file data. This approach without storing data not only avoids the inability to cache because the working set is too large but also makes the client and the whole system consistent. The buffer of Linux stores all the frequently accessed data in memory, so chunk servers do not need to cache file data, which greatly improves the performance and speed of GFS.

Communication protocols:

The setting and selection of communication protocols are very important for the overall design of a system. Google adopts a simple, minimal, and efficient remote call protocol. Communication of the remote call protocol requires a serialization component to transform the procedure call data. So, Google developed a protocol buffer, which is a simplified, high-performance serialization component. Google also uses a separate protocol to publish and subscribe.

Protocol buffers:

Protocol buffers focus on data description and subsequent data serialization. It wants to provide a simple, efficient, extensible way to specify and serialize data independent of language and platform. The serialized data can be stored, transferred, or in any scenario that needs to serialize the data format. There are three reasons why Google chose to use protocol buffers. As shown in Fig.5. The disadvantage of Google’s design is that it’s not as expressive as XML.

Publish-subscribe

Because protocol buffers cannot fully meet Google’s requirements for communication, the designer also uses publish-subscribe. It can ensure that distributed events can be sent to a large number of potential customers in real-time and reliably. The main reason it is used is to support Google’s advertising system. Google’s publish-subscribe uses a theme-based approach that emphasizes reliable and timely delivery. In this way, although communication can be effectively implemented, it will cause additional overhead.

Key finding:

Google search engine can achieve the fastest speed and the most efficient retrieval mode without taking up too many resources, no matter the distributed system architecture, the way of management scalability, availability, and security, or in the way of communication. I think the core technology that Google search engine can complete the whole retrieval requirements in 0.2S is Google’s unique distributed file system.

At the stage of Google’s design of the Google file system, the goal is to provide redundancy for the storage of massive data on cheap but low-reliability computers. Because the distributed file system Google wants needs to meet the application and workload of Google, the designer designed the Google File System (GFS) on the premise of high component failure rate, high throughput, and low latency. The framework and basic operation of GFS are introduced in 3.1. It can be seen that the biggest difference between GFS and other distributed file systems is the use of a single primary device. Because the traditional distributed file system will have a single point of failure and throughput bottleneck. In order to avoid these failures, GFS weakens the main device and never moves data (excluding metadata), and establishes a cache on the server. Only when the data changes, can the primary device agent replicate the data. Although the design is simple, it is good enough.

At the same time, the system has high fault tolerance. In the event of a system error or failure, the primary device and block server can be restarted in a few seconds, and there are at least three replicas for block replication. In addition, the main device is hidden.

GFS also has some problems with reducing efficiency. At present, Google has more than 450000 devices, but only 1 / 3 of them are really effective. This brings Google a lot of extra costs, extra energy, and extra space. I think that since GFS can achieve high performance at a low cost, the next problem to be solved is to reduce unnecessary costs.

Need help with assignments?

Our qualified writers can create original, plagiarism-free papers in any format you choose (APA, MLA, Harvard, Chicago, etc.)

Order from us for quality, customized work in due time of your choice.

Click Here To Order Now