Clustering of Large Databases

Authors:
8. Vedang Bhange
7. Rahul Baviskar
17.Umesh Gaikwad
23.Akash Hedau

WHAT IS CLUSTERING ?
Classification of objects into groups is called as Clustering .Database clustering refers to the ability of several servers or instances to connect to a single database. An instance is the collection of memory and processes that interacts with a database, which is the set of physical files that actually store data. It is a method of dividing or Grouping the objects in an unstructured database into the clusters on the basis of comparisons made between them . Clustering is can be done using advance technologies like Machine Learning where Machine learns the patterns existing in the dataset it have and Classifies the content in database as per differences it finds relevant. large datasets are the hundreds of terabytes or even petabytes of data. This type data storage is hard to for a cluster. It is typically a scientific investigation which comprises of two phases: the data generation phase, and the data analysis phase.
Especially in data Science there is an important application of clustering as it handles the unstructured data and have significant implementations in the field of Personalized Advertising, City Planning , Individual Assessment in Teaching Learning Process. News Classification etc.
Clustering of large high-dimensional databases is a significant problem with challenging performance and system resource requirements. There are a number of algorithms that are applicable to cluster large databases, and a few that address high-dimensional data. Although a Maximum data in data-warehouses is nominal, but even it contains fewer mixed data generated by the system.

WHY ITS NEEDED?
Provision of continuous application-service to user is a significant for organizations to achieve their business objectives. However, nowadays, with continuous increase in the data , many organizations are facing issues in providing high performance and continuous service to their clients. These problems generally fall in the following categories:
Load Balancing issues : If servers have been interacting with many users through the website, the number of queries coming from the users might overload a single database server. This leads to website becoming unresponsive. It also largely relies on the system setup. Basically, load balancing allocates the workload among the different computers that are part of the cluster. In case of huge traffic, there is a higher assurance that it will be able to support the new traffic. This can provide scaling seamlessly as required. This links directly to high availability. Without balancing the load, a system could get overworked and traffic would slow down, leading to decrement of the traffic to zero.
Server-Crash/Failure Issues : The database server on which as whole system is dependent on data retrieval might compromise due to an internal system software (OS) exception. This may result in the Application failure.
System admin manage such failures and fix issues efficiently. Here, the clusters come for rescue. So, the High availability Clusters prepares the service availability by replicating servers and by redundant software and hardware reconfiguration. So, system is controlling the other and works on requests if any one node fails. These clusters types are highly feasible for those client users who relies on their own systems completely. For example, e-commerce, websites, etc.
But the system should be capable enough to know which all systems are running, from which IP is running, which request and what would be the progression of action in case of a crash. The most crucial factor is that the servers should not stop working in any scenario.
Performance Issues : There are number of sleep queries (DB connections that are in waiting state for to get closed) and complex read and write queries in a job queue that causes the target database and application to run slowly.
Some queries that do read or write operations complex queries which include insert update delete commands keep on executing, which impacts the performance of the system because of high utilization of CPU and memory. This might result into a very slow response from the website server to the user.
In such cases, the organization may use database cluster if the clients want to develop or host the business website. By using this way, the organization may not have to face the issue of the low response time issue. Here High-Performance clustering plays an important role in serving e-commerce applications, where the requests are scattered over the group of database clusters and sent back to the client efficiently via a HTTP server. One of the important features in this cluster is DBMS, which works to retrieve back requests from many nodes within a distribution through a single request from the user. This overcomes cothmplex query problems and optimizes distributed query responses.
Data Redundancy: In many Large databases , Insertion of continuous record impacts to duplication of records which occupies unnecessary memory space. So, use of clustering the database ensures data redundancy. As Multiple computers work together to store data amongst each other with database clustering. This gives the advantage of data redundancy. All the systems are in synchronisation that means each node will have the exactly the same data as all the other nodes. In a database, we need to avoid these kinds of repetitions (redundancies) that lead to data ambiguity. Such kind of redundancy that clustering offers is certain because of the synchronization. In case if a system malfunctions, we will have all the data available as backup

HOW IT WORKS
Some important clustering techniques/methods can be classified into five different types: Hierarchical, Partitioning, Density-based, Grid-based, Model-based methods.
1. Hierarchical Clustering: Hierarchical clustering does not split the data directly into a particular cluster in a single step. There is a series of portioning of data which may run from a single cluster containing all objects to n number of clusters each containing a single object. Hierarchical clustering is further divided into two categories: Agglomerative, divisive types
a. Agglomerative Algorithms- Bottom-up Approach
i. ROCK [GRS99] is an agglomerative algorithm which uses a similarity metric to find neighbouring data points. It then defines links between two data points based on the number of neighbours they share. The algorithm attempts to maximize a goodness measure that favours merging pairs with a large number of common neighbours. In order to manage large databases, ROCK algorithm requires sampling of the data.
ii. CACTUS [GGR99] is an agglomerative algorithm that uses data summarization to achieve linear scaling in the number of rows. It requires just two scans of the data. However, it has exponential scaling in number of attributes. This limits the algorithm’s usefulness for databases with many attributes.
b. Divisive Algorithms-Top-down Approach
DIANA is also known as Divisive Analysis clustering algorithm. It is based on the top-down approach of hierarchical clustering where all data points are initially assigned a single cluster. Further, the clusters are divided into two least similar clusters. This is recursively implemented until clusters groups are formed which are distinct to each other.

Hierarchical Clustering Algorithm types

2. Partitioning: Clusters are Partitioned in order to access them considering various factors of commonness of data, following are some of the widely used partitioning Algorithms:
Prototype-based clustering algorithms can be further divided into two types: crisp clustering where each data point belongs to one cluster only, and fuzzy clustering where every data point belongs to clusters to a certain degree . Partitional algorithms are dynamic. Points Identified can move from one cluster to another cluster k-Prototypes [Hua97a]: is a partition-based algorithm that apply a heterogeneous distance function to compute distance for mixed data. It requires weighting the contribution of the numeric attributes versus the nominal ones.
The k-means algorithm : It one of most used algorithms which is based on centroid-based portioning technique. The k-means algorithm attempts to classify the given data-sets or observations into k-clusters. The k-medoids is a representative object-based technique. The partitioning method performs actions based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point.

K-Means Partitioning

3. Density based Clustering: The DBSCAN algorithm is one of the most significant Density based clustering algorithm. It is based on this intuitive notion of “clusters” and “noise”. The key idea is that for each point of a cluster, the neighbourhood of a radius given should contain at least a minimal number of points. It checks the distance between the point nearest to the reference and hence considering the concentration of points at particular location HDBSCAN is a density-based clustering method that extends the DBSCAN methodology by converting it to a hierarchical clusteproring algorithm.

DBSCAN Algorithm Outcome

4. Grid Based Clustering: It creates grids of clusters in the database Algorithms are:
STING (Statistical Information Grid Approach): – In STING is prominently used Grid Approach algorithm, the data set is divided recursively in a hierarchical manner. Each cell is further sub-divided into a different number of cells. It captures some statistical measures of the cells that helps in answering the queries in a small amount of time.
It gives additional advantage of fast processing time and its typical independence of the number of objects (scales well). The spatial area is split over rectangular cells. There are usually several levels of such cells especially in 3D, which form a hierarchical structure

Sting Algorithm

BENEFITS AND CHALLENGES
Benefits
• Multiple systems can work together to store data amongst each other with database clustering. It ensures data redundancy.
• Load distribution: Load is not concentrated on particular centre. It gets distributed among the nodes of clusters.
• Data Redundancy :By load balancing, you allocate the workload among the different computers that are part of the cluster.
• High accessibility: Clustering overcomes the risk of application failure due to the risk that the server may go down.
• Despite all of these distributions at the back-end, it successfully appears to be a single system to the user.
• Benefits to identify the patterns in the data which is clustered, supports use of unsupervised Machine Learning Algorithms and data mining algorithms in large database which is extensively used by large Data organizations like Google and Amazon.

Challenges :
• Complexity: The database which is clustered creates a complex system where many resources and table are interdependent. In case of Relational and structured DB, whole database structure is dependent.
• Inability to recover from database corruption: With increased complexity and Heavy volume of data in large database , any corruption in data or structure become irreversible and very hard to change
• Large database clustering also involves high maintenance cost specially when data is ever increasing.

CONCLUSION and FUTURE SCOPES
Large data organizations like Google, Amazon also uses the clustering mechanisms to identify the pattern specially in domain of data mining and organizing the data. Database clustering ensures high availability of organized data interconnected with each other and easy for machine learning Process and find some trends in it. It is also widely used as per domain requirements specially for the system Automation projects for machine to understand the process. Various Clustering Algorithms and varied Architectures of clustered of database system increases the flexibility and adaptability leading to diverse and multi domain use cases.
Ever Increasing data at various data organizations also maintains consistent research in this field with adequate funding that ensures the systems in the world to be up to date even after the crossing the capabilities of Databases.

Comments

Post a Comment

Popular posts from this blog

Smart Towns and Industry 4.0