Big Data is unquestionably one of the most talked about Technologies in the
IT world today. While the majority of customers are convinced of its
significance, the focus is primarily more application oriented than
Infrastructure oriented. Application feasibility of Big Data is an important
aspect, but it is the Infrastructure setup which is of paramount importance. The
way infrastructure architecture is worked out, affects the performance of any
Big Data cluster. The ultimate goal is to achieve the proper balance between
cost and efficiency, in working out Infrastructure Architecture of Big Data.
This paper will take a closer look at the concept behind Big Data. To ensure
clarity for the readers, Hadoop software framework is taken as a Big Data
product. Architecture and methods of a Hadoop cluster, and how it relates to the
server and network infrastructure. The typical storage requirements of a Big
Data cluster. Information Security with Big Data is discussed at a high level.
The content presented here is largely based on academic work, some
experimentation done within Tata Consultancy Services (TCS) labs and experience
derived from some of the implementation activities done for TCS customers.
Big Data is unquestionably one of the most talked about technologies in the
IT world today. The volume of data generated globally by mobile phones and
social networks is growing at a phenomenal scale and the variety of data
generated only adds to its complexity. The data generated has opened up a
completely new avenue for organizations to leverage these growing information
assets to better understand and compete in the market.
The current focus of the global community is in driving the creation of best
practices and key learnings for Big Data adoption. Within TCS, we see the more
the pressing issue to be the creation of a Big Data Architecture framework.
Based on our experience, we understand the present status of the Big Data
Architecture process and its limitations. In our recommended architecture
process, we suggest a robust mechanism to tackle the same. The ultimate goal is
to achieve the proper balance between cost and efficiency, in developing
infrastructure architecture for Big Data. This process is meant to be adaptive
and will grow best practices, as it evolves.
Big Data Adoption process – Present Status
The adoption of Big Data may be driven by two perspectives: the application
perspective of analytics or the infrastructure perspective of storage.
The application requirements are analyzed and feasibility of Big Data is
ascertained by means of requirement mapping as well as through POCs. Once this
activity is successfully completed, cluster design and sizing starts. Various
parameters specified as application requirements need to be captured, before
coming up with the cluster design.
In the present state, requirements are well articulated which drives the
cluster design and sizing. To some extent the underlying hardware architecture
for Name Node, Job Tracker, Data Node, etc., are decided based on the cluster
designed. In most of the cases, underlying network architecture for Big Data is
given standard treatment, without giving a detailed look into Big Data aspects.
Storage architecture for Big Data is considered as DAS, with each node having
HDD. There is little clarity on storage optimization techniques. Information
security is the barely addressed at this point of time, primarily because of the
lack of available mechanisms, utilities, etc.
Below are various design aspects presently considered while designing Big
1. Cluster Design
Application requirements are analyzed in terms of workload, volume, etc.
Based on this, the cluster is designed. Cluster design is not an iterative
process. The initial setup needs to be verified and validated with sample
application and sample data. Big Data cluster design has considerable
flexibility in terms of fine-tuning the configuration parameters. But this makes
it complex as well.
2. Hardware Architecture
Usage of good quality commodity equipment is the key for Hadoop clusters. As
the clusters grow, the cumulative hardware cost can be significant. In the
present scenario, the hardware architecture requirements for Name Node is higher
RAM and moderate HDD. Job Tracker, if it is physically separate from the server,
will have higher RAM & CPU speed. Data nodes are standard low end server class
3. Network Architecture
Presently the network architecture is not specifically designed for Big Data.
Standard network setup within a data center is taken as backbone. In most of the
cases, it may result in over-estimated network deployment and, at times, have a
negative effect on the MapReduce algorithm. Inputs from Cluster design and
application requirements are not really mapped to the network architecture.
Hence there is significant scope for creating concrete guidelines related to
Network Architecture for Big Data.
4. Storage Architecture
In the present scenario, most of the enterprises will have huge investments
in NAS and SAN devices. These storage devices need to be accommodated into
overall Big Data architecture, although DAS is the recommended storage for Big
Data. The type of disk, shared-nothing v/s shared something, are not given
5. Information Security Architecture
General examination of different big data implementations shows that,
security features are sparse and aftermarket offerings which are not fully
tailored to these clusters. Findings show these deployments to be largely
insecure, and wholly reliant on network and perimeter security support.
Big Data Adoption Process - Recommended
While designing Big Data Architecture for an Enterprise setup, it is
necessary to take a comprehensive approach.
Application Requirements derive from the overall Cluster Design activity. Cluster Design comprises of Cluster Sizing, Hardware Architecture, Network
Architecture, Storage Architecture and Information Security Architecture
Hardware Architecture should be based on Application requirements and Cluster
Network Architecture should also take inputs from Application
Requirements and Cluster Sizing. This can be worked out in parallel to Hardware
Storage Architecture would depend upon Cluster Sizing,
Hardware Architecture and Network Architecture. Application requirements will
help fine tune the Storage Architecture.
The Information Security Architecture
would depend upon Hardware Architecture, Network Architecture, and Storage
Architecture. Application requirements will validate the Security Architecture.
A view of the recommended adoption process is depicted in the Figure below,
with the sequence of the steps identified and the complexity indicated with the
darkness shade of the respective boxes.
Figure 1: Big Data adoption– Recommended process
Looking at Big Data architecture from the application perspective gives an
isolated view. Similarly, taking the infrastructure perspective alone may negate
the advantages of Big Data in general. A comprehensive strategy covering
application and infrastructure aspects while architecting Big Data is the most
desirable. Unfortunately, it is currently difficult to find any enterprises
following it religiously, but we expect the environment is bound to improve in
the future with the growth of Big Data adoption.
Big Data implementation is a multi-skilled discipline. It has major
dependency on the underlying infrastructure and the deployment architecture.
Once a decision to use Big Data has been made by an organization, the deployment
architecture associated network, storage and security architecture must be
worked out in an iterative manner, before finalizing the same. The rule of thumb
is to exploit the benefit of low TCO by using commodity hardware and get almost
real-time application performance.
Nandkishor Mardikar has over twenty years of experience in IT, and heads the
Big Data CoE as part of Infrastructure Services within TCS. An alumnus of VJTI,
Mumbai, Nandkishor started his career as a developer. He has played various
roles such as Program Manager, Technology Lead, Chief Architect, etc. His wide
range of Industry domain experience includes, BFS, Insurance, Telecom, Media and
Manufacturing. With strong experience in the ADM, Information Security and
Operations Management, his present responsibilities include providing technology
support for middleware & Big Data infrastructure for TCS customers across the