Governing Big Data Platform for Industrialization

The distributed and horizontally scalable architecture based platforms for storing and processing data, like, Hadoop, has seen very rapid development in past six to seven years and has come out as an obvious option for main stream data applications. But, still there is a lot to be done to mature these technologies on enterprise class features, like, security, monitoring, usability, etc. In this blog, I will discuss different aspects and ideas which can help in building a secured big data solution for industrialization.

Governing Big Data Platform for IndustrializationPlease note that this blog will specifically reference features of Hadoop platform but is conceptually applicable to most of the data technologies.

Building Secured Big Data Platform

Security is a basic need for any data application because there is a lot of information at stake which can cause monetary as well as personal harm to individuals and organizations. When we talk of big data application, it matters more due to the below mentioned reasons

  • Big data includes a lot of data which means a large volume of “sensitive information”
  • Big data involves combining data and audiences that used to be in secure silo systems
  • Moving data to big data lake means, losing the built-in compliance controls that legacy systems are tried and tested with
  • Security breaches from internal and external attackers are on all time high because now everybody understands, “Data is Gold”.
  • Systems with poor access control to their data sets are facing lawsuits, negative publicity, and huge regulatory fines.

Without ensuring proper security controls, “Big Data can easily become a Big Problem with a Big Price Tag”.

What Is Needed?

There are many challenges to building a secured platform. A few of the many questions that need to be answered are:

  • How to enforce authentication for users and applications
  • How to control authorized access, modify and stop data and processes
  • How can Attribute-Based Access Control (ABAC) or Role-Based Access Control (RBAC) be implemented
  • How to encrypt data in transit
  • How to encrypt data at rest
  • How to integrate with existing enterprise security services
  • How can you keep track of data provenance and audit different events in the platform

Added Complexity Due To Big Data

All these issues make it really complex for big data platforms as:

  • Data is partitioned and replicated across multiple machines, racks and datacenters
  • There are diverse data structures used in a big data platform which may need customized security algorithms
  • Heterogeneous storage technologies are used in large clusters
  • System may not have centralized control layer. Having multiple independent services means implementing security for all the different services.

Big Data Security Aspects

The overall security requirements in big data system can be divided into multiple layers:

  1. Perimeter Level Security
    • Guarding access to the platform itself
    • Only Authenticated access of the platform
    • Network isolation of the platform
  2. Access Level Security
    • Defining what users and applications can do with data
    • Defining multiple level of access permissions
    • Authorization for different actions on the platform
  3. Platform State Visibility
    • Reporting on where data came from and how it’s being used
    • Auditing on the platform state access and change events
    • Building data lineage to track history
  4. Data Protection
    • Protecting data in the cluster from unauthorized visibility
    • Encryption and Tokenization
    • Data masking

So, lets discuss what needs to done to take care of these different security layers.

Perimeter – Guarding Access to Platform

  • Preserve user choice of the right service
  • Implement with existing standard systems: Active Directory and Kerberos
  • Conform to centrally managed authentication policies

Access– Who Can Do What

  • Provide users access to only the resources needed to do their job
  • Leverage a role-based access control model built on active directory
  • Conform to centrally managed authentication policies

Visibility – Who Did What

  • Understand where report data came from and discover more data like it
  • Comply with policies for audit, data classification, and lineage
  • Centralize the audit repository; perform discovery; automate lineage

Data Protection – Prevent Unauthorized Access at Rest and Motion

  • Encrypt data, conform to key management policies, protect from root
  • Integrate with existing HSM (Hardware security Module) as part of key management infrastructure
  • Perform analytics on regulated data
  • Encryption of data going in network using SSH

HDP and CDH Platform Security Services

The common Hadoop distribution providers Hortonworks Data Platform (HDP) and Cloudera Distribution of Hadoop (CDH) are actively maturing their platforms with integration of different frameworks to provide the requisite security features

  • HDP: Apache Knox, Apache Ranger, Apache Atlas, Hadoop Distributed File System Encryption
  • CDH: Cloudera Manager, Apache Sentry, Cloudera Navigator, Rhino


There is lot of work in progress to enhance the security of big data systems which targets providing enterprise level security features, but there is still a lot to be done. For taking any big data system to production, the four pillars of security should be thought through and designed upfront, otherwise the big data journey may become a herculean task to execute.

Leave a Reply

Your email address will not be published. Required fields are marked *