I’ve been thinking about writing a series of posts about Big Data for months now… which is entirely too much thinking an not enough doing, so here we go.
And, by Big Data we mean…
Wikipedia offers a very computer science oriented explanation of Big Data – but while the size of the dataset is a factor in Big Data there are several others worth considering:
- The arrival pattern (how data is created in the large dataset)
- Dimensions (how the data is grouped, sub-grouped and organized for analysis)
- Transformation (how the data is re-dimensioned or transformed to meet new analysis needs)
- Consumption (how will the data be consumed by client applications and ultimately humans)
Classically Big Data was regarded is extremely large datasets, usually acquired over a long period of time, involving historical data leveraged for predictive analysis of future events. In the simplest terms, we use hundreds of years of historical weather data to predict future weather. Big Data isn’t new, my first exposure to these concepts was in the early 1990’s dealing with Contact Center telecom and CTI (computer telephony integration) analytics and predictive analysis of future staffing needs in large footprint (3 to 5 thousand agent) contact centers.
Another example of the classic Big Data problem is the traditional large operational dataset. In the 90’s a 2 Terabyte dataset – like the one I worked with at Bank of America – was so massive that it created the need for a special computing solution. Today large operational datasets are measured in Petrabytes and are becoming fairly common. The primary challenges with these types of datasets is maintaining acceptable operational performance for the standard CRUD operations.
There are Big Data storage models emerging and maturing today with the two default options being hadoop – which relies on a distributed file system and distributed key value store systems such as Cassandra and CouchDB. These systems (referred to as NoSQL solutions) differ from standard RDBMS systems in two important ways:
- The underlying organization of data is amorphous and does not implement relationships.
- There is no support for Transactional Integrity (also know as ACID).
While all of these are interesting engineering problems, they still lack a crucial component. As a matter of fact, most conversations about Big Data fail to adequately address what is, perhaps, the most important problem with Big Data systems today.
The Intersection of Real Time and Big Data
Today’s big datasets are manageable in RDMBS systems. That being said, a significant amount of complexity is inserted in the managament process, most notably:
- Database Sharding and the complexity of managing multiple databases.
- Downtime associated with Schema changes or complexity of CLOB fields used in a amorphous manner.
Given that, large datasets that change slowly over time, or more accurately, those that have a relatively low volume of creates (including those that occur as a result of large transformations) as compared to read, update and delete can be managed using RDBM systems.
Where Real Time – specifically as related to Social Media, user generated content and other high create applications (such as the Large Hadron Collider) – intersect with Big Data is – to me – the most interesting Big Data topic.
This model presents three distinct challenges:
- High data arrival – potentially hundreds of thousands of new objects “found” per second.
- High transformation rate.
- Real Time Client updates.
The intersection of all three of these challenges was exactly the what we dealt with at justSignal. We needed to consistently collect hundreds of thousands of Social Media objects per second, generate hundreds of meta data elements (transformations) for each object, and make all of that data available in real time to our client applications. This view of Big Data is slightly different. In this view the size of the dataset – while still significant – isn’t as important as the challenges presented by a very high volume of CRUD operations over very short time slices.
The most important thing I’ve learned is that there is no silver bullet. While the traditional relational database isn’t effective in real time big data scenarios neither is a standalone Hadoop or Distributed Key Value Store. Essentially, you must evaluate each use case against the suite of technologies available and select the one best suited to that particular use case. Selecting a NoSQL solution for order processing – which has heavy ACID/Transactional requirements isn’t a good idea. However, staying with your RDBMS for high insert/transform data processing isn’t going to work either.
The approach we took at justSignal – which I will go into in more detail in a future post – was to create a unified data persistence layer designed to leverage the right long/short term data store (and sometimes more than one) based on the requirements of the application. This data persistance layer is made up of:
Each plays a critical role in our ability to collect, transform, process and serve hundreds of thousands of social media mentions per minute.
More to follow…