The time for consolidation in the DB market has come; RethinkDB to shut down.

Too many databases with too little differentiation. The time has come for consolidation – including more news like RethinkDB shutting down:

screen-shot-2016-10-05-at-11-24-28-am
Via HackerNews

Document oriented databases will be first – but expect additional consolidation in other segments.

To be clear – we are not heading back to the bad old days of one database (vendor) to rule them all – we will just see definitive winners in the key segments.

Upping your data game with Graph Databases

4094315135_c192532fe2_o
Photo Credit: hjl on Flickr

Since the late-2000’s there has been an explosion of non-relational (NoSQL if you must) data persistence technology. The industry buzz has focused around the derivatives of the seminal work done at Google – i.e., BigTable – and Amazon – i.e., Dynamo.

We’ve seen massive adoption of simple document stores and key-value based stores – which focus on availability and partition tolerance and thereby enable storage and processing of schema-less (or semi-structured) data at velocities and volumes previously considered entirely impractical.

There were – however – compromises. These systems are abysmal at dealing with connections between the data – or more precisely connecting the entities in the data sets with one another in a variety of contexts.

Many of our platforms, systems, applications and services are intended to deal with these types of connections – and unfortunately most engineering teams fall back to relational databases to solve these problems. The problem with this approach, however, is that relational databases are inherently inefficient when performing complex set operations:

The true value of the graph approach becomes evident when one performs searches that are more than one level deep. For instance, consider a search for users who have “subscribers” (a table linking users to other users) in the “311” area code. In this case a relational database has to first look for all the users with an area code in “311”, then look in the subscribers table for any of those users, and then finally look in the users table to retrieve the matching users. In comparison, a graph database would look for all the users in “311”, then follow the back-links through the subscriber relationship to find the subscriber users. This avoids several searches, lookups and the memory involved in holding all of the temporary data from multiple records needed to construct the output. Technically, this sort of lookup is completed in O(log(n)) + O(1) time, that is, roughly relative to the logarithm of the size of the data. In comparison, the relational version would be multiple O(log(n)) lookups plus additional time to join all the data.[3]

Via Wikipedia

The opportunity to build highly efficient systems by leveraging natural graph models (those that exist in the real world) is massive and dramatically under utilized.

Imagine a CRM system which wasn’t tightly coupled to a rigid account, contact, entitlement hierarchy model. Imagine an student roster system which directly modeled the complexity of sections for teachers, schools, districts and states without a myriad of join tables.

Imagine a performant entity, attribute, value implementation that allows for performant queries over arbitrary attribute types and combinations.

The only barrier is investing in education required to understand the efficiencies of graphs and the graph databases available today. I highly recommend you up your game by downloading Neo4j today and beginning to learn the advantages graph databases can provide in your polyglot persistence architectures, and if you need any help let me know – I’m happy to help you and your team model your data and leverage a graph to simply your system and increase your feature velocity.

 

Key to Big Data Success – Data Driven Culture

I’m not always sure people always know what they mean when they talk about Big Data – and even when they do know, I’m not sure they can contrast this new Big Data thing from Data’s previous incarnation.

So let’s see if we can clear it up.

Prior to big data the amount and content of the data you had access to was limited – in technical terms you had to deal with a limited information domain. Why? Because obtaining and storing data was expensive and, more importantly, most data was locked up in the real world and never entered the digital (binary data living in computational systems) realm. That obviously has changed.

15422638442_e239227dce_o
Photo Credit: Janet McKnight @ Flickr
This flip – from only generating and storing data directly relevant to operating a business to having access to, collecting and storing massive amounts of data which may or may not be relevant to operating a business is the state change.

The first big problem was tooling. The systems and technologies to collect and store data were designed for the relatively small amounts of strictly modeled data relevant to running our business. Moreover, they were designed to strictly control adding to it, because that was expensive. This was the problem we needed to address first – which is why when we talk about Big Data we invariably talk about technologies – Hadoop, MongoDB, Spark, Kafka, Storm, Cassandra…

But, for business leaders this is misleading, because implementing any (or all) of those technologies will not make the business effective in a Big Data context. These technologies will not provide you magical data which supercharges your business. You will not suddenly have insights your competitors do not; you will not – overnight – find the clarity required to dominate your market.

The key is to combine those tools and capabilities with data driven practices and culture.

Let’s start by avoiding the mistake made with Big Data – let’s clearly talk about what has changed and why data driven is different than what came before.

I’ve worked with organizations – from startups to enterprises – that have robust reporting and systems of operational metrics they use to run the business. They review reports and dashboards regularly, perform regular operational reviews focused on those metrics and target resources and budget toward those that are under performing. Invariably they suggest they are already data driven – because they leverage data to run their business.

They are not. They are optimally operating in the pre-Big Data model – where the universe of data was fixed, the metrics long lived and stable and information outside that realm unobtainable – those insights beyond reach.

A Data Driven organization still does those things – metrics, operational reviews, targeted investments based on under performing metrics. But, they also leverage the larger universe of data to openly question the validity of those metrics; they develop processes to evaluate that universe for new metrics and insights; they allow the data to lead them to opportunities and the identification of threats.

This practice almost always feels like a radical shift – and it is. Organizations must shift from the practice of only focusing on the known knows and embrace this new ability to examine and gain insight from the known unknowns and unknown unknowns.

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.[1]

Rumsfeld’s observation applies equally to businesses.

When these Data Driven processes and practices, extending and augmenting your metrics driven operational practices, become part of the culture the real value of all that data and all those tools can be realized.

 

Polyglot Persistence – Benefits and Barriers

21828243446_136614fc89_o
Photo Credit: Christophe BENOIT

Polyglot persistence is simply the notion that one should leverage multiple data storage technologies chosen based upon the way the data will be used by the application.

In short, use the best tool for the job.

Benefits

  1. Attempting to make a single data store (or database if you prefer) encapsulate all your application contexts breeds complexity. When each context, entity or value object can tune the data store leveraged to the unique requirements of that domain complexity is reduced and feature velocity is increased.
  2. Polyglot enables in data store transformation, materialized views and projections of the data into alternate stores for the purpose of enabling specific application features. Simply put, you can have multiple representations of the same data where and when it is convenient in your application context.
  3. Data store spend is targeted toward the features and contexts in the application which actually require the investment.

Barriers

  1. Joins – perceived complexity due to the inability to create a single “query” joining multiple contexts, entities or value objects.
    1. Understanding the benefits of composition allows us to see this as a false barrier – it is simply an issue of changing from the old way of doing things.
  2. Maintenance cost – expertise and management of multiple data stores adds to the overall cost of operating the application.
    1. In a monolithic data store system extensive effort is put into the “tuning” of the data store. This is always due to either the massive complexity of data stores that try to do everything or the need to make a single data store solve too many disparate persistence models. When we use data stores which are “natural” to the domain, context or entity this overhead is massively reduced.
  3. Developer Complexity – finding and staffing developers that can work with multiple data stores is impossible.
    1. When transforming from a monolithic data store architecture this will absolutely be problematic. However, as your polyglot practice matures this issue will diminish with time.

All of the above relies on having a solid domain driven design and flexible, adaptable architecture for your application.

Emotions & Facts – Becoming data driven by overcoming bias

Humans are very, very good at rapid pattern recognition. It is the basis of the flight or fight response and based on our ability to see past events in current and future situations.

… humans are amazing pattern-recognition machines. They have the ability to recognize many different types of patterns – and then transform these  “recursive probabalistic fractals” into concrete, actionable steps …

From: Humans Are the World’s Best Pattern-Recognition Machines, But for How Long?

This fact is leading to a number of advances in AI leveraging big data approaches. It enables us to understand what is happening right now or what might happen in the future based on recognizing patterns found in historical data. And this is good – and bad.

In stable systems – businesses that dominate their markets in particular, but also in political parties, social groups and non-competitive systems – cognitive biases can make your pattern recognition superpower your kryptonite. How? By convincing you that new data – competitors, market behaviors, demographic shifts, and disruptions – are false.

Too often the reaction to these leading indicators is disbelief or even retrenchment. In institutions that lack high quality data driven practices confirmation and conservatism biases often become the norm furthering the notion that the old patterns still apply. All too often this results in, what appears to be, a sudden collapse.

The key to avoiding this fate is to consistently apply solid data driven approaches which negate the biases and our very human tendency to dismiss data that doesn’t conform to our known patterns. Acknowledging the reality of the data “as it is” and attempting to validate the data via consistent, unbiased best practices enables us to recognize changes in the underlying patterns more rapidly.

That ability – to be open to questioning your pattern recognition and the biases inherent in it – can become your real superpower. That ability to be data driven; to continuously evaluate the data to understand reality in an objective way and apply what is learned is the superpower of enduring, innovative organizations.