Series – Part 2: Serverless Architecture – a practical implementation: IoT Device data collection, processing and user interface.

In part one of this series I briefly discussed the purpose of the application to be built and reviewed the IoT local controller & gateway pattern I’ve deployed. To recap, I have a series of IP cameras deployed and configured to send (via FTP) images and videos to a central controller (RaspberryPI 3 Model B). The controller processes those files as they arrive and pushes them to Amazon S3. The code for the controller process can be found on GitHub.

In this post we will move on to the serverless processing of the videos when they arrive in S3.

Continue reading “Series – Part 2: Serverless Architecture – a practical implementation: IoT Device data collection, processing and user interface.”

The future of data is still Polyglot…

… but vendors will fight you every step of the way.

Remember – every vendor wants to get as much of your data in their database as possible. And every data approach (Relational, Document, Key/Value Store and Graph) can be forced to do just about anything; and given the opportunity the vendor will tell you their solution is the right choice. 

The challenge for the modern Enterprise Data Architect is maintaining a cogent point of view about assembling a polyglot solution to make each use case (or micro service) less complex, more scalable and easier improve/enrich over time.

Series – Part 1: Serverless Architecture – a practical implementation: IoT Device data collection, processing and user interface.

 

reinvent_launch-page_illustration_lambda
AWS Lambda

Serverless architectures are getting a lot of attention lately – and for good reason. I won’t rehash the definition of the architecture because Mike Roberts did a fine (and exhaustive) job over at MartinFowler.com.

However, practical illustrations of patterns and implementations are exceptionally hard to find. This series of posts will attempt to close that gap by providing both the purpose, design and implementation of a complete serverless application on Amazon Web Services.

Part 1 – The setup…

Every application needs a reason to exist – so before we dive into the patterns and implementation we should first discuss the purpose of the application.

Nest wants how much for cloud storage and playback?

I have 14 security cameras deployed, each captures video and still images when motion is detected. These videos and images are stored on premises – but getting them to “the cloud” is a must have – after all if someone breaks in and takes the drive they are stored on all the evidence is gone.

If I were to swap all of the cameras out for Nest cameras cloud storage and playback would cost $2250/year – clearly this can be done cheaper… so…

Continue reading “Series – Part 1: Serverless Architecture – a practical implementation: IoT Device data collection, processing and user interface.”

Key to Big Data Success – Data Driven Culture

I’m not always sure people always know what they mean when they talk about Big Data – and even when they do know, I’m not sure they can contrast this new Big Data thing from Data’s previous incarnation.

So let’s see if we can clear it up.

Prior to big data the amount and content of the data you had access to was limited – in technical terms you had to deal with a limited information domain. Why? Because obtaining and storing data was expensive and, more importantly, most data was locked up in the real world and never entered the digital (binary data living in computational systems) realm. That obviously has changed.

15422638442_e239227dce_o
Photo Credit: Janet McKnight @ Flickr
This flip – from only generating and storing data directly relevant to operating a business to having access to, collecting and storing massive amounts of data which may or may not be relevant to operating a business is the state change.

The first big problem was tooling. The systems and technologies to collect and store data were designed for the relatively small amounts of strictly modeled data relevant to running our business. Moreover, they were designed to strictly control adding to it, because that was expensive. This was the problem we needed to address first – which is why when we talk about Big Data we invariably talk about technologies – Hadoop, MongoDB, Spark, Kafka, Storm, Cassandra…

But, for business leaders this is misleading, because implementing any (or all) of those technologies will not make the business effective in a Big Data context. These technologies will not provide you magical data which supercharges your business. You will not suddenly have insights your competitors do not; you will not – overnight – find the clarity required to dominate your market.

The key is to combine those tools and capabilities with data driven practices and culture.

Let’s start by avoiding the mistake made with Big Data – let’s clearly talk about what has changed and why data driven is different than what came before.

I’ve worked with organizations – from startups to enterprises – that have robust reporting and systems of operational metrics they use to run the business. They review reports and dashboards regularly, perform regular operational reviews focused on those metrics and target resources and budget toward those that are under performing. Invariably they suggest they are already data driven – because they leverage data to run their business.

They are not. They are optimally operating in the pre-Big Data model – where the universe of data was fixed, the metrics long lived and stable and information outside that realm unobtainable – those insights beyond reach.

A Data Driven organization still does those things – metrics, operational reviews, targeted investments based on under performing metrics. But, they also leverage the larger universe of data to openly question the validity of those metrics; they develop processes to evaluate that universe for new metrics and insights; they allow the data to lead them to opportunities and the identification of threats.

This practice almost always feels like a radical shift – and it is. Organizations must shift from the practice of only focusing on the known knows and embrace this new ability to examine and gain insight from the known unknowns and unknown unknowns.

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.[1]

Rumsfeld’s observation applies equally to businesses.

When these Data Driven processes and practices, extending and augmenting your metrics driven operational practices, become part of the culture the real value of all that data and all those tools can be realized.

 

Big Data – Storage Isn’t Enough

We should have seen it coming. When we stopped even thinking about how we store data for our applications, when we just assumed some DBA would give us a database – and some SysAdmin would give us a file system. Sure, we can talk about W-SAN (what WLAN was to the LAN, but for storage) solutions like Amazon S3 and Rackspace Cloud – but they didn’t fundamentally change anything.

Big Data forces us to re-think storage completely. Not just structured/unstructured, relational/non-relational, ACID compliance or not. It forces us – at the application level – to rethink the current model exemplified by

I’m storing this because I may need it again in the future.

Where storage means physical, state aware object persistence and future means anywhere between now and the end of time.

Data Persistance – A Systemic Approach to Big Data for Applications

What Big Data applications require is a systemic approach to data. Instead of applications approaching data as only a set of if/then operations designed to determine what (if any) CRUD operations to perform it demands that applications (or supporting Data Persistence Layers) understand the nature of the data persistence required.

This is a level of complexity developers have been trained to ignore. The CRUD model itself explicitly excludes any dimensionally – or meta information about the persistence. It is all or nothing.

Data Persistance is primarily the idea that data isn’t just stored – it is stored for a specific purpose which is relevant within a specific time slice. These time slices are entirely analogous to those discussed in Preemption. Essentially, any sufficiently large real time Big Data system is simply a loosely aggregated computer system in which any data object may generate multiple tasks each of which have a specific priority.

For example, in a geo location game (Foursquare) the appearance of a new checkin requires multiple tasks which are prioritized based on their purpose, for example:

  1. Store the checkin to distribute to “friends” (real-time)
  2. Store the checking association with the venue (real-time)
  3. Analyze nearby “friends” (real-time)
  4. Determine any game mechanics, badges, awards, etc
  5. Store the checkin on the user’s activity
  6. Store the checkin object

NOTE: Many developers will look at this list above and ask: “Why not a database?” While a traditional database may suffice for a relatively low volume system (5k users, 20k checkins per day) it would not be sufficient at Big Data scale (as discussed here).

This Data Persistence solution is comprised of four vertical persistence types:

Big Data, Real Time Data Persistance

Transitory

Transitory persistance is for data persisted only long enough to perform some specific unit of work. Once the unit of work is completed the data is no longer required and can be expunged. For example: Notifying my friends (that want to be notified) that I’m at home.

Generally speaking (and this can vary widely by use case) Transitory persistence must be atomic, extremely fast and fault tolerant.

Volatile

Volatile persistance is for data that is useful but can be lost and rebuilt at any time. Object caching (how memcached is predominantly used) is one type Volatile persistence, but does not describe the entire domain. Other examples of volatile data include process orchestration data, data used to calculate decay for API Rate Limits, data arrival patterns (x/second over the last 30 seconds), etc.

The most important factor for Volatile data persistence is that the data can be rebuilt from normal operations or from long term data storage if it is not found in the Volatile dataset.

Generally speaking, data is stored in Volatile persistence because is offers superior performance, but limited dataset size.

ACID

Relational databases and atomicity, consistency, isolation and durability (ACID) are not obsolete. It is important for specific types of operations – done for specific purposes to maintain transactional compliance and ensure the entire transaction either succeeds in an atomic way, or fails. Examples of this include eCommerce transactions, Account Signup, Billing Information updates, etc.

Generally speaking, this data complies with the old rules of data. It is created/updated slowly over any given time slice, it is read periodically, there is little need to publish the information across a large group of subscribers, etc.

Amorphous

Amorphous persistence is the new kid on the block. NoSQL solutions fit nicely here. This non-volatile storage is amorphous in that the content (think property, not property value) of any object can change at any time. There is no schema, table structure or enforced relationship model. I think of this data persistence model as raw object storage, derived object storage and the transformed data that forms the basis of what Jeff Jones refers to as Counting Systems. Additionally, these systems store data in application consumable objects – with those objects being created on the way in.

Systems in this layer are generally highly scalable, fault tolerant, distributed systems with enhanced write efficiency. They offer the ability to perform the high volume writes required in real time Big Data systems without significant loss of read performance.

What Does All This Mean?

Most notably it means, that after years of obfuscating the underlying data storage from developers, we now need to re-engage application developers in the data storage conversation. No longer can a DBA define the most elegant data model based on the “I’m storing this because I may need it again in the future.” model and expect it to function in the context of a real time Big Data application.

We will hear a chorus of voices who will attack these dis-aggregated data persistence models based on complexity or the CAP Theorem or on the standard “the old way is the best way” defense of ACID and the RDBMS for everything. But all of this strikes me as a perfect illustration of what Henry Ford said:

If I had asked customers what they wanted, they would have told me they wanted a faster horse

Talking Big Data with IBM’s Jeff Jonas

I saw this on TechCrunch earlier today and thought it was an awesome add to my big data series. Jeff Jonas is clearly a big thinker and I agree with almost everything he says. The only thing I take issue with is the recurring theme in this interview that Big Data is primarily about commerce – and specifically ad targeting.

In my next post I’ll be talking about Big Data for One – which I think Jeff hints at but never fully develops.

Part 1: What is Data?

http://player.ooyala.com/player.js?deepLinkEmbedCode=JnZXZyMTpY5njnZyFbtL6owNeHSZaStK&width=630&height=354&embedCode=JnZXZyMTpY5njnZyFbtL6owNeHSZaStK

Part 2: Why data makes us more ignorant.

Data is actually evidence that you already knew, but failed to act on it. Amnesia.

http://player.ooyala.com/player.js?deepLinkEmbedCode=ZzZXZyMTrsb27oUZZDutt7A-TqGdPlXo&width=630&height=354&embedCode=ZzZXZyMTrsb27oUZZDutt7A-TqGdPlXo

Part 3: Why Big Data is the next big thing.

From pixels to pictures… This agrees with my idea of Big Data – it isn’t about the size of the dataset, but about using pieces of data in context by understanding context.

Why he goes to ad based is beyond me however…

http://player.ooyala.com/player.js?deepLinkEmbedCode=s4ZnZyMTrtWTaKSxWF2WEPPXkBtMjZc3&width=630&height=354&embedCode=s4ZnZyMTrtWTaKSxWF2WEPPXkBtMjZc3

Part 4: How data makes us average.

 

Very similar to the points I made here: Living In Public – Facebook, Privacy and Frictionless Distribution
http://player.ooyala.com/player.js?deepLinkEmbedCode=05ZnZyMToC-qRTChMHxO9jsDjOcJFdjo&width=630&height=354&embedCode=05ZnZyMToC-qRTChMHxO9jsDjOcJFdjo

Why the future is irresistible.

http://player.ooyala.com/player.js?deepLinkEmbedCode=JnZnZyMTpU1XeIWbGdtOrD96fWvhbDX6&width=630&height=354&embedCode=JnZnZyMTpU1XeIWbGdtOrD96fWvhbDX6

Big Data – Spike Volume

One aspect of Big Data is the arrival pattern of the data being stored/analyzed.

Yesterdays Rangers win in the ALCS is a excellent example of this. At justSignal we monitor all of the MLB teams and when the Rangers won we experienced a phenomenal spike in data arrival. Right now – which is fairly representative of normal volumes – we are receiving 7.87 Tweets per Second from Twitter. Last night we peaked out at 43.5 Tweets per Second.

Here is what that looks like graphically:

Yankees Loss ALCS Tweets

Texas Rangers ALCS Win Tweets

Managing that kind of data arrival while continuing to generate meta data dimensions (transformations) and deliver data to clients in real time is no small task.

Big Data and Real Time – What I’ve Learned

real big time dataI’ve been thinking about writing a series of posts about Big Data for months now… which is entirely too much thinking an not enough doing, so here we go.

And, by Big Data we mean…

Wikipedia offers a very computer science oriented explanation of Big Data – but while the size of the dataset is a factor in Big Data there are several others worth considering:

  • The arrival pattern (how data is created in the large dataset)
  • Dimensions (how the data is grouped, sub-grouped and organized for analysis)
  • Transformation (how the data is re-dimensioned or transformed to meet new analysis needs)
  • Consumption (how will the data be consumed by client applications and ultimately humans)

Classically Big Data was regarded is extremely large datasets, usually acquired over a long period of time, involving historical data leveraged for predictive analysis of future events. In the simplest terms, we use hundreds of years of historical weather data to predict future weather. Big Data isn’t new, my first exposure to these concepts was in the early 1990’s dealing with Contact Center telecom and CTI (computer telephony integration) analytics and predictive analysis of future staffing needs in large footprint (3 to 5 thousand agent) contact centers.

Another example of the classic Big Data problem is the traditional large operational dataset. In the 90’s a 2 Terabyte dataset – like the one I worked with at Bank of America – was so massive that it created the need for a special computing solution. Today large operational datasets are measured in Petrabytes and are becoming fairly common. The primary challenges with these types of datasets is maintaining acceptable operational performance for the standard CRUD operations.

There are Big Data storage models emerging and maturing today with the two default options being hadoop – which relies on a distributed file system and distributed key value store systems such as Cassandra and CouchDB. These systems (referred to as NoSQL solutions) differ from standard RDBMS systems in two important ways:

  • The underlying organization of data is amorphous and does not implement relationships.
  • There is no support for Transactional Integrity (also know as ACID).

While all of these are interesting engineering problems, they still lack a crucial component. As a matter of fact, most conversations about Big Data fail to adequately address what is, perhaps, the most important problem with Big Data systems today.

The Intersection of Real Time and Big Data

Today’s big datasets are manageable in RDMBS systems. That being said, a significant amount of complexity is inserted in the managament process, most notably:

  • Database Sharding and the complexity of managing multiple databases.
  • Downtime associated with Schema changes or complexity of CLOB fields used in a amorphous manner.

Given that, large datasets that change slowly over time, or more accurately, those that have a relatively low volume of creates (including those that occur as a result of large transformations) as compared to read, update and delete can be managed using RDBM systems.

Where Real Time – specifically as related to Social Media, user generated content and other high create applications (such as the Large Hadron Collider) – intersect with Big Data is – to me – the most interesting Big Data topic.

This model presents three distinct challenges:

  • High data arrival – potentially hundreds of thousands of new objects “found” per second.
  • High transformation rate.
  • Real Time Client updates.

The intersection of all three of these challenges was exactly the what we dealt with at justSignal. We needed to consistently collect hundreds of thousands of Social Media objects per second, generate hundreds of meta data elements (transformations) for each object, and make all of that data available in real time to our client applications. This view of Big Data is slightly different. In this view the size of the dataset – while still significant – isn’t as important as the challenges presented by a very high volume of CRUD operations over very short time slices.

The most important thing I’ve learned is that there is no silver bullet. While the traditional relational database isn’t effective in real time big data scenarios neither is a standalone Hadoop or Distributed Key Value Store. Essentially, you must evaluate each use case against the suite of technologies available and select the one best suited to that particular use case. Selecting a NoSQL solution for order processing – which has heavy ACID/Transactional requirements isn’t a good idea. However, staying with your RDBMS for high insert/transform data processing isn’t going to work either.

The approach we took at justSignal – which I will go into in more detail in a future post – was to create a unified data persistence layer designed to leverage the right long/short term data store (and sometimes more than one) based on the requirements of the application. This data persistance layer is made up of:

  • MySQL
  • memcached
  • Cassandra
  • Beanstalkd

Each plays a critical role in our ability to collect, transform, process and serve hundreds of thousands of social media mentions per minute.

More to follow…