Person Recognition in Images with OpenCV & Neo4j

Time for an update on my ongoing person identification in images project; for all the background you can check out these previous posts:

Analyzing AWS Rekognition Accuracy with Neo4j

AWS Rekognition Graph Analysis – Person Label Accuracy

Person Recognition: OpenCV vs. AWS Rekognition

In my earlier serverless series I discussed and provided code for getting images into S3 and processed by AWS Rekognition – including storing the Rekognition label data in DynamoDB.

This post builds on all of those concepts.

In short – I’ve been collecting comparative data on person recognition using AWS Rekognition and OpenCV and storing that data in Neo4j for analysis.

Continue reading “Person Recognition in Images with OpenCV & Neo4j”

AWS Rekognition Graph Analysis – Person Label Accuracy

Last week I wrote a post evaluating AWS Rekognition accuracy in finding people in images. The analysis was performed using the Neo4j graph database.

As I noted in the original post – Rekognition is either very confident it has identified a person or not confident at all. This leads to an enormous number of false negatives. Today I looked at the distribution of confidence for the Person label over the last 48 hours.

You be the judge:

rekognition-person-label-confidence-distribution

Check out original post to see how the graph is created and constantly updated as images are created in the serverless IoT processing system.

Analyzing AWS Rekognition Accuracy with Neo4j

As an extension of my series of posts on handling IoT security camera images with a Serverless architecture I’ve extended the capability to integrate AWS Rekognition

Amazon Rekognition is a service that makes it easy to add image analysis to your applications. With Rekognition, you can detect objects, scenes, and faces in images. You can also search and compare faces. Rekognition’s API enables you to quickly add sophisticated deep learning-based visual search and image classification to your applications.

My goal is to identify images that have a person in them to limit the number of images someone has to browse when reviewing the security camera alarms (security cameras detect motion – so often you get images that are just wind motion in bushes, or headlights on a wall).

Continue reading “Analyzing AWS Rekognition Accuracy with Neo4j”

The time for consolidation in the DB market has come; RethinkDB to shut down.

Too many databases with too little differentiation. The time has come for consolidation – including more news like RethinkDB shutting down:

screen-shot-2016-10-05-at-11-24-28-am
Via HackerNews

Document oriented databases will be first – but expect additional consolidation in other segments.

To be clear – we are not heading back to the bad old days of one database (vendor) to rule them all – we will just see definitive winners in the key segments.

Key to Big Data Success – Data Driven Culture

I’m not always sure people always know what they mean when they talk about Big Data – and even when they do know, I’m not sure they can contrast this new Big Data thing from Data’s previous incarnation.

So let’s see if we can clear it up.

Prior to big data the amount and content of the data you had access to was limited – in technical terms you had to deal with a limited information domain. Why? Because obtaining and storing data was expensive and, more importantly, most data was locked up in the real world and never entered the digital (binary data living in computational systems) realm. That obviously has changed.

15422638442_e239227dce_o
Photo Credit: Janet McKnight @ Flickr
This flip – from only generating and storing data directly relevant to operating a business to having access to, collecting and storing massive amounts of data which may or may not be relevant to operating a business is the state change.

The first big problem was tooling. The systems and technologies to collect and store data were designed for the relatively small amounts of strictly modeled data relevant to running our business. Moreover, they were designed to strictly control adding to it, because that was expensive. This was the problem we needed to address first – which is why when we talk about Big Data we invariably talk about technologies – Hadoop, MongoDB, Spark, Kafka, Storm, Cassandra…

But, for business leaders this is misleading, because implementing any (or all) of those technologies will not make the business effective in a Big Data context. These technologies will not provide you magical data which supercharges your business. You will not suddenly have insights your competitors do not; you will not – overnight – find the clarity required to dominate your market.

The key is to combine those tools and capabilities with data driven practices and culture.

Let’s start by avoiding the mistake made with Big Data – let’s clearly talk about what has changed and why data driven is different than what came before.

I’ve worked with organizations – from startups to enterprises – that have robust reporting and systems of operational metrics they use to run the business. They review reports and dashboards regularly, perform regular operational reviews focused on those metrics and target resources and budget toward those that are under performing. Invariably they suggest they are already data driven – because they leverage data to run their business.

They are not. They are optimally operating in the pre-Big Data model – where the universe of data was fixed, the metrics long lived and stable and information outside that realm unobtainable – those insights beyond reach.

A Data Driven organization still does those things – metrics, operational reviews, targeted investments based on under performing metrics. But, they also leverage the larger universe of data to openly question the validity of those metrics; they develop processes to evaluate that universe for new metrics and insights; they allow the data to lead them to opportunities and the identification of threats.

This practice almost always feels like a radical shift – and it is. Organizations must shift from the practice of only focusing on the known knows and embrace this new ability to examine and gain insight from the known unknowns and unknown unknowns.

Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.[1]

Rumsfeld’s observation applies equally to businesses.

When these Data Driven processes and practices, extending and augmenting your metrics driven operational practices, become part of the culture the real value of all that data and all those tools can be realized.

 

Polyglot Persistence – Benefits and Barriers

21828243446_136614fc89_o
Photo Credit: Christophe BENOIT

Polyglot persistence is simply the notion that one should leverage multiple data storage technologies chosen based upon the way the data will be used by the application.

In short, use the best tool for the job.

Benefits

  1. Attempting to make a single data store (or database if you prefer) encapsulate all your application contexts breeds complexity. When each context, entity or value object can tune the data store leveraged to the unique requirements of that domain complexity is reduced and feature velocity is increased.
  2. Polyglot enables in data store transformation, materialized views and projections of the data into alternate stores for the purpose of enabling specific application features. Simply put, you can have multiple representations of the same data where and when it is convenient in your application context.
  3. Data store spend is targeted toward the features and contexts in the application which actually require the investment.

Barriers

  1. Joins – perceived complexity due to the inability to create a single “query” joining multiple contexts, entities or value objects.
    1. Understanding the benefits of composition allows us to see this as a false barrier – it is simply an issue of changing from the old way of doing things.
  2. Maintenance cost – expertise and management of multiple data stores adds to the overall cost of operating the application.
    1. In a monolithic data store system extensive effort is put into the “tuning” of the data store. This is always due to either the massive complexity of data stores that try to do everything or the need to make a single data store solve too many disparate persistence models. When we use data stores which are “natural” to the domain, context or entity this overhead is massively reduced.
  3. Developer Complexity – finding and staffing developers that can work with multiple data stores is impossible.
    1. When transforming from a monolithic data store architecture this will absolutely be problematic. However, as your polyglot practice matures this issue will diminish with time.

All of the above relies on having a solid domain driven design and flexible, adaptable architecture for your application.

Emotions & Facts – Becoming data driven by overcoming bias

Humans are very, very good at rapid pattern recognition. It is the basis of the flight or fight response and based on our ability to see past events in current and future situations.

… humans are amazing pattern-recognition machines. They have the ability to recognize many different types of patterns – and then transform these  “recursive probabalistic fractals” into concrete, actionable steps …

From: Humans Are the World’s Best Pattern-Recognition Machines, But for How Long?

This fact is leading to a number of advances in AI leveraging big data approaches. It enables us to understand what is happening right now or what might happen in the future based on recognizing patterns found in historical data. And this is good – and bad.

In stable systems – businesses that dominate their markets in particular, but also in political parties, social groups and non-competitive systems – cognitive biases can make your pattern recognition superpower your kryptonite. How? By convincing you that new data – competitors, market behaviors, demographic shifts, and disruptions – are false.

Too often the reaction to these leading indicators is disbelief or even retrenchment. In institutions that lack high quality data driven practices confirmation and conservatism biases often become the norm furthering the notion that the old patterns still apply. All too often this results in, what appears to be, a sudden collapse.

The key to avoiding this fate is to consistently apply solid data driven approaches which negate the biases and our very human tendency to dismiss data that doesn’t conform to our known patterns. Acknowledging the reality of the data “as it is” and attempting to validate the data via consistent, unbiased best practices enables us to recognize changes in the underlying patterns more rapidly.

That ability – to be open to questioning your pattern recognition and the biases inherent in it – can become your real superpower. That ability to be data driven; to continuously evaluate the data to understand reality in an objective way and apply what is learned is the superpower of enduring, innovative organizations.

 

 

Big Data – Empowering the Age of Agile Analytics

Big data is a buzzword, no question. Given that it is incumbent on practitioners – in particular architects – to tie the new Modern business conceptpatterns available in a “big data” enabled infrastructure to practical business benefits.

While there are a variety of business benefits that are enabled by big data infrastructure the single most tangible is Agile Analytics (also known as self service BI and data discovery and exploration). Here’s why:

1) Your business users never wanted reports.

What they really wanted was to be able to leverage data to answer questions. Traditional BI infrastructure did that well, provided you knew what questions you wanted answered in advance.

The problem is, you don’t. The world moves too fast to create a set of KPIs and instrument your business by those alone for 10 years.

2) Data Driven decision making requires empowered business users.

Business users must be empowered to use the the data directly – without intermediation by technical staff – in order to realize the benefit of data driven decision making.

This isn’t to say the technical staff doesn’t have a role – they do. They provide the platform and advanced support enabling business users to use data directly.

3) The prepared data can only answer known questions.

Business users need to follow the data to the important insights. They have the business knowledge to derive insights that matter, but they can only base those insights in data if they are empowered to explore in the data to follow the information to the value.

This means all the data – from the raw data, through each transformation or aggregation to the KPIs and rolled up analytics.

Big data infrastructure – properly deployed and governed – can provide a platform which solves the key problems preventing business users from engaging directly with the data and discovering valuable insights.

How to spot a fake Big Data Engineer

9666149041_0a1918019f_o
Photo Credit: counterfitsoner on Flickr

I interview a lot of candidates… I mean a LOT.

And every resume I get these days has a “Big Data Project” listed.

So, naturally, my first question is – what is it that made it a “Big Data” project?

Top five immediate disquallification answers are:

  1. We were using Hadoop
  2. We had TONS of data
  3. We were running map reduce jobs
  4. The data was unstructured
  5. It wasn’t in our data warehouse

The truth is, no one knows what you mean when you say it was a “Big Data” project – and we all know it is on your resume as keyword search fodder – but if you are going to have it on there you better come with a better answer than one of the five above.

Big Data – Storage Isn’t Enough

We should have seen it coming. When we stopped even thinking about how we store data for our applications, when we just assumed some DBA would give us a database – and some SysAdmin would give us a file system. Sure, we can talk about W-SAN (what WLAN was to the LAN, but for storage) solutions like Amazon S3 and Rackspace Cloud – but they didn’t fundamentally change anything.

Big Data forces us to re-think storage completely. Not just structured/unstructured, relational/non-relational, ACID compliance or not. It forces us – at the application level – to rethink the current model exemplified by

I’m storing this because I may need it again in the future.

Where storage means physical, state aware object persistence and future means anywhere between now and the end of time.

Data Persistance – A Systemic Approach to Big Data for Applications

What Big Data applications require is a systemic approach to data. Instead of applications approaching data as only a set of if/then operations designed to determine what (if any) CRUD operations to perform it demands that applications (or supporting Data Persistence Layers) understand the nature of the data persistence required.

This is a level of complexity developers have been trained to ignore. The CRUD model itself explicitly excludes any dimensionally – or meta information about the persistence. It is all or nothing.

Data Persistance is primarily the idea that data isn’t just stored – it is stored for a specific purpose which is relevant within a specific time slice. These time slices are entirely analogous to those discussed in Preemption. Essentially, any sufficiently large real time Big Data system is simply a loosely aggregated computer system in which any data object may generate multiple tasks each of which have a specific priority.

For example, in a geo location game (Foursquare) the appearance of a new checkin requires multiple tasks which are prioritized based on their purpose, for example:

  1. Store the checkin to distribute to “friends” (real-time)
  2. Store the checking association with the venue (real-time)
  3. Analyze nearby “friends” (real-time)
  4. Determine any game mechanics, badges, awards, etc
  5. Store the checkin on the user’s activity
  6. Store the checkin object

NOTE: Many developers will look at this list above and ask: “Why not a database?” While a traditional database may suffice for a relatively low volume system (5k users, 20k checkins per day) it would not be sufficient at Big Data scale (as discussed here).

This Data Persistence solution is comprised of four vertical persistence types:

Big Data, Real Time Data Persistance

Transitory

Transitory persistance is for data persisted only long enough to perform some specific unit of work. Once the unit of work is completed the data is no longer required and can be expunged. For example: Notifying my friends (that want to be notified) that I’m at home.

Generally speaking (and this can vary widely by use case) Transitory persistence must be atomic, extremely fast and fault tolerant.

Volatile

Volatile persistance is for data that is useful but can be lost and rebuilt at any time. Object caching (how memcached is predominantly used) is one type Volatile persistence, but does not describe the entire domain. Other examples of volatile data include process orchestration data, data used to calculate decay for API Rate Limits, data arrival patterns (x/second over the last 30 seconds), etc.

The most important factor for Volatile data persistence is that the data can be rebuilt from normal operations or from long term data storage if it is not found in the Volatile dataset.

Generally speaking, data is stored in Volatile persistence because is offers superior performance, but limited dataset size.

ACID

Relational databases and atomicity, consistency, isolation and durability (ACID) are not obsolete. It is important for specific types of operations – done for specific purposes to maintain transactional compliance and ensure the entire transaction either succeeds in an atomic way, or fails. Examples of this include eCommerce transactions, Account Signup, Billing Information updates, etc.

Generally speaking, this data complies with the old rules of data. It is created/updated slowly over any given time slice, it is read periodically, there is little need to publish the information across a large group of subscribers, etc.

Amorphous

Amorphous persistence is the new kid on the block. NoSQL solutions fit nicely here. This non-volatile storage is amorphous in that the content (think property, not property value) of any object can change at any time. There is no schema, table structure or enforced relationship model. I think of this data persistence model as raw object storage, derived object storage and the transformed data that forms the basis of what Jeff Jones refers to as Counting Systems. Additionally, these systems store data in application consumable objects – with those objects being created on the way in.

Systems in this layer are generally highly scalable, fault tolerant, distributed systems with enhanced write efficiency. They offer the ability to perform the high volume writes required in real time Big Data systems without significant loss of read performance.

What Does All This Mean?

Most notably it means, that after years of obfuscating the underlying data storage from developers, we now need to re-engage application developers in the data storage conversation. No longer can a DBA define the most elegant data model based on the “I’m storing this because I may need it again in the future.” model and expect it to function in the context of a real time Big Data application.

We will hear a chorus of voices who will attack these dis-aggregated data persistence models based on complexity or the CAP Theorem or on the standard “the old way is the best way” defense of ACID and the RDBMS for everything. But all of this strikes me as a perfect illustration of what Henry Ford said:

If I had asked customers what they wanted, they would have told me they wanted a faster horse