Big Data and Real Time – What I’ve Learned

real big time dataI’ve been thinking about writing a series of posts about Big Data for months now… which is entirely too much thinking an not enough doing, so here we go.

And, by Big Data we mean…

Wikipedia offers a very computer science oriented explanation of Big Data – but while the size of the dataset is a factor in Big Data there are several others worth considering:

  • The arrival pattern (how data is created in the large dataset)
  • Dimensions (how the data is grouped, sub-grouped and organized for analysis)
  • Transformation (how the data is re-dimensioned or transformed to meet new analysis needs)
  • Consumption (how will the data be consumed by client applications and ultimately humans)

Classically Big Data was regarded is extremely large datasets, usually acquired over a long period of time, involving historical data leveraged for predictive analysis of future events. In the simplest terms, we use hundreds of years of historical weather data to predict future weather. Big Data isn’t new, my first exposure to these concepts was in the early 1990’s dealing with Contact Center telecom and CTI (computer telephony integration) analytics and predictive analysis of future staffing needs in large footprint (3 to 5 thousand agent) contact centers.

Another example of the classic Big Data problem is the traditional large operational dataset. In the 90’s a 2 Terabyte dataset – like the one I worked with at Bank of America – was so massive that it created the need for a special computing solution. Today large operational datasets are measured in Petrabytes and are becoming fairly common. The primary challenges with these types of datasets is maintaining acceptable operational performance for the standard CRUD operations.

There are Big Data storage models emerging and maturing today with the two default options being hadoop – which relies on a distributed file system and distributed key value store systems such as Cassandra and CouchDB. These systems (referred to as NoSQL solutions) differ from standard RDBMS systems in two important ways:

  • The underlying organization of data is amorphous and does not implement relationships.
  • There is no support for Transactional Integrity (also know as ACID).

While all of these are interesting engineering problems, they still lack a crucial component. As a matter of fact, most conversations about Big Data fail to adequately address what is, perhaps, the most important problem with Big Data systems today.

The Intersection of Real Time and Big Data

Today’s big datasets are manageable in RDMBS systems. That being said, a significant amount of complexity is inserted in the managament process, most notably:

  • Database Sharding and the complexity of managing multiple databases.
  • Downtime associated with Schema changes or complexity of CLOB fields used in a amorphous manner.

Given that, large datasets that change slowly over time, or more accurately, those that have a relatively low volume of creates (including those that occur as a result of large transformations) as compared to read, update and delete can be managed using RDBM systems.

Where Real Time – specifically as related to Social Media, user generated content and other high create applications (such as the Large Hadron Collider) – intersect with Big Data is – to me – the most interesting Big Data topic.

This model presents three distinct challenges:

  • High data arrival – potentially hundreds of thousands of new objects “found” per second.
  • High transformation rate.
  • Real Time Client updates.

The intersection of all three of these challenges was exactly the what we dealt with at justSignal. We needed to consistently collect hundreds of thousands of Social Media objects per second, generate hundreds of meta data elements (transformations) for each object, and make all of that data available in real time to our client applications. This view of Big Data is slightly different. In this view the size of the dataset – while still significant – isn’t as important as the challenges presented by a very high volume of CRUD operations over very short time slices.

The most important thing I’ve learned is that there is no silver bullet. While the traditional relational database isn’t effective in real time big data scenarios neither is a standalone Hadoop or Distributed Key Value Store. Essentially, you must evaluate each use case against the suite of technologies available and select the one best suited to that particular use case. Selecting a NoSQL solution for order processing – which has heavy ACID/Transactional requirements isn’t a good idea. However, staying with your RDBMS for high insert/transform data processing isn’t going to work either.

The approach we took at justSignal – which I will go into in more detail in a future post – was to create a unified data persistence layer designed to leverage the right long/short term data store (and sometimes more than one) based on the requirements of the application. This data persistance layer is made up of:

  • MySQL
  • memcached
  • Cassandra
  • Beanstalkd

Each plays a critical role in our ability to collect, transform, process and serve hundreds of thousands of social media mentions per minute.

More to follow…

Let’s Get Serious – Social Media ROI

I’m honestly heartened by the sudden rash of efforts to create a methodology to determine ROI (return on investment) for Social Media efforts. It signals something very important for Social Media – the return of rationality to the debate.

head scratch.pngWhen you consider that a few short months ago the prevailing meme was that creating a basis for your Social Media efforts in terms of ROI was “doing it wrong” – it is impressive how far we’ve come. The realization that moral arguments and scare tactics will only get you so far – and in many cases backfire – has led to an overwhelming need to create an ROI model.

Unfortunately many of these efforts are not really after ROI – they are seeking to justify an already formed point of view.

The reality is we simply don’t know if Social Media has a analytical, fact based ROI. That may sound odd coming from a guy who has bet his personal savings starting a Social Media Engagement and Analytics company – so let me explain both why the ROI hasn’t been proven and why I’m betting it will be.

Social Media is a Niche Opportunity – Today

If you want to know why there is no fact based proven ROI for Social Media investments today, all you need to understand is that Social Media has been adopted in niches. It may be in the Marketing department, or used by your Digital Agency, or perhaps in your Customer Service department. Each of these adoptions was driven out of fear (we have to monitor this and deal with the negative) or the moral (we love our customers – so we are going to do this). The investment was negligible – and in most cases I’d bet it was funded right out of the operating budget of the organization where it was used.

These organizations are beginning to declare victory and are being challenged to prove it. This presents unique challenges, because Social Media runs on anecdotes, not analysis. Dell sells 3 million in product from Dell Outlet after offering those products on Twitter. That is a great anecdote – but it isn’t analysis. When you ask the critical questions:

  • What would you have sold without Twitter?
  • Was that a 3MM increase in sales – or just 3MM net sales from those links?
  • How much did it cost to generate the 3MM in sales and how does that compare to email?
  • Is this repeatable – can it be replicated in other parts of the business – and how do you know?

you quickly find that the anecdote doesn’t equate to ROI. It might… but it isn’t there yet.

These types of anecdotes are justifications. They are about proving the correctness of an already made assumption.

I’ve seen this movie before – it exactly parallels the pattern for CRM in the late 1990’s.

technology-adoption-enterprise.png

NOTE: For simplicity I’ve omitted the case where a technology/methodology has a niche ROI without broader adoption.

We are squarely in the middle of the justification phase for Social Media. This roughly corresponds to the height of the expectations (the big peak on the Gartner Magic Quadrant) and always directly precedes the Trough of Disillusionment. This is a recognizable and predictable pattern for adoption of new technologies and methodologies – and here is why.

The initial opportunity is too good to stay on the sidelines for some early adopter group. They – almost always within existing operating budgets and using the promise as a bulwark defense – adopt the technology/methodology. Once they believe they have seen tangible results they attempt to socialize the “win” outside the organization by creating justifications for what they’ve already done. These justifications bring broader scrutiny.

That scrutiny happens in two phases:

  1. Was it worth it?
  2. Can it be done systemically – can I forecast a x% increase in metric z if I do this again.

The second is ROI. A systemic way of proving that adoption generates a return. If, and only if, that can be proven will the technology escape the niche application and be applied on a broad scale.

Why does it work this way – because enterprises are first and foremost risk management systems. They systemically avoid large risks.

Why Will Social Media Attain Broad Adoption

The primary reasons I believe Social Media will in fact generate a valid ROI and attain broad adoption:

  1. It is measurable.
  2. The unrecognized value far exceeds the recognized value.

Measurability

As you might imagine, it is very difficult to justify and create a systemic ROI for something that is exceptionally difficult to measure. Social Media is – in contrast – eminently measurable. Rational decisions must be made about what to measure – and we need more focus on connecting those measures to the core business metrics – but there is no fundamental barrier to creating valuable measures.

The Value Proposition

Today, we’ve put all our Social Media eggs in the PR/Marketing basket. Even the small amount of credibility given to customer service via Social Media has been driven by the (C-Level Down) idea that customer service should “avert disasters” by monitoring Social Media and addressing customer issues. Make no mistake, this is customer service acting in a PR role – the goal isn’t to provide service so much as to avoid negative perceptions.

However, if you take one large step back and think about the opportunity Social Media presents – you can quickly see that the value proposition is in having a huge, open back channel to your market. We’ve had channels to our customers, and sometimes even our prospects – but this is bigger. It is the entire market for your product or service. You get to listen in on what they have to say about what they want and need. You can engage them to better understand their motivations. You can apply what you learn to create incremental improvements in every phase of your business.

Yes, you can send out special offers. Yes, you can address customer concerns. But the real return will come from having a robust back channel with your entire market; and the resulting market intelligence can – if you apply it – help you make every part of your business more appealing to your target market.

So let’s get serious about ROI. Let’s talk about how companies operate and win by continually tuning their processes to better address the needs of their target market. Let’s talk about how Social Media provides them a back channel to that market, a back channel that is an invaluable source of intelligence about the market.

Let’s talk about how a business that applies the intelligence gained via Social Media to all of their decision making processes is faster and more agile in addressing the needs of their market – and thereby wins market share.


The myth of “everything” – Responses to my view of Track

My post yesterday – specifically regarding the Great Track Debate – received several responses on Twitter, identi.ca and friendfeed. Many of these – really all of them were positive – but the debate is far from settled.

I learned something very important from one discussion in particular – that there is a myth that permeates the conversation. The myth of “everything”.

This myth is the adherence to two ideas:

  1. That somehow the “fire-hose” (as discussed here) represents the complete information about any given topic – that somehow the “fire-hose” is everything.
    1. At the root of it, this is the idea that everything is attainable.
  2. That – in order to monetize track – having everything is essential. For example, if you are trying to manage your brand you need every reference to it in real-time.
    1. At the root of it, this is the idea that everything is required and valuable.

The reason I refer to these two ideas collectively as the “myth of everything” is because when they are clearly stated and examined they are immediately recognizable as inconsistent with reality.

So let’s take the two ideas one at a time and examine how tenuous their attachment to reality is.

First, that the “fire-hose” represents everything about any given topic. The fire-hose is the sum total of what is said on a given service. In order for that to be everything that service would need to be participated in by everyone. That is a hard enough hurdle to overcome, but there is more – not only would it need to be everyone, it would need to be the only method by which they communicate their thoughts, ideas and feelings.

Even if you were able to combine the fire-hose of every service available – every social networking site, every blog, every micro-blog, every IM service, every news site – you would still be far short of everything.

So the idea that track only becomes valuable when it can capture “everything” is a myth. Everything is unattainable.

Second, that having everything is essential. Let’s suppose we were to allow that somehow everything was achievable. Even if it were would it be required? Would it provide value commensurate with the effort required to collect it?

Let’s evaluate this in terms of brand management. The assumption which underlies this is that it is required to respond in real time to every post which is misleading, false, or damaging. This assumption is flawed – the reality is that it is required to respond to a statistically relevant sample of those posts. You aren’t trying to refute every post – you are trying to move (or keep from moving) the average (or perhaps mean) opinion.

If any company were forced to staff enough positions to actively monitor and respond to every post made about them – they would immediately cease to be profitable. It isn’t scalable, and more importantly it isn’t required.

This holds true equally for politics.

So the idea that track delivers value because having access to everything is required and the primary driver of value is a myth. Everything is neither required nor valuable in real terms.

What the track community should be focused on – again IMHO – is not the fire-hose and the attendant myth of everything, but creating systems which can attain enough trackable scope to provide a statistically relevant sample of the posts in the social media universe.

I understand that, emotionally, it feels good to tap into some perceived “everything” and refute any and all posts that you think are misleading, false, biased, or offensive. But this isn’t about what feels good – at the end of the day it will be about what is effective. And to be effective everything is neither required nor valuable.

These two conclusions – that everything is unattainable and that – even if attained – is neither required nor valuable should allow us to dispense with the “myth of everything” and return to the point of track:

  1. Real-Time Information Discovery
  2. Real-Time Participation

The evolution of FFStream and the Great Track Debate

ff-filtered-twitter.pngAs those of you who follow me on twitter or friendfeed know FFStream – which I began discussing in this post – has evolved into FF-Filtered. The changes are not dramatic, but they are significant.

FF-Filtered is now focused on providing “your friendfeed – filtered” – and as that implies what it does is filter your friendfeed (home feed to be specific) by a list of keywords. If these keywords match the post title, comment, or user you receive the update in real time – in the browser or via IM, including GTalk, Jabber, AIM and Yahoo.

It isn’t track – as I’ve been repeatedly and vehemently told by the “community” over the last 5 days – more on that later.

Additionally – for the mobility set – we’ve added like, comment, post and filter updates via a mobile web page. If you click on the link in the IM from a mobile platform you get the following mobile web page:

IMG_0001.PNG

I’ve got a few more tricks up my sleeve – so look for more changes this week including a name change.

Now for the second half of the post… I’m going to talk about Track again… so sharpen your knives (or tongues) and get ready to revel in your abject disdain for my refusal to “go along” or shut up.

Let me say one thing first – if you want to attack my positions and opinions go for it. If you are here to attack my motives or me personally – GO AWAY NOW.

The last 5 days has been very illuminating for me – both in terms of the fervor of the “track community” and in terms of their point of view – which at times verges on dogma.

Let me attempt to “play back” what I’ve heard and then explain – as clearly as possible given my limited skills – my point of view.

Track – by the definition of the “track community” as led today by Steve Gillmor of Gillmor Gang/News Gang Live, is defined by the “fire-hose”. This fire-hose is the complete unabridged stream of posts occurring on any social media site. In the case of Twitter, this is the entire public timeline published in real time.

This is an important distinction – because it states that “track” can not be achieved without the “fire-hose”. More on this latter.

The second component of “track” is the ability to keyword filter the real time stream and deliver the filtered content in real (or near real) time.

The third component of “track” is the ability to insert posts into the public timeline from the same user interface you are viewing the stream in.

Those four things collectively comprise the “holy grail” of track.

I’m sure you will all let me know exactly how wrong I am… but I think those four capture the broad brushstrokes. Ok?

Before I attempt to explain my point of view – let me clarify one point. Regardless of my agreement or disagreement with the track community on any given point, there is one thing we vehemently agree on:

There is massive value in the ability to discover and participate in the social media stream in real (or near) real time. Our objective where that is concerned is the same.

When I consider track – I consider it in terms of the problems it attempts to solve. To me, track is an attempt to solve 2 very important problems:

  1. Real-Time Information Discovery
  2. Real-Time Participation

Any solution which solves those two problems would – by my definition – fall within the scope of a “track” service.

Now let me explain why (take a deep breath… you can throw something at me later). Where I differ with the “track community” on this issue is on the scope of the track-able data not what happens after the “track service” receives it. As importantly I fundamentally agree that the wider the scope of the data being tracked the more effective the track solution will be.

But, consider this – not every user wants to track the entire social media universe. To the contrary – most IMHO simply want to track their friends, family, co-workers, brand, market makers, influencers, power users, etc.

For those users a limited scope is a good thing. Beyond consideration of the scope of the data being tracked this service solves the exact same problems.

  1. Real-Time Information Discovery
  2. Real-Time Participation

So apply the duck test. It walks like a duck… it quacks like a duck… why isn’t it a duck?

It is my opinion – and you can feel free to take issue with it – that the track communities’ obsession with the “fire-hose” has actually retarded the growth of alternative track services. The obsession with scope has prevented the creation of useful (if limited by their limited trackable data) solutions under the banner of track – and that is a shame.

Every developer that seeks to solve the two problems should be embraced, encouraged and supported.

The real battle here is one of leverage. And the way to get the social media services to both open up their data and participate in the creation of a standard for doing so is to create a win-win. I believe track services that are useful and solve real problems (e.g., real-time brand monitoring) can and will provide the leverage that causes the change the community has been seeking.

If Twitter wants to pretend they ARE the social media universe – let them. It is abundantly clear from the success of friendfeed that no single service is or will be the social media universe – any service that ignores this will fail.

When compelling and broadly adopted services exist, which demand real-time un-scoped access to multiple underlying services, the individual services will have no choice but to “open their kimono” or face massive user defection.

So stop complaining about the lack of a “fire-hose” and figure out what those services are, who needs them the most, and how to drive that value to as many users as fast as possible. If you do that – you’ll get what you want… not today, not even tomorrow… but relatively soon.

I had intended to discuss standards, what I believe the high level components of an open track environment might look like, and why friendfeed is in the best position to lead standard development… but this has already gotten too long. I’ll set those subjects aside for another day.

If you’ve disagreed with everything else I’ve said – please remember – I share your goal. I’m not saying the outcome you seek isn’t valuable – I am just proposing a different course of action. I hope I’ve done so respectfully and without denigrating anyone or their point of view.