The Big Rules of Big Data

Shahin Khan/AI, Big Data, Data Economy, Machine Learning /

December 18, 2015

1- Data is cheaper to keep than to delete. Multiple copies, in fact. #NoDelete

In a way, Big Data is enabled by the economics of keeping it around. Nobody dares delete anything because it’s cheap enough to keep, you never know if you’ll need it later, and there may be legal consequences in deleting it.

2- Whatever caused you to collect data will cause you to collect a lot more data. #PlanForScale

Most data collection is focused on ongoing activities so it’s streaming in. Furthermore, as you learn what to do with data, the appetite for even more data usually grows.

If content is king
context is kingdom!

3- Big Data systems start small, show promise, go big. #NoMiddle

There are few mid-size Big Data deployments. Once the proof of concept for a project looks promising, they go big and then grow incrementally from there, while spawning new projects.

4- Data must flow to be valuable. Just how valuable is a function of context. #Workflow

Sitting data is an idle asset that is likely depreciating in value. And some contexts are more valuable than others. Think of Big Data as workflow and consider that if content is king, then context is kingdom.

5- Never assume that you know what is cause and what is effect. #ConfirmationBias

In most cases where using Big Data is worth the effort, cause and effect relationships are complex, the data is incomplete, and the users’ biases get in the way.

Reminds me of an epigraph I read years ago:
“If there is a will to condemn, the evidence will be found.”

6- The ratio of relevant data to irrelevant data will asymptotically approach zero. #Haystack

One way to say this is: there’s only one needle, and lots of haystack. The more data you collect, the more haystack you’re adding. But the real point here is that for a given context, irrelevant data accumulates faster than relevant data.

7- The ultimate purpose of analysis is synthesis. #Synthetics

When you’re done with analytics, you’re going to want “synthetics”! This is where Machine Learning and Cognitive Computing come in, but also the kind of lateral thinking and connecting-the-dots that only humans seem able to do.

8- Time = Money = Data. There is always a context in which a piece of data is valuable. #ReturnOnData

How valuable is your data and how rapidly does it lose its value? Data is an asset and while it can appreciate in value, it usually depreciates as new data displaces old data and as historical data becomes less likely to be relevant. What is the “interest rate” for your data?

9- Volume-Velocity-Variety-Value, meet Irreproducible-Irrelevant-Incomplete-Incorrect. #4Vs4Is

The quality of the insight is a direct function of the quality of data (and the interpretation of that data).

10- Given enough data, you can simultaneously “prove” opposites. #BeautifulMind #Multiverse

The evidence to support any hypothesis will grow with the size of data, asymptotically approaching 100%.

A fully scientific methodology can guard against wrong conclusions, but complexity, (im)proper motivation, malice, or ignorance can lead to invalid conclusions. The more data, the better the odds that one can get confused and make an innocent mistake, cherry-pick to advance a desired belief, or twist the facts to achieve sinister ends. It reminds me of an epigraph I read years ago: “If there is a will to condemn, the evidence will be found.”

In addition, correlation not being causation, totally wrong but interesting correlations are plentiful and should be a warning sign!

11- Most conclusions will be either uninteresting or invalid. Big Data starts with interesting-but-useless and graduates to valid-and-useful. #InsightWins

We live in a world of new media and viral memes where the interesting-but-shallow can trump the insightful-but-boring. Occasionally, something is both interesting and insightful, but long-term, viral witticism will saturate its space and we’ll hopefully get too used to linkbait patterns to be moved. Big Data is about deeper understandings that can improve things beyond one’s immediate mood.

12- Big Data and HPC converge as data volume grows. #Analytics

If you have 200 rows of data, you have a spreadsheet; if you have 2 billion rows, you have HPC! As the size of data grows, you need math and science to make sense of it. Value is increasingly in analytics (and “synthetics” as in item 7 above), which in turn is about math and scientific models. Check out what my colleague Stephen Perrenod wrote in a 2-part series on this topic here and here.

Are these consistent with what you see? Share your insights please.

Shahin Khan

Shahin is a technology analyst and an active CxO, board member, and advisor. He serves on the board of directors of Wizmo (SaaS) and Massively Parallel Technologies (code modernization) and is an advisor to CollabWorks (future of work). He is co-host of the @HPCpodcast, Mktg_Podcast, and OrionX Download podcast.