The Big Rules of Big Data
Data is good. Big Data is big, but what does it take to make it good?
All else being equal, it’s better to have data and dismiss it than not have it and miss out. No data, and you’re driving blind. But too much data, and you might as well be driving blind. Sensory overload!
The digitization of human life is just beginning. Information Age is coming and is changing everything. So the best days of Data are ahead of it. We’d better get used to data, lots of it.
Here at OrionX.net, We’ve done several projects to help define business and solution strategies in Big Data, IoT, and Cloud markets. Along the way, some rules and truisms have emerged.
Here is a snapshot of my Big Rules of Big Data (after Big Rules of Marketing, and Big Rules of Competitive Intelligence).
1- Data is cheaper to keep than to delete. Multiple copies, in fact. #NoDelete
In a way, Big Data is enabled by the economics of keeping it around. Nobody dares delete anything because it’s cheap enough to keep, you never know if you’ll need it later, and there may be legal consequences in deleting it.
2- Whatever caused you to collect data will cause you to collect a lot more data. #PlanForScale
Most data collection is focused on ongoing activities so it’s streaming in. Furthermore, as you learn what to do with data, the appetite for even more data usually grows.
3- Big Data systems start small, show promise, go big. #NoMiddle
There are few mid-size Big Data deployments. Once the proof of concept for a project looks promising, they go big and then grow incrementally from there, while spawning new projects.
4- Data must flow to be valuable. Just how valuable is a function of context. #Workflow
Sitting data is an idle asset that is likely depreciating in value. And some contexts are more valuable than others. Think of Big Data as workflow and consider that if content is king, then context is kingdom.
5- Never assume that you know what is cause and what is effect. #ConfirmationBias
In most cases where using Big Data is worth the effort, cause and effect relationships are complex, the data is incomplete, and the users’ biases get in the way.
Reminds me of an epigraph I read years ago:
“If there is a will to condemn, the evidence will be found.”
6- The ratio of relevant data to irrelevant data will asymptotically approach zero. #Haystack
One way to say this is: there’s only one needle, and lots of haystack. The more data you collect, the more haystack you’re adding. But the real point here is that for a given context, irrelevant data accumulates faster than relevant data.
7- The ultimate purpose of analysis is synthesis. #Synthetics
When you’re done with analytics, you’re going to want “synthetics”! This is where Machine Learning and Cognitive Computing come in, but also the kind of lateral thinking and connecting-the-dots that only humans seem able to do.
8- Time = Money = Data. There is always a context in which a piece of data is valuable. #ReturnOnData
How valuable is your data and how rapidly does it lose its value? Data is an asset and while it can appreciate in value, it usually depreciates as new data displaces old data and as historical data becomes less likely to be relevant. What is the “interest rate” for your data?
9- Volume-Velocity-Variety-Value, meet Irreproducible-Irrelevant-Incomplete-Incorrect. #4Vs4Is
The quality of the insight is a direct function of the quality of data (and the interpretation of that data).
10- Given enough data, you can simultaneously “prove” opposites. #BeautifulMind #Multiverse
The evidence to support any hypothesis will grow with the size of data, asymptotically approaching 100%.
A fully scientific methodology can guard against wrong conclusions, but complexity, (im)proper motivation, malice, or ignorance can lead to invalid conclusions. The more data, the better the odds that one can get confused and make an innocent mistake, cherry-pick to advance a desired belief, or twist the facts to achieve sinister ends. It reminds me of an epigraph I read years ago: “If there is a will to condemn, the evidence will be found.”
In addition, correlation not being causation, totally wrong but interesting correlations are plentiful and should be a warning sign!
11- Most conclusions will be either uninteresting or invalid. Big Data starts with interesting-but-useless and graduates to valid-and-useful. #InsightWins
We live in a world of new media and viral memes where the interesting-but-shallow can trump the insightful-but-boring. Occasionally, something is both interesting and insightful, but long-term, viral witticism will saturate its space and we’ll hopefully get too used to linkbait patterns to be moved. Big Data is about deeper understandings that can improve things beyond one’s immediate mood.
12- Big Data and HPC converge as data volume grows. #Analytics
If you have 200 rows of data, you have a spreadsheet; if you have 2 billion rows, you have HPC! As the size of data grows, you need math and science to make sense of it. Value is increasingly in analytics (and “synthetics” as in item 7 above), which in turn is about math and scientific models. Check out what my colleague Stephen Perrenod wrote in a 2-part series on this topic here and here.
Are these consistent with what you see? Share your insights please.
Shahin is a technology analyst and an active CxO, board member, and advisor. He serves on the board of directors of Wizmo (SaaS) and Massively Parallel Technologies (code modernization) and is an advisor to CollabWorks (future of work). He is co-host of the @HPCpodcast, Mktg_Podcast, and OrionX Download podcast.
If content is king
context is kingdom!