By Patrick Condren, Ph.D., Chief Information Officer
The interesting thing about data is, if we get huge amounts of information (aka: Big Data) the precision of the data declines, but the probability of the data increases. Big Data, for example, is like the tweets about the NY Yankees and the Facebook data (and other sources) relating to that team, being analysed to determine fan satisfaction with the team.
This kind of data is ‘Big’. If it involves scores of data sources and 9 figures data records it is ‘Big’. But is there a different kind of data that goes beyond what we refer to as ‘Big’ and would be better called, ‘Span Data’?
Here is why precision declines with more data, while the probability increases:
If you had a data set of 100 people where 100% of those people bought a blue sports shirt, you could conclude that 100% of people buy blue sports shirts. You would be wrong.
The only conclusion that you can make is that 100% of the 100 people in your data set bought a blue sports shirt. Your sample is very small. If you knew that your data set was entirely made up of males from the ‘Blues’ fan club (Say Manchester City fan club), then a result of 100% is not surprising and probably not very useful. That extra ‘data point’ effectively nullifies the relevance of the sample population.
If your data was on 100,000 people and covered all genders and ages equally and you noticed that 100% of the people bought a blue sports shirt, could you then conclude that 100% of people buy blue sports shirts? You should probably look at the other factors that might influence your findings:
- Timing (was there a big ‘Blue’ event on at the time?)
- Location (was the population take from the ‘Blue team fan shop network’?),
- Influences (was this concentrated in an area where there may be undue influence? Say next to the ‘Blues’ sport stadium or in Athens on an international football day?
You get the point, even big data sets can be skewed.
Now, what if you have a sample population that is massive and the data points are hugely diverse, say at a national level in the USA. You have all genders, ages, ethnicity, local and regional factors and a host of other factors. You then notice that 65% bought blue shirts in the last 6 months. What if you could also eliminate any special event that may have occurred during the period? What can you conclude then? Probably that in the same period in the following year it is likely that 65% will buy blue shirts again.
The benefit of the difference between the 100 person sample population and the USA level population is that you can draw a more accurate probability from the data. Critically, the data can now be viewed across a number of criterion. For example, if you sell blue shirts you may want to know which regions or ethnic populations don’t currently buy many blue shirts. This would potentially lead you to new markets.
With that insight, would you make a strategic decision? Perhaps after some investigation you would decide to provide red shirts to that target population (or whichever shirt would most likely penetrate that market) or perhaps you would even decide to use targeted advertisement to force blue shirt penetration into the targeted market.
You may well be thinking that this is ‘Big Data’ in action. I would argue not specifically, even if it is similar.
Big Data is defined as extremely large data sets that may be analysed computationally to reveal patterns, trends and associations; especially relating to human behaviour and interactions.
It tends to be a whole load of data pulled from a variety of sources and then analysed to find ‘interesting patterns’ that emerge.
There is not always complete serendipity, but there is a large serendipity element to most Big Data activity. Some operations have much better control, but I am suggesting that there is a new evolution possible: the child of ‘Big Data’ called ‘Span Data’. Wide Data may be even more accurate, but a ‘Wide Data’ definition already exists at the record/table level.
Span Data would be defined as the targeted search for specific insights, using Big Data sets across ‘broad and diverse systems’ and data populations (or silos), with a clearly predefined, specific goal.
Span Data differs from Big Data for the following reasons:
- Social media may not be a factor (not a prime factor, but a simple data point)
- The data is not necessarily about human interactions
- The data is not necessarily relating to human behaviour
- Beyond ‘Constrained Programming Languages’ additional ‘User Context’ can be applied and may even be a critical factor
- Non-system data can be added for illumination (e.g. There was a big sports event at that time, or a power blackout, or a war, or protests nearby, or a large disaster in a foreign place that impacted the supply chain, etc.).
An example would be a tyre manufacturer who decides to penetrate a new market. In this case, Big Data will be all about the impact on the Internet and social media, which is useful, but not the full requirement. The traditional approach is based on market size and potential revenue projections, against set up, launch, growth and running costs over a period. But what if the following was conducted on a suitably large market, say the USA, for all tyre sectors?
Non ‘human data’
- A market size analysis (how many units sold in each category or vehicle, each tyre size)
- A growth analysis (which lines would likely grow over time, projected from history)
- Vehicle manufacturing trend data (Which tyre types are likely to grow going forward. Ex: don’t build solid tyres for fork lifts if the trend is air filled)
- Cost per unit of each potential product type (including design, manufacture, eco, distribution to storage, storage costs, admin overhead, etc.)
- Cost of distribution of each product type (including sales, marketing, shipping, fitting, etc.)
- Regional trends (where does most sales activity happen. Ex: there may be massive truck depots in Boston, Lincoln MA, Orlando FL and Dallas TX, which account for 80% of national sales. That would be good to know! Or maybe 40% of all US vehicles have their tyres fitted in Detroit.)
- Timing (When are the season peaks and troughs in sales)
- Is there a natural inflection point in the market due? (New eco-friendly legislation or whatever, locally or internationally)
- What is the ‘value’ of sales to your company in the different aspects of the markets (Ex: you may sell 1 million units to Ford at $2 profit per unit or 100,000 units direct to the public at $40 per unit. Which would you choose? The direct to public comes with unspecified overheads, but you could build repeatable customer loyalty…)
- Customer satisfaction with current brands in those lines (to generate insight into ability and effort required to dislodge incumbents)
- Trending analysis on the current market leaders (what caused ups and downs in terms of popularity? Is there an opportunity to capitalize on a weakness?)
- Live analysis of social media to launch and tune (to understand and examine marketing effectiveness)
- Context around customer satisfaction
- Context around sales spikes (was the big spike just because on a flood, an out of season blizzard or some other non-predicable event?)
- Oral and some text-based research, where conversations are had about direction and influence, but not in social media. (However, a single interview on TV, Radio or other spoken word may be crucial)
The key gain from the ‘Span Data’ set is that greater benefit and insight may be derived. It aids in making key and strategic decisions, where probability of success can be better anticipated or at least efforts can be focused more surgically.
Big Data is an important part of this, but it is not the only data ‘type’ that needs to be considered. Even though the data set is ‘Big’, it more importantly ‘Spans’ a series of critical pillars to provide even more actionable insight.
Article written by Patrick Condren for icrunchdata News Dublin, Ireland