Finding Hidden Treasure: Self-serve Analytics at The “Speed of Thought”

Key features of a modern agile analytics platform; how to truly democratize data and help biz users to create analytics at the “speed of thought”

Irzana Golding
Published in
10 min readApr 6, 2021

--

Hint: No IT-dependent centralized data-swamps

No one in their right mind would think of filling out an IT ticket with requirements to make an Excel spreadsheet, right? Yet this is how analytics is done in many organizations, via centralized analytics teams. If this is your organization, you are not alone, and it means that you are not really data-driven and almost certainly not agile in practice, even though you might have agile methods in place.

In this article I will describe why self-serve data analytics is essential to sustainable business growth. I will also define what it means in simple enough terms that you can go demand it from your sponsors and stakeholders! But first, I need to make my case.

I have worked extensively in analytics for large corporations. The reality is that analytics is a messy world. Orchestrating data is like herding cats — probably worse. It’s not only the data that needs herding, with its tendencies to leak, break, change and fail, but the people around it with their tribal knowledge. And, with the increasing amounts of data, the problem is getting worse. This should give us a clue that any attempt to “centralize” analytics is folly.

IBM claims that as much of 80% of corporate data is so-called “Dark Data” ... But this is where sunken treasure lies…

The truth is that much of the data we instrument is often a by-product or side-effect of processes and products not designed with analytics in mind. The mindset of “build now, measure later” still prevails. It is reminiscent of the failed approaches to quality back in the days when folks thought that quality was something that could be “inspected in” to a product. Six sigma came along and killed such follies.

Nonetheless, the ability to instrument, collect and query data at scale is generating increasing swathes of data in all corps, sloshing around like a giant ocean about islands that are products. Much of that ocean remains dark and uncharted. IBM claims that as much of 80% of corporate data is so-called “Dark Data” of this kind. But this is where sunken treasure lies.

In case it isn’t obvious, hidden treasure is, by definition, hidden. It can only be exposed by a treasure hunt. But unlike the fables of pirates, this treasure lacks a treasure map. In other words, we don’t know where it lies. This means we need two things:

  1. The tools to shine the light in the dark ocean
  2. A kind of “beach combers’s” instinct to go find the treasure.

This latter part is more important and often overlooked. I take for granted that many great tools now exist to allow us to shine the light, but without the seeker’s instincts, these tools will often provide only limited insights.

And this is the main reason that analytics needs to be self-serve, meaning that the folks with the instincts can go treasure hunting all by themselves as fast as their instincts surface, or, to use the cliche: “at the speed of thought”.

The only way forward is to use a self-serve architecture that enables business folks to help themselves to data in order to reveal insights as easily and readily as nearly anyone today can use a spreadsheet.

I use the spreadsheet analogy as it’s a necessary awakening for many managers who still labor under the delusion that analytics is a specialized task done by specialized folks. This is no longer true for many cases. And when I say many cases, I mean many cases that can deliver significant and potentially immediate value without the need for specialized analytics skills (e.g. of a data scientist). Much of this value can be gained via what I call “back of the data envelope” probing, which I shall explain shortly.

Without defining it fully, why is self-serve possible?

The trend in data analytics and data tooling is clear — advances in software are putting what was yesterday a specialist’s job into the hands of anyone who cares to take an interest in data (which should be all of us). This is a necessary and essential development because when it comes to data analysis, the best folks to do it are the ones with business-domain knowledge.

Let me expand upon this point. No amount of knowledge transfer and carefully written requirements is going to allow centralized IT folks to understand the nuances of a particular business activity outside of their IT domain. In other words, those folks in centralized-IT who don’t know the nuances of the business domain cannot form instincts of the kind needed to go find the hidden treasure in the data. Contrariwise, with powerful analytics tools that do not need IT skill levels, business folks can go find that treasure themselves.

Due to advances in modern tools, it is possible for a business person to perform relatively complex data operations that were previously done by IT experts. So the trend is a shift of analytics from the technically literate to the business literate. Self-serve is recognizing this trend and then accelerating it.

Two Essential Parts…

First, we need a self-serve environment where business folks can help themselves to datasets — in a variety of states, from raw to curated — in order to explore the data. Second, we need for the business literate to become more data literate (which is not the same as IT literate).

Basic Data Literacy +Up-skilling Employees

I will begin with data literacy. This has a number of parts, but the first part is the biggest step towards becoming a truly data-driven organization. What I have seen over and over in corporations is a tendency to discuss and debate subjects, and even run entire projects, including the assignment of additional resources, without much sense-checking assumptions using data. In the “show me the evidence” world of Jeff Bezos, this would be a sackable offense!

A classic situation, and one that I have lost count of, is when a well-intended person instigates a piece of work to “fix” something that is apparently broken, at least according to anecdote. For example, Bob hears, perhaps from a few of his customers, that customers don’t fill out a form properly, so he sets about fixing the form by assigning project resources to it. However, he doesn’t first check what the data says about “the problem”. How many forms are missed? How often? What’s the measurable impact? Is it a certain type of customer? Does it happen under particular circumstances?

As you might have guessed, answers to these questions can probably be found in the data, but Bob has no dashboard to tell him, nor a clue where to look. But looking at the data should be Bob’s first instinct. I don’t mean some highfalutin data science analysis. I mean a simple sense check of the data, or what I call “back of the data envelope” checking.

What is back-of-the-data-envelope?

This back-of-the-envelope approach is something that experienced data scientists know well (although newbies often skip it). It typically involves visualizing data to get a feel for its nature. Simple checks like averages, distributions, anomalies, obvious patterns, and the like. These types of analysis are trivial with tools like Tableau and should be no more feared than building a simple spreadsheet. Incidentally, even as a data scientist, I will often use Tableau for a quick initial feel for the data (provided the data has been consumed properly so that meaningful slice-and-dice is readily accessible — more on that in another post, perhaps).

This kind of skill — and instinct — to query datasets and sift around using first-order visual analysis is what I mean by data literacy. It really is a must-have for any enterprise serious about business and up-skilling employees to become data literate as part of a “data-driven transformation”.

But there is a deeper aspect to data literacy which takes time to develop. This is the ability to trust data over biases. Many of these “we need to fix it” problems, such as Bob’s form, easily become pet projects that business owners fixate upon. And, as I have written before, the Dunning-Kruger effect of over-estimating one’s own opinion comes into play and causes strong biases towards personal project interests, like fixing a form. Data literacy includes learning to make data-based arguments alongside anecdotal ones.

The greatest enemy to this process is the fear that many managers have — the fear that the data will reveal a reality about their business which contradicts agendas or the way that managers account for their performance. Let’s face it, Bob’s excitement to fix the form might be that he can add more headcount to his team. Or perhaps he has convinced a higher-up, without evidence, that this form’s performance is a critical deliverable. Perhaps Bob has constructed a vanity metric around it that has appeared in his Powerpoint slides for the last year or so. This is part of a wider issue of not having a systems approach to work, or not organizing work around competence versus hierarchy (which is often a sign of a missing “Agile Mindset”).

It is my observation and contention that this anecdotal approach, often reinforced by vanity metrics, aggregates across an enterprise to produce sizable excesses in resources spent fixing problems that don’t really need fixing. If the culture isn’t particularly data-driven, then the problem is compounded by obfuscation of agendas that is often easy to get away with in the absence of hard data-driven evidence for why things need fixing.

In a nutshell, self-serve is the ability for folks like Bob, or his colleagues, to readily make the kind of data envelope calculations that provide a truly data-driven sense check. Of course, self-serve could be something much bigger, but this definition and starting point alone will propel many organizations towards developing a data-driven culture.

But what does a self-serve environment look like?

At its core is a “data lake” that is accessible by everyone by default, not by filling out requests from IT. I will avoid technical definitions of data lakes and the like, and just describe some key attributes that make it self-serve. Namely, it should contain reliable data that is accessible by anyone at any scale.

Reliable does not mean accurate, precise or what you might think it means. It simply means transparently trustworthy, even if a particular dataset contains “dirty” data. What I mean here is that the consumer of the data is able to tell what kind of data she is dealing with. A dataset might be deemed “dirty” because it might contain conflicting duplicates. So long as the consumer is able to discern this fact and proceed “at her own risk”, then this is fine. In fact, this approach is unavoidable because the work required to sanitize every dataset prior to usage would become an intractable bottleneck and is often not necessary for first-order “back of the envelope” analytics.

However, in a self-serve world, an expedient solution is for a particular consumer to curate a sanitized dataset by transforming it. For example, someone in Bob’s team might go about curating a cleaned-up dataset of customer form submissions that is easy to visualize and already contains a number of joined-up data sources. This dataset can now be curated and documented (via a number of tools).

Henceforth, Bob and his colleagues now have a trusted dataset to do back-of-the-envelope analysis with. Not that long ago, this step was difficult, or often impossible, because the compute power required to transform large datasets “on demand” wasn’t available. However, with modern solutions like Snowflake, users can spin up large amounts of data processing power on-demand and only pay for what they use (in the cloud).

A key aspect of self-serve is that such curated datasets can be properly curated, which means that lineage (where the source data comes from) and clean-up work (transforms) are readily accessible and documented, ideally within the same tool. Ideally, datasets of varying levels of curation and treatment are assigned to zones, like “raw” (for unclean), “experimental”, “development”, “production” and the like, all of which are under the control of each team doing the curation. Yes — there are tools for doing work like this. Contrary to what many IT folks will tell you about their allergic reaction to self-serve, it is easily possible to allow users to manage their own zones with sufficient isolation not to affect source or mission-critical data.

The old model of centralized data ownership is being replaced by decentralized data stewardship. The attitudes, methods and tools of stewarding data in this way are part of the data literacy that I have been explaining.

Conclusion

Data is increasingly easy to instrument and generate. One the one hand, finding insights is getting easier because of advances in tools, but on the other hand it is getting harder because of the vastness of the oceans of data we have to search. Any attempts to centralize “insights generation” are a throwback to the old way of doing things when analytics required IT-level skills. The new tools have moved on: Tableau (and similar) are the new Excel.

Without insights, businesses ossify and decisions are made in vacuums with consequent inefficiencies. Insights require tools (“the light”) plus — the most important ingredient of all — business intuitions that help to know where to shine the light. Only the business folks have those intuitions. Only by decentralizing the task via self-serve environments and methodologies, can it be made scalable and productive. The temptation to centralize using “cost center” metrics (and “economies of scale” mindset) should be avoided at at cost (forgive the pun). The goal should be to move analytics as close to the edge of real-time business decision-making. That is the true goal of self-serve.

--

--

SaaS Growth Insights | Analytics | Forecasting | Business Intelligence Specialist