Here's the truth - the most successful datasets I've built, the ones that stuck around and didn't need constant remodeling, all started with dragging out the necessary context and defining the ideal state. Sounds simple, right? But whether I'm sculpting a grand warehouse domain model, crafting a dataset for operational analytics, or simply answering a business intelligence question, nailing down the real goal is often the trickiest part. More often than not, we breeze past this crucial step. Why? Well, it's awkward, uncomfortable, and who's got time with all those looming deadlines?
I'm not talking about the technicalities of constructing queries. It's more like a small-scale data modeling exercise - and trust me, it doesn't have to be a time-consuming nightmare.
Too often, we dive headfirst into coding with a vague idea, like playing a game of broken telephone with the business side.
Now, I'm not saying we should slam the brakes on every request because someone doesn't "get it," or drag out every data request for weeks. Business needs change, and one answer usually spawns more questions. So, how do we ride the wave rather than fight against it?
When a new request hits your desk, or you're launching a project, here's my playbook:
Multiple requests across different groups can aim to solve the same problem. Understanding the task's context has helped me to anticipate the upcoming data requests.
EXAMPLE ANSWER:
"We need to understand how long each step of our service delivery process takes so we can identify which parts we need to optimize"
Stakeholders describing where and how data will be used has helped me plan the required refresh interval and imagine future decisions based on the information I deliver.
EXAMPLE ANSWER:
" We want to run email marketing campaigns to a segment of our self-signup customers that are at the beginning of the onboarding process AND are not on the hitlist of our sales reps"
This question fights the never-ending wishlist of data columns. Force the person to prioritize to deliver the most crucial data points faster. Find what columns are non-negotiable, but exclude standard columns like ID or 'created at' from these business discussions. If they can't decide on just 5, increase the number - but make it a firm limit. You need to understand the burning needs, not what they could potentially use in the future (which, let's face it, will never get used anyway.)
EXAMPLE ANSWER:
"Without onboarding status, revenue range, marketing consent, email address and links to salesforce we won't be able to segment the users"
This question helps to describe the perfect outcome together. Keep the focus on explaining the meaning of each attribute in an ideal state; it is your job to sort out the construction details and handle limitations one way or another.
EXAMPLE ANSWER:
"Ideally onboarding status would tell me whether this user has even gone through the steps, where they are right now, and how long they have been in each stage"
The reason for asking this is to distinguish crucial information from nice-to-haves. Highly skilled stakeholders can find ways to deliver the most critical metrics. The more painful or time-consuming this workaround is, the more crucial metric we are talking about.
EXAMPLE ANSWER:
"Until now, I have had this spreadsheet that I populate with Zapier and then manually import it to my email marketing tool"
Once you've nailed these down, start the digging. Iterate through sources, fields, and columns, and build bit by bit. People usually figure out what they really need once they see a quick example. Get that initial dataset out fast for a reality check, and once they're on board, expand the model or dataset.
In a nutshell, the success of any data endeavor lies in understanding the real questions before jumping into the tech stuff. And here's a pro tip: automate the building process so you can focus on getting that all-important context (use reconfigured, for example ;-).
Happy data wrangling! 🚀