Scoping Data Science Projects in a Smarter Way

October 08, 2019
By Vera Shao,
Data Scientist

No matter what industry we work in, we have all heard the buzz around Data Science. So, what is Data Science and why is it getting so much attention? Coming up with a perfect definition is tricky since the industry and the techniques it uses continue to evolve quickly, and it means something slightly different in each case. In the most simple terms, Data Science is a process of obtaining information and insights out of data. In practice, Data Science techniques allow us to wrangle, clean, validate, manipulate, and analyze massive amounts of data in order to reveal meaningful and actionable insights. 

We can all agree that data is important in our rapidly changing business world. Every organization should have at least some knowledge about the science behind how to utilize massive amounts of data in a meaningful way. Data Science changes how decisions are made and companies need to adopt a data-driven approach on a huge scale if they want to compete successfully. 

Whether your company already has a Data Science team, is just starting to build one, or plans to hire an agency to start a Data Science project, here are some pointers to help your company scope a data science project effectively. 

Cyclical vs. Linear: The Process of a Data Science Project 

If the graph below looks familiar to you, you have experienced a project process we call the linear project management life cycle. For this process, each step is clearly defined with detailed information including task title, current status, time tracking percentage with deadlines, resource allocation, and even daily breakdowns. You also may have had a clear vision and control over every step of the project the whole time.

Linear Project plan example

The process of a Data Science project, however, is not like the previous chart. Rather than linear, it’s always cyclical. This graph shows how the Bounteous team organizes the circular steps in the process: 

example of cycle of a data science project

Many people may expect the same rigid timelines as they are used to in the linear process, but differences in a business’s technology, data maturity, data quality, and business objectives make it challenging for the data scientist to give an exact deadline for the completion of each step. Until the data scientist is extremely familiar with the data, it’s not possible to predict the best data features to use for modeling, the performance of models, or how the exercise will fulfill the business needs.

We can see from the arrows in our graph that the relationships between Business Understanding, Data Understanding, Data Preparation, Modeling, and Evaluation are reciprocal and happen through a fluid, back and forth process. It is key that every member of the Data Science project team understands the full process so everyone knows what to expect. Knowing something about each step helps clarify the project scope. For example, set up a date range for all three analysis steps — Data Preparation, Modeling, and Evaluation — with frequent check-in points, instead of breaking them down with specific deadlines.

Understand the TRUE Need of the Project 

Business Understanding is the critical first step of any Data Science project, and the Data Science team needs to help clarify what the TRUE problem is. Asking many “Why” and “So what?” questions during the initial meetings helps the team better define the most important questions at the heart of the project.

For example, if the ask is “We want to know more about our customers,” the Data Science team should ask questions like: “Why do you want to know more about your customers? What will you do once you know more? Do you want to remarket to them, or make them more engaged, or do you just want to know their different characteristics?”

If the answer is “We want to make them more engaged,” the Data Science team should keep asking questions like: “What are some behaviors that show the customers are engaged?” By doing so, we can identify the root business problem, the key pain points, and design an actionable solution with measurable impact.

Data, Data, It’s All About Data 

If it’s a Data Science project, then data is the key to the project. My colleague Eugene Catrambone has a great blog post about asking the right questions to collect the right data to fully leverage the power of Data Science.

Ask any Data Scientist and they will tell you that the process of ‘wrangling’ (loading, understanding, and preparing) data represents the lion’s share of their workload — often up to as much as 80%.

-Ian Thomas
Chief Data Officer, Publicis Spine

After collecting the right data, one of the important tasks is for the Data Scientist to explore it: “What’s the size of the dataset? Do we have labeled columns? Is there a data sparsity problem? How might I visualize the datasets to get a better understanding of them?” These and similar questions need to be asked and answered before beginning the data preparation and modeling stages.

In addition, understanding the various interpretations of the data, along with its limitations in quality and quantity, can help us identify risks in the project further down the road. As many of us may remember from statistics class, we must check our assumptions before running a test. 

Great Data Science is More Than Modeling

A great data scientist does more than coding. While every estimate is different, more than one-third of a data scientist’s time can be spent on other tasks that are not directly related to coding and modeling. For example, project management meetings and status updates, communicating and delivering progress and findings to stakeholders, translating scientific results to business language, making great slide decks and sharing them out. 

Documentation is also key. A great data scientist takes the time to include comments in his/her codes so that other analysts can easily reproduce it. If the work results in a deployed model, a strong user manual should be created and shared with all relevant teams.

Many Data Science projects also require cross-functional teamwork. For example, when it comes time to deploying and activating on a model, we may have an engineering team that can help build the data pipeline and automate the process. We might also have a Marketing Service team and/or Experience Design team to help execute A/B testing and collect results for later evaluation. A Data Science project is often the bridge that connects these functions across a company.

We need to make sure all of these activities are included when scoping the project and providing time estimates.

Iterative Data Science 

The purpose of this blog was to introduce you to the iterative Data Science process and to provide helpful tips for scoping a great Data Science project. My advice is: Do not be afraid of failure, we just need to try and keep learning from each Data Science project we do! As Aristotle said, “For the things we have to learn before we can do them, we learn by doing them.”