Data wrangling, what I would call the most important step in the data science process. The foundation of good data science comes down to good data.
For that, we need to do:
We call this data wrangling, and this five step process we carry out is our data wrangling pipeline, to gather, filter, convert, explore and integrate data.
This is the first step, gathering data, is about:
This data can be from sensor data, from downloading videos and images, crawling websites, fetching log files, polling or survey results, etc. The data may be in small pieces that eventually get appended or added together into what could be called big data, or it may already be under a mountain of data that we need to extract bits and pieces from.
Filtering / scrubbing is taking raw data and removing parts that we don’t need, removing corrupt data, removing invalid data. It’s the first step in going from unstructured or raw data to something usable. So, in the end, we decide what we’re going to keep, and in many cases, we keep only what we need or what we may need in the future.
In this step we get data in some particular format. Once we’ve gathered data and done some preliminary filtering and scrubbing, we need to convert it to a usable format that can be processed by our analysis software or code. This includes common formats such as CSV, JSON, XML, or SQL, to name a few. So, we have the format of the entire structure of the data. But we also need to look at the format of the individual items inside this data, like the number formats, the string formats, date formats. We need to think about formats at this conversion step in the process, so we can decide how we’d ultimately like to store it.
we have the data in a usable format, we need to do some exploratory analysis. So, this is a pre-analysis step to find out what we have in the data.
We run some preliminary queries to get some sense of the data we have, and this will inform our statistical analysis and hypothesis. So, we get a sense of okay, what do we actually have in our data? What’s our next step? What sort of hypothesis can we form from this data? And once we’ve done some basic exploration, we get into our data integration.
So, our previous data wrangling steps involve working on possibly small pieces of data and possibly schema-less data. So, it’s not in any type of structured format. So, we often process many smaller pieces that get concatenated, appended, joined, and aggregated into larger data stores. When these pieces add up, they sometimes add up to something considerable in size that require alternate storage and processing methods, which we call big data. But the integration itself is still about how we bring the small pieces together. So, we take data in a very raw format and we go through the gathering, the filtering, the conversion, and the exploration.
And then we ultimately decide how we want to store it or assemble it so that we can perform our analysis on it. And then if we have a very large amount of data at the end, we call it big data, because we need different processing techniques or different storage techniques, maybe distributed storage.
Our analysis will have to go across a very large data store. And those are the five steps of the data wrangling process summarized. So, you gather, filter, convert, explore, and finally integrate your data.