Before data science/machine learning/data mining/predictive analytics can be done, you need to have the data you are going to use. This may see obvious, but in many cases there is more to this step than may first be assumed, and the whole process is what I will call “data wrangling”, although has other names like “data munging”.
So what is involved in data wrangling? First, it is about extracting data from where it is currently stored. If it is in a traditional database, this will often involve writing SQL queries, or finding another way to export the data from the system if it doesn’t give direct access to the data.
Data wrangling can also involve data transformation. For instance, lets say you have a database with date of birth stored about people, but in the end you are want to have the age of people used, then you will need to do something to change those birth dates into ages. (And as I once learned, to do it in a way that accounts for their age at the time of whatever you are comparing, but I will talk more about that in a later post..)
Data wrangling also often involves data cleanup. For instance, if you’re database has information about phone numbers, and some of the data is stored with symbols and some aren’t, and maybe some have area codes and others don’t, and maybe some entries have text like “ext” or “none”; then if you wanted to use the area code as part of your ultimate data mining/machine learning/predictive analytics, you will need to find a way of clearing out the stuff you don’t want, and determine which part of the phone number is just the area code.
Area code is also a good example of what needs to be considered while “data wrangling”, which is how good are the data you want to use? There is an old adage in computer science: Garbage In, Garbage Out (GIGO). If the area code has been typed in wrong, or if the data has been gathered inconsistently where some people have not included an area code, then you are not going to get good results.
Further, a lot of time you will be using data as a proxy for something else. Area code in the past has been an identifier for a geographic region, and thus might be used to determine information about people in different regions. But that was more true before the common usage of mobile phones. Now a days, if someone moves from one geographic region to another, it is not uncommon for them to keep the same mobile phone number, and thus area code is no longer as good of a proxy for where people live.
How much data wrangling is involved depends greatly upon the end goal, and the quality and form of the data you currently have. For applications that have been designed from the ground up (such as LinkedIn) to used for data mining, these probably don’t require a lot of manual data wrangling. But for businesses that have a large existing database, that has been used for a long time, and may not have had the best database design when it was first created; data wrangling might be a huge part of the overall process.
So what happens after the data is gathered in a form that can be used? The next stage is what traditionally has been called “data mining”.