After the data has been gathered and in a form that can be used, it can then have an appropriate algorithm used to accomplish the data mining/machine learning/predictive analytics. This is the stage that traditionally has been called “data mining” because it is the part that gets additional value from the data in the form of some type of knowledge (this is why early on, the process was sometimes called “knowledge discovery in data” (KDD).
Often the algorithm is referred to as a “model” in other texts, because statistical models are commonly used, and in all cases our understanding of the underlying data is really just a model, and that model may or may not produce the answer we are looking for as well as other models. But, I will use the word “algorithm”, because some techniques involve complex algorithms where we don’t necessarily have a clear mental model in the end about why the answer came out as it did. Examples of where the data scientist might not have a complete understanding of the full underlying model include neural networks (which utilize a system that kind of acts like how neurons act in the brain) and ensemble methods are used (which use multiple algorithms/models).
The data mining stage involves having sufficient knowledge of math/stat and computer science to use an algorithm that is going to get good results from the data you have, and often then programming this algorithm to be used. (Although, often data mining software packages, or using libraries in a programming language will alleviate the data scientist from creating the full algorithm from scratch)
So how does a data scientist know if they have a good algorithm? Usually the data scientist will do a test with past data to see if they had used their algorithm in the past, would it had made enough correct predictions? Of course, once the algorithm is implemented, it will continue to be watched to see if it is currently making enough correct predictions. Further, it is common for one algorithm to be used at first, and then another better one used later. For example Netflix has improved its algorithm over time about how it guesses what other movies you might like to see. (Although, what is interesting is that they had a huge contest to see who could come up with the best algorithm, and in the end it was an ensemble of a lot of other algorithms… But the end method was so complex that Netflix decided not to use it.)
Once the software is making predictions/learning/categorizing/etc. there is the need to do something with this knowledge, and this is the stage I will call “data artistry”.