Excellent sleuthing. Whether you are using Squark’s tool or another automated data science platform that uses machine learning, you will notice that sometimes you can run the same data set, and different results are produced. This isn’t an error. It is due to the learning part of machine learning, and it’s a good thing.
What you are noticing is the stochastic nature of machine learning (ML) versus traditional statistics. Meaning it has a random probability distribution or pattern that may be analyzed statistically, but may not be predicted precisely. Rest assured, nothing is wrong or broken with your project, and here’s why.
When a deterministic algorithm, which will always produce the same output, with the underlying machine always passing through the same sequence as used in traditional statistics, is given the same dataset, it learns the same model every time (for example, in linear regression). You will see that the results of the linear regressions are the same, but when we get into the world of non-linear models or algorithms, things change given the underlying stochastic processes.
ML algorithms tend not to be deterministic; instead, they are stochastic. They incorporate elements of randomness. Stochastic does not mean random. Stochastic machine learning algorithms are not learning a random model, but learning a model based on the data provided. Instead, small decisions made by the algorithm can vary randomly.
The impact is each time the stochastic machine learning algorithm is run on the same data, it can learn a slightly different model. In turn, the model may create slightly different predictions, and when evaluated for error/accuracy, may have a slightly different performance.
Some stochastic model evaluation or validation include techniques like test-split and k folds. It means small decisions are made in the process involving randomness. In the case of model evaluation, randomness is used to choose which rows are assigned to a subset of data and allows for resampling to approximate an estimate of model performance independent of the sample. The estimates allow for a model to be applied in a way to understand how it performs on new data.
Randomness in ML is a feature, not a bug.
The answer is in ML, and there is not a single model for your data set. Instead, a stochastic process generates models making similar but different decisions when re-run. In Squark, you can choose to either apply the best model to the data for scoring, apply another model to the data from the leaderboard, or rebuild from scratch a deterministic algorithm that you’ve previously run.
The model selected can be singular or a combination of multiple models. An ensemble is a meta learning process that takes the other algorithms and combines them to create an entirely new algorithm. When the stacked ensemble is used, the variables of importance listed within Squark’s platform are the weighted importance of the other models applied. For example, Shapley provides a way to look at the positive and negative contributions to variables in a stacked ensemble (and the other algorithms).
Want to learn more or see this in action? Contact us here to review Squark’s platform.
Judah Phillips