It is good practice to gather a population of results when comparing two different machine learning algorithms or when comparing the same algorithm with different configurations.
Repeating each experimental run 30 or more times gives you a population of results from which you can calculate the mean expected performance, given the stochastic nature of most machine learning algorithms.
If the mean expected performance from two algorithms or configurations are different, how do you know that the difference is significant, and how significant?
Statistical significance tests are an important tool to help to interpret the results from machine learning experiments. Additionally, the findings from these tools can help you better and more confidently present your experimental results and choose the right algorithms and configurations for your predictive modeling problem.