Pretty sure the only useful thing I found in that test is how much random variance I can expect testing the same dataset four times. Based on these tests the only major conclusion I can draw is that I should up the tests to 25 casts or more when I have a dataset worth testing.
We know it's broken for mana. We now know it doesn't work for damage either. At least not properly or consistently.