How would you architect a scalable approach to test and oversee AI systems whose outputs are non-deterministic?