Before we entrust our care to AI systems and “doctor robots”, we must first commit to identifying biases in datasets and fixing them as much as possible. Furthermore, AI systems need to be evaluated not just on the accuracy of their recommendations, but also on whether they perpetuate or mitigate disparities in care and outcomes. One approach could be to create national test datasets with and without known biases to understand how adeptly models are tuned to avoid unethical care and nonsensical clinical recommendations. We could go one step further and leverage peer review to evaluate findings and make suggestions for improving the AI systems. This is similar to the highly effective approach used by the National Institutes of Health for evaluating grant applications and by journals for evaluating research findings. These interventions could go a long way towards improving public trust in AI and perhaps, someday, enabling a patient to receive the kind of unbiased care that human doctors should have been providing all along.