As the Artificial Intelligence (AI) is making an inroad in our day-to-day life, it is quite fascinating to know how machine learning systems arrive at decisions. Scientists in pursuance of creation of less biased systems often argue that the key to excell in their endeavour is to use better algorithms. But algorithms are only as good as the data they’re using, finds a new study conducted by a team of scientists at Massachusetts Institute of Technology.
“We view this as a toolbox for helping machine learning engineers figure out what questions to ask of their data in order to diagnose why their systems may be making unfair predictions,” says MIT professor David Sontag.
“One of the biggest misconceptions is that more data is always better. Getting more participants doesn’t necessarily help, since drawing from the exact same population often leads to the same subgroups being under-represented. Even the popular image database ImageNet, with its many millions of images, has been shown to be biased towards the Northern Hemisphere,” adds study lead author Irene Chen, who along with Professor Sontag and postdoctoral associate Fredrik D. Johansson, presented the findings of the study at the annual conference on Neural Information Processing Systems.
Sontag says that the key to solve this issue is to go out and collect more data from the under-represented groups. For example, the scientists found that a system’s ability to predict intensive care unit mortality was less accurate for Asian patients, or an income-prediction system was twice as likely to misclassify female employees as low-income and male employees as high-income.
The scientists found that it significantly helps to look at a dataset and determine how many more participants from different populations are needed to improve accuracy for the group with lower accuracy while still preserving accuracy for the group with higher accuracy.
“With a more nuanced approach like this, hospitals and other institutions would be better equipped to do cost-benefit analyses to see if it would be useful to get more data,” says Chen.