I'm amazed that in this lengthy article there isn't a single mention of the largest potential source of error in the data - data bias due to either undetected biases in the population, or due to inappropriate selective sampling of the data. Data teams need to check continually to ensure that the data that is being used is truly representative of the population being modeled and to take care to evaluate whether population biases are creeping in - not to mention their own biases.
Hey Mike. That's a great point. Let me address that. Simply put- bias is not a problem. From an information theory perspective: biases are simply shortcuts that a model (natural or artificial) takes to arrive at decisions quickly. Biases
can be positive, negative, or neither depending on what the bias is.
Biases become a problem in two distinct cases-
1. One we pick a fundamentally bad problem. That's a very important part but that's out of scope for this specific article.
2. Two we have biases in our that act in ways that we don't want them to (or worse we don't expect). That's why both this article and part 1 must such an emphasis on transparency.
To address your example- if your sampling is off: that's a very classic example of dev-prod mismatch. Data quality checks and validation would address this directly.
Undetected biases ( of any kind) are both an example of low transparency and domain knowledge.
Monitoring and understanding the bias on AI systems no doubt, but that's not different from thorough domain analysis+ piperine monitoring+ transparency.
I'm amazed that in this lengthy article there isn't a single mention of the largest potential source of error in the data - data bias due to either undetected biases in the population, or due to inappropriate selective sampling of the data. Data teams need to check continually to ensure that the data that is being used is truly representative of the population being modeled and to take care to evaluate whether population biases are creeping in - not to mention their own biases.
Hey Mike. That's a great point. Let me address that. Simply put- bias is not a problem. From an information theory perspective: biases are simply shortcuts that a model (natural or artificial) takes to arrive at decisions quickly. Biases
can be positive, negative, or neither depending on what the bias is.
Biases become a problem in two distinct cases-
1. One we pick a fundamentally bad problem. That's a very important part but that's out of scope for this specific article.
2. Two we have biases in our that act in ways that we don't want them to (or worse we don't expect). That's why both this article and part 1 must such an emphasis on transparency.
To address your example- if your sampling is off: that's a very classic example of dev-prod mismatch. Data quality checks and validation would address this directly.
Undetected biases ( of any kind) are both an example of low transparency and domain knowledge.
Monitoring and understanding the bias on AI systems no doubt, but that's not different from thorough domain analysis+ piperine monitoring+ transparency.
Hopefully that makes sense.