Data is Not the new Oil, Data is the new Diamonds (maybe) 3509
Over the past decade I have heard this sentence more than I can count: "Data is the new oil". At the the time it sounded right, now I see it as misguided.
That simple sentence started when people realized that big tech (mostly Facebook, Google) were collecting huge amounts of data on their users. Although it was before (in hindsight) AI blew up as the massive thing it is now, It had a profound effect on people's mind. The competitive advantages that companies who had data where able to achieve inspired a new industry and a new speciality in computer science: Big Data, and fostered the creation of many new technologies that have become essential to the modern internet.
"Data is the new Oil", means two things:
1- Every drop is valuable
2- The more you have, the better.
And it seemed true, but it was an artifact of a Big Tech use case. What Big Tech was doing at the time was selling ads with AI. To sell ads to people, you need to model their behaviour and psychology, to achieve that you need behavioural data, and that's what Google and Facebook had: Behavioural data. It is a prefect use case, were the data collected is very clean and tightly fits the application. In other words, the noise to signal ratio is low, and in this case, the more data you can collect the better.
This early success however hid a major truth for years. For AI to work great the quality of the dataset highly matters. Unlike oil, when it comes to data, some drops are more valuable than others.
In other words, data like a diamond needs to be carved and polished before it can be presented. Depending on the application, we need people able to understand the type of data, the meanings associated to it, the issues associated to collection and most importantly how to clean it, and normalized it.
It is in my opinion that data curation is a major factors in what differentiates a great AI from an below average AI. Those who misunderstood this concept ended up significantly increasing their costs with complex Big Data infrastructures to drown themselves in heaps of data that they don't need and hinder the training of their models.
When it comes to data hoarding and greed are not the way to go. We should keep in mind that data has no intrinsic value, the universe keeps generating infinite amounts of it. What we need is useful data.