Hopara, Founder, CTO: Data Scientists Should Be Better at Data Integration, Management and Visualization
(US and Canada) Dr. Michael Stonebraker, Founder, CTO and Visionary, Hopara, and Ken Smith, Co-CEO, Founder, Hopara, speak with Ricardo Crepaldi, Director Business Intelligence, Big Data and Integration, BASF-Germany about the upcoming trends in data and the problems with the academic side of data science.
Regarding the role of data and how it is likely to change over the next decade, Stonebraker says a lot of universities started data science programs in the last decade, and they mostly teach tools to analyze data, machine learning, matrix operations, etc. What they are completely missing, he notes, is twofold. One, if 90% of the time is spent in data preparation, it’s not data science. Instead, it is data integration. And so, data scientists should train to improve at data integration.
The second problem, according to Stonebraker, is that data scientists have to get better at data management and visualization. “Let's say I write a machine learning algorithm to predict the price of oil. I get some training data, I run the model and it doesn't work. So then I say, my training data is probably wrong. So, I go get some more training data — still doesn't work. Then I tune the hyper-parameters — still doesn't work. Then I realize the data is dirty, so I go and I clean it. A typical model goes through a hundred or so iterations to get something that you can put in production,” he says.
This is trial and error through a lot of iterations, he explains, and every iteration needs training data, output data, models, and hyperparameters, and it’s a large amount of data that goes into every iteration. “Fundamentally, data modeling/data analysis is a data problem, just keeping track of all of the stuff so that I can move backward if I have to. And it's a visualization problem because I want to figure out what went wrong.
“Data scientists have to become better at data integration. They have to become better at data management and they have to become better at visualization. Those are the three key things that you need to be more successful than you are now,” Stonebraker adds.
Sharing his viewpoint, Smith says that the supply of data analysts is still not keeping up with the demand for data analysis. “Given the demand, anybody with that title is going to stay well employed for a very long time.”
Smith advises that it’s imperative to think about the data end users, who are not data scientists but are the frontline folks trying to empower the organization. “When you're trying to drive data-driven decision making, etc., think about them first and then build backward,” he says.
“Don't be afraid of disruptive innovation,”Stonebraker interjects. ”You embrace innovation, embrace disruption.”
Sharing briefly on how Hopara came about, he says that it emerged from the idea of applying Google Maps-like services to data. “If you have floor plans, you're out of luck. If you need other visualizations that are two-dimensional, like scatter plots or heat maps, Google Maps doesn't do that. The original idea was why can't we do Google Maps for all of your data? And that turned into a research project at MIT where we built a prototype of what is now Hopara,” he concludes.