In our previous article, “Data Science Industry Perspectives in the Cloud,” we discussed that evolution is key if you plan to grow your business. You can start with ready-made solutions, then, in time, you can switch to in-house solutions created with the help of a group of data scientists.
This time around, we’ll talk about the cases and solutions when machine learning as a service (MLasaS) doesn’t work. Your company’s first move is not to go out and hire a data scientist to solve your urgent business needs; rather, your company should invest in a custom-made solution. Only when you have a workable solution can you dive in deeper and create a proper team that can create an in-house data science solution.
Custom Made End-to-End Solution as a Start
Data Science in the classical sense involves data, goals and models to solve a pressing issue. The best way to jump start a data science process is by putting together various ready-made services into a single workable product to show your customers clear-cut results in short order. This can be done without any complex or global research, and you can comfortably formulate specifications taking into account all the feedback from your customers to create a more high-level data science offering later.
One of the biggest issues for any data science project is in formulating the specifications for it. The usual request is, “Create something for my company using data science. Analyze data for me.” This type of a job has lots of trial and error. Having a custom end-to-end solution from the start lets you add on extra services as needed, and enables greater insights and predictions into a specific business workflow.
I believe companies should build their data science system not from the data science research but from the point of view of an end-to-end solution. And then, bit by bit, they can adjust their services in the process.
Moving on to Data Science Research
Companies start data science research by employing a data scientist who matches their needs. You can use the standard data scientist qualifications: This employee should have a firm footing in the application environment, mathematics and programming. Also, their business skills are key to success—I believe an understanding of the topical area is key here because we are solving business issues. And a data scientist who solves problems that are more academic will be more focused on winning a Kaggle competition than on addressing the business needs.
It is important this person understands the product development cycle. This way, the models and analyses data will be built in such a way that they could be deployed in production. For example, if a person is using R (the language and framework for data scientists), we should note that it is more research-oriented and it is not production-ready. Correspondingly, the results of such research cannot be deployed in any end-to-end solution. Therefore, the data scientist should work with a programmer.
Although in the above diagram we see that the Data Scientist should have hacking skills, in reality, this is not the case. In their vast majority, Data Scientists are not able to write quality code. And if you need not just the research but a solution then process-wise you need to have a Data Scientist working together with a Data Engineer.
CAP Theory as an Analogy
It is impossible to have three database properties at once: consistency, availability and partition-tolerance. This basic rule is known to all developers. This is true for a data scientist as well: People envision a data scientist overlapping all three properties, but in real life, you cannot find such an ideal candidate. Usually, people tend to lean one or another way in their work and keeping a balance is not always a priority.
In principle, data scientist should have business insight, understand the math behind it and work in tandem with a skilled developer. Of course, a proper data scientist should be able to write any semblance of code, but a data scientist should work with a data engineer to write and realize code, being responsible for both the quality and sustainability of this solution.
The classic tragedy of a company that decides to initiate any data science research is when a data scientist says, “I’ve got 40,000 lines of Python code on my PC. Can you make it work in production?” And, of course, this is virtually impossible to do. When that happens, all the research is simply wasted.
Cross-Functional Teams
Any team that deals with data science has to be cross-functional. In other words, it has to cover a whole stack of the solutions it writes. In a normal infrastructure there should be a DevOps engineer, data scientist, data engineer and a product developer writing the web app and/or mobile app. This single team is responsible for the result. The team members should work together and solve tasks that are interconnected in their interactions.
All of this means that the whole team is responsible for the business result. This is also true for the transitionary research done by a data scientist that is impossible to use in production on its own.
Old-School vs. Vertical Teams
To dig in deeper, let’s take a classic old-school layered company organization structure when you have a department of data scientists, operations, UI developers, a big data department, QA engineers and so on. In this case, we have every project penetrating most of these teams. And the classic problem is that tickets and tasks are being thrown around by one team to another, and the real business goals are being watered down along the way and not solved in the end. So, instead of this horizontal division, we have divided the teams vertically. This allows us to create teams that see a clear-cut goal they need to achieve. And at the same time, they can improve their cross-skills and boost their responsibility levels.
As a result, such teams began to deliver, Scrum and Agile properly. It is not directly related to data science, but companies need to realize the difference between a data scientist, which works in academia, and a production data scientist. Companies should aim at employing the latter one within their teams, and not let a data scientist work alone remotely.
About the Author / Stepan Pushkarev
Stepan Pushkarev is the Head of DevOps Practice at Squadex.com and CTO of Hydrosphere.io. Co-founded and managed engineering teams for eCommerce, IoT and Ad-tech companies. He has been responsible for the full products stack: math models, infrastructure & operations, enterprise applications as well as hiring, establishing engineering culture and delivery process. Stepan combines strong technology, management skills and entrepreneurial spirit. Connect with him on LinkedIn.