BlueData Brings DevOps Agility to Data Science Operations with Spark, R, and Python

Santa Clara, Calif.—BlueData, provider of the leading Big-Data-as-a-Service (BDaaS) software platform, today announced the new winter release for the BlueData EPIC software platform. This new release delivers several new enhancements for data science operations, bringing DevOps agility and collaboration to data science teams as well as support for new machine learning use cases.

More organizations are now building data science teams and the role of the data scientist is the #1 job in the U.S. for the second year in a row. Data scientists are highly skilled at developing advanced analytical models and prototypes; their data-driven innovations can be game-changing. But the siloed efforts and custom-crafted prototypes of individual data scientists can be difficult to scale, reproduce, and share across multiple users. What works for an ad-hoc model in development may not necessarily work in production; what works as a one-off prototype on a laptop might not work as a consistent and repeatable process in a distributed computing environment.

Increasingly, data science is becoming a team sport – often involving multiple data scientists, data engineers, data analysts, and developers that have different skillsets and different specialized tools. What’s needed is an approach that brings the agility, automation, and collaboration of DevOps to these data science and engineering teams. They need to operationalize the data science lifecycle in a streamlined and repeatable way. They require an agile and lean process that enables them to iterate quickly and fail fast. They need the ability to easily share data, models and code in a secure distributed environment. And they need the flexibility to use their own preferred tools and try out new technologies in the rapidly changing field of data science.

The BlueData EPIC software platform delivers this agility and flexibility, providing an easy-to-use self-service interface that allows data science teams to quickly spin up Docker-based environments for their preferred tools – running on shared infrastructure either on-premises or in the public cloud, with secure access to common data (e.g. in an HDFS data lake or Amazon S3). And the new winter release of the EPIC platform addresses many of the challenges outlined above. Some of the highlights of this new release include:

Collaboration and productivity, with a choice of web-based notebooks: Provides the option for data science teams to use JupyterHub, RStudio Server, and/or Zeppelin notebooks. Each of these notebooks are pre-configured and pre-tested as Docker images in the BlueData EPIC App Store and can be installed via automated one-click deployment. BlueData EPIC ensures governance, security, and authentication while providing the ability for users to share their data, models, and code in a multi-tenant environment on common infrastructure.

Flexibility to support the full spectrum of data science tools: Data science environments in BlueData EPIC are pre-configured for R and Python support, with or without Spark. This enables data science teams to use their preferred languages, packages, and tools – without the operational challenges of testing and validating configurations or version dependencies. For example, data scientists can easily start with R standalone (e.g. RStudio, Shiny Server) and then later opt to use R with Spark (e.g. SparkR and sparklyr) for other use cases; the same applies to Python (i.e. with JupyterHub and PySpark). Other tools can also be added to BlueData EPIC App Store, using the App Workbench to create new Docker-based images.

Simplified job submission, either with a few mouse clicks or programmatically: Allows users to easily submit R, Python, Spark, Hadoop, or SQL jobs – for persistent or transient clusters – from either the BlueData EPIC Web-based UI or REST API. This helps data science teams to quickly respond to dynamic business requirements by running a variety of jobs ranging from analytical SQL to Spark machine learning scripts against their data in a matter of a few clicks or with simple code.

Bootstrap and other action scripts to automate data science operations: Offers the ability to easily patch and update some or all of the nodes in a running environment with a single click. Bootstrap action scripts within BlueData EPIC can be executed both during and after creating the environments – e.g. to install additional software or to change the configuration. This helps to removes the operational overhead of setting up, configuring, and managing the end-to-end lifecycle of data science environments.

Support for machine learning and a wide range of other data science use cases: BlueData continues to add new Big Data frameworks and tools to its platform – including pre-integrated H2O as well as Spark MLlib and other machine learning tools. With the pre-configured Docker image for H2O in the BlueData EPIC App Store, customers can now quickly deploy the H2O set of machine learning libraries – including H2O with R, H2O with Spark (i.e. “Sparkling Water”), as well as the H2O Flow user interface.

Data science never really works right with a “one size fits all” cookie cutter solution. With the BlueData EPIC platform, data science teams can analyze data using Scala, Python, R, or SQL; build models using R, Python, Spark MLlib, or H2O; and run and visualize their analysis using RStudio or JupyterHub or Zeppelin notebooks – all on the same Spark cluster, using shared data. Data scientists and other users have the flexibility to take advantage of a wide variety of tools and algorithms to address an increasingly complex set of use cases.

This new release builds upon BlueData’s continued product innovation over the past year. Most recently, BlueData announced BlueData EPIC on AWS to provide ultimate flexibility and choice for Big Data deployments on the Amazon cloud, including the ability to tap into both Amazon S3 and on-premises storage. BlueData provides the only Big-Data-as-a-Service solution that can be deployed either on-premises, in the public cloud, or in a hybrid architecture.

“It’s time for enterprises to extend the benefits of DevOps to their data science and engineering teams, whether for real-time analytics and machine learning or other use cases,” said Kumar Sreekanti, co-founder and CEO at BlueData. “BlueData customers can bring this agility and speed to their data science operations, with the ability to create fully integrated data science environments in just a few mouse clicks – both on-premises and in the public cloud.”

BlueData will be demonstrating this new functionality at the Spark Summit East event in Boston, February 8th and 9th, in booth K8. BlueData’s co-founder and chief architect, Thomas Phelan, will be presenting at Spark Summit on “Lessons Learned from Dockerizing Spark Workloads” on Wednesday February 8th at 12:20pm.

— Jules Louis