Unified Analytics Leader Supports New Apache Spark 2.4; Introduces New Feature to Simplify Distributed Deep Learning
SAN FRANCISCO–(BUSINESS WIRE)–Databricks, the leader in unified analytics and founded by the original creators of Apache Spark™, today announced support for the newly released Apache Spark 2.4.0 within Databricks’ Unified Analytics Platform. Databricks is the first unified analytics vendor to support Apache Spark 2.4. It is supported as part of Databricks Runtime 5.0, which is now generally available. Databricks also introduced a key feature, HorovodRunner, within Runtime 5.0 to further simplify distributed deep learning.
The Apache Spark community made multiple valuable contributions to the Spark 2.4 release which was introduced on November 8, 2018. In this release, Project Hydrogen substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. Project Hydrogen directly addresses the challenges data teams face because there is a significant difference between how big data jobs and deep learning jobs are executed. Whereas Spark excels at data processing at massive scale, deep learning assumes complete coordination and dependency among tasks which is optimized for constant communication rather than scalability and fault tolerance.
“Innovation continues to thrive within the Apache Spark community. Project Hydrogen is the most recent major initiative with an aim to provide first-class support for popular distributed machine learning frameworks on Apache Spark,” said Reynold Xin, co-founder at Databricks, Apache Spark PMC member and the top contributor to the project.
Within Apache Spark 2.4, Project Hydrogen introduces Barrier Execution, a new scheduling mode that allows practitioners to properly embed distributed deep learning training as an Apache Spark workload. Added Xin, “This is the largest change to Spark’s scheduler since the inception of the project. At Databricks, we also found additional opportunities to simplify the complexity of machine learning workloads. Within Databricks’ Unified Analytics Platform, which is powered by Spark 2.4, we created further optimizations to simplify distributed deep learning.”
Databricks Simplifies Distributed Deep Learning in Runtime 5.0
Model experimentation usually takes place on a single-node machine, locally or in the cloud, before scaling out computation as needed. Migrating from single-node workloads to distributed training on a CPU or GPU clusters can often times require a full code rewrite, increasing the complexity of moving to distributed training. To accelerate migration to distributed deep learning, Databricks just released HorovodRunner. The new feature provides a simple way to scale up deep learning training workloads from a single machine to large clusters, reducing overall programming and training time from hours to minutes.
To help simplify deep learning further, Databricks also provides native integration with the most popular frameworks including TensorFlow, Keras, and Horovod, as well as a performance edge with the most popular machine learning algorithms from MLlib and GraphFrames. This provides practitioners with a convenient way to get machine learning clusters started in seconds, pre-configured with the latest machine learning frameworks, libraries, and their dependencies.
Additional Resources
- Databricks Blog: Announcing Databricks Runtime 5.0
- Databricks Blog: Introducing Apache Spark 2.4
About Databricks
Databricks’ mission is to accelerate innovation for its customers by unifying Data Science, Engineering and Business. Founded by the original creators of Apache Spark, Databricks provides a Unified Analytics Platform for data science teams to collaborate with data engineering and lines of business to build data products. Users achieve faster time-to-value with Databricks by creating analytic workflows that go from ETL and interactive exploration to production. The company also makes it easier for its users to focus on their data by providing a fully managed, scalable, and secure cloud infrastructure that reduces operational complexity and total cost of ownership. Databricks, venture-backed by Andreessen Horowitz, NEA and Battery Ventures, among others, has a global customer base that includes Viacom, Shell and HP.
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation.