If you are someone who loves to write code in Python/ SQL / Scala /R and would like to use only one platform for all your activities in different areas from data analysis, data engineering or data science then databricks can save your efforts.
Here is why I love using it for any type of work on cloud data platform.
- Easy to learn:
The platform has it all, whether you are data scientist, data engineer, developer, or data analyst, the platform offers scalable services to build enterprise data pipelines.
The platform is also versatile and is very easy to learn in a week or so.
2. Well recognized and trusted by big companies:
Well who doesn't want to learn framework or platform which is industry recognized and used by big companies. Here is the Gartner’s magic quadrant for data science and machine learning platform.
3. Delta lake and parquet files support:
Thanks to Apache spark framework, you can easily process billions of records, build scalable pipelines or query large amount of data using parquet files.
Databricks also supports hive tables but the recent support for Delta lake framework is just amazing.
It brings ACID transaction like capabilities along with easy merging of data from staging layer. It also has other features but we will talk about it it another post. You can refer to documentation to read more about delta lake.
And yes databricks delta tables do support stream data too.
4. Data scientists toolbox:
The ease to use Python/R/Scala/SQL from data-bricks notebooks along with support for Ml-flow to track experiments,create flavors, schedule flows, create dashboards, the versatility of databricks platform makes it ideal choice for building industry scale data science pipelines.
5. Easy GitHub integration and job scheduling:
You can easily schedule notebooks or integrate your code with GitHub in a few clicks. You can easily schedule jobs to run your flows/notebooks.
6. CLI integration and dash-boarding:
You can access the workspace and automate manual activities using databricks CLI .
You can easily build dashboards to share your insights with the team.
7. Access Control over workspace and cluster control:
Databricks does provide ability to control access to the workspace, notebooks and dashboards along with limited token access to underlying hive tables.
Ability to control the cluster management manually is another feature if we know about cluster management.
I will talk more about pyspark, how to use databricks for building pipelines and machine learning in upcoming posts.
So if you consider my opinion, databricks is a wonderful easy to learn platform to hone your data skills.
If you want to give it a try , signup for the community edition.
If you liked this post don’t forget to cheer me up with a clap, next post will be on something different.