What is Databricks
“Databricks is an open and unified data analytics platform for data engineering, data science, machine learning and analytics.”
Databricks is an enterprise software company founded by the creators of apache spark. Databricks is an industry leading, cloud-based data engineering tool used for processing and transforming huge amounts of data and exploring the data through machine learning models. Databricks developed a web based platform for working with spark, providing automated cluster and IPython-style notebook. Recently added to Azure, It is the latest big data tool for the Microsoft cloud. It is available for all the organizations and allows them to easily achieve the full potential of combining their data, machine learning and ETL process.
Databrick are available in two versions: community and paid. Paid version like azure and aws cloud you can use it in the backend. It helps us to implement ML flow.
When you start with community Edition the interface will look like as shown in the screenshot. You can see a few things in this screenshot: Explore the quickstart tutorials, import and export the data, create a blank notebook. The task we are able to do in the community edition is create a new notebook, create a table, create a new cluster, create a new MLflow Experiment and import library.
How to create a cluster ?
To create a cluster just click on a new cluster, give any name of the cluster which you want, select Databricks Runtime version. Default version is 8.2(scala 2.12, spark 3.1.1). They will provide you 15 GB free memory. This cluster will automatically terminate after 2 hours.
After clicking on create cluster it will take a few minutes to start a new cluster. Once it is started successfully it will see as shown in the screenshot below. As you can see, It provide lot of options. If you click on library> install new it will provide many options like upload, pypi etc.
Export the dataset
To explore the data in databricks, it provides many options like upload, s3, DBFS, other data sources etc. To upload the dataset in databricks just click on upload the drag the data set file which you want to upload, after that your data will be uploaded on databricks.
Now you can use this data to data analysis with the pyspark and build machine learning model.