Skip to main content

How to Write Streaming Data into a Data Table in Databricks

Azure Databricks is the latest way of doing Data Engineering and Data Science workload in Microsoft space. If you are new to Azure Databricks, and wondering what is it and finding ways of how to get started with it, I would like to refer my Jump Start Series of articles. 😊

If someone asked me, why databricks or Apache Spark is so special, my first answer would be "it is good at with to do real-time streaming processing apart from the batch processing". Also, no matter whether you are a good Java Programmer, very handy Python developer, an experienced Data Scientist with expertise with R, or a Data Engineer who born with SQL. Databricks has designed for all of you. Irrespective to the programming language, you can perform your workload without hassle. 

In this blog post, I'm going to explain to you how to write streaming data frame into a Databricks Data Table.

You may have your processing data in Azure Databricks using dataframes either Batches or Streams. Each way the Read and Write methods are different. You can have a detailed understanding of it if you are looking into the Azure Databricks documentation. In this post, I'm only focused on Streaming Writes to data tables in Databricks.  There are two steps you need to perform in order to write streaming data successfully. 

  1. Write Streaming DataFrame into Parquet file format
  2. Read Parquet files and Write into Databricks table

Parquet is a columnar format that is supported by many other data processing systems including Apache Spark. Read more from here

In my demo example, I've written the sentiment score of Twitter feeds into a databricks table

In this code instead of display the result in the console I'm writing to the Databricks table. In here you don't need to specifically create any table to store data manually. The code does create the table. 

Once this executed you can see the table has been created in the data tab. 



You can click the table and view the Data actually written. 



Further learning:
This parquet files and the checkpoints consume mnt directory in the DBFS (Data Bricks File System). You can view the DBFS by goto the Databricks home-> Upload data-> DBFS




You can see a set of parquet files written inside TwitterSentiment folder. In the checkpoint location also created a set of files. See the image below. Checkpointing is used as Production quality streaming implementation for fault tolerant. In order to read more on checkpointing here



Hope you learned how to write Streaming data into Azure Databricks tables in Spark Cluster. If you find this article as interesting, subscribe my blog to get more articles like this. Also, if you have any issues or would like to give any feedback please leave a comment below. Cheers! 😀

Comments

  1. Invest with 200$ and get a returns of 5,000$ within seven business working days.
    Why wasting your precious time online looking for a loan? When there is an opportunity for you to invest with 200$ and get a returns of 5,000$ within seven business working days. Contact us now for more information if interested on how you can earn big with just little amount. This is all about investing into Crude Oil and Gas Business.

    Email: investmoneyoilgas@gmail.com

    ReplyDelete
  2. Most stupid article I have seen in a while. Seriously so much of build up just to tell 'writestream' in the end.

    ReplyDelete
  3. Thank you for your post. This is an excellent information. It was amazing and wonderful to visit your blog.
    SQL Azure Online Training
    Azure SQL Training
    SQL Azure Training

    ReplyDelete
  4. Thanks for the post of Blog. This is very useful for me and my friends so keep it and another types we want such as Marketing Analytics , Supply Chain Consulting.
    Any way This is most useful for Blog Readers.

    ReplyDelete
  5. I have read all the comments and suggestions posted by the visitors for this article are very fine,We will wait for your next article so only.Thanks! Admond Lee

    ReplyDelete
  6. Nice blog has been shared by you. before i read this blog i didn't have any knowledge about this but now i got some knowledge.
    so keep on sharing such kind of an interesting blogs. BigData Course in Delhi

    ReplyDelete
  7. Thanks for sharing information.We also offer Corporate Shifting in madhapur, Hyderabad and across all towns and cities in India.

    ReplyDelete
  8. Hey Nisal, good day. A very nice blog to get started with Spark Streaming. I want to know one thing. When we are writing parquet files and putting them in a table, instead of that can we do it like when we get tweets directly insert them in table and not use parquet files as a medium?
    Other thing is if I follow this method of writing them into parquet files and then inserting them into table, I noticed that it updated table only first time and when new tweets are coming then files were getting created but they were not getting inserted into table. Please share your thoughts.

    ReplyDelete

Post a Comment

Popular posts from this blog

Step-by-Step Twitter Sentiment Analysis Using Power BI Streaming Dataset, Microsoft Flow and Azure Text API

Sentiment Analysis is known as Opinion mining or emotion AI which is a branch of Natural Language Processing and text analytics where systematically identify, extract, quantify, and study affective states and subjective information. This kind a analysis widely apply to analyse the product or service reviews, voice of the customer, survey responses from online and social media feeds to analyze the attitude of the customer. Basically from the sentiment analysis the output would be either Positive, Negative or Neutral.  There are various algorithms and methods to do a sentiment analysis out there. In this post here I'm doing a sentiment analysis for iPhone 8 product by analyzing twitter feeds. Because, I wanted to know what others are thinking about the latest phone released by Apple. In order to do this task I'm using,  Microsoft Azure cognitive services : Text Analytics (to run the sentiment analysis algorithms to get out the results)  Microsoft Flow ...

How to Get Row Counts for all Tables in your SQL Database

This is the simplest method to get the row counts of all tables in your SQL Server database. Might be useful when you are doing the data testing over your BI project.     select   schema_name (tab.schema_id) , tab.name  as  [table],         sum (part.rows)  as  [rows]     from  sys.tables  as  tab          inner join  sys.partitions  as  part              on  tab.object_id  =  part.object_id where  part.index_id  IN  ( 1 ,  0 )  //   0 :  Table   without  PK,  1 :  table   with  PK and   schema_name (tab.schema_id)  =   'dbo'   group by   schema_name (tab.schema_id) , tab.name --order by sum(part.rows) desc  

COVID-19 Situation in Sri Lanka: Real-time Dashboard using Power BI

[Updated on March 28, 2020] COVID-19  is a new strain that was discovered in late December, 2019 and by now it has been started to spread all over the world including 199 countries. There are around 26,000 people were died due to the virus and 580,000 confirmed cases at the moment I update this blog post. Though, the fatality rate is lower when compared to other viruses in similar strains like MERS and SARS . However, the biggest challenge most of the countries facing right now is to accommodate the patients with the growing numbers at every minute. The public was asked by the government to avoid nonessential gatherings as much as possible to stop spreading this deadly virus. Even though there were not many cases found in Sri Lanka, from March first week, they found the initial case a patient who works as a tourist. Now the situation is turning to worst after they found many cases especially tourists who are born Sri Lankans came down from Italy and European countries....