Skip to main content

Value Encoding in Vertipaq Engine

If you can remember my last blog post regarding Vertipaq engine inside SSAS Tabular, I’ve discussed three algorithms which are in use when process the model. Processing in the sense perform data loading from the relational source and load into the tabular structure. In here data compression is taken place in order to save the memory footprint. It is really important because all the data we had in the data warehouse or source relational database after processed the model load into the memory. So by the compression save the huge amount of memory space and it will utilize your hardware optimum way while faster scans because the data model is smaller than the original.

These are the steps taking place when we process the tabular model from SSDT or via SQL Server Management Studio. 
  1. Read the data from the source database and transform into columnar structure or vertipaq data structure while data encoding and compression occurs.
  2. Creating of dictionaries and indexes for each column.
  3. Creation of relationships for data structures
  4. Computation and compression of all the calculated columns
Step 3 and 4 interchangeable. Which means it also possible to create relationships based on the calculated columns you created. 

Let's focus on data encoding and compression in this post. If you can remember I mentioned there are three types of encoding happening behind the scene. Hash Encoding, Value Encoding and RLE

Value Encoding 

We are playing with numbers. In data warehousing, we model dimensions and facts. While all the relational entities converting into dimensions, all the numeric measures which related to business processes put into fact table structure. 
When we talk about fact tables mostly contain an integer which represents the all the surrogate keys and some measures and floating point values for other numeric values. Value encoding evolves with integer columns only.

Let's have a look at how the value encoding works. For an example look at the CityKey column in Order fact table in the WideWorldImporters data warehouse database. From the Vertipaq engine perform some mathematical operation to do the compression. 



In here what has done subtract the Original value from the minimum value 41165 in the column. If the original value consumed 32 Bit Integer to store the data after the compression it only takes 16 Bit integer. It almost reduces the memory footprint by 50%. Typically fact tables contain millions or even billion rows sometimes. Can you imagine how much space save by this value encoding? 
By the time when query this column what vertipaq does re-apply the same function to get the original value back. In reality, the engine does even more advanced calculations to reduce the memory footprint. 

Will discuss about Dictionary encoding/Hash encoding from the Next post!

Reference: SQL BI

Comments

Popular posts from this blog

Step-by-Step Twitter Sentiment Analysis Using Power BI Streaming Dataset, Microsoft Flow and Azure Text API

Sentiment Analysis is known as Opinion mining or emotion AI which is a branch of Natural Language Processing and text analytics where systematically identify, extract, quantify, and study affective states and subjective information. This kind a analysis widely apply to analyse the product or service reviews, voice of the customer, survey responses from online and social media feeds to analyze the attitude of the customer. Basically from the sentiment analysis the output would be either Positive, Negative or Neutral.  There are various algorithms and methods to do a sentiment analysis out there. In this post here I'm doing a sentiment analysis for iPhone 8 product by analyzing twitter feeds. Because, I wanted to know what others are thinking about the latest phone released by Apple. In order to do this task I'm using,  Microsoft Azure cognitive services : Text Analytics (to run the sentiment analysis algorithms to get out the results)  Microsoft Flow ...

How to Get Row Counts for all Tables in your SQL Database

This is the simplest method to get the row counts of all tables in your SQL Server database. Might be useful when you are doing the data testing over your BI project.     select   schema_name (tab.schema_id) , tab.name  as  [table],         sum (part.rows)  as  [rows]     from  sys.tables  as  tab          inner join  sys.partitions  as  part              on  tab.object_id  =  part.object_id where  part.index_id  IN  ( 1 ,  0 )  //   0 :  Table   without  PK,  1 :  table   with  PK and   schema_name (tab.schema_id)  =   'dbo'   group by   schema_name (tab.schema_id) , tab.name --order by sum(part.rows) desc  

COVID-19 Situation in Sri Lanka: Real-time Dashboard using Power BI

[Updated on March 28, 2020] COVID-19  is a new strain that was discovered in late December, 2019 and by now it has been started to spread all over the world including 199 countries. There are around 26,000 people were died due to the virus and 580,000 confirmed cases at the moment I update this blog post. Though, the fatality rate is lower when compared to other viruses in similar strains like MERS and SARS . However, the biggest challenge most of the countries facing right now is to accommodate the patients with the growing numbers at every minute. The public was asked by the government to avoid nonessential gatherings as much as possible to stop spreading this deadly virus. Even though there were not many cases found in Sri Lanka, from March first week, they found the initial case a patient who works as a tourist. Now the situation is turning to worst after they found many cases especially tourists who are born Sri Lankans came down from Italy and European countries....