Perform Analytics over Large Data Volumes: Aggregations in Power BI

"Aggregations" is one of the biggest features introduced to the Power BI in terms of enriching the interactive analytics over Big Data. Before Power BI team introduce this feature, It was not possible to analyze and visualize data where the large tables contains millions or even billions of records. I’m writing this post in a way you can use as a tutorial to read, understand and tryout by yourself -- How to implement Aggregations in Power BI.

Before starting to read this, please download the Demo content from here. This will contain three files.

AggsDemo.bak: Demo database backup build on SQL Server 2017
Create Aggs Tables.sql : SQL Query to create aggregation tables
AggsDemo.pbix: Power BI demo file which implemented aggregations in this post

Connection Types - Direct Query vs Import

Even before we move to Aggregations, I think it is important to have an understanding of connection types in Power BI. Predominantly there are two connection types in Power BI. Import and Direct Query.

Import

In Import mode, by name it self you import the data into your Power BI file. There are many advantages in this mode like, you can analyze data much faster. Because, when you import the data, data will be compress and store in in-memory Vertipaq engine, which is the database engine behind the Power BI, Analysis Services Tabular and Power Pivot. However you cannot exceed the upper limit of the import data which is 1GB.

Direct Query

Oppose to the Import mode, in DirectQuery Mode you just made the connection with your source. You are not importing data. So, your Power BI file only hold the metadata of your source tables. Which means in another way, when you draw a visual to canvas by consuming your data, in real-time, a DAX query will generate by Power BI engine and translate to relevant source query and retrieve the data. For an instance, if you connect to SQL Server Data source by DirectQuery, when you consume the data, it will translate your DAX query into T-SQL code and query the data from SQL Server database and load into the visuals.

In both types there are Pros and Cons. Ex: Import mode data loading to visual is really fast than DirectQuery. But you have to refresh the data in the import mode. Because, the data in the mode and the source may different over the time. But in DirectQuery mode you don't want to refresh data. Because, in that mode, query the source data in live.

Demo Database

Hope by now you have downloaded the Demo Content. In the demo folder you will find AggsDemo.bak SQL Server Database backup. You need to restore it in your SQL Server instance. In the demo database, only included four tables. Product, Date, Customer and Sales. This database I've made by derived from ConsotoDW database. The reason just to used four tables to make the Demo simple and just allow you to learn the concept. But, keep in mind the Aggregation concept is Powerful and once you learned it you can apply for Big Data Sources like data in databricks cluster.

This structure is a sample for Star schema, Sales is a Fact table contain ~12M records.

Let's Consume the Demo Database using Power BI

Open the Power BI desktop, click Get Data to connect to Data source, select SQL Server as source type, select the tables, and make Direct Query to Source AggsDemo database.

In the next window select the tables in the database.

Hence we chose Direct Query mode, Click Load Data, it just made the connection between the Power BI file and the database.

You can ensure that your dataset in DirectQuery storage mode, its there in the bottom right corner.

Other indication of you are in DirectQuery mode, in this mode you won't see a Data tab. Only Model tab and Report tab will be there.

Switch to the Model tab, and I'm going to create relationship and build a data model. At the beginning there are no relationships and you will see just isolated four tables like in below.

I'm going to connect ProductKey in Sales connect to ProductKey in Product table and create the relationship. DateKey in Sales table connect to DateKey in Date table, also CustomerKey in Sales table connect to CustomerKey in Customer table.

After you built the relationships you can see your data model like this.

You can check the relationships you made by clicking Manage Relationships button.

Let's draw a visual Sales by Year. When you draw you would notice, there is small progress icon, to indicate the data loading is in progress. In here we are scanning whole 12 M Sales table and joining with Date Table.

Let's Build Aggregations

As I mentioned earlier in this post, we can use Aggregations to improve the Query performance. Though, we used relatively small table with 12 M records, when it comes to billion records, your model defensively takes more time to load data into visuals.

Aggregation is a concept of store pre-calculated data in tables specially when you have large tables in your model like Billions of records. This helps you to reduce the table scans and improve the query performance.

Ex: in order to serve the above query, we could have build a table with yearly sales.

-- Aggregated by CalendarYear

SELECT D.CalendarYear, SUM(S.SalesAmount) AS SalesAmount
INTO dbo.SalesByYear
FROM dbo.Sales S
JOIN dbo.Date D ON D.Datekey = S.DateKey
GROUP BY D.CalendarYear

Once you run this query, it will create another table called, SalesByYear. This table just contain 3 records. Then we can assume, it will take lesser time to load data than earlier.

Add Newly Created Aggregated Table into Model

In this step, I'm going to use my newly created aggregated table into the model. I'm select the Recent Queries button and connect to my AggsDemo database and added my SalesByYear table into the Power BI file.

In this time also I'm using DirectQuery mode and just made the connection.

How to Determine Whether Aggregation Table used by the Power BI Engine or Not?

We need to ensure Vertipaq engine use our aggregation tables. Otherwise we may just waste our time by creating aggregation tables if engine down not pick it. Just like we do build Indexes in SQL Server we need to ensure our effort. So, how to detect whether Engine hit the aggregation table or not?

Power BI Performance Analyzer

One way of determine that is using Power BI Performance Analyzer. You can enable the Performance analyzer by go to the view tab and check Performance Analyzer . Performance Analyzer helps you to analyze the query performance.

You can start recording, then it will trace all the queries and execution times in canvas. In here since visual already loaded into canvas you won't see any query information at the beginning. Then what you have to do Refresh Visuals. Then it will refresh the visuals again and trace the query information.

You can see it has taken 1516 milliseconds to load the data into visual. You can copy the query and paste into a notepad and see which tables actually has been queried. If you look at the below snapshot of my query, you can see there are two query sections. DAX query, which is generate by Vertipaq engine, the second one is the T-SQL query which is translated and executed against my SQL Server AggsDemo database.

If you pay attention to closer on SQL Query, you can see the engine has not consumed my Aggregated table. 😓

The reason for that, just adding your aggregation table is not enough!.. You need to configure the aggregation in order to work.

How to Configure Aggregations?

Go to the Model tab, right click on SalesByYear my aggregation table, select Manage Aggregations.

In your manage aggregations window, Select SUMMARIZATION of CalanderYear is Group By, DETAIL TABLE is Date table. DETAIL COLUMN is Calendar Year.

In Sales Amount column SUMMARIZATION is Sum, (If you have used different aggregation like Count, Min, you can use that as Summarization). DETAIL TABLE is Sales. DETAIL COLUMN is Sales Amount.

Once you click Apply All button, This configuration will be applied to your model. As soon as you applied aggregations, the Aggregation table will go to the Hidden mode and it is not visible for the end user.

The idea in here as I understood, you may have many aggregations in your model. After configure it goes to Hidden mode automatically. End user does not know there aggregation tables with duplicate data. (If they do, they might confuse). So, the engine is smart enough to switch between your detail table and smaller aggregated table based on the queries you are using. 😎

Now once again I'm Clear the Performance Analyzer stats and refresh visuals.

This time it just took 132 milliseconds to load. Woohoo !.. Seems like Power BI engine has pick the aggregation table. Lets see the query.

You can see this time, it has used the aggregation table SalesByYear.

Another Way of Detect Aggregations

Another way of detect aggregations is using DAX Studio. DAX Studio is a free tool you can download from daxstudio.org which allows you to perform various tasks like analyze query performance, browse model objects, write DAX queries, trace and identify bottlenecks etc. Using DAX Studio you can connect Power BI models, Power Pivot and Tabular Models.

In order to get server execution times enable Server Timing in the menu bar.

EVALUATE
SUMMARIZECOLUMNS('Date'[CalendarYear],
"Sales", SUM(Sales[SalesAmount])
)

Let's write a DAX query like above, this query will return the same information we used in Power BI.

If you go to the Server Timing tab at below, you can see lot of information like, FE- Formula Engine timing, SE- Storage Engine timing. In the result table there are two records.

RewriteA: Aggregate Rewrite Attempts - This indicate whether engine hits the aggregation table or not. A smart way to identify that. 😎

SQL: The SQL Query it generate to query the data.

You can see it says matchFound in Aggregate Rewrite Attempt section.

Let's Try Some Advanced Aggregations

So far we learned the Aggregation concept and how to use simple aggregations in your model. Now let's try to build some advanced aggregations.

I'm change my DAX query to return ProductCategory instead of CalendarYear. You can see the Aggregation Attempts failed.

Not a surprise. Because, I haven;t build any aggregations with Product. Let's create another aggregation table.

-- Aggregated by Year and ProductKey

SELECT P.ProductKey,D.CalendarYear, SUM(S.SalesAmount) AS SalesAmount

INTO dbo.SalesByProductYear

FROM dbo.Sales S

JOIN dbo.[Date] D ON S.DateKey = D.Datekey

JOIN dbo.Product P ON S.ProductKey = P.ProductKey

GROUP BY D.CalendarYear, P.ProductKey

Whenever I use Product wise Yearly Sales, to hit this smaller table rather scan huge Sales table. That is my objective by creating this aggregation table.

I'm here following the same process to add this SalesByProductYear table into my Power BI file. Go to Recent Sources, select the table, and connection type as DirectQuery.

Once again I need to configure aggregations, by select Manage Aggregations.

By now I hope you have understanding how to configure this window. So, I'm not going to repeat the process. Leave to you 😊

Now our data model looks like this.

Let's try to run the DAX query again and see whether it picks the Aggs table by the engine. It tooks 1,842 ms and aggregate attempt failed.

The reason for this, in our SalesByProductYear aggregation table there is NO product category column. So how to fix this?. Should we add another column ProductCategory to Aggregation table? We can't build aggregations like that. In that case we may need to add many columns to aggs table and end up with useless.

So till the engine doesn't know that Product Table Product Category has a connection with Aggregation table ProductKey. In order to tell the engine, we need to build a relationship.

So, I'm going to build a relationship between SalesByProductYear aggregation table and Product Table by connecting using ProductKeys. In that way we don't require to bring ProductCategory column into Aggregate table. Not only that, we can analyze any attribute in the Product table along with Aggregate table.

Just like normal way create a relationship with SalesByProductYear table and Product table.

After creating the relationship the model will appear like this.

Lets run the query using DAX studio and see the result. Now it should use aggregated table.

Now it says match found it has used the aggregations properly. If you look at the query execute against the SQL Server, you can see it JOIN the SalesByProductYear table with Product table to produce the result.

Summary

In this post, you learned the connection types in Power BI, the concept of Aggregations and how to use aggregations over data mode. Even though I used a relatively smaller data set, yet the concept and the way of implementing is valid for the Big Data like data in a Apache spark cluster or a Data Lake. Latter part of this post we walked through on How to detect aggregations using Performance Analyzer and DAX Studio. Hope you could learn the concept, Aggregations in Power BI. Please feel free add a comment and give a feedback.

References:

Vertipaq Engine

Aggregations in Power BI

DAX Studio

Power BI Performance Analyzer

How to use DAX Studio

DirectQuery vs Import in Tabular

Step-by-Step Twitter Sentiment Analysis Using Power BI Streaming Dataset, Microsoft Flow and Azure Text API

Sentiment Analysis is known as Opinion mining or emotion AI which is a branch of Natural Language Processing and text analytics where systematically identify, extract, quantify, and study affective states and subjective information. This kind a analysis widely apply to analyse the product or service reviews, voice of the customer, survey responses from online and social media feeds to analyze the attitude of the customer. Basically from the sentiment analysis the output would be either Positive, Negative or Neutral. There are various algorithms and methods to do a sentiment analysis out there. In this post here I'm doing a sentiment analysis for iPhone 8 product by analyzing twitter feeds. Because, I wanted to know what others are thinking about the latest phone released by Apple. In order to do this task I'm using, Microsoft Azure cognitive services : Text Analytics (to run the sentiment analysis algorithms to get out the results) Microsoft Flow ...

James ZicrovNovember 15, 2019 at 2:09 AM
I think Power BI is the best tool to create reports for end users.Apart from this it is also useful for business analytic tools.

Powerbi Read Rest
Krishna OptifJanuary 27, 2020 at 9:58 PM
very nice and provide me informative content thanks for sharing for more information Looking for the best Power BI Business Intelligence Services
Digital ChanduMarch 12, 2020 at 7:45 AM
very nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing

Digital Marketing In Telugu
Digital Marketing In Hyderabad
internet marketing
Digital marketing
Digital Marketing Strategy

Nisal Mihiranga's Blog

Search This Blog