"Aggregations" is one of the biggest features introduced to
the Power BI in terms of enriching the interactive analytics over Big Data. Before Power BI team introduce this feature, It was not possible to analyze and visualize data where the large tables
contains millions or even billions of records. I’m writing this post in a way
you can use as a tutorial to read, understand and tryout by yourself -- How
to implement Aggregations in Power BI.
Before starting to read this, please download the Demo
content from here. This will contain three files.
- AggsDemo.bak: Demo database backup build on SQL Server 2017
- Create Aggs Tables.sql : SQL Query to create aggregation tables
- AggsDemo.pbix: Power BI demo file which implemented aggregations in this post
Connection Types - Direct Query vs Import
Even before we move to Aggregations, I think it is important to have an understanding of connection types in Power BI. Predominantly there are two connection types in Power BI. Import and Direct Query.
Import
In Import mode, by name it self you import the data into your Power BI file. There are many advantages in this mode like, you can analyze data much faster. Because, when you import the data, data will be compress and store in in-memory Vertipaq engine, which is the database engine behind the Power BI, Analysis Services Tabular and Power Pivot. However you cannot exceed the upper limit of the import data which is 1GB.
Direct Query
Oppose to the Import mode, in DirectQuery Mode you just made the connection with your source. You are not importing data. So, your Power BI file only hold the metadata of your source tables. Which means in another way, when you draw a visual to canvas by consuming your data, in real-time, a DAX query will generate by Power BI engine and translate to relevant source query and retrieve the data. For an instance, if you connect to SQL Server Data source by DirectQuery, when you consume the data, it will translate your DAX query into T-SQL code and query the data from SQL Server database and load into the visuals.In both types there are Pros and Cons. Ex: Import mode data loading to visual is really fast than DirectQuery. But you have to refresh the data in the import mode. Because, the data in the mode and the source may different over the time. But in DirectQuery mode you don't want to refresh data. Because, in that mode, query the source data in live.
Demo Database
Hope by now you have downloaded the Demo Content. In the demo folder you will find AggsDemo.bak SQL Server Database backup. You need to restore it in your SQL Server instance. In the demo database, only included four tables. Product, Date, Customer and Sales. This database I've made by derived from ConsotoDW database. The reason just to used four tables to make the Demo simple and just allow you to learn the concept. But, keep in mind the Aggregation concept is Powerful and once you learned it you can apply for Big Data Sources like data in databricks cluster.This structure is a sample for Star schema, Sales is a Fact table contain ~12M records.
Let's Consume the Demo Database using Power BI
Open the Power BI desktop, click Get Data to connect to Data source, select SQL Server as source type, select the tables, and make Direct Query to Source AggsDemo database.
In the next window select the tables in the database.
Hence we chose Direct Query mode, Click Load Data, it just made the connection between the Power BI file and the database.
Other indication of you are in DirectQuery mode, in this mode you won't see a Data tab. Only Model tab and Report tab will be there.
Switch to the Model tab, and I'm going to create relationship and build a data model. At the beginning there are no relationships and you will see just isolated four tables like in below.
I'm going to connect ProductKey in Sales connect to ProductKey in Product table and create the relationship. DateKey in Sales table connect to DateKey in Date table, also CustomerKey in Sales table connect to CustomerKey in Customer table.
After you built the relationships you can see your data model like this.
You can check the relationships you made by clicking Manage Relationships button.
Let's draw a visual Sales by Year. When you draw you would notice, there is small progress icon, to indicate the data loading is in progress. In here we are scanning whole 12 M Sales table and joining with Date Table.
Let's Build Aggregations
As I mentioned earlier in this post, we can use Aggregations to improve the Query performance. Though, we used relatively small table with 12 M records, when it comes to billion records, your model defensively takes more time to load data into visuals.
Aggregation is a concept of store pre-calculated data in tables specially when you have large tables in your model like Billions of records. This helps you to reduce the table scans and improve the query performance.
Ex: in order to serve the above query, we could have build a table with yearly sales.
-- Aggregated by CalendarYear
SELECT D.CalendarYear, SUM(S.SalesAmount) AS SalesAmount
INTO dbo.SalesByYear
FROM dbo.Sales S
JOIN dbo.Date D ON D.Datekey = S.DateKey
GROUP BY D.CalendarYear
Once you run this query, it will create another table called, SalesByYear. This table just contain 3 records. Then we can assume, it will take lesser time to load data than earlier.
Add Newly Created Aggregated Table into Model
In this step, I'm going to use my newly created aggregated table into the model. I'm select the Recent Queries button and connect to my AggsDemo database and added my SalesByYear table into the Power BI file.
In this time also I'm using DirectQuery mode and just made the connection.
How to Determine Whether Aggregation Table used by the Power BI Engine or Not?
We need to ensure Vertipaq engine use our aggregation tables. Otherwise we may just waste our time by creating aggregation tables if engine down not pick it. Just like we do build Indexes in SQL Server we need to ensure our effort. So, how to detect whether Engine hit the aggregation table or not?
Power BI Performance Analyzer
One way of determine that is using Power BI Performance Analyzer. You can enable the Performance analyzer by go to the view tab and check Performance Analyzer . Performance Analyzer helps you to analyze the query performance.
You can start recording, then it will trace all the queries and execution times in canvas. In here since visual already loaded into canvas you won't see any query information at the beginning. Then what you have to do Refresh Visuals. Then it will refresh the visuals again and trace the query information.
You can see it has taken 1516 milliseconds to load the data into visual. You can copy the query and paste into a notepad and see which tables actually has been queried. If you look at the below snapshot of my query, you can see there are two query sections. DAX query, which is generate by Vertipaq engine, the second one is the T-SQL query which is translated and executed against my SQL Server AggsDemo database.
If you pay attention to closer on SQL Query, you can see the engine has not consumed my Aggregated table. 😓
The reason for that, just adding your aggregation table is not enough!.. You need to configure the aggregation in order to work.
How to Configure Aggregations?
Go to the Model tab, right click on SalesByYear my aggregation table, select Manage Aggregations.
In your manage aggregations window, Select SUMMARIZATION of CalanderYear is Group By, DETAIL TABLE is Date table. DETAIL COLUMN is Calendar Year.
In Sales Amount column SUMMARIZATION is Sum, (If you have used different aggregation like Count, Min, you can use that as Summarization). DETAIL TABLE is Sales. DETAIL COLUMN is Sales Amount.
Once you click Apply All button, This configuration will be applied to your model. As soon as you applied aggregations, the Aggregation table will go to the Hidden mode and it is not visible for the end user.
The idea in here as I understood, you may have many aggregations in your model. After configure it goes to Hidden mode automatically. End user does not know there aggregation tables with duplicate data. (If they do, they might confuse). So, the engine is smart enough to switch between your detail table and smaller aggregated table based on the queries you are using. 😎
Now once again I'm Clear the Performance Analyzer stats and refresh visuals.
This time it just took 132 milliseconds to load. Woohoo !.. Seems like Power BI engine has pick the aggregation table. Lets see the query.
You can see this time, it has used the aggregation table SalesByYear.
Another Way of Detect Aggregations
Another way of detect aggregations is using DAX Studio. DAX Studio is a free tool you can download from daxstudio.org which allows you to perform various tasks like analyze query performance, browse model objects, write DAX queries, trace and identify bottlenecks etc. Using DAX Studio you can connect Power BI models, Power Pivot and Tabular Models.
In order to get server execution times enable Server Timing in the menu bar.
EVALUATE
SUMMARIZECOLUMNS('Date'[CalendarYear],
"Sales", SUM(Sales[SalesAmount])
)
SUMMARIZECOLUMNS('Date'[CalendarYear],
"Sales", SUM(Sales[SalesAmount])
)
Let's write a DAX query like above, this query will return the same information we used in Power BI.
If you go to the Server Timing tab at below, you can see lot of information like, FE- Formula Engine timing, SE- Storage Engine timing. In the result table there are two records.
RewriteA: Aggregate Rewrite Attempts - This indicate whether engine hits the aggregation table or not. A smart way to identify that. 😎
SQL: The SQL Query it generate to query the data.
You can see it says matchFound in Aggregate Rewrite Attempt section.
Let's Try Some Advanced Aggregations
So far we learned the Aggregation concept and how to use simple aggregations in your model. Now let's try to build some advanced aggregations.
I'm change my DAX query to return ProductCategory instead of CalendarYear. You can see the Aggregation Attempts failed.
Not a surprise. Because, I haven;t build any aggregations with Product. Let's create another aggregation table.
-- Aggregated by Year and ProductKey
SELECT P.ProductKey,D.CalendarYear, SUM(S.SalesAmount) AS SalesAmount
INTO dbo.SalesByProductYear
FROM dbo.Sales S
JOIN dbo.[Date] D ON S.DateKey = D.Datekey
JOIN dbo.Product P ON S.ProductKey = P.ProductKey
GROUP BY D.CalendarYear, P.ProductKey
Whenever I use Product wise Yearly Sales, to hit this smaller table rather scan huge Sales table. That is my objective by creating this aggregation table.
I'm here following the same process to add this SalesByProductYear table into my Power BI file. Go to Recent Sources, select the table, and connection type as DirectQuery.
Once again I need to configure aggregations, by select Manage Aggregations.
By now I hope you have understanding how to configure this window. So, I'm not going to repeat the process. Leave to you 😊
Now our data model looks like this.
Let's try to run the DAX query again and see whether it picks the Aggs table by the engine. It tooks 1,842 ms and aggregate attempt failed.
The reason for this, in our SalesByProductYear aggregation table there is NO product category column. So how to fix this?. Should we add another column ProductCategory to Aggregation table? We can't build aggregations like that. In that case we may need to add many columns to aggs table and end up with useless.
So till the engine doesn't know that Product Table Product Category has a connection with Aggregation table ProductKey. In order to tell the engine, we need to build a relationship.
So, I'm going to build a relationship between SalesByProductYear aggregation table and Product Table by connecting using ProductKeys. In that way we don't require to bring ProductCategory column into Aggregate table. Not only that, we can analyze any attribute in the Product table along with Aggregate table.
Just like normal way create a relationship with SalesByProductYear table and Product table.
After creating the relationship the model will appear like this.
Lets run the query using DAX studio and see the result. Now it should use aggregated table.
Now it says match found it has used the aggregations properly. If you look at the query execute against the SQL Server, you can see it JOIN the SalesByProductYear table with Product table to produce the result.
Summary
In this post, you learned the connection types in Power BI, the concept of Aggregations and how to use aggregations over data mode. Even though I used a relatively smaller data set, yet the concept and the way of implementing is valid for the Big Data like data in a Apache spark cluster or a Data Lake. Latter part of this post we walked through on How to detect aggregations using Performance Analyzer and DAX Studio. Hope you could learn the concept, Aggregations in Power BI. Please feel free add a comment and give a feedback.
I think Power BI is the best tool to create reports for end users.Apart from this it is also useful for business analytic tools.
ReplyDeletePowerbi Read Rest
very nice and provide me informative content thanks for sharing for more information Looking for the best Power BI Business Intelligence Services
ReplyDeletevery nice blogs!!! i have to learning for lot of information for this sites...Sharing for wonderful information. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing
ReplyDeleteDigital Marketing In Telugu
Digital Marketing In Hyderabad
internet marketing
Digital marketing
Digital Marketing Strategy