Fabric Dataflow Gen2 Partitioned Compute: Setup and Benchmark

Vojtěch Šíma
Mar 15
6 min read

tl;dr Dataflow Gen2 Partitioned Compute (in short) is the ability to process multiple files at the same time, resulting in a significantly smaller refresh time. This requires a partition key setup and works only on supported file systems (ADLS Gen2). From my benchmarks, it does actually work, but read the blog to get to know more crucial details.

Disclaimer: In general Dataflow Gen2 is not the fastest and cheapest Data Factory solution in Microsoft Fabric, if you have the know how or coding skills, you're better off with Data pipeline activity or non spark Notebook. Dataflow Gen2, on the other hand, is extremely user friendly and has hundreds of built-in sources.

What is Dataflow Gen2's Partitioned Compute

Partitioned compute is a capability of the Dataflow Gen2 engine that allows parts of your dataflow logic to run in parallel, reducing the time to complete its evaluations.

Microsoft documentation

If I were to describe it in my own words, it is the ability of the Dataflow Gen2 engine to process (supported) files in parallel, handling multiple files at the same time, which should reduce the time needed for overall evaluation and therefore also lower costs (in theory).

Now, as I said, this feature is in preview, and currently, it supports only files with a proper file system, which is really just Azure Data Lake Storage Gen2. So, no SharePoint (even though you can technically try to set it up this way, it won't work).

You can produce the same partitioning in Power BI Desktop, but it won't do anything, as this is an engine exclusive to Dataflow Gen2.

There is no real documentation on how it works underneath, but my interpretation is that it enables you to stream the whole process with a certain amount of concurrent requests, therefore processing multiple files at the same time. The partition part that you create is tied to the file name you are processing, and if you have the right file system, it enables you to easily translate the full path for that file.

There are a couple of options to use when running the AzureStorage.DataLake function, but these won't necessarily improve the speed (though they are worth tinkering around with when querying large files).

Set up Dataflow Gen2's Partitioned Compute

There are two main things in Options that you need to select first.

Go to options:

Then, in Privacy, Allow combining data from multiple sources. This could expose sensitive or confidential data to an unauthorized person:

Then, in Scale, Allow use of partitioned compute:

Then at the Query level, set your query as staging by right-clicking your query and selecting Enable staging.

Once this is done, you need to enter the M code for the partition key. If you use the clicking experience of combining files, it is going to be automatic. If you want to write it yourself, you can paste the Microsoft-based code, but feel free to adjust the code to your liking.

let
    rootPath = Text.TrimEnd(Value.Metadata(Value.Type(#"Filtered hidden files"))[FileSystemTable.RootPath]?, "\"),
    combinePaths = (path1, path2) => Text.Combine({Text.TrimEnd(path1, "\"), path2}, "\"),
    getRelativePath = (path, relativeTo) => Text.Middle(path, Text.Length(relativeTo) + 1),
    withRelativePath = Table.AddColumn(#"Filtered hidden files", "Relative Path", each getRelativePath(combinePaths([Folder Path], [Name]), rootPath), type text),
    withPartitionKey = Table.ReplacePartitionKey(withRelativePath, {"Relative Path"})
in
    withPartitionKey

In some cases, the code actually produced an invalid partition.

The important part is that you set the PartitionColumn and the value of that column in a way that mimics the path after the RootPath of the chosen storage. Lastly, make sure the partition column stays in the file step of your query.

Alternative code could look like this:

  addPartitionColumn = (tbl as table) => 
  let
    rootPath = Text.TrimEnd(Value.Metadata(Value.Type(tbl))[FileSystemTable.RootPath]?),
    getRelativePath = (root, relative) => Text.AfterDelimiter(relative, root),
    addPartitionColumn = Table.AddColumn(tbl, "PartitionColumn", each getRelativePath(rootPath, [Folder Path]) & [Name], type text),
    setPartitionColumn = Table.ReplacePartitionKey(addPartitionColumn, {"PartitionColumn"}),
    rootPathCheck = if rootPath = null then error "FileSystemTable.RootPath not found, it's either missing, or has different name. PartitionCompute may not be supported" else setPartitionColumn
  in
    rootPathCheck

Invoking it together with transformation functions for your files:

addTransformationForFiles = Table.AddColumn(addPartitionColumn(source), "Transform file", each fx_single_file_transformation([Content]))

Example of Azure Delta Lake Storage Gen2 implementation:

let
  source = AzureStorage.DataLake("https://<blob_storage>.dfs.core.windows.net/"),
  
  addPartitionColumn = (tbl as table) => 
  let
    rootPath = Text.TrimEnd(Value.Metadata(Value.Type(tbl))[FileSystemTable.RootPath]?),
    getRelativePath = (root, relative) => Text.AfterDelimiter(relative, root),
    addPartitionColumn = Table.AddColumn(tbl, "PartitionColumn", each getRelativePath(rootPath, [Folder Path]) & [Name], type text),
    setPartitionColumn = Table.ReplacePartitionKey(addPartitionColumn, {"PartitionColumn"}),
    rootPathCheck = if rootPath = null then error "FileSystemTable.RootPath not found, it's either missing, or has different name. PartitionCompute may not be supported" else setPartitionColumn
  in
    rootPathCheck,
  
  addTransformationForFiles = Table.AddColumn(addPartitionColumn(source), "Transform file", each fx_single_file_transformation([Content])),
  selectColumns = Table.SelectColumns(addTransformationForFiles, {"PartitionColumn", "Transform file"}),
  expandFiles = Table.ExpandTableColumn(selectColumns, "Transform file", {"friendlyId", "somethingFriendly", "anotherFriendly", "friendlyDate"})
in
  expandFiles

Once you run your Dataflow with the right settings, go to Recent Runs and check the individual run details to see whether it indeed ran through PartitionedCompute. If you see Engine: PartitionedCompute, it did use that engine.

If you didn't disable the FastCopy option, which is enabled by default, your query may use that engine instead. This is dramatically faster, and that is a good thing. This blog doesn't really focus on that, but if you have a supported data source like a Lakehouse or ADLS2 and you do close to no additional transformations, this can boost your evaluation significantly.

Dataflow Gen2's Partitioned Compute Benchmark

Now, let's look at whether we get some extra evaluation speed. As I noted in an earlier disclaimer, there are plenty of better options to process the files faster or cheaper. However, if you must use Dataflow Gen2 for any reason, this should at least help you use it better.

I did various tests, and all programmatically gathered the refresh times and consumption units tied to the refresh. All benchmarks were done on a single workspace with a Fabric Trial, which should be more or less equivalent to F64.

Please note that the precision of Average CU/sec may be off by some decimals.

Please note, Recent Runs UI shows slightly different start/end times for each refresh, for these runs sometimes up to 30 seconds, so take that with reserve

Notebook

First, as I have pointed out multiple times in the disclaimers, the most efficient (or more precisely, the cheapest) way would be to use native Python (no PySpark). With some basic code, processing 401 small CSV files and writing them as a delta table took one minute. Because one second of running a Python native notebook costs one consumption unit, the math is rather simple.

I just want to point out how much you have to pay for an easy (optional) clicking experience in Dataflow Gen2.

Item Type	Avg Duration (min)	Min Duration (min)	Max Duration (min)	Average CU (s)	Average CU/sec
Notebook	1.0	0.9	1.3	60.1	0.98

If your pipeline also required consuming the delta table through a Semantic Model built via Power BI Desktop, I also added the refresh times and consumption of the Semantic Model.

Item Type	Avg Duration (min)	Min Duration (min)	Max Duration (min)	Average CU (s)	Average CU/sec
Semantic Model	1.2	1.2	1.3	2000.0	27.03

The background operations of the Lakehouse itself during the refreshes and interactions are very low in terms of CUs, around 40 CUs during the refresh of the semantic model.

Legend:

Part. Comp. - Partitioned Compute option on/off
Modern Engine - New modern evaluation engine option on/off
Dur. - Refresh Duration in Minutes
CU - Capacity Unit (Consumption Unit)
Semantic Model - The same query was copied to PBI Desktop, published, and then the refresh was measured

Benchmark settings

Files storage: Fabric Lakehouse Files

File type: csv

File size: 2MB

File amount: 401

File rows: Each file is 58k rows, in total 23kk

I chose this option because processing a large amount of small files should point out the biggest speed differences and optimizations happening, whereas a smaller amount of bigger files would probably show no difference (which is also important to know).

Item Type	Avg Dur. (min)	Min Dur. (min)	Max Dur. (min)	Average CU (s)	Average CU/sec	Part. Comp.	Modern Engine
SemanticModel	2.7	2.1	3.9	5019.3	31.7	Off	Off
DataflowGen1	4.8	4.3	5.6	3775.8	13.17	Off	Off
Dataflow Gen2	10.4	9.4	11.9	6402.4	10.27	On	On
DataflowGen2	10.4	9.9	26.4	6899.9	8.54	On	Off
DataflowGen2	20.5	19.9	21.4	8094.2	6.58	Off	On
DataflowGen2	22.9	21.9	23.9	8278.3	6.04	Off	Off

The Result

From the benchmark results, we can see that for some reason, Dataflow Gen1 is still the king for this kind of operation. Naturally, it has its own disadvantages, as you can't move it to a destination, etc., but if these constraints are okay, it still performs wonderfully. Naturally, the whole operation just inside the Semantic Model proves to be the fastest and cheapest.

On the Dataflow Gen2 side, we can clearly see that Partitioned Compute does work; whether you also have the Modern Engine on or off doesn't really matter. You can point out the max duration for the combo of Partitioned Compute on and Modern Engine off, but that is just an anomaly. Runs of around 10 minutes were the majority; the only time it was higher was when I was doing multiple runs of dataflows together. If it was isolated, it achieved the better durations.

Now, the combo of disabled Partitioned Compute with the Modern Engine on shows that the modern engine may do something, but it is still really bad. Finally, having everything off is rather a disaster. I apologize for these strong words, but I think there is still some work to do on the whole normal Dataflow Engine so it performs better.

The good thing is that after I posted these results on the Microsoft Fabric subreddit, Microsoft did reach out to me. They are investigating, so I believe there may be some fixes soon.