Dataflows Gen1 and Gen2: Where is my data stored?

Vojtěch Šíma
Oct 26
5 min read

tl;dr Gen1 stores your data as CSVs in a CDM folder in your ADLS Gen2 account. To get to it, you need to link it to your data lake storage, otherwise you won't be able access it. If Enhanced Compute Engine is on, refresh also loads a SQL cache you can use. Gen2 with no destination saves output in the semi hidden DataflowsStagingLakehouse and exposes it via DataflowsStagingWarehouse. The data is stored as Delta tables backed by Parquet files.

Before you read

My background spans the entire Power BI platform, so I like to touch on different aspects of PBI development. Now that Microsoft Fabric is here, along with an upgraded version of Dataflows, I naturally got curious. Hence, this blog post.

That said, I’m not about to recommend Dataflows for data engineering. Even though Microsoft’s own survey (shared on FabCon EU 25) claims that Dataflows Gen2 is the most-used data factoring tool in Fabric, it's also the priciest and arguably the slowest. We might get a peek into why that is in this post, but I’ll probably dive deeper into the performance and cost of Dataflows Gen2 in a separate blog.

What are Dataflows Gen1?

Dataflows Gen1, originally just called Dataflows, is now a legacy, user-friendly data 'engineering' tool, though you can still create new items. Dataflows are closely tied to Power BI because, at first glance, they share the same data transformation methods. In other words, they both use Power Query to transform data. Behind the scenes, the processing might differ, but the GUI and the language are basically the same.

You’d use Dataflows to, for example, load or stage files into a table as-is, offload heavier transformations outside the semantic model, or separate historical data that doesn’t need to be refreshed that often. It was primarily designed for people who prefer clicking their way through things, or who came from Excel and wanted to use Power Query in the cloud.

Since it still allowed writing M code, you could build everything manually. That meant you could connect to pretty much anything, just like in a more traditional data engineering tool.

That’s probably enough intro to Dataflows Gen1 or Dataflows in general, for that matter. Now let’s take a look at how they actually store data.

Dataflows Gen1 storage

From now on, I’ll use D1 for Dataflows Gen1.

Whenever you build anything in D1, it tries to produce a table as the output. For example, if your last step returns a JSON file, D1 will auto-add steps to turn it into a table. It’s just enough to get a table, so don’t rely on it to properly parse your files.

Same story with nested fields. If you have a table with a column that contains a nested record, it’s still a valid table, so that record will most likely be flagged as an error and replaced with null. That’s what Dataflows does for all tables, by the way: it replaces cell-level errors with null. If an error prevents a step from running at all, the refresh just fails.

With successfully loaded tables, D1 generates a CSV snapshot of the run (not just metadata, the actual data) and dumps it into Microsoft’s internal Data Lake storage under a CDM folder. You can’t normally access it. If you really want those outputs, link the workspace to your own ADLS, then you can get to them.

Since it’s CSV, there’s no query folding or other fancy features when you consume the data. It can also be slow to read, because you’re pulling from a CSV, not a nicely indexed and normalized database.

Dataflows Gen1 with Enhanced Compute Engine storage

From now on, I’ll use D1 ECE for Dataflows Gen1 with Enhanced Compute Engine.

Many of you probably haven’t heard of this feature. If you have Premium capacity or Premium Per User, you can set your dataflows to use the Enhanced Compute Engine. This setting is a normal option, for example right next to Scheduled Refresh.

In terms of storage, it functions similarly to D1. On top of that, in Microsoft’s own words: "The enhanced compute engine is an improvement over the standard engine, and works by loading data to a SQL Cache and uses SQL to accelerate table transformation, refresh operations, and enables DirectQuery connectivity."

Essentially, you can low-key treat your dataflows like 'a database' and even use DirectQuery as the connection method. In DQ mode, whenever the dataflow refreshes, you get your data on canvas refresh, either scheduled automatically without clicking or after you interact with the report.

This SQL Cache is not accessible outside Dataflow, and you will struggle to trace the SQL back as well. You could only perhaps use the Native Query to check what SQL query it sends, but it's quite generic/generated.

select [$Table].[1] as [1]
from [schema$1d6a371e-056f-4526-87eb-20110455b4a8].[Table] as [$Table]

Dataflows Gen2 storage

From now on, I’ll use D2 for Dataflows Gen2.

D2 has a very similar interface to D1. Under the hood, it spins a different version of the mashup engine that is arguably faster for some operations (and/or slower for others). The benefit of D2 is the option to move the data to a destination. That can be Lakehouse, Warehouse, SharePoint, etc.

Just bear in mind, as I mentioned earlier, D2 is way more expensive than a Notebook or Data pipeline, in terms of CUs (capacity unit). https://learn.microsoft.com/en-us/fabric/data-factory/pricing-dataflows-gen2

You can also skip choosing a destination and set your queries as Staging. This creates a staging Lakehouse and Warehouse that you then use to consume your dataflows when you want to connect through Power BI, for example.

However, you still go through the Dataflows connector, even though you can reach the staging Warehouse through the OneLake Catalog or the Power Query wizard. The Warehouse is only intermediate or temporary storage and it is not advised to connect to it directly, as it may not have the right data.

As for the hard files, they live in the staging Lakehouse, namely StagingLakehouseForDataflows. You can easily view the files via OneLake Explorer for Windows, for example. Inside the subfolder Tables, you will find the Parquet files and the deltalog. You can also see Files, which are mostly metadata for the Parquet files in Tables.

When connecting to the dataflows, you are spinning a SQL endpoint, so reads will be drastically faster than D1. You can also fold your later transformations and view the native query.

Native Query sent back to the Lakehouse with Power Query clicked steps:

select [_].[Column1] as [Column1],
    'is' as [Column2],
    'gen2' as [Column3]
from 
(
    select [_].[Column1]
    from [StagingLakehouseForDataflows_20251026173245].[dbo].[a9b2c3d465e147f2b8812e3a92b4c8d1_d18263f1_002Df9a2_002D47aa_002D93b1_002D8cf3b2a10c77] as [_]
    where [_].[Column1] = 'this' and [_].[Column1] is not null
) as [_]