Automated Delta Table Maintenance in Fabric Lakehouse (Without PySpark)
- Vojtěch Šíma
- Feb 13
- 13 min read
Updated: Feb 15
tl;dr Delta tables work, in short, by generating new files (JSONs, parquets) upon transactions (writes, updates, etc.). This generally may produce lots of files, so you need to clean them. If you wanna save CU with elimination of Spark Jobs, you can use the delta-rs library. This offers you standard methods like optimize (compact, zorder), vacuum, etc. For V-ordering, you need curently Spark Job. For more details, read the blog post.
How do delta tables work
Delta log (_delta_log)
To understand the maintenance needs for delta tables, we need to understand how they work. Delta tables have a core component called the transaction log (also known as the delta log, I will use these terms interchangeably). Delta log is a folder that contains ordered individual logs as JSON files that describe commits. Every single transaction (commit) of your delta table creates a new JSON file and contributes to the next table version. A single transaction usually does not describe the full table by itself, it only records what changed.
Individual files are zero-paded json files; this can look like the following.
00000000000000000600.jsonCheckpoints (*.checkpoint.parquet)
To handle these commit logs better, delta tables have their own mechanism to help readers catch up faster through checkpoints. A checkpoint is a Parquet file in the delta log folder that consolidates the state of the transaction log up to a specific version, so the reader does not have to replay the whole log from the beginning. The original citation suggests a checkpoint after 10 commits by default, but you should not rely on that as a hard rule. You can also create a snapshot on demand.
00000000000000000599.checkpoint.parquetLatest checkpoint pointer (_last_checkpoint)
Because you can have many checkpoints, there is also a pointer file called _last_checkpoint. That is a small JSON file pointing at the latest checkpoint, so readers can jump to it instead of scanning a huge _delta_log directory.
Note that this file technically has no file extension/type, but you can read it as json.
Think of the last checkpoint literally like a game save. When you read the table, you 'spawn' at the latest checkpoint. Any progress made after that (newer than the checkpoint) is effectively 'unsaved progress' that the engine must read and replay to catch up. In short, if your checkpoint is version 850, you spawn there. But to see the current state, the engine must also read every single JSON above 850 (851, 852, etc.) and apply those changes.
_last_checkpointParquet data files
The second component of the delta tables is the actual data. The data is written as Parquet files. Parquet is immutable, so you do not “patch” a file. You write new ones. Appends add more files, and updates usually rebuild the file that had the row you changed, pulling along the other rows from that file too.
7246655d-957d-4c83-aab5-cb9a2a21ce1a.parquetpart-00000-9800ad29-a568-4a9c-9273-df62e566c7b4-c000.snappy.parquetPartitions
If you have partitions on your data, the delta table folder structure expands so that each distinct value for the partition column gets its own folder in the form partition_column=column_value, and the Parquet files for that value live inside that folder. If you have more partition columns, the folders become nested, for example, year=2026/month=02/day=12, and each leaf folder contains its own set of Parquet files. This is just the physical layout on storage, and it can help queries skip unrelated data when they filter by the partition column, but it also means lots of distinct values can create a lot of folders and files.
Even tho, maintenance technically should also validate your partitions, this blog post doesn't cover it. Generally (and very simply), if your potential partitions will have at least 1GB of data, go ahead, otherwise don't. (this is a simplification tho).
date=2025-03-05How the Delta tables resolve (the latest) versions
The logic of producing the actual queryable version of your delta table is theoretically simple. Here's a scenario: When you have a transaction that adds three rows, you write a Parquet file with those rows. When you add another row later, you write a new Parquet file again. Later, if you have a transaction that modifies existing rows, you take the file(s) that contain those rows, write new file(s) with the updated rows, and the older rows from those same files come along for the ride.
If we visualize it, it can look like this:
Transaction | Operation | Delta Log Action | Physical File Content | Current Active Files (Snapshot) |
1 | Initial Write | add(File 1) | File 1: {row1, row2, row3} | File 1 |
2 | Append Data | add(File 2) | File 2: {row4} | File 1, File 2 |
3 | Update row1 | remove(File 1) add(File 3) | File 3: {row1*, row2, row3} (row2 & row3 "tag along") | File 2, File 3 |
4 | Update row3 | remove(File 3) add(File 4) | File 4: {row1*, row2, row3*} (row1 & row2 "tag along") | File 2, File 4 |
Some Delta implementations support deletion vectors, which can record deletes and some updates without immediately rewriting the full Parquet file. The table still behaves the same from a query point of view, but the physical rewrite can be deferred until later compaction or maintenance
Quick signals before maintenance
Before you do any optimization or maintenance, you may wanna know how healthy your Delta table looks from files point of view: how many active Parquet files make up the latest version, whether those files are tiny or reasonably sized, and whether you have old unreferenced files waiting for cleanup. Even tho you will probably do maintenance anyway, it is still useful to check the table health once in a while, and see how your transactions affect it over time. Automating a health check is not part of this blog post, but I am giving you a few tools to check it, and I trust you can wire it into automation later yourself.
Visual look
If you wanna first just look at how the folder structure looks like, go to the Lakehouse, expand Tables, click a single table, then open the Files view for that table. This is super great if you do not even know what the thing looks like yet. I recommend doing that.
If you use OneLake file explorer in Windows, you can also navigate through the same files there.
Delta-rs
Since this blog focuses on non-Spark operations, I am using Delta-rs via the deltalake Python package. This is an extremely great tool to do a bunch of operations with delta tables; for this specific scenario, we will be utilizing the DeltaTable data class.
Get add actions
This is a good place to start checking the health. The result of this method gives you a list of all files that have an add action, contributing to the latest version. In small data, it can be as simple as just one consolidated Parquet file. If you have hundreds or thousands of entries here, you likely need maintenance.
In Python - run with Lakehouse attached:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
display(dt.get_add_actions())For a nicer result, add built-in to_pandas:
# previous config
display(dt.get_add_actions().to_pandas())History
To see the whole history of operations, you can use the history() method. This will list operations like write, optimize, and vacuum with tags and operation parameters. As a rule of thumb, if you see many writes but no vacuum or optimize, do some maintenance.
In Python - run with Lakehouse attached:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
display(dt.history())Dry vacuum
Vacuum is an operation that removes unreferenced data files older than the retention threshold. In Fabric, the default file retention threshold is 7 days for Delta tables in OneLake, so files must be older than that to be eligible.
If you want to first see what is vacuumable, you can run the dry_vacuum method to first get a glimpse of what to be deleted.
In Python - run with Lakehouse attached:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
display(dt.vacuum(dry_run=True))The code above uses the default settings of the retention period. If you want to have full control, write something like this:
# previous config
display(dt.vacuum(dry_run=True, retention_hours=0, enforce_retention_duration=False))Always double check that dry_run = True, otherwise, you will actually vacuum the files
Note that VACUUM does not remove transaction logs; therefore, a dry run will not list them. To identify which logs are eligible for deletion, you must filter the table history() for entries older than the log retention period. Note that logs are deleted automatically after at least one checkpoint and after exceeding the retention period.
To view the table configuration and retention settings, you can use:
# previous config
display(dt.metadata().configuration)If the result is empty, the table is using the defaults: 30 days for logs and 7 days for Parquet files.
Alternative: Physical storage audit
Iterating over folders to count files is an anti-pattern for health checks. A raw count is misleading because it mixes active data with leftovers (tombstoned files), and it is often slow on partitioned tables. Delta’s state is defined by the transaction log anyway, so folder counts can be noisy.
Physical iteration is useful for storage audits because it shows the full storage footprint of the table folder, including logs and deleted files that still exist until VACUUM cleans them up. It still does not tell you what is safe to remove. For actionable signals, rely on the transaction log methods (history(), get_add_actions()) and vacuum(dry_run=True) instead.
Maintenance tools for Delta tables in Fabric Lakehouse
As I mentioned earlier, I am focusing on Python notebooks outside of PySpark or SparkSQL. Therefore, we will focus again on the Delta-RS options, as we can do all mandatory maintenance operations within, without touching Spark.
You can't do V-Order with this API, that can be currently done only through Spark. I'll cover and explain what V-Order is and if you need it, how you can technically run it through Fabric REST API
Before we jump into the implementation itself (which will be really quick), I want to explain the tools we will be using
OPTIMIZE
Optimize is a core Delta table feature that allows you to compact small files into larger ones (bin-packing). If you have a lot of data, OPTIMIZE will efficiently distribute your files into properly sized Parquet files. This significantly benefits read performance later on. The bin maximum size can be configured, and the default for Spark Delta Lake is 1GB. For delta-rs, the default target file size is 256MB.
In Fabric, the optimization process can be slightly adjusted with a couple of variants:
Z-Order
Z-ordering is a technique to colocate related information in the same set of files, so the engine can skip more files during reads. Delta collects data skipping stats (min and max values), and Z-order improves how useful those stats are by shaping the file layout around columns you filter on often.
This is great when you know which columns users query a lot. Adding too many columns usually means more work during optimization and diminishing returns, so keep it meaningful.
V-Order
V-Order is a write-time optimization to Parquet used in Microsoft Fabric. Microsoft describes it as sorting, row group distribution, encoding, and compression that improve read performance across Fabric engines.
Microsoft’s claim is: around ~15% write-time overhead on average, up to 50% better compression, and read-time improvements (Spark and other non-Verti-Scan engines average ~10% faster reads, with some scenarios up to 50%)
It is disabled by default in new Fabric workspaces, and it is mainly recommended for read-heavy scenarios. Also, you can apply it retroactively by rewriting files during optimization.
As I said above, V-Order can't be currently done via delta-rs, I'll cover how you can do it later should you need it.
VACUUM
VACUUM cleans up files that are no longer referenced by the current Delta table log. This matters because OPTIMIZE rewrites files and leaves old ones around until VACUUM removes them after the retention threshold.
In Fabric, the default file retention threshold is seven days. Setting a shorter retention impacts time travel and can be risky with concurrent readers and writers. Fabric also says the UI and public APIs fail by default for retention intervals under 7 days unless you disable the retention duration check.
Implementation of maintenance tools with Delta-rs
Let’s look at how to implement the maintenance tools in a Python notebook. Quick note before you do anything: run these commands outside of any changes that could happen to your Delta tables. Some of these methods can fail if other operations conflict with the maintenance run (especially anything that removes or rewrites files)
You can also treat the order of these steps as the order you may want to replay in your pipeline.
If you skipped directly here, I recommend reading about the individual parts first (look above)
For each method, find an in-depth explanation and parameter details here.
OPTIMIZE & Z-ORDER
To optimize, bin pack files, we can use the compact() method. Optimize itself does not delete files, it only creates new ones and marks old ones as removed, therefore this step alone is not enough (see Vacuum later).
If this operation happens concurrently with any operations other than append, it will fail.
If you decide that your table benefits from Z ordering, do not run optimize twice. Z order is a superset of compact, so either run compact alone, or if you choose to have Z order, run only Z order
compact(partition_filters=None, target_size=None, max_concurrent_tasks=None, min_commit_interval=None)Quick tips:
When partition_filters is provided, you will optimize only that partition or partitions.
target_size is in bytes. 256 MB would be 256*1024*1024 (also equal to 2**8 * 1024**2).
max_concurrent_tasks defaults to the number of CPUs.
running compact() (without new changes) twice will do nothing
Run with defaults:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
dt.optimize.compact()If you want to tag along Z-order, define the columns you want to run it with:
z_order(columns, partition_filters=None, target_size=None, max_concurrent_tasks=None, max_spill_size=21474836480, min_commit_interval=None)Run with defaults and column 'timestamp':
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
dt.optimize.z_order(["timestamp"])Practical note: once you used Z order, you cannot easily go back. You either stop running Z order for later runs, or if you want to get completely rid of the current layout you have to rewrite the whole table, or choose a different Z order (because Z order physically rewrites data files).
VACUUM
Once you compact your files, you should remove the old “garbage” or “ghost” files. As we know, optimize itself does not delete anything from storage, it only writes new compact files and marks old ones as removed, so the old ones are still there until Vacuum removes them.
Important note: if you compact your data today, the retention window for those removed files starts when they were logically removed (tombstoned) by that optimize commit. Therefore, if you run Vacuum immediately after, you likely will not delete the current run’s demoted files unless you override retention enforcement.
However, since you are doing this periodically, it is still very important to run Vacuum, as you will remove files from the previous run, for example, if it was one week ago.
Naturally, if you do not need time travel (going back to an older version of the Delta table), you can set your retention period to 0 and sweep it right away, but it is not recommended because it can break time travel and can also break long-running readers. Use with extreme care
Run vacuum, with other parameters as default:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
dt.vacuum(dry_run=False)Not-recommended total sweep:
# previous config
dt.vacuum(retention_hours=0, enforce_retention_duration=False, dry_run=False)
Logs cleanup
Technically, if you want, you can also clean up your logs. This is automatically done after checkpoint operations, but checkpoint timing can be inconsistent across environments, so you can treat this yourself as well.
The default retention period for logs is 30 days. After checkpointing, older log entries are eligible to be cleaned up based on delta.logRetentionDuration. If you lower this retention, you also reduce how far back you can time travel using the log history.
The worklog to clean up logs can look like this:
Optional: set up a new delta.logRetentionDuration
Create a checkpoint
Clean up metadata
Here is the run with a new retention period set to 14 days:
from deltalake import DeltaTable
workspace_id = 'workspace_guid'
lakehouse_id = 'lakehouse_guid'
table_name = 'your_delta_table_name'
dt = DeltaTable(f'abfss://{workspace_id}@onelake.dfs.fabric.microsoft.com/{lakehouse_id}/Tables/{table_name}')
dt.alter.set_table_properties({"delta.logRetentionDuration": "interval 14 days"})
dt.create_checkpoint()
dt.cleanup_metadata()To check custom configurations, like altered retention periods, run this (if blank, you use defaults):
# previous config
dt.metadata().configurationOne more practical note: there have been reports in some versions or mixed writer scenarios where custom delta.logRetentionDuration was not respected exactly as expected, so validate this on a non-critical copy of a table first.
V-Order maintenance for Delta tables in Fabric Lakehouse
As I mentioned earlier, v-order is not present in the delta-rs maintenance tools at the time of writing. Microsoft also doesn't give you methods on how to apply it without Spark.
You do have a Fabric REST API that you can technically call without Spark, but it's a Spark job anyway.
I did personally not yet benchmarked V-Order, so I cannot tell you with proper confidence. However, others who did benchmark it point out that V-Order is definitely not an always-on feature. V-Order has write overhead, therefore, cost overhead, with the benefits of faster reading and theoretically therefore less expensive reads.
For scenarios when you require Direct Lake connection, or when you have a partitioned Warehouse, you may benefit from V-Order.
Spark SQL control for OPTIMIZE with V-Order (and Z-Order)
In relation to delta table maintenance, you can write the following Spark SQL to control the optimization with V-Order and Z-Order (they can work together). These command shapes are documented by Microsoft Fabric:
%%sql
OPTIMIZE <table|fileOrFolderPath> VORDER;
OPTIMIZE <table|fileOrFolderPath> WHERE <predicate> VORDER;
OPTIMIZE <table|fileOrFolderPath> WHERE <predicate> [ZORDER BY (col_name1, col_name2, ...)] VORDER;When ZORDER and VORDER are used together, Apache Spark performs bin-compaction, ZORDER, VORDER sequentially.
On partinoned tables, you may also check out Optimize write. That should help you write optimal partitions (size). Read more about it here.
Alternative in preview maintenance tool
If you want to try something new and still in preview, you can try Fabric’s REST API for Table Maintenance that helps you do everything in a single call.
This API is explicitly marked as preview and not recommended for production use. But it's still worth knowing that it exists.
This operation runs a background job and enables you to optimize with V-Order, Z-Order, and vacuum with an optional interval (retention period).
You can run this in python notebook, and basic chatty code with basic code handling can look like this:
import requests
import json
import time
from datetime import datetime
token = notebookutils.credentials.getToken("pbi")
workspace_id = "<workspace_id>"
lakehouse_id = "<lakehouse_id>"
table_name = "<table_name>"
schema = "dbo" #dbo is default
shared_headers = {
"Authorization": f"Bearer {token}",
"Content-Type": "application/json"
}
payload = {
"executionData": {
"tableName": table_name,
"schemaName": schema,
"optimizeSettings": {
"vOrder": True
},
"vacuumSettings": {
"retentionPeriod": "7:01:00:00"
}
}
}
def run_maintenance_request():
return requests.post(
f"https://api.fabric.microsoft.com/v1/workspaces/{workspace_id}/lakehouses/{lakehouse_id}/jobs/TableMaintenance/instances",
headers=shared_headers,
json=payload
)
while True:
response = run_maintenance_request()
if response.status_code == 429:
print("Too Many Requests")
retry_after = int(response.headers.get("Retry-After", 60))
time.sleep(retry_after)
continue
if response.status_code == 202:
print("Submitted. Looping for updates.")
status_url = response.headers.get("Location")
retry_after = int(response.headers.get("Retry-After", 60))
while True:
time.sleep(retry_after)
status_resp = requests.get(status_url, headers=shared_headers)
status_data = status_resp.json()
state = status_data.get("status")
print(f"Current Status: {state} - {datetime.utcnow()}")
if state in ["Completed", "Failed", "Cancelled", "Deduped"]:
if state == "Failed":
print("Error Details:", status_data.get("failureReason", "No details provided"))
break
break
print(f"Failed to submit: {response.text}")



Comments