Being able to update and delete records is a feature that was lost during the transition from traditional data warehouses to data lakes. While data lakes were great at solving scale and cost issues, they sacrifice the ability to update and delete records. Data lakes were made of many and many files that could quickly turn into a data swamp and that’s where things went wrong and the lakehouse architecture came to the rescue.
The lakehouse architecture is a hybrid between data warehouses and data lakes that combines both to address their issues. One of those issues is the lack of ACID transactions with DML support that were much beloved in data warehouses and that is where Delta Lake shines. However, due to Delta’s ACID guarantees, in-place file changes are out of the question.
In this post, I will address how Delta supports DML commands, how deletion vectors work, and why they are important as a performance improvement.
With the help of the transaction log, Delta Lake supports DML commands like Update, Delete, and Merge. For simplicity, we’ll focus on Update and Delete commands since they are simpler and work the same way under the hood.
So, what happens when the following query is executed?
There are three steps involved:
- Find the files where the predicate gender = ‘M’ matches
- For each of the files found, rewrite the file with the records updated
- Remove File 2 and File 4 from the transaction log and add File 5 and File 6
The same logic is used for a Delete command. The process of rewriting the files with the required updates is called copy-on-write.
By copying the entire file with the updated records we avoid having to do in-place modifications and are able to concurrently navigate through the transaction log and the several different versions that make up the latest state of a Delta table (time travel).
For the above query, this strategy does not seem to be a problem since there is a high percentage of the records that will be updated, but what about this one?
In this case, the file that contains this specific id will have to be fully replaced by a new one without the record. For small files, this should not be an issue but let’s say we have files in the range of hundreds of MB, this is highly inefficient.
In sum, copy-on-write is great for fast reads, as all the data is always present in the same file, and good when there are not many DML operations. On the contrary, the writes are expensive and it is a bad strategy when there are a lot of update operations like we’ve seen above.
Delta Deletion Vectors (DVs) are a mechanism that is used to increase the performance of writes when an update only changes a very small percentage of the records in a file due to copy-on-write.
Put in simple terms, they are an array of RoaringBitmaps that map directly into rows of Parquet files (the default implementation of 32-bit integers is not enough to cover the possible number of rows). DVs are created by commands that delete or update existing data and act as a marker that will be used to filter out rows when scanning data. In the case of a delete, they act as a deletion marker as the name indicates, in the case of an update it acts as a row invalidator which I’ll detail further below.
Let’s use the people10m public dataset that we’ve been using in the previous Delta Lake posts and analyze how DVs actually work.
We should have a table like this:
The first thing we need to do is enable the feature by running:
Note: Be aware that by turning this feature on the table protocols are updated to readerVersion=3 and writerVersion=7 and might break compatibility with older readers/writers.
Let’s force the creation of a deletion vector by deleting an id from the table:
Without DVs, we should have had a new file in the transaction log that would contain all the ids from the original file except id 1. Let’s see what happened by inspecting the transaction log.
Unpacking the transaction log
The commitInfo entry contains the information about the DELETE operation.
Here we see that numDeletionVectorsAdded:”1″ and numAddedFiles:”0″ which means that we avoided rewriting a file.
You may find it odd that it includes a remove entry as this behavior is similar to standard copy-on-write behavior and the above metrics indicate that no file was removed numRemovedFiles:”0″.
The add entry explains everything that is happening with DVs enabled.
Here we see that the path of the added file is the same as the removed file but it now contains a deletion vector. Add and remove entries can now optionally include a deletionVector field containing information about the DVs associated with a file.
In versions that support DVs, files are uniquely identified by both the file path and the unique deletion vector id which defaults to NULL when DVs are not used.
How is this id defined? What information is provided by the new DV structure?
It mostly depends on the storage type. In our case, the storage type is “u”. For this storage type, the pathOrInlineDv is
<random prefix — optional><base85 encoded uuid>. It is used for DVs that are stored in a path that is relative to the Delta table path. The DV file name can be derived from the UUID (see details).
In our example, we can see the deletion vector stored alongside the table files. Be aware that if the table has partitioned values DV files are not kept in the partition directories but in the Delta table root directory as a DV can span across multiple partitions.
Storage type can take other values such as“i” where the vector is inlined so pathOrInlineDv is
<base85 encoded bytes> and“p” where the file is stored in the absolute path provided by pathOrInlineDv.
Offset is an optional field that states the position where the data starts in storage types that are backed up by files. It is not present when the DV is inlined (storage type “i”).
SizeInBytes is the size of the DV in bytes.
Cardinality is the number of rows that are logically deleted by the DV.
With DVs enabled, the DML commands work differently to make use of the new information.
Update with no existing DVs
With DVs enabled, the update command logic is as follows:
- Scan all the files that satisfy the predicate and need to be updated (File 2, File 4)
- Write DVs for the rows that need to be invalidated (DV 2.1, DV 4.1)
- Write a new file with the updated rows (File 5)
- Add (File 5, NULL) to the transaction log, remove (File 2, NULL) and (File 4, NULL), and add (File 2, DV 2.1) and (File 4, DV 4.1) files
Note: At the time of writing, DVs are not supported in UPDATE, and MERGE commands, only in DELETE. However, they will be supported in the future.
Reading tables with DVs
When files have associated DVs, the scan will implicitly read the DV and filter the resulting rows matched by the vector. While without DVs we read a single file to get a set of results, with DVs we’ll have to read both file and DV in order to find the correct set of results which might impact the reading performance.
Delete with existing DVs
In this case, we will execute a delete that will affect a file that already has an associated DV file. Since we are only deleting we won’t need to write any new files with updated rows but only update the existing DV.
- Scan (File 2, DV 2.1) to find the latest set of rows
- Write a new DV containing the old DV deleted rows plus the new one
- Remove (File 2, DV 2.1) from the transaction log and add (File 2, DV 2.2)
With constant updates, DVs might be constantly replaced and there are a lot of files left behind. VACUUM will also target DVs for deletion just like regular files.
Before DVs, OPTIMIZE used minFileSize property to choose which files should be selected for compaction. Since there are files that might have a huge part of their rows invalidated, it makes sense to also select those that fall off a certain threshold. The property maxDeletedRowsRatio defines the max allowed ratio for a file to be selected for compaction and have its DV deleted.
This OPTIMIZE strategy is not idempotent, each time it is executed it will try to generate a new set of organized files, which means that no DVs will result from this operation, and all previous DVs and files are marked for deletion.
The theory is all covered but let’s see the improvements in practice by running a simple DELETE query.
In order to simulate a real-life scenario of a write-intensive application with an average target file size I have compacted the dataset into a single file with approximately 236 MB and 10 million records and compared the performance of the query for both scenarios, with and without DVs enabled.
The first thing that we can notice on the left is the number of bytes added which is the same as the original file minus the row that was just deleted. There are two files added that total 9,999,999 rows and took 27.1s to run the entire operation which includes scanning the files and rewriting the updated ones.
On the right, we can see the DVs in action. We have 0 bytes added, as there are 0 new files. We see that a new DV with one row (34 bytes) was added to an existing file that still contains 9,999,999 valid rows. Overall, the operation took 2.7s to run which is roughly a tenfold improvement on write compared with the previous approach! This way we avoided having to rewrite the whole file just to delete a single row in a 10 million rows dataset by writing a DV that took 117ms to write.
In this post, we’ve seen how good DVs are for write-intensive use cases where there are a lot of small updates or deletes. While the reading times might be impacted by having to read more files, they can be highly amortized by the several optimizations that can happen either as part of an auto-compaction job or by the scheduled OPTIMIZE jobs. All in all, it is a trade-off between write and read performance so analyzing the kind of workload you are running is key before choosing to enable this feature.