Data Version Control For the Modern Data Scientist: 7 DVC Concepts You Can’t Ignore


2. .dvc files

.dvc files use the YAML 1.2 file format, which is a human-friendly data serialization format for all programming languages.

As I mention above, DVC creates one lightweight .dvc file for each file or folder tracked with DVC.

When you take a peek inside the contents of images.dvc, you will see the following entries:

Image by me

The most interesting part is md5. MD5 is a popular hashing function. It takes a file of arbitrary size and uses its contents to produce a string of characters of fixed length (32 characters in our case).

These characters can seem random, but they will always be the same if you rerun the hashing function on the file however many times. But, even if a single bit is changed in the file, the resulting hash will be completely different.

DVC uses these hashes (also called checksums) to differentiate whether two files are identical, completely different, or different versions of the same file.

For example, if I add a new fake image to the images folder, the resulting MD5 hash inside images.dvc will be different:

Image by me

As mentioned earlier, you should track all .dvc files with Git so that modifications to large assets become a part of your Git commits and history.

$ git add images.dvc

Find out more about how .dvc files work from this page of the DVC user guide.

3. DVC cache

When you call dvc add on a large asset, it gets copied into a special directory called DVC cache, located under .dvc/cache.

The cache is the place where DVC keeps a pristine record of your data and models at different versions. The .dvc files in the current working directory may be showing the latest or some other version of the large assets, but the cache will include all the different states the assets have been in since you started tracking them with DVC.

For example, let’s say you added a 1 GB data.csv file to DVC. By default, the file will be both in your workspace and inside the .dvc/cache folder, taking up twice as much space (2 GB).

Image by me

Any subsequent changes tracked with dvc add data.csv will create a new version of data.csv with a new hash inside .dvc/cache, taking up another gigabyte of memory.

So, you might already be asking — isn’t this highly inefficient? And the answer would be yes! At least, for single files, but we will see methods to mitigate this problem in the next section.

As for folders, it is a bit different.

When you track different versions of folders with dvc add dirname, DVC is smart enough to detect only the files that changed within that directory. This means that unless you update every single file in the directory, DVC will cache only the versions of the changed files, which won’t take up much space.

In summary, think of DVC cache as a counterpart to Git’s staging area.

Learn more about the cache and internal DVC files from this user guide section.



Source link

Leave a Comment