.dvc files use the YAML 1.2 file format, which is a human-friendly data serialization format for all programming languages.
As I mention above, DVC creates one lightweight
.dvc file for each file or folder tracked with DVC.
When you take a peek inside the contents of
images.dvc, you will see the following entries:
The most interesting part is
md5. MD5 is a popular hashing function. It takes a file of arbitrary size and uses its contents to produce a string of characters of fixed length (32 characters in our case).
These characters can seem random, but they will always be the same if you rerun the hashing function on the file however many times. But, even if a single bit is changed in the file, the resulting hash will be completely different.
DVC uses these hashes (also called checksums) to differentiate whether two files are identical, completely different, or different versions of the same file.
For example, if I add a new fake image to the
images folder, the resulting MD5 hash inside
images.dvc will be different:
As mentioned earlier, you should track all
.dvc files with Git so that modifications to large assets become a part of your Git commits and history.
$ git add images.dvc
Find out more about how
.dvc files work from this page of the DVC user guide.
3. DVC cache
When you call
dvc add on a large asset, it gets copied into a special directory called DVC cache, located under
The cache is the place where DVC keeps a pristine record of your data and models at different versions. The
.dvc files in the current working directory may be showing the latest or some other version of the large assets, but the cache will include all the different states the assets have been in since you started tracking them with DVC.
For example, let’s say you added a 1 GB
data.csv file to DVC. By default, the file will be both in your workspace and inside the
.dvc/cache folder, taking up twice as much space (2 GB).
Any subsequent changes tracked with
dvc add data.csv will create a new version of
data.csv with a new hash inside
.dvc/cache, taking up another gigabyte of memory.
So, you might already be asking — isn’t this highly inefficient? And the answer would be yes! At least, for single files, but we will see methods to mitigate this problem in the next section.
As for folders, it is a bit different.
When you track different versions of folders with
dvc add dirname, DVC is smart enough to detect only the files that changed within that directory. This means that unless you update every single file in the directory, DVC will cache only the versions of the changed files, which won’t take up much space.
In summary, think of DVC cache as a counterpart to Git’s staging area.
Learn more about the cache and internal DVC files from this user guide section.