Python Dependency Management: Which Tool Should You Choose? | by Khuyen Tran | Jun, 2023


An in-depth comparison between Poetry, Pip, and Conda

Image by Author

Originally published at https://mathdatasimplified.com on June 13, 2023.

As your data science project expands, the number of dependencies also increases. To keep the project’s environment reproducible and maintainable, it’s important to use an efficient dependency management tool.

Thus, I decided to compare three popular tools for dependency management: Pip, Conda, and Poetry. After careful evaluation, I’m convinced that Poetry surpasses the other two options in terms of effectiveness and performance.

In this article, we will delve into the advantages of Poetry and highlight its key distinctions from Pip and Conda.

Having a broad selection of packages makes it easier for developers to find the specific package and version that best suits their needs.

Conda

Some packages, like “snscrape,” cannot be installed with conda. Additionally, certain versions, such as Pandas 2.0, might not be available for installation through Conda.

While you can use pip inside a conda virtual environment to address package limitations, conda cannot track dependencies installed with pip, making dependency management challenging.

$ conda list
# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name Version Build Channel$ conda list # packages in environment at /Users/khuyentran/miniconda3/envs/test-conda: # # Name Version Build Channel

Pip

Pip can install any packages from the Python Package Index (PyPI) and other repositories.

Poetry

Poetry also allows the installation of packages from the Python Package Index (PyPI) and other repositories.

Reducing the number of dependencies in an environment simplifies the development process.

Conda

Conda provides full environment isolation, managing both Python packages and system-level dependencies. This can result in larger package sizes compared to other package managers, potentially consuming more storage space during installation and distribution.

$ conda install pandas

$ conda list

# packages in environment at /Users/khuyentran/miniconda3/envs/test-conda:
#
# Name Version Build Channel
blas 1.0 openblas
bottleneck 1.3.5 py311ha0d4635_0
bzip2 1.0.8 h620ffc9_4
ca-certificates 2023.05.30 hca03da5_0
libcxx 14.0.6 h848a8c0_0
libffi 3.4.4 hca03da5_0
libgfortran 5.0.0 11_3_0_hca03da5_28
libgfortran5 11.3.0 h009349e_28
libopenblas 0.3.21 h269037a_0
llvm-openmp 14.0.6 hc6e5704_0
ncurses 6.4 h313beb8_0
numexpr 2.8.4 py311h6dc990b_1
numpy 1.24.3 py311hb57d4eb_0
numpy-base 1.24.3 py311h1d85a46_0
openssl 3.0.8 h1a28f6b_0
pandas 1.5.3 py311h6956b77_0
pip 23.0.1 py311hca03da5_0
python 3.11.3 hb885b13_1
python-dateutil 2.8.2 pyhd3eb1b0_0
pytz 2022.7 py311hca03da5_0
readline 8.2 h1a28f6b_0
setuptools 67.8.0 py311hca03da5_0
six 1.16.0 pyhd3eb1b0_1
sqlite 3.41.2 h80987f9_0
tk 8.6.12 hb8d0fd4_0
tzdata 2023c h04d1e81_0
wheel 0.38.4 py311hca03da5_0
xz 5.4.2 h80987f9_0
zlib 1.2.13 h5a0b063_0

Pip

Pip installs only the dependencies required by a package.

$ pip install pandas

$ pip list

Package Version
--------------- -------
numpy 1.24.3
pandas 2.0.2
pip 22.3.1
python-dateutil 2.8.2
pytz 2023.3
setuptools 65.5.0
six 1.16.0
tzdata 2023.3

Poetry

Poetry also installs only the dependencies required by a package.

$ poetry add pandas

$ poetry show

numpy 1.24.3 Fundamental package for array computing in Python
pandas 2.0.2 Powerful data structures for data analysis, time...
python-dateutil 2.8.2 Extensions to the standard Python datetime module
pytz 2023.3 World timezone definitions, modern and historical
six 1.16.0 Python 2 and 3 compatibility utilities
tzdata 2023.3 Provider of IANA time zone data

Uninstalling packages and their dependencies frees up disk space, prevents unnecessary clutter, and optimizes storage resource usage.

Pip

Pip removes only the specified package, not its dependencies, potentially leading to the accumulation of unused dependencies over time. This can result in increased storage space usage and potential conflicts.

$ pip install pandas

$ pip uninstall pandas

$ pip list

Package Version
--------------- -------
numpy 1.24.3
pip 22.0.4
python-dateutil 2.8.2
pytz 2023.3
setuptools 56.0.0
six 1.16.0
tzdata 2023.3

Conda

Conda removes the package and its dependencies.

$ conda install -c conda pandas

$ conda uninstall -c conda pandas

Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

environment location: /Users/khuyentran/miniconda3/envs/test-conda

removed specs:
- pandas

The following packages will be REMOVED:

blas-1.0-openblas
bottleneck-1.3.5-py311ha0d4635_0
libcxx-14.0.6-h848a8c0_0
libgfortran-5.0.0-11_3_0_hca03da5_28
libgfortran5-11.3.0-h009349e_28
libopenblas-0.3.21-h269037a_0
llvm-openmp-14.0.6-hc6e5704_0
numexpr-2.8.4-py311h6dc990b_1
numpy-1.24.3-py311hb57d4eb_0
numpy-base-1.24.3-py311h1d85a46_0
pandas-1.5.3-py311h6956b77_0
python-dateutil-2.8.2-pyhd3eb1b0_0
pytz-2022.7-py311hca03da5_0
six-1.16.0-pyhd3eb1b0_1

Proceed ([y]/n)?

Preparing transaction: done
Verifying transaction: done
Executing transaction: donePoetry

Poetry

Poetry also removes the package and its dependencies.

$ poetry add pandas

$ poetry remove pandas

• Removing numpy (1.24.3)
• Removing pandas (2.0.2)
• Removing python-dateutil (2.8.2)
• Removing pytz (2023.3)
• Removing six (1.16.0)
• Removing tzdata (2023.3)

Dependency files ensure the reproducibility of a software project’s environment by specifying the exact versions or version ranges of required packages.

This helps recreate the same environment across different systems or at different points in time, ensuring collaboration among developers with the same set of dependencies.

Conda

To save dependencies in a Conda environment, you need to manually write them to a file. Version ranges specified in an environment.yml file can result in different versions being installed, potentially introducing compatibility issues when reproducing the environment.

Let’s assume that we have installed pandas version 1.5.3 as an example. Here is an example environment.yml file that specifies the dependencies:

# environment.yml
name: test-conda
channels:
- defaults
dependencies:
- python=3.8
- pandas>=1.5

If a new user tries to reproduce the environment when the latest version of pandas is 2.0, pandas 2.0 will be installed instead.

# Create and activate a virtual environment
$ conda env create -n env
$ conda activate env

# List packages in the current environment
$ conda list
...
pandas 2.0

If the codebase relies on syntax or behavior specific to pandas version 1.5.3 and the syntax has changed in version 2.0, running the code with pandas 2.0 could introduce bugs.

Pip

The same problem can occur with pip.

# requirements.txt 
pandas>=1.5
# Create and activate a virtual environment
$ python3 -m venv venv
$ source venv/bin/activate

# Install dependencies
$ pip install -r requirements.txt

# List packages
$ pip list
Package Version
---------- -------
pandas 2.0
...

You can pin down the version by freezing them in a requirements.txt file:

$ pip freeze > requirements.txt
# requirements.txt

numpy==1.24.3
pandas==1.5.3
python-dateutil==2.8.2
pytz==2023.3
six==1.16.0

However, this makes the code environment less flexible and potentially harder to maintain in the long run. Any changes to the dependencies would require manual modifications to the requirements.txt file, which can be time-consuming and error-prone.

Poetry

Poetry automatically updates the pyproject.toml file when installing a package.

In the following example, the “pandas” package is added with the version constraint ^1.5. This flexible versioning approach ensures that your project can adapt to newer releases without manual adjustments.

$ poetry add 'pandas=^1.5'
# pyproject.toml

[tool.poetry.dependencies]
python = "^3.8"
pandas = "^1.5"

The poetry.lock file stores the precise version numbers for each package and its dependencies.

# poetry.lock
...
[[package]]
name = "pandas"
version = "1.5.3"
description = "Powerful data structures for data analysis, time series, and statistics"
category = "main"
optional = false
python-versions = ">=3.8"

[package.dependencies]
numpy = [
{version = ">=1.20.3", markers = "python_version < "3.10""},
{version = ">=1.21.0", markers = "python_version >= "3.10""},
{version = ">=1.23.2", markers = "python_version >= "3.11""},
]
python-dateutil = ">=2.8.2"
pytz = ">=2020.1"
tzdata = ">=2022.1"
...

This guarantees consistency in the installed packages, even if a package has a version range specified in the pyproject.toml file. Here, we can see that pandas 1.5.3 is installed instead of pandas 2.0

$ poetry install

$ poetry show pandas

name : pandas
version : 1.5.3
description : Powerful data structures for data analysis, time series, and statistics

dependencies
- numpy >=1.20.3
- numpy >=1.21.0
- numpy >=1.23.2
- python-dateutil >=2.8.1
- pytz >=2020.1

By separating the dependencies, you can clearly distinguish between the packages required for development purposes, such as testing frameworks and code quality tools, from the packages needed for the production environment, which typically include the core dependencies.

Conda

Conda doesn’t inherently support separate dependencies for different environments, but a workaround involves creating two environment files: one for the development environment and one for production. The development file contains both production and development dependencies.

# environment.yml
name: test-conda
channels:
- defaults
dependencies:
# Production packages
- numpy
- pandas
# environment-dev.yml
name: test-conda-dev
channels:
- defaults
dependencies:
# Production packages
- numpy
- pandas
# Development packages
- pytest
- pre-commit

Pip

Pip also doesn’t directly support separate dependencies, but a similar approach can be used with separate requirement files.

# requirements.txt
numpy
pandas
# requirements-dev.txt
-r requirements.txt
pytest
pre-commit
# Install prod
$ pip install -r requirements.txt

# Install both dev and prod
$ pip install -r requirements-dev.txt

Poetry

Poetry simplifies managing dependencies by supporting groups within one file. This allows you to keep track of all dependencies in a single place.

$ poetry add numpy pandas
$ poetry add --group dev pytest pre-commit
# pyproject.toml
[tool.poetry.dependencies]
python = "^3.8"
pandas = "^2.0"
numpy = "^1.24.3"

[tool.poetry.group.dev.dependencies]
pytest = "^7.3.2"
pre-commit = "^3.3.2"

To install only production dependencies:

$ poetry install --only main

To install both development and production dependencies:

$ poetry install

Updating dependencies is essential to benefit from bug fixes, performance improvements, and new features introduced in newer package versions.

Conda

Conda allows you to update only a specified package.

$ conda install -c conda pandas
$ conda install -c anaconda scikit-learn
# New versions available
$ conda update pandas
$ conda update scikit-learn

Afterward, you need to manually update the environment.yaml file to keep it in sync with the updated dependencies.

$ conda env export > environment.yml

Pip

Pip also only allows you to update a specified package and requires you to manually update the requirements.txt file.

$ pip install -U pandas
$ pip freeze > requirements.txt

Poetry

Using Poetry, you can use the update command to upgrade all packages specified in the pyproject.toml file. This action automatically updates the poetry.lock file, ensuring consistency between the package specifications and the lock file.

$ poetry add pandas scikit-learn

# New verisons available
poetry update

Updating dependencies
Resolving dependencies... (0.3s)

Writing lock file

Package operations: 0 installs, 2 updates, 0 removals

• Updating pandas (2.0.0 -> 2.0.2)
• Updating scikit-learn (1.2.0 -> 1.2.2)

Dependency conflicts occur when packages or libraries required by a project have conflicting versions or incompatible dependencies. Properly resolving conflicts is crucial to avoid errors, runtime issues, or project failures.

Pip

pip installs packages sequentially, which means it installs each package one by one, following the specified order. This sequential approach can sometimes lead to conflicts when packages have incompatible dependencies or version requirements.

For example, suppose you install pandas==2.0.2 first, which requires numpy>=1.20.3. Later, you install numpy==1.20.2 using pip. Even though this will create dependency conflicts, pip will proceed to update the version of numpy.

$ pip install pandas==2.0.2

$ pip install numpy==1.22.2
Collecting numpy=1.20.2
Attempting uninstall: numpy
Found existing installation: numpy 1.24.3
Uninstalling numpy-1.24.3:
Successfully uninstalled numpy-1.24.3
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pandas 2.0.2 requires numpy>=1.20.3; python_version < "3.10", but you have numpy 1.20.2 which is incompatible.
Successfully installed numpy-1.20.2

Conda

Conda uses a SAT solver to explore all combinations of package versions and dependencies to find a compatible set.

For instance, if an existing package has a specific constraint for its dependency (e.g., statsmodels==0.13.2 requires numpy>=1.21.2,<2.0a0), and the package you want to install doesn’t meet that requirement (e.g., numpy<1.21.2), conda won’t immediately raise an error. Instead, it will diligently search for compatible versions of all the required packages and their dependencies, only reporting an error if no suitable solution is found.

$ conda install 'statsmodels==0.13.2'

$ conda search 'statsmodels==0.13.2' --info
dependencies:
- numpy >=1.21.2,<2.0a0
- packaging >=21.3
- pandas >=1.0
- patsy >=0.5.2
- python >=3.9,<3.10.0a0
- scipy >=1.3

$ conda install 'numpy<1.21.2'

...
Package ca-certificates conflicts for:
python=3.8 -> openssl[version='>=1.1.1t,<1.1.2a'] -> ca-certificates
openssl -> ca-certificates
ca-certificates
cryptography -> openssl[version='>1.1.0,<3.1.0'] -> ca-certificates

Package idna conflicts for:
requests -> urllib3[version='>=1.21.1,<1.27'] -> idna[version='>=2.0.0']
requests -> idna[version='>=2.5,<3|>=2.5,<4']
idna
pooch -> requests -> idna[version='>=2.5,<3|>=2.5,<4']
urllib3 -> idna[version='>=2.0.0']

Package numexpr conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> numexpr[version='>=2.7.0|>=2.7.1|>=2.7.3']
numexpr
pandas==1.5.3 -> numexpr[version='>=2.7.3']

Package patsy conflicts for:
statsmodels==0.13.2 -> patsy[version='>=0.5.2']
patsy

Package chardet conflicts for:
requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']
pooch -> requests -> chardet[version='>=3.0.2,<4|>=3.0.2,<5']

Package python-dateutil conflicts for:
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> python-dateutil[version='>=2.7.3|>=2.8.1']
python-dateutil
pandas==1.5.3 -> python-dateutil[version='>=2.8.1']

Package setuptools conflicts for:
numexpr -> setuptools
pip -> setuptools
wheel -> setuptools
setuptools
python=3.8 -> pip -> setuptools
pandas==1.5.3 -> numexpr[version='>=2.7.3'] -> setuptools

Package brotlipy conflicts for:
urllib3 -> brotlipy[version='>=0.6.0']
brotlipy
requests -> urllib3[version='>=1.21.1,<1.27'] -> brotlipy[version='>=0.6.0']

Package pytz conflicts for:
pytz
pandas==1.5.3 -> pytz[version='>=2020.1']
statsmodels==0.13.2 -> pandas[version='>=1.0'] -> pytz[version='>=2017.3|>=2020.1']

While this approach enhances the chances of finding a resolution, it can be computationally intensive, particularly when dealing with extensive environments.

Poetry

By focusing on the direct dependencies of the project, Poetry’s deterministic resolver narrows down the search space, making the resolution process more efficient. It evaluates the specified constraints, such as version ranges or specific versions, and immediately identifies any conflicts.

$ poetry add 'seaborn==0.12.2'
$ poetry add 'matplotlib<3.1'

Because poetry shell depends on seaborn (0.12.2) which depends on matplotlib (>=3.1,<3.6.1 || >3.6.1), matplotlib is required.
So, because poetry shell depends on matplotlib (<3.1), version solving failed.

This immediate feedback helps prevent potential issues from escalating and allows developers to address the problem early in the development process. For example, in the following code, we can relax the requirements for seaborn to enable the installation of a specific version of matplotlib:

poetry add 'seaborn<=0.12.2'  'matplotlib<3.1' 

Package operations: 1 install, 2 updates, 4 removals

• Removing contourpy (1.0.7)
• Removing fonttools (4.40.0)
• Removing packaging (23.1)
• Removing pillow (9.5.0)
• Updating matplotlib (3.7.1 -> 3.0.3)
• Installing scipy (1.9.3)
• Updating seaborn (0.12.2 -> 0.11.2)

In summary, Poetry provides several advantages over pip and conda:

  1. Broad Package Selection: Poetry provides access to a wide range of packages available on PyPI, allowing you to leverage a diverse ecosystem for your project.
  2. Efficient Dependency Management: Poetry installs only the necessary dependencies for a specified package, reducing the number of extraneous packages in your environment.
  3. Streamlined Package Removal: Poetry simplifies the removal of packages and their associated dependencies, making it easy to maintain a clean and efficient project environment.
  4. Dependency Resolution: Poetry’s deterministic resolver efficiently resolves dependencies, identifying and addressing any inconsistencies or conflicts promptly.

While Poetry may require some additional time and effort for your teammates to learn and adapt to, using a tool like Poetry can save you time and effort in the long run.



Source link

Leave a Comment