Indexing

Indexing packages that have been uploaded to packagecloud and making them available in the package manager's metadata is handled by a set of background jobs. Adding additional parallelization to this process differs depending on the package type being indexed. Certain types of package metadata cannot be handled concurrently due to technical restrictions of that metadata format.

In order to adjust the number of background workers handling indexing, you should modify your /etc/packagecloud/packagecloud.rb file and set the resque['index_worker_count'] option to the number of index worker processes. Read the following sections to understand how parallelization affects the different package types.

Fast indexing requires:

  • Sufficient worker processes for the package type (see below)
  • Enough IO capacity with your database
  • Sufficient CPU time for reindex jobs to execute and generate metadata
  • A good network connection so that metadata which has been generated can be quickly uploaded to AWS S3

In our internal tests we are able to achieve the following reindexing times for APT and YUM repositories:

  • Using a c5.9xlarge AWS EC2 instance with a db.r4.xlarge RDS instance running AWS Aurora
  • Reindexes an APT repository with 50,000 Debian packages in about 1min 35s.
  • Reindex a YUM repository with 50,000 RPM packages in about 1min and 13s.

Debian packages and APT repositories

The APT repository metadata for a single repository for a single version of Debian or Ubuntu cannot be parallelized by multiple processes. If you upload packages to different repositories, or different Ubuntu/Debian versions within the same repository, these operations can be handled in parallel.

Examples to help illustrate what can and cannot be parallelized:

  • The repository "user/example" can be reindexed for ubuntu/trusty serially. Multiple reindexes of "user/example" for the same version of Ubuntu, Debian, other APT-based system will run one by one. Adding additional worker processes will not speed up reindexing time of "user/example" for a single version of Debian, Ubuntu, or any other APT-based system.
  • The repository "user/example" can be reindexed for both ubuntu/trusty, ubuntu/xenial, debian/jessie, and/or any other distribution at the same time. Adding additional workers can enable reindexing for different distributions of Ubuntu and Debian to occur in parallel.
  • Uploading and amd64 and an i386 package to "user/example". This will cause two separate reindex jobs that must be processed serially to execute. Adding additional worker processes will not enable reindex jobs for different CPU architectures for the same repository to run in parallel.
  • Uploading an amd64 package to "user/example" and an i386 package "user/another-example". If you have sufficient worker processes, these reindexes can happen simultaneously because they are for different repositories.

You can increase reindex speed by disabling file list metadata for your repository. This metadata is very large, takes a lot of time to generate, and is generally only used by a small percentage of users. Disabling this metadata for APT repositories will mean that users will no longer be able to run "apt file" to determine which package from the repository provided the specified file. For extremely large packages (e.g., Chef Omnibus style packages) file list metadata may not be particularly useful – a user probably knows that all files in /opt/example are from the Example package.

You'll need to determine if you want to support filelist metadata, but for large repositories index time can be sped up significantly if this is disabled.

RPM packages and YUM repositories

The YUM repository metadata for a single repository for a single version of CentOS, Enterprise Linux, or any other YUM-based system for a particular CPU architecture can be parallelized. If you upload multiple packages for the same repository, with the same version of Enterprise Linux, CentOS, with the same CPU architecture - these uploads cannot be parallelized.

Examples to help illustrate what can and cannot be parallelized:

  • The repository "user/example" can be reindexed for el/6 for x86_64 packages serially. Multiple reindexes of "user/example" for the same version of CentOS, Enterprise Linux, etc for the same CPU architecture are processed serially. Adding additional workers will not allow these jobs to be processed faster.
  • The repository "user/example" can be reindexed for both x86_64 and i386 packages uploaded to a single version of CentOS, Enterprise Linux, or other YUM-based system (e.g. el/6) at the same time, if sufficient worker processes are available.
  • The repository "user/example" can be reindexed for x86_64 packages uploaded to both el/6 and el/7 at the same time. Adding additional workers can enable allow these index jobs to run in parallel.

You can increase reindex speed by disabling file list metadata for your repository. This metadata is very large, takes a lot of time to generate, and is generally only used by a small percentage of users. Disabling this metadata for YUM repositories will mean that users will no longer be able to run "yum whatprovides" to determine which package from the repository provided the specified file. For extremely large packages (e.g., Chef Omnibus style packages) file list metadata may not be particularly useful – a user probably knows that all files in /opt/example are from the Example package.

You'll need to determine if you want to support filelist metadata, but for large repositories index time can be sped up significantly if this is disabled.

Python packages and PyPI repositories

PyPI metadata can only be generated serially for a particular repository. Adding additional worker processes will not enable faster reindexing of PyPI repositories.

Multiple repositories can be reindexed in parallel. Adding additional worker processes will allow you to reindex two separate repositories concurrently.

Node.js and NPM registries

Multiple Node.js uploads to the same repository will cause multiple reindex jobs to be queue, which can be processed concurrently for distinct packages within the same repository. Multuple uploads of the same package (e.g. example-1.0, example-1.1, example-1.5, ...) in the same repository are processed serially.

Adding additional worker processes will allow multiple distinct packages in the same repository to be reindexed concurrently.

RubyGem packages and repositories

Multiple RubyGem uploads to different repositories will cause multiple reindex jobs to be queued, which can be processed concurrently for each of the repositories. Reindex jobs for the same RubyGem repository will be processed serially.

Adding additional worker processes will allow multiple distinct RubyGem repositories to be reindexed concurrently.

Java packages and Maven repositories

Currently, no background processes are used for generating Maven repositories, so no worker jobs are necessary for this repository type.