arrow-left arrow-right brightness-2 chevron-left chevron-right circle-half-full dots-horizontal facebook-box facebook loader magnify menu-down RSS star Twitter twitter GitHub white-balance-sunny window-close
Improving NuGet package extraction performance
3 min read

Improving NuGet package extraction performance

This is an old post and doesn't necessarily reflect my current thinking on a topic, and some links or images may not work. The text is preserved here for posterity.

A few weeks ago I received the following bug report for Octopus from a customer:

We have recently rolled out Octopus deploy [..] and we are having severe memory leak problems on the tentacles. After only 1 deploy one of the tentacles had used 1.2GB ram (Yes gigabytes). The package was quite large - circa - 180MB - but there must be a serious leak somewhere. Other projects are having similar issues - up-to about 600 MB memory usage.

I put together a test harness and created a NuGet package containing a pair of 90mb files. The harness simply used the NuGet.Core library's PackageManager class to install the package to a local folder. 55 seconds later, and having used 1.17GB of memory (as measured by GC.GetTotalMemory(false)), NuGet had finished extracting the package.

The good news is that given time, the memory usage reduced to normal, so the GC was able to free the memory (though much of it stayed allocated to the process just in case). The memory wasn't being leaked, it was just being wasted.

Wierdly, the NuGet API is designed around streams. And the System.IO.Packaging classes which NuGet depends on is also designed around streams. Looking into the implementation, the problem seemed to be down to NuGet.Core's ZipPackage class.

When you ask a NuGet.Core ZipPackage to list the files in its packages (GetFiles()) the implementation looks like this:

private List<IPackageFile> GetFilesNoCache()
{
    using (Stream stream = _streamFactory())
    {
        Package package = Package.Open(stream);

        return (from part in package.GetParts()
                where IsPackageFile(part)
                select (IPackageFile)new ZipPackageFile(part)).ToList();
    }
}

The ZipPackageFile constructor is being passed a PackagePart, which exposes a stream. But what happens with that stream?

public ZipPackageFile(IPackageFile file)
    : this(file.Path, file.GetStream().ToStreamFactory())
{
}

The ToStreamFactory call looks innocuous, but here's the implementation:

public static Func<Stream> ToStreamFactory(this Stream stream)
{
    byte[] buffer;

    using (var ms = new MemoryStream())
    {
        try
        {
            stream.CopyTo(ms);
            buffer = ms.ToArray();
        }
        finally 
        {
            stream.Close();
        }
    }

    return () => new MemoryStream(buffer);
}

That's right - it's reading the entire stream into an array, and then returning a new MemoryStream populated by the array anytime someone requests the contents of the file.

The reason for this appears to be that while System.IO.Packaging is designed to use streams which need to be disposed, the NuGet API and classes like ZipPackage are intended to be passed around without needing to be disposed. So instead of opening/closing the .nupkg file to read file contents when required, it copies it to memory.

This isn't a problem when your packages are less than a few MB, but it's pretty harmful when you're distributing applications.

After spending half a day trying to patch NuGet.Core to avoid reading the files into memory, in hopes of sending a pull request, I found that other people had also tried and been rejected - it seems like this is a problem the NuGet team plan to solve in an upcoming release.

Instead, I gave up and decided to write a package extraction function to suit my needs. This gist extracts the same file in 10 seconds using only 6mb of memory:

public class LightweightPackageInstaller
{
    private static readonly string[] ExcludePaths = new[] { "_rels", "package\\services\\metadata" };

    public void Install(string packageFile, string directory)
    {
        using (var package = Package.Open(packageFile, FileMode.Open, FileAccess.Read, FileShare.Read))
        {
            var files = (from part in package.GetParts()
                         where IsPackageFile(part)
                         select part).ToList();

            foreach (var part in files)
            {
                Console.WriteLine(" " + part.Uri);

                var path = UriUtility.GetPath(part.Uri);
                path = Path.Combine(directory, path);

                var parent = Path.GetDirectoryName(path);
                if (parent != null && !Directory.Exists(parent))
                {
                    Directory.CreateDirectory(parent);
                }

                using (var fileStream = new FileStream(path, FileMode.Create, FileAccess.Write, FileShare.Read))
                {
                    using (var stream = part.GetStream())
                    {
                        stream.CopyTo(fileStream);
                        fileStream.Flush();
                        fileStream.Dispose();
                    }
                }
            }
        }
    }

    internal static bool IsPackageFile(PackagePart part)
    {
        var path = UriUtility.GetPath(part.Uri);
        return !ExcludePaths.Any(p => path.StartsWith(p, StringComparison.OrdinalIgnoreCase)) &&
               !PackageUtility.IsManifest(path);
    }
}

There are other parts of NuGet.Core that break when large files are used. For example, some NuGet repositories use SHA hashes so that consumers can verify a package download. This is implemented in NuGet.Core's CryptoHashProvider. The method signature?

public interface IHashProvider
{
    byte[] CalculateHash(Stream stream);

    byte[] CalculateHash(byte[] data);

    bool VerifyHash(byte[] data, byte[] hash);
}

Again, instead of passing the stream (even though the underlying crypto classes we use accept streams), it will just read the entire file (180mb in this case) into an array just to hash it.

Here's hoping these problems are fixed soon. For now, at least, Octopus has a workaround that's only a few lines of code and performs much faster. As of Octopus 1.3.5, you'll be able to install large packages without dedicating most of your memory to it.

Paul Stovell's Blog

Hello, I'm Paul Stovell

I'm a Brisbane-based software developer, and founder of Octopus Deploy, a DevOps automation software company. This is my personal blog where I write about my journey with Octopus and software development.

I write new blog posts about once a month. Subscribe and I'll send you an email when I publish something new.

Subscribe

Comments