What is a Container
Overview
Teaching: 15 min
Exercises: 0 minQuestions
Containers and Scientific Reproducibility
Objectives
How containers contribute to scientific reproducibility
FIXME
What is a Container?
Slides available (download)
Scientific Reproducibility
There is rising concern about a lack of repeatability, replicability, and reproducibility in science and engineering with a large number of public failings across domains ranging from algorithm development and design to cancer genomics, medicine and economics, where the results of a number of high profile publications failing to be replicated when reproducibility studies were conducted. Variation in data collection methodologies, experimental environments, computational configuration, the lack of detailed and intricate documentation, and more are leading to a call for enhanced peer review and validation of experimentally produced artifacts. These experimentally produced artifacts represent the digital artifacts produced during the course of research that are either the inputs or outputs of the study, and can take the form of input data sets, raw data, software systems, scripts to run experiments or analyze results.
As part of this process to provide better validation of experimental works, a number of publishers are engaged in efforts to encourage and promote better scientific reproducibility. The Association of Computing Machinery has begun and embraced a campaign to produce digital badges representing various levels of digital artifact validation to accompany publications. These badges indicate if the digital artifacts have been found to be functional, reusable, available, and if the underlying studies have validated results and reproduced results.
Containerization
In recent years, containers and containerization technologies have emerged and gained significant adoption and importance among developers within industry and within the software development landscape. Applications distributed as containers have the same installation overhead as a normal installation of the container, but with a slight additional overhead requiring knowledge of the containerization framework. However, once the applications are containerized and the containers are made available for others, the full burden of application installation and configuration is abstracted from the end users of those containers, so long as they can utilize the underlying containerization framework on their systems. This allows users to download and run these pre-built containers without the need for often lengthy and detailed application installation procedures. Containers operate as a virtualization layer within the operating system of the host machine, and provide a separation of the containerized application environment from the physical resources of the hosts. Previous work has shown that applications bundled and distributed as containers and executed through containerization frameworks have very little execution overhead and exhibit almost no difference in application performance than the same application executing non-containerized on a “bare-metal” system. This separation and abstraction allows containers to be easily migrated from system to system as long as the underlying containerization framework is capable of being executed and has given significant traction to their adoption, particularly in cloud computing ecosystems. They typically include all the software and dependencies to run a single application, allowing them to facilitate DevOps principles and support for microservices (an environment dedicated to a single task).
The microservice development strategy, encapsulating all the necessary dependencies for a single application into a lightweight virtualization environment makes containers a natural fit for the software environments, processing scripts, and applications produced as digital artifacts in the course of scientific and academic research. A large portion of the artifacts generated by research groups are produced by graduate students and postdoctoral associates, are developed in unique or isolated environments, would not be considered production-level in terms of support and documentation, and often, after publication or when the student or postdoctoral associate moves on to other positions, are at best poorly updated and maintained. This leads to situations where researchers working to reproduce or build on this previous work are forced to engage in almost archaeological expedition level efforts with the identification and configuration of various undocumented application dependencies.
Docker has emerged as the clear leader in containerization frameworks, particularly for cloud computing ecosystems, but information security concerns about possible root privilege escalation exploitations have slowed its adoption within on premise data centers and academic high performance computing centers. To alleviate these potential security concerns, an alternative containerization framework, Singularity, was developed at Lawrence Berkeley National Laboratory, and has begun to make significant inroads as the a leader in containerization frameworks for non-cloud deployments. The Singularity containerization framework differs from the Docker containerization framework in that by design it does not provide a mechanism to allow for privilege escalation within the runtime environment, so a user within the container is the same user as outside the container. This makes it an ideal candidate for execution on large multi-tenant HPC systems, such as those found in academic and research institutions.
Key Points
First key point. Brief Answer to questions. (FIXME)