jupyter-enabled astrophysical analysis using data

13
CiSE Special Issue (accepted) Publisher: IEEE Jupyter-enabled astrophysical analysis using data-proximate computing platforms S. Juneau NSF’s NOIRLab K. Olsen NSF’s NOIRLab R. Nikutta NSF’s NOIRLab A. Jacques NSF’s NOIRLab S. Bailey Lawrence Berkeley National Lab Abstract—The advent of increasingly large and complex datasets has fundamentally altered the way that scientists conduct astronomy research. The need to work closely to the data has motivated the creation of online science platforms, which include a suite of software tools and services, therefore going beyond data storage and data access. We present two example applications of Jupyter as part of astrophysical science platforms for professional researchers and students. First, the Astro Data Lab is developed and operated by NOIRLab with a mission to serve the astronomy community with now over 1,400 registered users. Second, the Dark Energy Spectroscopic Instrument (DESI) science platform serves its geographically-distributed team comprising about 900 collaborators from over 90 institutions. We describe the main uses of Jupyter and the interfaces that needed to be created to embed it within science platform ecosystems. We use these examples to illustrate the broader concept of empowering researchers and providing them with access to not only large datasets but also cutting-edge software, tools, and data services without requiring any local installation, which can be relevant for a wide range of disciplines. Future advances may involve science platform networks, and tools for simultaneously developing Jupyter notebooks to facilitate collaborations. March/April 2021 1 arXiv:2104.06527v1 [astro-ph.IM] 13 Apr 2021

Upload: others

Post on 15-Apr-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Jupyter-enabled astrophysical analysis using data

CiSE Special Issue (accepted)Publisher: IEEE

Jupyter-enabled astrophysicalanalysis using data-proximatecomputing platforms

S. JuneauNSF’s NOIRLab

K. OlsenNSF’s NOIRLab

R. NikuttaNSF’s NOIRLab

A. JacquesNSF’s NOIRLab

S. BaileyLawrence Berkeley National Lab

Abstract—The advent of increasingly large and complex datasets has fundamentally altered theway that scientists conduct astronomy research. The need to work closely to the data hasmotivated the creation of online science platforms, which include a suite of software tools andservices, therefore going beyond data storage and data access. We present two exampleapplications of Jupyter as part of astrophysical science platforms for professional researchersand students. First, the Astro Data Lab is developed and operated by NOIRLab with a mission toserve the astronomy community with now over 1,400 registered users. Second, the Dark EnergySpectroscopic Instrument (DESI) science platform serves its geographically-distributed teamcomprising about 900 collaborators from over 90 institutions. We describe the main uses ofJupyter and the interfaces that needed to be created to embed it within science platformecosystems. We use these examples to illustrate the broader concept of empoweringresearchers and providing them with access to not only large datasets but also cutting-edgesoftware, tools, and data services without requiring any local installation, which can be relevantfor a wide range of disciplines. Future advances may involve science platform networks, andtools for simultaneously developing Jupyter notebooks to facilitate collaborations.

March/April 2021 1

arX

iv:2

104.

0652

7v1

[as

tro-

ph.I

M]

13

Apr

202

1

Page 2: Jupyter-enabled astrophysical analysis using data

IN ASTRONOMY, and other disciplines, theneed for online science platforms is driven byboth rapidly increasing data volume and by thecomplexity of datasets, which require highly spe-cialized and diverse software libraries to be co-located with the data. In terms of data rates, 1TB/night is typical of current data-intensive fa-cilities, while the incipient Rubin Legacy Surveyof Space and Time (Rubin LSST) will collect20 TB of raw data and send a stream of 10million alerts1 of brightness changes each night.The size of teams working collaboratively on dataanalysis is also growing, routinely comprisingtens to several hundreds of researchers and stu-dents. Clear advantages arise from being able toconnect Jupyter Notebooks or JupyterLab to largedatasets to perform analysis and data visualizationefficiently both to support collaborative work andto empower a wide array of researchers includingfrom institutions that may lack local data storageor compute resources.

Historically, observational research was pri-marily conducted by individuals and small teamsable to use a telescope or satellite to make obser-vations. They would process and analyze the datausing software packages on their local computers,before publishing results in astronomical journals.Larger teams and collaborations started changingthe landscape by adding a need to share dataamong several people. Another big change camewith publicly accessible data archives, which canbe used to address new science questions withoutnew observations. Researchers can download theneeded data, and process and/or analyze them.For instance, statistics from the Hubble SpaceTelescope show that discoveries from archivaldata outpace those made for the original intendedpurpose of the observations2. Even more recently,another major change took place as some datasets are becoming prohibitively large or complexsuch that researchers no longer have the meansto efficiently download the data onto their localcomputers. Instead, one needs to bring the com-puting workflows to the data.

Below, we showcase two astronomy projects

1Alerts are sent when a change in brightness is detected bythe software comparing the new observation of a given objectto previous measurements so that researchers can investigate thecause (e.g., new supernovae, variable stars).

2https://archive.stsci.edu/hst/bibliography/pubstat.html

as ongoing successful example applications. First,the Astro Data Lab ([1], [2]) at NSF’s NationalOptical-Infrared Astronomy Research Laboratory(NOIRLab) is an online astronomy science plat-form currently serving large public astronomicaldatasets including databases with tables rangingfrom 500,000 to 65 billion rows. Data-proximateanalysis and visualization is key to use databasescontaining over 100 billion rows; transferringeven 150,000,000 rows with a small subset ofthe columns would take ∌2 hours over a 14Mb/s connection, compared to ∌10 seconds overNOIRLab’s internal 10 Gb/s connection. It is evenmore relevant when considering that the databasesare supplemented with Petabytes of astronomicalimages. While this calculus will change withavailability of 5G networks, internal network up-grade opportunities will favor the code-to-data ap-proach for the foreseeable future. Since openingits doors in 2017, the Data Lab now has over1,500 registered users from different countries,ranging from students to faculty/researchers, andeducators.

Second, we describe how the Dark EnergySpectroscopic Instrument (DESI; [3], [4]) teamemploys Jupyter as the primary way to en-able their geographically distributed collaboration(nearly 900 researchers from more than 90 in-stitutions) to access petabytes of data with pre-installed software libraries. Jupyter Notebooks arealso used to train new team members, and asa quick way to perform integration testing ofsoftware releases. For each science platform, wegive an overview before presenting use cases forresearch, training, and education. We will thenconsider common elements that may apply toscience platforms in general, and possible im-provements from enhanced Jupyter capabilities.

ASTRO DATA LAB

Overview and implementationThe goal of the Astro Data Lab is to en-

able efficient archival use of massive astronom-ical survey data, particularly those produced onthe facilities of NSF’s National Optical-InfraredAstronomy Research Laboratory (NOIRLab). Asshown in Figure 1, the data volume from NOIR-Lab facilities has grown explosively over thelast decade, with much of that growth owed to

2 © 2021 IEEE Published by the IEEE Computer Society CiSE (accepted)

Page 3: Jupyter-enabled astrophysical analysis using data

wide-angle images taken with the Dark EnergyCamera3. These surveys of the sky result in largemaps with a multitude of stars, as well as galaxiesthat are billions of light-years from Earth. Thesemaps are typically processed through automatedmeasurement pipelines, yielding tables of billionsof astronomical objects. The value of these pro-cessed data is enormous, not just for the teamstaking the data for their own purposes; muchof the value lies in the potential for the broadastronomical community, and the public, to makeunanticipated discoveries.

Astro Data Lab enables several kinds ofarchival science cases, including 1) discoveryof unusual, rare, or otherwise interesting objectsselected through queries to databases; 2) classi-fication of objects in large samples, potentiallyincluding image cutouts and aided by machinelearning techniques; and 3) discovery in the timedomain, where time series data of selected objectsare used to create new understanding or revealnew phenomena. All of these science cases aresupported by services provided by Astro DataLab, with Jupyter Notebooks as the key technol-ogy that brings their power to users. For example,users can query databases directly within a Note-book (via a QueryClient), and save their output tostorage associated with their Data Lab accounts,which is currently one Terabyte per user (via aStoreClient). Figure 2 illustrates how these piecesconnect together to make Jupyter an integral partof the Astro Data Lab science platform, and weinclude examples of the code syntax to use theClients within a Notebook in the Appendix.

The Astro Data Lab science platform runson premises using commodity hardware (1U and2U general-purpose servers with typically twoXeon CPUs and 20 cores each). The variousdata services and the Jupyter notebook serverare currently deployed as virtual machines, withthe available resources shared by all users. Tofacilitate resource allocation, and permit usersto maintain their own software stack, we arecurrently working to implement containerizationbased on Docker and Kubernetes. Specializedhardware consisting of three machines (produc-tion, hot standby, staging) is deployed for the cat-

3https://www.darkenergysurvey.org/the-des-project/instrument/

alog databases, and is optimized for CPU/RAMthroughout and backed by fast SSD storage.

Jupyter for ResearchThe majority of Astro Data Lab users are

professional astronomers and graduate students.They tend to have topical expertise in their subjectof research but varied levels of coding experienceand knowledge. To remove the need for localinstallation, the Jupyter Notebook server includespre-installed software packages and Python li-braries (e.g., numpy, matplotlib, pandas), andis populated with example Jupyter notebooksdirectly within the user accounts. The defaultinstallation includes astronomy-specific packagessuch as Astropy4 ([5]) and the datalab clients5.

The Astro Data Lab Jupyter notebook serverprovides a convenient way to quickly querydatabase tables, visualize their content, and per-form detailed analyses. As an example scien-tific use case, we consider the Survey of theMagellanic Stellar History (SMASH6 [9]), whichmapped a wide area in and around the peripheryof the Magellanic Clouds, two of the nearest com-panion galaxies to the Milky Way. The SMASHdatabase table contains millions of stars formingthe Magellanic Cloud galaxies as well as millionsof distant, background galaxies, totaling nearly360 million unique objects. Discoveries fromSMASH include a ripple in the disk plane of theLarge Magellanic Cloud (LMC; [10])− likely theresult of an interaction with its neighbor galaxythe Small Magellanic Cloud − and a previouslyunknown faint dwarf galaxy companion to theMilky Way, named Hydra II ([11]). The studyof such faint companion galaxies may hold cluesto the nature of the unknown “dark matter”.

The Hydra II dwarf galaxy finding is fea-tured in a science example Jupyter Notebookprovided to users, giving them a demonstrationof a discovery from beginning to end, and astarting point for their own explorations. It showsthe chain of steps for research from how thedatabase query is constructed and issued to theserver, how to display the measurements that arereturned, and how to process them to highlight

4https://www.astropy.org/5https://github.com/noaodatalab/datalab6https://datalab.noirlab.edu/smash/smash.php

March/April 2021 3

Page 4: Jupyter-enabled astrophysical analysis using data

Department Head

Figure 1. Evolution of the sky coverage from NOIRLab’s observatories with telescope diameter of 2-4 metersduring the years 2012, 2015 and 2018. The color scheme indicates the exposure time on a logarithmic scale,mostly ranging from 900 seconds (15 minutes; green) up to nearly 1 million seconds (277 hours; dark red). Therapid increase of coverage starting around 2013 coincides with astronomical surveys conducted with wide-fieldcameras such as the Dark Energy Camera (DECam). These full-sky maps were generated with an equal-areaMollweide projection, and the Milky Way Galactic Plane is shown with a solid red line. The bar chart illustratesthe growth in data volume stored at NOIRLab’s Astro Data Archive, split between unprocessed (raw) images,processed images, and high-level data products such as image mosaics.

Figure 2. Illustration of some of the key Clients and Managers that are part of the Data Lab platform, andwhich can be used with Jupyter to connect the users to the data, and enable them to conduct their analysisusing Notebooks. User authentication is only required for users to access their MyDB or virtual storage space.The queryClient allows users to send a database query, and retrieve the results directly within a Notebook. Theoverall goal is for users to be able to conduct their entire analysis up to generating publication-quality figureson the Jupyter server.

4 CiSE (accepted)

Page 5: Jupyter-enabled astrophysical analysis using data

the newly discovered galaxy. Researchers and stu-dents can also browse other example notebooksfrom within their Astro Data Lab account, or onthe “03 ScienceExamples” GitHub repository7.

Researchers who conduct their analysis onthe Astro Data Lab platform are asked to in-clude a formal acknowledgement in all result-ing journal publications, and to use the facili-ties command “Astro Data Lab” when available,such as for the American Astronomical Societyjournals. They may further propose to add theirJupyter Notebook to our contributed Notebookcollection8. User-contributed notebooks that showapplications not yet covered in our collectionare especially welcome. The “05 Contrib” folderis divided into astronomy topics, and includesstep-by-step instructions as well as an AstroData Lab Notebook template for guidance. Theteam then reviews the proposed notebook, andmay iterate with the contributor until the Note-book is approved to be added to the notebookcollection, which will consist in merging thepull-request (PR) on GitHub. The vetting pro-cess requires manual interventions and topicalknowledge. Once the GitHub PR is merged, thedistribution and testing of the new contributednotebook is automated as described below.

Jupyter for TrainingTraining is crucial at all career levels. This

is partly due to the important changes on thesoftware, tools, and methods being developedand utilized. For example, several experiencedresearchers are newly learning the Python pro-gramming language. More fundamentally, this isalso due to the increasing prominence of data-driven modes of conducting science. Trainingpresents new challenges as we need to conveyand teach new, data-intensive methods, but it alsopresents new opportunities. Data Lab in combi-nation with Jupyter helps make research fairerand more equitable. Improving access to qualitydata is an important part of building an inclusivecommunity by lowering the barrier of entry, thusleveling the playing field for all ([6]).

The Astro Data Lab recognizes the importance

7https://github.com/noaodatalab/notebooks-latest/tree/master/03 ScienceExamples

8https://github.com/noaodatalab/notebooks-latest/tree/master/05 Contrib

of providing researchers, students, and professorswith adequate technical training and support indata science to empower them to take advan-tage of the increasing volume and complexity ofdatasets. In terms of documentation and tutorials,the team produced a User Manual9 using SphinxDocumentation10 as well as numerous exampleJupyter Notebooks, which are available as a refer-ence at any time. In addition, team members givelive presentations and webinars, conduct onlineand/or in-person demos as well as hands-on tuto-rials, during which participants can interactivelyfollow along. Most of the demos and tutorialsinvolve going through Jupyter Notebooks, whichrange from entry-level to advanced science ex-amples to technical ”How-To” notebooks, andwhich are described in a README file. Theprocedure to maintain, and automatically updatethe collection of Notebooks is described below.

Jupyter for EducationJupyter Notebooks facilitate education about

topics in astronomy and/or data science, includingcomputer coding, data manipulation and analysis.We currently support the Python language, andhave two primary sets of notebooks under theEducation and Public Outreach (EPO) umbrella.

The Teen Astronomy Cafe Program11 consistsof hands-on computer activities and discussionsled by volunteer scientists in addition to theprogram director. The target audience is 12 to18 year-old students. Jupyter Notebooks are pro-duced ahead of time, and may include interactivewidgets (sliders, and buttons) and/or instructionsto teach students to modify code cells. Eachactivity is designed to cover a different astronomytopic for which data science techniques can beemployed. As of 2020, these EPO Notebooks canbe run either on Google Colab or directly on theAstro Data Lab platform.

At the level of undergraduate and graduatestudents, the Astro Data Lab is adding a collec-tion of notebooks adapted from the La SerenaSchool for Data Science12. These notebooks wereoriginally written by the guest lecturers, and covermore advanced topics involving, e.g., statistical

9https://datalab.noirlab.edu/docs/manual/index.html10https://www.sphinx-doc.org11http://www.teenastronomycafe.org/12http://www.aura-o.aura-astronomy.org/winter school/

March/April 2021 5

Page 6: Jupyter-enabled astrophysical analysis using data

Department Head

tools applied to large datasets, high performancecomputing, image analysis, and machine-learningtechniques. The goal is for students to learnhands-on skills that they can then apply to theirown research.

Lastly, Data Lab notebooks have been used inthe classroom upon requests from professors andteachers, including semester-long graduate-levelastronomy courses. In some instances, we createtemporary demo accounts for students. These in-house demo accounts are generated via a script,and allow the users to save their work in progressduring the duration of the fixed-term account.After they expire, they are wiped clean, anddeactivated using an automated script.

Notebook Deployment and TestingThe Data Lab team develops a suite of note-

books for all users. These range from notebooksintroducing Python, SQL, and Data Lab, overtechnical How-Tos, to complete science casesthat reproduce interesting results from the astro-nomical literature. Astronomy and data scienceeducation notebooks complete the picture. DataLab also assists users who wish to contribute anotebook to the collection by providing them witha template and step-by-step instructions.

Notebooks are developed either on the DataLab notebook server13, or locally (with thedatalab package14 installed). All version con-trol occurs through GitHub15. Pull requests arereviewed by the Data Lab team. When merged tothe master branch, a webhook is triggered whichsends a beacon to the Data Lab servers. If thebeacon is successfully verified, the server pullsthe latest version from GitHub and re-populatesa specific directory with the updated notebooks.This notebooks-latest/ directory is read-only mounted in all users’ notebook spaces, andit can be browsed both using a browser and aterminal. Users can also pull a whole new read& write copy of it using a custom shell command(getlatest <targetdir>).

The notebook suite is now around 40 note-books strong and growing. We must ensure thatall notebooks continue to execute correctly at

13https://datalab.noirlab.edu/devbooks/14https://github.com/noaodatalab/datalab/15https://github.com/noaodatalab/notebooks-latest/

all times, despite the backend datasets and mid-dleware services being updated often. To fa-cilitate automated pass/fail testing we employnbconvert wrapped in a custom Python pack-age. Testing can be performed through a dedi-cated notebook (part of the notebook suite), orin command line mode. The testing supportsglobbing for .ipynb files, exclusion patterns,dynamic notebook kernel switching, and sum-mary reporting. Test failures do not interruptthe testing suite, and the tracelog of all testednotebooks is available for inspection. Currentlythe test suite completes in 35 minutes on the DataLab notebook server.

DESIOverview and implementation

The DESI instrument ([4]) will be used to sur-vey one third of the sky (14,000 square degrees),and obtain spectra for 35 million galaxies andquasars in order to construct the largest three-dimensional map of the universe to date [3]. Bycomparing this map to predictions from numericalmodels, the DESI team will constrain the natureof dark energy, the dominant yet mysterious forcedriving the expansion of the universe. In addi-tion to its main cosmology goals, which rely onspectroscopy of very distant galaxies and quasars,DESI will obtain spectra for 10 million starsfrom the Milky Way, and an additional 10 millionbrighter, less distant galaxies. These added-valuedata will allow astronomers to answer a slew ofastrophysical questions about stars, galaxies, andtheir evolution.

Currently installed on the four-meter Mayalltelescope at Kitt Peak National Observatory, Ari-zona, DESI can obtain spectra of up to 5000targets in a single observation, by allocating asingle optical fiber per target. The DESI team firstneeds to measure the sky positions (coordinates)of all the galaxies, quasars, and stars to observewith DESI, select which ones can be observedsimultaneously with 5000 fibers, and computehow to align each remotely-controlled fiber withits target (fiber assignment). The targets are se-lected from the pre-imaging Legacy Surveys16,which are now complete, and consist of opticaland infrared images which were used to measure

16https://www.legacysurvey.org/

6 CiSE (accepted)

Page 7: Jupyter-enabled astrophysical analysis using data

the position and fluxes of over 1.6 billion uniqueobjects.

DESI generates O(1 PB) of data per year,including raw telescope data, processed results,and computer simulations needed for final sci-ence analyses. This amount of data together withthe sophisticated and specialized software weremotivators to use a Jupyter server running closeto the data at the National Energy Research Sci-entific Computing Center (NERSC), where teammembers can sign in and launch an instanceof JupyterLab with pre-installed software. TheJupyterHub servers are maintained by NERSC asa service to all users, not specific to DESI, butthey are significantly used by DESI collabora-tors using DESI-maintained custom kernels withDESI-specific software.

NERSC deploys its JupyterHub service us-ing a container-as-a-service platform basedon Kubernetes, and uses wrapspawner17 andbatchspawner18 to enable JupyterLab sessions tospawn on shared or exclusive (batch) nodes onCPU or GPU resources. Most DESI collaboratorsusing Jupyter access it via a node shared withother NERSC users for non-CPU-intensive ex-ploratory analysis and visualization. More inten-sive Jupyter-based computing is available throughdistributed analytics platforms like Spark or Dask,or MPI-enabled applications leveraging IPyPar-allel, managed from notebooks through script-based adaptations to high-performance computingenvironments.

Below, we present example use cases ofJupyter for the DESI collaboration of nearly900 members including research staff, postdoc-toral researchers, graduate students and engineers.These include examples for the purpose of re-search, training, education, and software testing.The DESI software is maintained on GitHub(desihub19), and includes 59 repositories with107 members.

Jupyter for ResearchIn the case of DESI, research includes survey

planning, and data processing, exploration, andanalysis. The research applications themselves are

17https://github.com/jupyterhub/wrapspawner18https://github.com/jupyterhub/batchspawner19https://github.com/desihub

fairly broad as they include both the primary cos-mology goals centered around dark energy as wellas a slew of possible other projects on galaxies,stars, black holes, etc. Therefore, having a suiteof tools and analysis workflows will maximizethe outcome from the DESI observations. Jupyterenables interactive exploration and analysis ofpetabytes of data by remote collaborators withoutrequiring them to download the data locally. Italso provides kernels with pre-installed softwarereleases for DESI-specific data analysis.

Jupyter is frequently used by DESI collabo-rators for initial data exploration and algorithmdevelopment (Figure 3), before deploying thosealgorithms as Python packages to be used in non-Jupyter batch jobs using hundreds to thousands ofnodes. Jupyter notebooks are also used for collat-ing the results of those batch jobs and visualizingthe final results.

A specific example of data explorationis the interactive spectral visualization toolprospect20, which is developed on GitHub,and used by the DESI team at NERSC to visuallyinspect DESI spectra to debug data processingproblems. It is based on Bokeh, and works inthe JupyterLab environment. Figure 4 gives anoverview of a version of prospect that isbeing implemented at the Data Lab, showing anexample application with a Sloan Digital SkySurvey (SDSS) spectrum of a galaxy.

Jupyter for TrainingJupyter Notebooks are used to create DESI

Tutorials21 to train new DESI collaboration mem-bers, and to teach existing members new tasksor functionalities. They are maintained and canbe browsed on GitHub. However, DESI teammembers must run them at NERSC after authen-ticating, cloning the notebooks from GitHub, andlaunching a JupyterLab instance on the Jupyterserver. The underlying DESI software is writtenin Python, and is pre-installed. The user has theability to select DESI kernels to test different soft-ware versions if needed, or simply run the defaultkernel. Tutorials range from basic functionality togenerating mock data to planning observations viafiber assignment calculations, and investigating

20https://github.com/desihub/prospect21https://github.com/desihub/tutorials

March/April 2021 7

Page 8: Jupyter-enabled astrophysical analysis using data

Department Head

Figure 3. (Left) Overlay of the DESI focal plane on the sky with 5000 fiber patrol regions. The example testspectrum at the bottom was obtained from a fiber positioned on a part of the Triangulum Galaxy (red dot) aspart of early commissioning of the instrument in October 2019, and shows spectral features as labeled (Credit:DESI Collaboration; Legacy Surveys; NASA/JPL-Caltech/UCLA); (Middle) Section of a Jupyter Notebook usingthe DESI software to plot the fiber location to visualize how overlapping DESI pointings can result in repeatedobservations, which can be used to obtain more signal. (Right) Resulting plot shown inline in the Notebook.The black circles illustrate the full size of a given DESI pointing (focal plane), while each colored dot marksthe location of a fiber. The color scheme is mapped to HealPix indices, which are defined to tile the sky withnside=64 in this example. The color saturation level is keyed to the number of repeated fiber assignments(darker colors signify more overlap).

available spectra from early observations. Theyare meant to train users in a modular fashion,and to enable them to re-use and combine partsfrom different tutorials in order to create theirown workflow.

Jupyter for EducationOne of the educational efforts using Jupyter

is a program called DESI High22, which aims toteach 14 to 18 year-old students to understandDark Energy with actual data from the DESIproject and Jupyter notebooks. The latter aremade available on GitHub and can be run withBinder. The DESI High program also includesa lot of links to useful material and referencessuch as videos, websites, tutorials. An underlyingmotivation of this program, which is shared withsome of the initiatives mentioned previously, isto inspire and teach the next generation of (data)scientists.

22https://github.com/michaelJwilson/DESI-HighSchool

Jupyter for Software Testing

Integration testing of DESI software releasesis performed by a Jupyter Notebook on theJupyter server at NERSC. The “minitest.ipynb”notebook in the GitHub desitest23 repositoryruns an end-to-end integration test from simu-lating input data to processing it with the datapipeline to performing final quality assurance andanalysis. This notebook includes several steps,some of which spawn batch jobs and maintainlogfiles outside of the notebook itself. Undernormal activity levels on the NERSC servers, theminitest notebook runs end-to-end in about twohours and uses approximately 500 core-hours ofcomputing resources, a scale that would not beviable for typical continuous integration services.The output from this notebook serves as the “ref-erence run” for the software release, providingspecific examples of the data outputs, formats,and performance for that release.

23https://github.com/desihub/desitest

8 CiSE (accepted)

Page 9: Jupyter-enabled astrophysical analysis using data

Figure 4. Interactive spectral visualization tool prospect, which is developed on GitHub, and used by theDESI team at NERSC. It is based on Bokeh, and works in the JupyterLab environment. Above is a version ofprospect that is being implemented at the Data Lab, showing an example application with a SDSS spectrumof a galaxy (in red), with the best-fit model overlaid in black, and the error spectrum in green. Vertical dashedlines mark the location of expected spectral features such as emission lines from ionized gas (better seen inthe bottom right zoom box inset, which zooms around the location of the cursor by capturing the mouse-overinformation). The target galaxy can be seen in the top right image along with its celestial coordinates in RightAscension (RA) and Declination (Dec). The interactive tool allows users to zoom, pan, and reset the plot byusing the Bokeh buttons at the top of the viewer, and displays tabulated information about the object beingplotted (6th object of 50 in the example above as indicated by the slider). Clicking on the image thumbnail leadsto the location of the galaxy in the interactive Sky Viewer (https://www.legacysurvey.org/viewer; screenshot onthe right; with the North and East direction indicated with arrows).

JUPYTER IN SCIENCE PLATFORMSWe now consider similarities and differences

between the Astro Data Lab and DESI examples,as well as other astrophysical science platforms.Both the Data Lab team and the DESI teamtake advantage of the user-friendly format ofJupyter Notebooks to create examples and tu-torials to train new users, and to illustrate keyfunctionalities. They both maintain their tutorialsand examples on GitHub for version control andcontributions. In the case of Data Lab, eachuser account comes with the latest collection ofexample Notebooks which range from “GettingStarted” novice level, to more technical “HowTo” notebooks, complete scientific use cases, aswell as educational notebooks and contributednotebooks from the astronomy community. Inthe case of DESI, team members can followinstructions to clone the notebooks from GitHuband start using them.

The main difference may be the target au-dience, in the sense that the Astro Data Lab isprimarily serving publicly available data to the

general astronomy community, while the DESIteam has restricted access to proprietary data.Furthermore, the DESI effort is centered aroundthe instrument’s primary cosmology mission. Incontrast, the Astro Data Lab aims to maximizethe scientific output of the community includingall topics in astronomy and/or cosmology.

Toward a Network of Science PlatformsOther astronomy science platforms are ei-

ther active or being developed, and all rely onProject Jupyter to enable their users to performdata proximate analysis. For instance, the SpaceTelescope Science Institute (STScI) is utilizingProject Jupyter by 1) curating Jupyter notebooksas training materials24 for astronomy missiondata services; 2) building services for scienceusers to discover notebooks and utilize Jupyter-based tools such as the Jdaviz25 package ofastronomical data analysis visualization tools, and3) deploying science platforms in the Amazon

24https://github.com/spacetelescope/notebooks25https://jdaviz.readthedocs.io/en/latest/

March/April 2021 9

Page 10: Jupyter-enabled astrophysical analysis using data

Department Head

Web Service (AWS) that provide these notebooksand compute resources to users in proximity topublic mission data. STScI develops these plat-forms to support community engagement throughworkshops as well as through archive services,for example as an entry point for users to inter-act with data from the upcoming Roman SpaceTelescope mission.

Another example is the Canadian AstronomyData Center (CADC26), which is currently de-veloping a Jupyter Notebook server to be de-ployed on the Canadian Advanced Network forAstronomical Research (CANFAR27) in 2021,and which will utilize authentication and stor-age (VOSpace28) protocols and group permissionalready implemented at CADC, in addition toproviding users with access to data archives, andsoftware.

Also in development, the Vera C. Rubin Ob-servatory uses JupyterLab as the core technologyof the Notebook Aspect of its Rubin SciencePlatform29 (RSP). The Rubin Observatory, cur-rently under construction, will feature a wide-field 8.4-meter telescope, a gigapixel camera, anda data management system capable of handlingthe hundreds of petabytes that its 10-year LegacySurvey of Space and Time (LSST) will collect.The Notebook Aspect will be the key interfaceallowing the Rubin user community to accessthe massive dataset and run the LSST sciencepipelines. Other major projects in developmentplanning their own science platforms include theMaunakea Spectroscopic Explorer30, an 11-meterspectroscopic survey telescope, and the U.S. Ex-tremely Large Telescope Project, which will havea share of both the Thirty Meter Telescope31 andthe 25-m Giant Magellan Telescope32.

Challenges that may limit the adoption ofscience platforms include keeping the barrier ofentry as low as possible for onboarding new userswith no prior experience, the potential cost oragency limitations if platforms rely on commer-cial cloud solutions, and the complexity or het-erogeneity of datasets and services. The latter can

26http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/en/27https://www.canfar.net/en/28https://www.ivoa.net/documents/VOSpace/29https://nb.lsst.io30https://mse.cfht.hawaii.edu31https://www.tmt.org32https://www.gmto.org

be exacerbated if each platform deploys different,non-standard solutions. A tighter communicationacross platforms and disciplines including sharingexperiences, designing for inter-operability, anddiscussing and adopting common core technolo-gies (e.g., Jupyter, Docker, Kubernetes) wouldalleviate some of these expected challenges.

While the pace of technological upheaval insoftware and data science is rapid, Jupyter hasclearly discovered a need in the realm of astro-nomical research for a flexible, extensible, andeasily deployable interactive software environ-ment, and thus is well-positioned to remain a coretechnology. This need is likely to expand as astro-nomical data volumes grow further. Indeed, it isnot difficult to imagine a future in which the bulkof astronomical data analysis moves to the Cloud,in part powered by Jupyter, perhaps through asystem as that envisioned by the Science PlatformNetwork ([8]).

CONCLUSIONIn this paper, we considered how Jupyter

can play a central role in astrophysical scienceplatforms by connecting large volumes of datawith a local analysis host. Namely, we presentedtwo example cases that employ a Jupyter serverto provide researchers and students with accessto Petabytes of data, with pre-installed softwarefor advanced data analysis, and with tutorials. Wefirst described NOIRLab’s Astro Data Lab, whichis operating based on a mission to maximize thescientific outcome of the astronomy communityfrom publicly-available wide-field astronomicalsurveys.

Beyond performing research, Astro DataLab’s efforts extend to training users and con-tributing to education in astronomy and datascience. We then presented how the DESI teamutilizes Jupyter for a range of activities suchas survey preparation, software development, re-search, training, and education. Online scienceplatforms share the common goal of co-locatingsoftware, tools, and tutorials with the data. Thisis increasingly needed as datasets grow both involume and in complexity.

Science platforms can be leveraged to improveon the robustness and on the reproducibility ofresearch results by making entire analysis work-flows available. For instance, Jupyter Notebooks

10 CiSE (accepted)

Page 11: Jupyter-enabled astrophysical analysis using data

could be added to (or cited by) peer-reviewedjournal publications. In addition to posting Note-books on an open source platform such as ina GitHub repository, one can create a Releasefor the repository, and register a persistent digitalobject identifier (DOI) with the open repositoryZenodo33. This approach ensures the longevity ofpublished Notebooks. Additional coordination be-tween science platforms and professional journalsmay further strengthen this good practice.

Another interesting future outlook is the op-portunity to facilitate collaboration between re-searchers. In particular, users have already oftenexpressed their interest in being able to collabora-tively write Jupyter notebooks in real-time as partof online platforms. This may be achieved withrecent, ongoing developments, such as JupyterReal-Time Collaboration (Jupyter-RTC). Such en-hanced capability would further strengthen theimpact of Jupyter in accelerating the progress ofastrophysical discoveries.

Taking an even broader perspective, there aremany parallels to draw with other scientific dis-ciplines where similar data proximate comput-ing platforms are developed and operated. Forinstance, CyVerse34 is an interdisciplinary plat-form heavily used for biology and bioinformatics.While the type of data differs (e.g., genomesequence rather than stars and galaxies), thereare similar needs in terms of filtering, analyz-ing, and visualizing data. Other examples includePangeo35, a platform for open, reproducible, andscalable geoscience. Many of the challenges facedand solutions adopted by these projects are trans-ferable to astrophysical data science, creating awealth of opportunity for future cross-disciplinarycollaboration.

APPENDIXWe introduce how to use the Data Lab Clientswithin a Jupyter Notebook, which were illustratedin Figure 2. First, the authClient works by creat-ing a token to recognize that a user is properlyauthenticated, which is required to access virtualstorage (VOSpace) and personal database space(MyDB). This step can be skipped if a user doesnot need to access those in a given Notebook, or

33https://zenodo.org/34https://www.cyverse.org/35https://pangeo.io/

if the user has recently authenticated as the tokenpersists in the account.from dl import authClient as acfrom getpass import getpass

token = ac.login(input("Enter user name:"),getpass("Enter password:"))

if not ac.isValidToken(token):raise Exception("Token is not valid.Please check your usename/password andexecute this cell again.")

Secondly, we show how the queryClient canbe used to send a SQL (and ADQL,36 Astro-nomical Data Query Language) query and re-trieve the results in a Jupyter Notebook, formattedas a pandas DataFrame in the example below,and run synchronously (the optional keywordasync_ can be set to True to run the queryasynchronously, in which case a jobId is returned,which can be used to check on the status of thequery, and retrieve the results after the query hascompleted).from dl import queryClient as qc

query = "SELECT * FROM ls_dr8.tractorLIMIT 1000"

df = qc.query(sql=query, fmt=’pandas’)

Thirdly, we show how the storeClient canbe used to store files to and retrieve fromVOSpace. VOSpace is an implementation ofnetwork-attached storage, and each user accountcomes with a dedicated storage area. The result ofa query can be written out directly to VOSpace:res = qc.query(’SELECT * FROM

ls_dr8.tractor LIMIT 1000’,out=’vos://lsdr8_sample.csv’)

The file can be later retrieved:from dl import storeClient as scfrom dl.helpers.utils import convert

res = sc.get(’vos://lsdr8_sample.csv’)df = convert(res,’pandas’)

To share files with others, users can placefiles within the public/ subdirectory in theirVOSpace, e.g.sc.mv(fr=’vos://lsd8_sample.csv’,

to=’vos://public/lsdr8_sample.csv’)

Another user can then access it, and convert thecontent to, e.g., a pandas DataFrame:res = sc.get(

’otheruser://public/lsdr8_sample.csv’)df = convert(res,’pandas’)

36https://www.ivoa.net/documents/ADQL/

March/April 2021 11

Page 12: Jupyter-enabled astrophysical analysis using data

Department Head

ACKNOWLEDGMENTWe thank the remainder of the Astro Data

Lab team members, including A. Bolton, M.Fitzpatrick, D. Herrera, L. Huang, D. Nidever,R. Pucha, and A. Scott. In particular, we thankB. Weaver for producing Figure 4, and for hiswork adapting the prospect spectral viewer towork at the Astro Data Lab. We also thank thethree reviewers for their useful comments thathelped us to improve this manuscript, as well asS. Gaudet (National Research Council Canada),and G. Snyder (STSci) for sharing informationabout astronomical science platforms in whichthey are directly involved, R. Thomas (NERSC)and G. Jacoby for comments on the manuscript.Astro Data Lab is part of NSF’s National Optical-Infrared Astronomy Research Laboratory (NOIR-Lab), which is operated by the Association ofUniversities for Research in Astronomy (AURA),Inc. under a cooperative agreement with the Na-tional Science Foundation. DESI is supported bythe Director, Office of Science, Office of HighEnergy Physics of the U.S. Department of Energyunder Contract No.DE–AC02–05CH1123, and bythe National Energy Research Scientific Comput-ing Center, a DOE Office of Science User Facilityunder the same contract; additional support forDESI is provided by the U.S. National ScienceFoundation, Division of Astronomical Sciencesunder Contract No. AST-0950945 to the NSF’sNOIRLab; the Science and Technologies Facili-ties Council of the United Kingdom; the Gordonand Betty Moore Foundation; the Heising-SimonsFoundation; the French Alternative Energies andAtomic Energy Commission (CEA); the NationalCouncil of Science and Technology of Mexico;the Ministry of Economy of Spain, and by theDESI Member Institutions. The DESI team ishonored to be permitted to conduct astronomicalresearch on Iolkam Du’ag (Kitt Peak), a moun-tain with particular significance to the TohonoO’odham Nation.

REFERENCES

1. M.J. Fitzpatrick, K. Olsen, F. Economou, et al., “The

NOAO Data Laboratory: a conceptual overview,” Society

of Photo-Optical Instrumentation Engineers (SPIE), vol.

9149, Proc. SPIE id. 91491T, 2014.

2. R. Nikutta, M. Fitzpatrick, A. Scott, and B.A. Weaver,

“Data Lab: A community science platform,” Astronomy

and Computing, vol. 33, p. 100411, 2020.

3. DESI Collaboration et al., “The DESI Experiment

Part I: Science, Targeting, and Survey Design,”

arXiv:1611.00036, 2016.

4. DESI Collaboration et al., “The DESI Experiment Part II:

Instrument Design,” arXiv:1611.00037, 2016.

5. Astropy Collaboration et al., “The Astropy Project: Build-

ing an Open-science Project and Status of the v2.0 Core

Package,” The Astronomical Journal, Volume 156, Issue

3, article id. 123, 19 pp., 2018. (journal)

6. D. Norman, “Can Big Data Lead an Inclusion Revolu-

tion?,” AstroBeat; Astronomical Society of the Pacific, No.

162, 2018.

7. K. Olsen, A. Bolton, A., S. Juneau, S., et al., “The Data

Lab: A science platform for the analysis of ground-based

astronomical survey data,” Bulletin of the American As-

tronomical Society, Vol. 51, p. 61, 2019.

8. V. Desai, M. Allen, c. Arviset, et al., “A Science Platform

Network to Facilitate Astrophysics in the 2020s,” Bulletin

of the American Astronomical Society, Vol. 51, p. 146,

2019.

9. D. L. Nidever, K. Olsen, A. R. Walker, et al. 2017, AJ, 154,

199. doi:10.3847/1538-3881/aa8d1c

10. Y. Choi, D. L. Nidever, K. Olsen, et al. 2018, ApJ, 869,

125. doi:10.3847/1538-4357/aaed1f

11. N. F. Martin, D. L. Nidever, G. Besla, et al. 2015, ApJL,

804, L5. doi:10.1088/2041-8205/804/1/L5

Stephanie Juneau, Associate Astronomer at NSF’sNOIRLab and Data Scientist at the Astro Data Lab(in Tucson, AZ). Dr. Juneau received a Ph.D. in As-tronomy from the University of Arizona. Her scientificresearch is focused on galaxies and supermassiveblack holes. She is interested in applying statisticalmethods, advanced visualization techniques, and inimproving astrophysical science platforms. She is amember of the DESI and Euclid collaborations, andof the American Astronomical Society (AAS). Contacther at [email protected].

Knut Olsen, CSDC Principal Data Scientist at NSF’sNOIRLab (in Tucson, AZ). Dr. Olsen received hisPh.D. in Astronomy from the University of Wash-ington. He works on stellar populations in nearbygalaxies. Contact him at [email protected].

Robert Nikutta, Project Scientist for Astro DataLab at NSF’s NOIRLab (in Tucson, AZ). Dr. Nikuttareceived his Ph.D. from the University of Kentucky. Hisscientific interests range from the nuclear regions ofActive Galactic Nuclei to astroinformatics and big data

12 CiSE (accepted)

Page 13: Jupyter-enabled astrophysical analysis using data

research. Contact him at [email protected]

Alice Jacques, Data Analyst at NSF’s NOIRLab andAstro Data Lab (in Tucson, AZ). Alice received herB.S. degree in Physics and Computer Science fromGeorgia College and State University and her M.S.degree in Physics from the University of Louisville.Contact her at [email protected].

Stephen Bailey, Senior Software Developer in thePhysics Division at Lawrence Berkeley National Lab(in Berkeley, CA). Dr. Bailey received his Ph.D. inPhysics from Harvard University and currently leadsthe DESI Data Management team. He enjoys convert-ing raw data into useful data for the benefit of largescience collaborations, and has been doing so forover 25 years. Contact him at [email protected].

March/April 2021 13