EPFL & ETHZ

Summer School on Reproducibility in Computational Sciences 2018

09.09.18 - 13.09.18, Magliaso, Ticino, Switzerland

Three days full of workshops and talks on how to increase the impact and outreach of your research.

The summer school will introduce the best practices and tools for reproducible research in computational sciences. Speakers from all over the world with outstanding contributions in reproducible science will give lectures and workshops on recommended tools and platforms specifically designed to promote reproducibility and openness. Furthermore, they will address the fallacies and pitfalls in computational data sciences and provide workflow guidelines in scientific programming, data processing and visualization for a variety of applications. The targeted audience are graduate and post-graduate researchers working in any field in computational and data science.

Improve your coding skills

How to make your code readable, maintainable, and scalable?

Increase the lifetime of your research

How to share code and data in a sustainable way?

Hear about large, successful scientific projects

How to create software that will be useful to others?

The event is jointly organized by EPFL and ETHZ.

Program

Sunday, 09.09.2018
16:30 - 17:30 Welcome and keynote: The motivation behind reproducible science. Why does it matter? — Luc Henry (EPFL) (Video, Slides)
Monday, 10.09.2018
09:00 - 10:00 A software development workflow for academic research — Wenzel Jakob (EPFL) (Slides)
10:00 - 11:00 The philosophy of reproducible quantitative methods — Christie Bahlai (Kent State University) (Video, Slides)
11:00 - 11:30 Break
11:30 - 12:30 Research from an engineering viewpoint: Software development process and best practices — Filip Pavetić (Google)
12:30 - 13:30 Lunch
13:30 - 14:30 Advanced git workshop — Sourabh Lal (EPFL, GitHub Campus Expert) (Video, Slides)
14:30 - 18:00 Tools for reproducible research (Workshop) — Tim Head (Wild Tree Tech) (Video, Slides)
Tuesday, 11.09.2018
09:00 - 10:00 Introduction to the Renku platform — Eric Bouillet (SDSC) (Slides)
10:00 - 10:45 The journal as a medium for publishing software — Kate Keahey (University of Chicago)
10:45 - 11:00 Break
11:00 - 11:30 Automatic tools for reproducible research — Kate Keahey (University of Chicago)
11:30 - 12:30 Techniques and guidelines for reporting reproducible and statistically sound results — Torsten Hoefler (ETH Zürich) (Video, Slides)
12:30 - 13:30 Lunch
13:30 - 15:00 Discussion Panel
15:00 - 15:30 Break
15:30 - 17:30 Data management in research — Ana Sesartic Petrus & Malin Michelle Ziehmer (Digital Curation Office, ETH Zürich) (Video, Slides, Workshop notes 1, 2)
17:30 - 18:00 Practical data management — Anna Krystalli (University of Sheffield) (dataspice, tutorial)
Wednesday, 12.09.2018
09:00 - 10:00 Publishing and maintaining open data — Bastian Greshake Tzovaras (Video, Slides)
10:00 - 11:00 Tools for reproducibility in Statistics and Machine Learning — Heidi Seibold (LMU Munich) (Video, Slides)
11:00 - 11:30 Break
11:30 - 12:30 Project work
12:30 - 13:30 Lunch
13:30 - 18:00 How to write the perfect reproducible paper (Workshop)
Christie Bahlai (Kent State University) & Anna Krystalli (University of Sheffield)
Thursday, 13.09.2018
09:00 - 11:00 Presentations
11:00 - 13:00 Good-bye apero and lunch

Speakers

Christie Bahlai

Christie is an Assistant Professor at Kent State University and head of the Bahlai Lab. She is an applied quantitative ecologist and population ecologist who uses approaches from data science to help solve problems in conservation, sustainability, and ecosystem management. She combines a background in physics and organismal ecology with influences from the tech sector and conservation NGOs to ask questions and build tools addressing problems in population ecology.

Christie is a strong advocate of open science and has taught workshops and classes on open data management and data analysis, such as this amazing workshop, for pushing higher reproducibility standards in science. During this summer school, she will lead a workshop exploring the reproducibility of papers prepared using open science methods and talk about some of the philosophies and motivations for these approaches.

Eric Bouillet

Eric Bouillet received his PhD degree in Electrical Engineering from Columbia University, New York, NY in June 1999. Eric Bouillet has been working at IBM T.J. Watson Research Center, Hawthorne, NY since June 2004, and at the IBM Smarter City Technical Centre, Dublin from October 2010 to August 2016. While at IBM he has been working on scalable data stream analytics applied to a number of fields, including finances, law-enforcement, telecommunications, environmental monitoring, intelligent transport systems, and aircraft reliability control systems. He is currently Head of Engineering at the Swiss Data Science Center (SDSC).

Eric will share with us his experiences on building large-scale analytics systems and the challenges which arise from an engineering point of view. He will also introduce the Renku platform, a SDSC-based platform which facilitates collaboration and reproducibility for data scientists.

Bastian Greshake Tzovaras

Bastian is a biologist-turned-bioinformatician. In 2011 he co-founded openSNP, an open data repository that allows individuals to donate their personal genomes into the public domain. Since then over 4,100 datasets have been donated through openSNP, making it the worlds largest database of its kind. After he finished his PhD in Bioinformatics he joined the Open Humans Foundation. As its Director of Research, Bastian facilitates participatory research projects and empowers individuals to take control over sharing their personal data. In addition to this Bastian also mentors the next generation of Open Leaders with Mozilla and serves on the Board of the Open Bioinformatics Foundation.

While our algorithms become more and more powerful they also become hungrier and hungrier - for data. The large-scale success of machine learning these days is thus not only a function of better algorithms and open source software packages, but also of ever-growing training data sets. It is for this reason that reproducibility not only depends on good software engineering practices, but also on open data. Bastian will share his experiences in making data open and maintaining it.

Tim Head

Tim is an independent software developer and teacher. He works primarily on data science tools (such as binder) and teaches machine-learning courses for programmers, scientists and engineers. He has worked with International Organisations in Geneva, small startups in Zurich and academics around the world. Tim trained as an experimental physicist and worked at CERN and EPFL for several years.

Tim will introduce the Binder project to us. Binder drastically lowers the bar to sharing and re-using software. As a user wanting to try out someone else’s work requires only clicking a single link. For the author to prepare a binder-ready project is much easier than having to support many different platforms and for many projects involves little additional work.

Luc Henry

Luc Henry spent around ten years exploring science in various ways and places in Europe. In 2015, he was the Managing Editor of European science magazine Technologist. He then worked at the Swiss National Science Foundation before taking his actual position as an advisor to the President of EPFL. Luc holds a PhD in chemical biology from the University of Oxford.

Luc is interested in all aspects of open science since his early days as a researcher. In his presentation, he will give an overview of how science is changing, and why the transparency and reproducibility of research results are embedded in a broader movement to make knowledge more accessible and useful.

Torsten Hoefler

Torsten is an Associate Professor of Computer Science at ETH Zürich, Switzerland. Before joining ETH, he led the performance modeling and simulation efforts of parallel petascale applications for the NSF-funded Blue Waters project at NCSA/UIUC. He is also a key member of the Message Passing Interface (MPI) Forum where he chairs the "Collective Operations and Topologies" working group. His research interests revolve around the central topic of "Performance-centric System Design" and include scalable networks, parallel programming techniques, and performance modeling.

Torsten has been pushing reproducibility in the field of high performance computing for years. During the summer school he will teach techniques and guidelines for reporting reproducible and statistically sound results.

Wenzel Jakob

Wenzel is an assistant professor leading the Realistic Graphics Lab at EPFL's School of Computer and Communication Sciences. His research mostly revolves around light transport simulations and material appearance models. The overarching goal of this work is to improve the accuracy and fidelity of light transport simulations in computer graphics, and to make the underlying models accurate and fast enough for predictive design and manufacturing applications.

Wenzel will introduce us to the software development workflow used by his group, which enforces maintainable and reproducible development practices without losing the agility that is needed to make progress on a challenging research problem. He will step through all parts of the stack including Jupyter notebooks, C++14, pybind11, continuous integration and Git submodules in a hands-on manner. He will also share his experiences in creating projects that ultimately became relatively widely used and some of the positive and unexpected fallout that this has created.

Kate Keahey

Kate is a Scientist at Argonne National Laboratory and a Senior Fellow at the Computation Institute at the University of Chicago. She is one of the pioneers of infrastructure cloud computing. She created the Nimbus project, recognized as the first open source Infrastructure-as-a-Service implementation, and continues to work on research aligning cloud computing concepts with the needs of scientific datacenters and applications.

During the summer school, Kate will share her view on challenges arising for researchers practicing data-intensive science, as well as her experience and motivation in creating and maintaining large-scale open-source projects supporting such endeavors.

Anna Krystalli

Anna is a Research Software Engineer at the University of Sheffield, helping researchers do more with their code and data. With a background in computational ecology, Anna is interested in open source methodological innovation in science, and community and capacity around modern open science tools and practices. She is a member of vibrant open science communities such as the Mozilla Open Leaders network, as a member of the inaugural cohort and mentor on subsequent training rounds, rOpenSci and a co-organizer of the Sheffield R users group.

During the summer school, Anna will teach principles and best practice for conducting reproducible science, focusing on data and workflow management.

Sourabh Lal

Sourabh is a masters student at EPFL studying Computer Science and specializing in Information Systems. He is also the founder of LauzHack - EPFL's student hackathon. Prior to coming to Switzerland he studied at Jacobs University in Germany and Carnegie Mellon University in the USA. Interspersing his studies Sourabh has gained experience working at several companies including Logitech, CERN and Fraunhofer. Sourabh is currently a GitHub Campus Expert - a role that enables him to help empower student developers at EPFL.

During his workshop, Sourabh will introduce some of Git/GitHub's slightly more advanced features that make it easier to collaborate and contribute towards open source development.

Filip Pavetić

Filip is a software engineer at Google Zürich working on video analysis at YouTube. His most recent work is oriented around developing machine learning models and integrating them into complex systems. Before that he did two internships at Facebook, where he worked on spam detection and infrastructure, and a research internship at EPFL.

Filip will describe the experience and difficulties of applying research from an engineering viewpoint. Additionally, he will talk about the day-to-day software development process, covering practices in coding style, code review and testing.

Heidi Seibold

Heidi is a statistician at the LMU Munich, where she works on statistical methods for personalized medicine. She is a core member of OpenML, assistant editor at the Journal of Statistical Software (responsible for reproducibility checks) and a member of the LMU Open Science Center. Heidi is passionate about open science, open source and reproducible research.

Heidi's talk will be about reproducibility in Statistics and Machine Learning. She will cover tools and workflows to tackle the challenges in this field based on examples of her own research.

Ana Sesartic Petrus & Malin Michelle Ziehmer

Ana and Malin are responsible for data management planning and teaching at the Digital Curation Office, ETH Library, ETH Zurich. Before joining the library, they both worked in research and received their PhD in Climate Sciences - Ana from the ETH Zurich and Malin from the University of Bern. During their research career, they were confronted with the daily challenges of data management in modelling (Ana) and experimental (Malin) research. This experience highly motivated them to engage in training researchers at any career stage on data management and data management planning.

Ana and Malin will interactively introduce what data management essentially is and why it concerns all of us. As the research data lifecycle is probably going to outlive a researcher's project at a particular institution, a data management plan is a stringent necessity which describes measures for active research data management, data sharing as well as the long-term preservation of research data for a potential re-use of data in the future.

Location


Centro Magliaso

Organizers

Athena Economides

Computational Science & Engineering Lab
ETHZ

Frederike Dümbgen

Audiovisual Communications Lab
EPFL

Ivica Kičić

Computational Science & Engineering Lab
ETHZ

Martin Müller

Digital Epidemiology Lab
EPFL