[Usecase] Notes & Actions: DesignSafe Team Meeting 11/18/24

Tue Nov 19 09:24:04 CST 2024

DesignSafe Team Meeting
November 18, 2024

Attendees: Ellen, Fred, Silvia, Gilbert, Kayla, Raul, Tim, Jamie, Ahsan, Jean-Paul, Laura, Gilberto, Krishna, Natalie

Action Items

  *   Silvia to schedule a meeting with Kayla, Yang, Carlos, and Tim to discuss Kayla's machine learning workflow and potential optimizations.
  *   Kayla to send her presentation slides to Dr. Kumar.
  *   Tim to explore the possibility of implementing a containerized Jupyter environment similar to Chishiki on DesignSafe.
  *   Ellen to plan the next team meeting focusing on AI and machine learning resources, including inviting Clint's team.
  *   Silvia to follow up with Kayla on submitting jobs to multiple nodes on Stampede 3 to potentially speed up her machine learning tasks.
  *   Use cases -  review Silvia's spreadsheet and get back to her within 2 weeks to discuss https://docs.google.com/spreadsheets/d/1mZB91I7j4UW1rkbQTEOo6QkMA7jGMYKdkDqIlcUrfho/edit?usp=sharing

Use Case Write Ups

Jupiter HPC Use Case Progress
Ellen initiated a meeting to discuss the progress on use cases and the use of the Jupiter HPC. Sylvia was tasked with checking in on use cases, while Kayla was to provide an update on her successful use of the Jupiter HPC. Sylvia then shared her screen and discussed her notes on the use cases, highlighting some errors and suggesting updates to the content in the notebooks and the documentation. She also mentioned the need to switch over to Tapis v3 and the creation of template notebooks for this purpose. The team was encouraged to familiarize themselves with the new setup.

Jupyter Notebooks Accessibility and Organization
Silvia, Ellen, Fred, Tim, and Gilbert discussed the organization and accessibility of their use cases and Jupyter notebooks. They agreed to move certain user guides to the top of the visualization apps section and to add a link to a demonstration of combining data sets into a workflow on Taggit and HazMapper. They also discussed the issue of users needing to sign in with Jupyter Hub to access the notebooks, and the recommendation to automatically have a JupyterLab 2024 connection for users coming from the use cases. However, they encountered a problem with the order of the options in the JupyterLab interface, which they plan to address. The team also discussed the need for long-term longevity of their notebooks as they migrate to Python 3.9.

Notebook Image Rendering and Viewing
Silvia discussed the technical aspects of notebooks, including the issue of images not always rendering. She proposed three options for users: viewing the notebook, copying the content to their local drive, or opening it in Jupyter Hub. Scott suggested that users should run the notebook directly for a full experience, while Silvia emphasized the need for a previewer to view the content before deciding to run it. Ellen agreed with Scott's point about potential confusion if the images don't load correctly. The team agreed to explore ways to make the images work and to consider the option of saving the notebook as an HTML file.

ML use case with Jupyter HPC

Optimizing Neural Network for GPU
Kayla is working on optimizing a neural network for GPU usage. Silvia suggests submitting the job directly to the HPC system instead of running it on a single node through the Jupyter Hub, as this would allow parallelization across multiple nodes and faster training times at lower cost. Ellen clarifies that using the Jupyter Hub limits resources to a single node. Silvia and Kayla plan to follow up to understand Kayla's current setup and explore options for distributed training on the HPC system.

Stampede 3 CPU Limitations Discussed
In the meeting, Scott, Kayla, and Silvia discussed the limitations of the CPU on Stampede 3, which only allows for 4 cores to be used for parallel computing. Kayla mentioned that she could use up to 12 cores on her own CPU. The team also discussed the potential for running multiple jobs simultaneously on multiple nodes, which could significantly speed up processing time. Tim suggested that all three apps folks, including Sylvia, Yang, and Carlos, should get hands-on experience with Kayla's problem-solving approach. The team agreed to schedule a meeting to discuss this further. Ellen mentioned that she had invited Carlos to discuss how his research group is using Jupyter Hub on TACC resources, which could be relevant for their work on DesignSafe.

Complexity of Running Scripts on Stampede 3
Scott raised a concern about the complexity of running scripts on Stampede 3, particularly the need to specify an app to run them on. Silvia suggested that Kayla could set everything up and then just SSH to call the job directly on Stampede 3, bypassing the need for an app. However, Scott pointed out that Stampede 3 requires specifying how to run the thing, which could involve specifying a certain version of Pytorch or other complications. Kayla agreed, noting that testing on different platforms can be challenging. Silvia then explained a process she had gone through with Wang Yang and a student from Clemson, which involved installing Conda, setting up a virtual environment with a specific version of Python, and adding necessary packages. She suggested that Kayla could build her own virtual environment with the required packages and submit it to Slurm. Ellen expressed concern about the complexity of the process, while Raul asked Kayla if she was running the examples directly from Jupyter.

Optimizing Processing Power and Parallelization
The team discussed strategies for optimizing processing power and parallelization. Raul suggested using multi-processing to exploit the number of cores available, while Kayla mentioned setting the number of workers to negative one to grab as many available workers. Silvia recommended looking into concurrent futures, a newer version of multi-processing, and suggested considering parallelization at different levels, including within scripts or in the SLURM. Tim and Silvia also discussed the challenges of working in a GPU environment. Kayla mentioned a warning about saving data in a less accessible location, which could impact processing speed. The team also briefly touched on the topic of AI and machine learning, with Krishna expressing interest in the slides from the meeting.

Chishiki Jupyter features

Custom Python Environments on Supercomputing
Krishna discussed the goal of running custom Python environments on supercomputing systems like Lone Star or Frontera. He explained the issue of needing a system-level package for installation and proposed a solution of developing a containerized environment with its own Jupyter. This would link to the Jupyter kernels and allow users to access their custom packages. Krishna demonstrated how this was done in a course, emphasizing the ease of creating a custom container and linking it to the Jupyter GPU environment. He also mentioned the possibility of doing this on the DesignSafe platform. The team discussed the benefits of this approach, including the ability to use multiple nodes and the ease of installing custom packages. Scott suggested that these containers could also be used to submit jobs without Jupyter, which Krishna confirmed. The team agreed to focus on AI and machine learning resources in their next meeting.

---
Natalie Henriques, PMP
Project Manager
Texas Advanced Computing Center (TACC)
The University of Texas at Austin
Email: natalie at tacc.utexas.edu<UrlBlockedError.aspx>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.designsafe-ci.org/pipermail/usecase/attachments/20241119/7cca6259/attachment-0001.html>