[Usecase] Notes & Actions: DesignSafe Team Meeting 12/16/24

Natalie Henriques natalie at tacc.utexas.edu
Tue Dec 17 16:12:44 CST 2024


DesignSafe Team Meetings
December 16, 2024


Quick recap

The team discussed the use of DesignSafe resources for workflow management, specifically for ADCIRC, and the process of generating machine learning-ready data from high-fidelity simulation outputs. They also explored the advantages and challenges of using HPC Jupyter for their work, and the potential of customizing the environment for users when connecting to the system. Lastly, they discussed the use of different file systems for data storage and access, and the need for a more streamlined process for users.

Next steps

* Carlos to ping Sal about fixing the user ID issue in the Jupyter container.
* Silvia to submit a ticket regarding the SSH access issue from regular Jupyter hub.
* Ellen and Krishna to review and streamline the Jupyter/container setup in January, potentially involving Carlos.
* Carlos to connect with SimCenter about standardizing data formats for surrogate modeling outputs.
* Natalie to schedule meetings between Ellen and use case PIs in mid-late January to discuss roles going forward.


Summary

DesignSafe for ADCIRC Workflow Management

  *   Ellen introduced the topic of using DesignSafe resources for workflow management, specifically for ADCIRC. Carlos then took over, discussing the use of surrogate modeling for ADCIRC on DesignSafe.


  *   He highlighted the three main areas of focus: the collection of a dataset of high-fidelity model outputs, the fitting of the machine learning model with the high-fidelity data, and the validating of the model on new inputs.


  *   Carlos also emphasized the importance of clearly defining inputs and outputs for the machine learning model. He mentioned that they have already published a couple of models and are actively working on two more. The discussion ended with Carlos expressing his desire for the conversation to focus on the current capabilities of DesignSafe.

Generating Machine Learning-Ready Simulation Data

  *   Carlos discussed the process of generating machine learning-ready data from high-fidelity simulation outputs, emphasizing the importance of understanding the input-output relationship and the need for high-performance computing (HPC) resources. He mentioned the use of the parametric job launcher, Pylauncher, and a custom Tapis application for coordinating ensembles of simulations.


  *   Ellen suggested that the parametric job launcher could be more widely useful and asked about the integration with DAPI. Carlos agreed and mentioned the need to update the application to the V3 platform. He also discussed the challenges of going from raw simulation data to a machine learning problem, including the need to consider physical constraints and the potential for poor features to learn from.


  *   The team also discussed the importance of documenting the process well and the potential for publishing the datasets and train models in the same DesignSafe dataset for reproducibility. Pedro asked about where the data is stored, and Carlos explained that they use scratch for active simulations and move data to Corral when needed. Ellen pointed out that HPC Jupyter currently does not allow access to data in Corral.

HPC Jupyter Challenges and Solutions

  *   Carlos discussed the advantages and challenges of using HPC Jupyter for their work. He highlighted that HPC Jupyter provides easier access to HPC work and scratch directories, facilitating workflows. However, he noted that the containerization of the job in HPC Jupyter instances is problematic.


  *   He also expressed a desire for multi-node HPC Jupyter and the addition of wait times and statuses, similar to the TACC Analysis Portal (TAP) system. Pedro clarified that HPC Jupyter provides access to HPC work and scratch directories. Ellen questioned why HPC Jupyter doesn't act like TAP, to which Dan responded that it's due to the containerization. Silvia suggested that the containerization could be a way of prepackaging an environment ahead of time.


  *   The team agreed that while HPC Jupyter is more usable than before, there are still some issues with the containerization and the divergence of the two environments.

Customizing User Environment and Containers

  *   The team discussed the potential of customizing the environment for users when connecting to the system. They considered the possibility of pre-installing packages and setting up the environment to make it more user-friendly. The idea of containerization was also discussed, with the team acknowledging its limitations, such as not being able to have containers within containers.


  *   They also discussed the benefits of using containers to mimic the design safe data structure and the ability to access the scratch directory through design safe. The team agreed that these features and options could be added as the system is configured, allowing users to customize their environment and install software as needed.

File System Differences and Jupyter Hub

  *   The team discussed the differences between various file systems, including work, scratch, and my data. They clarified that work is a global file system with less bandwidth to each system, while scratch is different on all the boxes and has more bandwidth.


  *   They also discussed the confusion between uppercase and lowercase 'work' file systems, which are the same but mounted differently. Carlos shared his method of creating symbolic links to scratch for easier access.


  *   The team also discussed the differences between regular Jupyter hub and HPC Jupyter hub, with the latter not having a project.

File Systems and Machine Learning

  *   The team discussed the use of different file systems for data storage and access. Ellen and Dan clarified that 'work' is a faster and more accessible file system compared to 'my data'. Silvia added that she uses 'My Data' for storing her data and 'work' for faster access and flexibility.


  *   The team also discussed the challenges of moving data between file systems and the need for increased access to resources. Carlos mentioned that they are moving towards multi-node training for larger datasets, which is not currently configured.


  *   The team also discussed their use of PyTorch and TensorFlow for machine learning, with Carlos mentioning that they are updating their architecture to use PyTorch. The team agreed on the need for running scripts through the Jupyter environment for initial testing and debugging, but for more production runs, they plan to use job submissions.

SSH and Data Sharing Challenges

  *   In the meeting, Silvia raised an issue about logging in and not being able to SSH from a regular Jupyter hub. Carlos suggested adding a dash user option to the SSH command, which Dan confirmed would work. However, Silvia mentioned that this didn't work for her on the OpenSees VM. Carlos suggested that this might be a bug and suggested reporting it.


  *   Ellen and Carlos discussed the need for a more streamlined process for users, especially novice ones, and the importance of standardizing data formats for better data sharing. Laura suggested connecting with the SimCenter to establish a format for incorporating data into R2D simulations.


  *   Ellen announced that the next meeting would be the last before the holidays and that she would schedule individual meetings with each PI to discuss roles and opportunities going forward.




---
Natalie Henriques, PMP
Project Manager

---
Natalie Henriques, PMP
Project Manager
Texas Advanced Computing Center (TACC)
The University of Texas at Austin
Email: natalie at tacc.utexas.edu<UrlBlockedError.aspx>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.designsafe-ci.org/pipermail/usecase/attachments/20241217/dcb4849f/attachment-0001.html>


More information about the Usecase mailing list