add submit scripts and job outputs

a71dd9df · Marie Weiel · 5066c26a · a71dd9df · a71dd9df · a71dd9df
Commit a71dd9df authored 3 months ago by Marie Weiel
--- a/3_ensembles/solutions/sh/slurm_dataloaders.out
+++ b/3_ensembles/solutions/sh/slurm_dataloaders.out
+[1/2]: Loading data using truly parallel dataloader...
+[0/2]: Loading data using truly parallel dataloader...
+File size is 2390277560 bytes.
+After Allgatherv: All line starts: [         0        479        957 ... 2390276125 2390276602 2390277082]
+[0/2]: Construct array with line starts and lengths in bytes.
+[1/2]: Construct array with line starts and lengths in bytes.
+[0/2]: Make global train-test split.
+[1/2]: Make global train-test split.
+[0/2]: Decode 1250000 test samples from file.
+[1/2]: Decode 1250000 test samples from file.
+[0/2]: Draw local 1875000 train indices.
+[0/2]: Decode train lines from file.
+[1/2]: Draw local 1875000 train indices.
+[1/2]: Decode train lines from file.
+Elapsed time truly parallel data loading: global average 2.5e+02s, local 2.5e+02s
+[0/2]: Loading data using root-based dataloader...
+[1/2]: Loading data using root-based dataloader...
+There are 3750000 train and 1250000 test samples.
+Local train samples: [1875000 1875000]
+train_indices have shape (3750000,).
+Elapsed time root-based data loading: global average 36s, local 36s
+[0/2]: DONE.
+Parallel: Local train samples / targets have shapes (1875000, 18) / (1875000,).
+Parallel: Global test samples / targets have shapes (1250000, 18) / (1250000,).
+Root: Local train samples / targets have shapes (1875000, 18) / (1875000,).
+Root: Global test samples / targets have shapes (1250000, 18) / (1250000,).
+[1/2]: DONE.
+Parallel: Local train samples / targets have shapes (1875000, 18) / (1875000,).
+Parallel: Global test samples / targets have shapes (1250000, 18) / (1250000,).
+Root: Local train samples / targets have shapes (1875000, 18) / (1875000,).
+Root: Global test samples / targets have shapes (1250000, 18) / (1250000,).
+
+============================= JOB FEEDBACK =============================
+
+NodeName=uc2n[113,116]
+Job ID: 22915408
+Cluster: uc2
+User/Group: ku4408/scc
+State: COMPLETED (exit code 0)
+Nodes: 2
+Cores per node: 80
+CPU Utilized: 00:09:44
+CPU Efficiency: 1.12% of 14:26:40 core-walltime
+Job Wall-clock time: 00:05:25
+Memory Utilized: 2.58 GB
+Memory Efficiency: 1.47% of 175.78 GB
--- a/3_ensembles/solutions/sh/slurm_parallel.out
+++ b/3_ensembles/solutions/sh/slurm_parallel.out
+[1/2]: Loading data...
+######################################################
+# Distributed Random Forest in Scikit-Learn with MPI #
+######################################################
+
+[0/2]: Loading data...
+Using truly parallel dataloader...
+File size is 2390277560 bytes.
+[1/2]: Construct array with line starts and lengths in bytes.
+After Allgatherv: All line starts: [         0        479        957 ... 2390276125 2390276602 2390277082]
+[0/2]: Construct array with line starts and lengths in bytes.
+[1/2]: Make global train-test split.
+[0/2]: Make global train-test split.
+[1/2]: Decode 1250000 test samples from file.
+[0/2]: Decode 1250000 test samples from file.
+[1/2]: Draw local 1875000 train indices.
+[1/2]: Decode train lines from file.
+[0/2]: Draw local 1875000 train indices.
+[0/2]: Decode train lines from file.
+Elapsed time data loading: global average 2.5e+02s, local 2.5e+02s
+[0/2]: DONE.
+Local train samples and targets have shapes (1875000, 18) and (1875000,).
+Global test samples and targets have shapes (1250000, 18) and (1250000,).
+Labels are [0. 0. 0. ... 0. 0. 1.]
+Elapsed time forest creation: global average 2.8e-05s, local 3.8e-05s
+[0/2]: Set up and train local random forest with 50 trees and random state 2.
+[1/2]: DONE.
+Local train samples and targets have shapes (1875000, 18) and (1875000,).
+Global test samples and targets have shapes (1250000, 18) and (1250000,).
+Labels are [1. 0. 0. ... 0. 0. 1.]
+[1/2]: Set up and train local random forest with 50 trees and random state 3.
+Elapsed time training: global average 8.8e+02s, local 8.8e+02s
+[0/2]: Evaluate random forest.
+[0/2]: Get predictions of individual sub estimators.
+[1/2]: Evaluate random forest.
+[1/2]: Get predictions of individual sub estimators.
+[1/2]: Calculate majority vote via histograms.
+[0/2]: Calculate majority vote via histograms.
+[1/2]: Local accuracy is 0.7971136, global accuracy is 0.7998568.
+[0/2]: Local accuracy is 0.79728, global accuracy is 0.7998568.
+Elapsed time test: global average 56s, local 56s
+[1/2]: Loading data...
+######################################################
+# Distributed Random Forest in Scikit-Learn with MPI #
+######################################################
+
+[0/2]: Loading data...
+Using root-based dataloader with Scatterv...
+There are 3750000 train and 1250000 test samples.
+Local train samples: [1875000 1875000]
+train_indices have shape (3750000,).
+Elapsed time data loading: global average 35s, local 35s
+[0/2]: DONE.
+Local train samples and targets have shapes (1875000, 18) and (1875000,).
+Global test samples and targets have shapes (1250000, 18) and (1250000,).
+Labels are [0. 0. 0. ... 1. 0. 0.]
+Elapsed time forest creation: global average 2e-05s, local 1.9e-05s
+[0/2]: Set up and train local random forest with 50 trees and random state 2.
+[1/2]: DONE.
+Local train samples and targets have shapes (1875000, 18) and (1875000,).
+Global test samples and targets have shapes (1250000, 18) and (1250000,).
+Labels are [0. 1. 1. ... 0. 1. 0.]
+[1/2]: Set up and train local random forest with 50 trees and random state 3.
+Elapsed time training: global average 8.8e+02s, local 8.8e+02s
+[0/2]: Evaluate random forest.
+[0/2]: Get predictions of individual sub estimators.
+[1/2]: Evaluate random forest.
+[1/2]: Get predictions of individual sub estimators.
+[0/2]: Calculate majority vote via histograms.
+[1/2]: Calculate majority vote via histograms.
+[0/2]: Local accuracy is 0.7977704, global accuracy is 0.800064.
+[1/2]: Local accuracy is 0.7974024, global accuracy is 0.800064.
+Elapsed time test: global average 57s, local 57s
+
+============================= JOB FEEDBACK =============================
+
+NodeName=uc2n[222,237]
+Job ID: 22915390
+Cluster: uc2
+User/Group: ku4408/scc
+State: COMPLETED (exit code 0)
+Nodes: 2
+Cores per node: 80
+CPU Utilized: 01:11:56
+CPU Efficiency: 1.23% of 4-01:36:00 core-walltime
+Job Wall-clock time: 00:36:36
+Memory Utilized: 3.12 GB
+Memory Efficiency: 1.77% of 175.78 GB
--- a/3_ensembles/solutions/sh/slurm_serial.out
+++ b/3_ensembles/solutions/sh/slurm_serial.out
+########################################
+# Serial Random Forest in Scikit-Learn #
+########################################
+
+Loading data...
+DONE.
+Train samples and targets have shapes (3750000, 18) and (3750000,).
+First ten elements are: [[ 1.43997777e+00  1.63248479e+00 -9.67991173e-01  5.68317890e-01
+   1.46224272e+00  1.36972353e-01  8.19436967e-01  9.11546707e-01
+   1.21039116e+00 -5.06843209e-01  9.31933105e-01  1.27073431e+00
+   1.21000016e+00  1.47810447e+00  8.52468789e-01  1.11379075e+00
+   2.47141235e-02  6.13238990e-01]
+ [ 3.43844175e-01 -7.04108357e-01 -1.51597571e+00  5.51238716e-01
+   5.92378020e-01  1.29997504e+00  6.11836970e-01  2.91105419e-01
+   9.18442786e-01 -1.56051174e-01  4.62793738e-01  6.52638137e-01
+   1.25141346e+00  1.32741857e+00  4.72825408e-01  1.00110984e+00
+   9.48828042e-01  1.48614004e-01]
+ [ 5.07223248e-01  3.59210372e-01  4.22794193e-01  7.39285111e-01
+   8.21336746e-01 -2.84344971e-01  6.44737303e-01 -1.26930571e+00
+   8.19674611e-01 -1.98983908e-01  5.15161872e-01  7.95682371e-01
+   1.37060583e+00  1.57789564e+00  4.93557423e-01  1.17942798e+00
+   7.52048492e-01  2.89388001e-01]
+ [ 5.56809664e-01 -1.57331979e+00  1.35683000e+00  7.13156343e-01
+   7.61644185e-01  1.53109705e+00  7.97575951e-01 -2.99961656e-01
+   1.19725823e+00 -4.84998196e-01  1.09257638e+00  9.92057800e-01
+   8.05751562e-01  2.16137958e+00  1.08722878e+00  1.62078691e+00
+   4.51993123e-02  3.25273015e-02]
+ [ 3.59211493e+00  2.13520462e-03 -6.44264281e-01  1.90623689e+00
+   3.20609003e-01  3.08485538e-01  3.18263960e+00 -2.74604857e-01
+   2.91751528e+00  3.33075738e+00  2.50526094e+00  1.63983834e+00
+   5.80852270e-01  0.00000000e+00  1.69048905e+00  5.32282293e-01
+   7.83510923e-01  1.63441002e-01]
+ [ 7.58597434e-01 -1.89242971e+00 -1.67973864e+00  6.62953973e-01
+  -1.17093217e+00 -1.02506316e+00  1.43694293e+00  3.01254123e-01
+   2.15702868e+00 -1.12557161e+00  6.50241256e-01  1.39164054e+00
+   1.89918971e+00  2.74402332e+00  7.75830925e-01  2.08014536e+00
+   1.54032695e+00  4.80755001e-01]
+ [ 9.87198830e-01  1.21461833e+00  1.33140802e-01  1.17224276e+00
+  -9.93977427e-01 -9.41504121e-01  6.46450996e-01  1.56007898e+00
+   9.70402718e-01 -3.25230628e-01  1.75528979e+00  1.04119718e+00
+   5.26380122e-01  1.83662927e+00  1.74301934e+00  1.45607185e+00
+   2.37897053e-01  4.61494997e-02]
+ [ 1.70994639e+00 -8.22762012e-01 -1.13427246e+00  9.53481674e-01
+  -1.90318573e+00  4.29294139e-01  1.42728555e+00  3.82643938e-01
+   2.05661044e-01 -9.03856218e-01  1.40107632e+00  1.72245789e+00
+   1.09095061e+00  0.00000000e+00  1.44269097e+00  1.23851955e+00
+   1.05762923e+00  4.25258994e-01]
+ [ 9.43593442e-01  1.99409112e-01  8.14792871e-01  7.85433173e-01
+  -4.54714209e-01 -1.14372945e+00  5.78886461e+00  6.90133393e-01
+   2.04307199e+00  5.17880249e+00  7.89597631e-01  1.86898780e+00
+   2.10046601e+00  1.05997956e+00  1.11436117e+00  1.30555189e+00
+   1.50189817e+00  7.06036985e-01]
+ [ 7.64213026e-01  1.81087300e-01 -1.32228279e+00  6.47381306e-01
+  -7.69020736e-01 -2.73123175e-01  1.21376109e+00  7.54100859e-01
+   1.82200515e+00 -8.98288131e-01  6.92662835e-01  1.20266879e+00
+   1.54078197e+00  2.04965854e+00  7.89022744e-01  1.62554586e+00
+   1.52046657e+00  3.61533999e-01]] and [0. 0. 1. 0. 1. 0. 0. 1. 1. 0.]
+Test samples and targets have shapes (1250000, 18) and (1250000,).
+First ten elements are: [[ 6.91279709e-01  1.76920056e+00 -1.71181107e+00  4.76238310e-01
+   1.54056251e+00 -2.67858446e-01  8.72983932e-01  2.78744638e-01
+   1.08713210e+00 -5.51878333e-01  5.12543857e-01  8.91626298e-01
+   1.54371738e+00  8.60364318e-01  5.83059669e-01  6.91559553e-01
+   1.50008941e+00  5.59077024e-01]
+ [ 4.17667389e-01 -1.88861191e+00 -2.25399002e-01  5.72387338e-01
+  -1.19409788e+00  7.04542026e-02  6.95429265e-01  1.66337061e+00
+   1.04392505e+00 -3.82097363e-01  4.27827954e-01  8.09540808e-01
+   1.67913723e+00  1.76087189e+00  4.39944327e-01  1.31074250e+00
+   1.36796749e+00  1.91391006e-01]
+ [ 2.74744093e-01 -1.38396299e+00 -1.03303587e+00  4.39079940e-01
+   1.67037070e+00 -6.75462365e-01  7.48247623e-01  9.07953739e-01
+   1.12321198e+00 -4.34616148e-01  9.00804639e-01  7.02887774e-01
+   6.92424834e-01  1.52986670e+00  9.32529151e-01  1.16853487e+00
+   1.54546940e+00  2.30877008e-03]
+ [ 1.16500580e+00  1.00420511e+00 -1.22337127e+00  1.62781668e+00
+  -3.76755238e-01  8.39448214e-01  2.00318694e+00  1.60608697e+00
+   2.76874542e+00  2.29274392e+00  1.51524031e+00  1.37557471e+00
+   8.05598021e-01  0.00000000e+00  1.52915168e+00  1.25636113e+00
+   1.52380025e+00  4.66813985e-03]
+ [ 1.40449703e+00 -1.45557001e-01  5.21603674e-02  6.55963719e-01
+   1.24605799e+00  1.40364683e+00  6.71340346e-01 -1.68536985e+00
+   6.29944921e-01 -3.38737279e-01  1.19020391e+00  1.10914612e+00
+   8.26955855e-01  7.59544134e-01  1.14556456e+00  1.11023533e+00
+   2.60205507e-01  3.26079011e-01]
+ [ 2.63595748e+00  6.21905744e-01  1.27720702e+00  2.45305467e+00
+   1.10806942e+00 -5.16619682e-01  7.82781899e-01  1.04785419e+00
+   4.86200094e-01  9.83858526e-01  2.23186541e+00  1.23496139e+00
+   4.91022170e-01  6.09043658e-01  2.15673685e+00  4.74038422e-01
+   5.26653826e-01  4.24163006e-02]
+ [ 8.90232325e-01 -9.47344065e-01  6.88447535e-01  7.76006341e-01
+  -1.44117546e+00 -4.05649900e-01  4.55961943e-01  3.85926753e-01
+   3.63408327e-01  7.74472058e-01  7.38129973e-01  3.90631944e-01
+   4.69624609e-01  0.00000000e+00  6.06057763e-01  2.01749176e-01
+   2.38361638e-02  6.27153963e-02]
+ [ 8.09924126e-01  1.76636660e+00  1.50027168e+00  8.24404478e-01
+   9.41239297e-01 -9.60501552e-01  1.00532985e+00 -1.67402878e-01
+   1.49302173e+00 -5.72425246e-01  7.54992843e-01  1.13767517e+00
+   1.33718264e+00  1.82880926e+00  7.92772830e-01  1.42058229e+00
+   1.07851815e+00  3.78154993e-01]
+ [ 2.71482444e+00 -1.04809391e+00 -9.96524235e-04  2.07914996e+00
+  -7.31364310e-01 -1.69832718e+00  1.54993641e+00  1.68483472e+00
+   3.84886861e-01 -1.23952270e+00  2.09488463e+00  2.30906534e+00
+   9.78124380e-01  3.79600048e-01  2.12716794e+00  5.43927491e-01
+   1.45123398e+00  3.90201986e-01]
+ [ 6.09283328e-01 -8.08349609e-01  2.94039518e-01  9.17876959e-01
+  -3.87946656e-03  1.57775986e+00  9.13081244e-02  1.04072893e+00
+   1.14134960e-01  4.03893739e-01  6.74048424e-01  2.03140706e-01
+   2.67436147e-01  0.00000000e+00  6.38434172e-01  3.05932641e-01
+   7.21062347e-02  2.93387994e-02]] and [0. 0. 1. 1. 1. 1. 1. 1. 1. 0.]
+Time for data loading is 39.535260654985905 s.
+Set up classifier.
+Train.
+Time for training is 4739.061137255281 s.
+ Accuracy is 0.8003664.
+
+============================= JOB FEEDBACK =============================
+
+NodeName=uc2n378
+Job ID: 22915357
+Cluster: uc2
+User/Group: ku4408/scc
+State: COMPLETED (exit code 0)
+Nodes: 1
+Cores per node: 40
+CPU Utilized: 01:19:24
+CPU Efficiency: 2.47% of 2-05:34:40 core-walltime
+Job Wall-clock time: 01:20:22
+Memory Utilized: 7.58 GB
+Memory Efficiency: 17.25% of 43.95 GB
--- a/3_ensembles/solutions/sh/submit_parallel.sh
+++ b/3_ensembles/solutions/sh/submit_parallel.sh
+#!/bin/bash
+
+#SBATCH --job-name=RF2                     # Job name
+#SBATCH --partition=multiple               # Queue for the resource allocation
+#SBATCH --nodes=2                          # Number of nodes
+#SBATCH --time=70:00                       # Wall-clock time limit
+#SBATCH --ntasks-per-node=1                # maximum count of tasks per node
+#SBATCH --cpus-per-task=40                 # Number of CPUs per task
+#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.
+
+export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
+export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.
+export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.
+
+# Set up modules.
+module purge                               # Unload all currently loaded modules.
+module load compiler/gnu/13.3              # Load required modules.
+module load mpi/openmpi/4.1
+module load devel/cuda/12.4
+module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1
+
+source ${VENVDIR}/bin/activate             # Activate your virtual environment.
+
+mpirun python ${PYDIR}/distributed_forest.py --dataloader parallel # Use truly parallel dataloader.
+mpirun python ${PYDIR}/distributed_forest.py --dataloader root  # Use root-based dataloader.
--- a/3_ensembles/solutions/sh/submit_serial.sh
+++ b/3_ensembles/solutions/sh/submit_serial.sh
+#!/bin/bash
+
+#SBATCH --job-name=RF1                     # Job name
+#SBATCH --partition=single                 # Queue for the resource allocation
+#SBATCH --time=24:00:00                    # Wall-clock time limit
+#SBATCH --cpus-per-task=40                 # Number of CPUs per task
+#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.
+
+export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
+export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.
+export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.
+
+# Set up modules.
+module purge                               # Unload all currently loaded modules.
+module load compiler/gnu/13.3              # Load required modules.
+module load mpi/openmpi/4.1
+module load devel/cuda/12.4
+module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1
+
+source ${VENVDIR}/bin/activate             # Activate your virtual environment.
+
+python -u ${PYDIR}/serial_forest.py        # Run your Python script.
--- a/3_ensembles/solutions/sh/submit_test_dl_2.sh
+++ b/3_ensembles/solutions/sh/submit_test_dl_2.sh
+#!/bin/bash
+
+#SBATCH --job-name=dataloader_test         # Job name
+#SBATCH --partition=dev_multiple           # Queue for the resource allocation
+#SBATCH --nodes=2                          # Number of nodes
+#SBATCH --time=30:00                       # Wall-clock time limit  
+#SBATCH --ntasks-per-node=1                # Maximum count of tasks per node
+#SBATCH --cpus-per-task=40                 # CPUs per task
+#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.
+
+export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
+export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.
+export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.
+
+# Set up modules.
+module purge                               # Unload all currently loaded modules.
+module load compiler/gnu/13.3              # Load required modules.
+module load mpi/openmpi/4.1
+module load devel/cuda/12.4
+module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1
+
+source ${VENVDIR}/bin/activate # Activate your virtual environment.
+
+mpirun python ${PYDIR}/test_dataloaders.py