Compare revisions

Marie Weiel · Marie Weiel · 24ceb630
--- a/4_logit/sheet_4.ipynb
+++ b/4_logit/sheet_4.ipynb
+%% Cell type:markdown id:7cfce73b tags:
+
+# Skalierbare Methoden der Künstlichen Intelligenz
+Dr. Charlotte Debus (<charlotte.debus@kit.edu>)
+Dr. Markus Götz (<markus.goetz@kit.edu>)
+Dr. Marie Weiel (<marie.weiel@kit.edu>)
+Dr. Kaleb Phipps (<kaleb.phipps@kit.edu>)
+
+## Übung 4 am 21.01.25: Logistische Regression
+In der vierten Übung beschäftigen wir uns mit der logistischen Regression und ihrer Parallelisierung. Die logistische Regression ist ein einfacher Machine-Learning-Algorithmus zur binären Klassifizierung. Er berechnet die gewichtete Summe seiner Eingaben und gibt eine Aktivierung aus, die die gewichtete Summe in das feste Intervall $\left(0,1\right)$ abbildet. Auf diese Weise können wir die Ausgabe als Wahrscheinlichkeit für die Zugehörigkeit zu einer Klasse interpretieren. Im Falle mehrerer Klassen trainiert man einfach mehrere Modelle. In dieser Übung werden wir die logistische Regression verwenden, um zu entscheiden, zu welcher von zwei Verteilungen ein Datenpunkt gehört.
+
+### Aufgabe 1
+Wie in der Vorlesung vom 05.12.24 besprochen kommt in der logistischen Regression zur numerischen Minimierung der Kostenfunktion $L$ und damit zur Bestimmung der Modellparameter $W$ typischerweise das iterative Gradientenabstiegsverfahren zum Einsatz:
+
+$$W_{i+1}=W_i-\eta\nabla_W L$$
+
+In dieser Update-Regel bezeichnet $\eta$ die sogenannte Lernrate.
+Dieses Verfahren erfordert die Berechnung des Gradienten der Kostenfunktion bezüglich der Modellparameter. Der Gradient $\nabla_W L$ entspricht gerade dem Vektor mit den Komponenten $\frac{\partial L}{\partial W_j}$.
+Berechnen Sie den Gradienten der Mean-Square-Error-Kostenfunktion
+
+$$L\left(Y,\hat{Y}\right)=MSE\left(Y,\hat{Y}\right)=\frac{1}{N}\left(Y-\hat{Y}\right)^T\left(Y-\hat{Y}\right)=\frac{1}{N}\sum_{i=1}^N\left(y_i-\hat{y}_i\right)^2$$
+
+für die logistische Regression mit
+
+$$\hat{Y}=sig\left(XW\right)=\frac{1}{1+e^{-XW}}$$
+
+per Hand. Betrachten Sie hierzu einen Datensatz $\lbrace\text{Samples, Labels}\rbrace=\lbrace X, Y\rbrace$ bestehend aus $N$ Samples mit jeweils $D$ Features, d.h. für die Daten $X\in\mathbb{R}^{N\times\left(D+1\right)}$, die Labels $Y\in\mathbb{R}^N$, die Modellparameter $W\in\mathbb{R}^{D+1}$ sowie die Modellvorhersage $\hat{Y}\in\mathbb{R}^N$ gilt (nach Anwendung des Bias Tricks):
+
+$$X=\begin{pmatrix}
+1 & x_{11} & x_{12} & \dots & x_{1D}\\
+1 & x_{21} & x_{22} & \dots & x_{2D}\\
+\vdots & \vdots & \vdots & \ddots & \vdots \\
+1 & x_{N1} & x_{N2} & \dots & x_{ND}
+\end{pmatrix},\quad
+Y=\left(y_1, y_2, \dots, y_N\right)^T,\quad
+W=\left(w_0, w_1,\dots,w_D\right)^T,\quad
+\hat{Y}=sig\left(XW\right)=\left(\hat{y}_1, \hat{y}_2,\dots, \hat{y}_N\right)^T$$
+
+Hierbei entspricht $w_0$ gerade dem Bias.
+### Aufgabe 2
+#### Teil a
+Wir betrachten zunächst eine serielle Implementierung der logistischen Regression zur binären Klassifikation eines künstlich erstellten Datensatzes. Der Datensatz besteht zu gleichen Teilen aus Samples, die aus zwei verschiedenen Gaußverteilungen gezogen und anschließend durchmischt wurden. Untenstehend finden Sie eine beispielhafte Funktion zur Generierung solcher Daten, eine veranschaulichende Plotfunktion sowie den seriellen Code zur lokalen Ausführung im Notebook.
+
+Wie Sie der Funktionssignatur entnehmen können, hat der Gradient-Descent-Trainingsalgorithmus `lr_train` drei Hyperparameter:
+
+- Die Lernrate `eta` legt die Schrittgröße des Gradientenabstiegs fest.
+- Die Anzahl der Epochen `epochs` legt die Anzahl der vollständigen Durchläufe durch den Trainingsdatensatz fest.
+- Die Batchgröße `b` legt die Anzahl der Samples des Trainingsdatensatzes fest, die durchlaufen werden, bevor die Parameter des Modells innerhalb einer Epoche aktualisiert werden.
+
+Analysieren Sie das Trainingsverhalten des Algorithmus für Daten bestehend aus 10 000 Samples mit je 2 Features für verschiedene Kombinationen von `epochs` und `b`. Was fällt Ihnen auf?
+
+|`b`|  1 | 10 |  10 | 100 |  100 |  2000 | 10,000 |  10,000 |
+|-----|----|----|-----|-----|------|-------|-------|--------|
+| `epochs` | 20 | 20 | 100 | 100 | 1000 | 10,000 | 10,000 | 100,000 |
+
+%% Cell type:code id:8abea2bb063ee34b tags:
+
+``` python
+from typing import Union
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+np.random.seed(842424)  # Fix random seed for reproducibility.
+
+
+def generate_data(
+    n_samples: int = 10000, input_dim: int = 2
+) -> tuple[np.ndarray, np.ndarray]:
+    """
+    Generate artificial data, i.e., coordinates distributed in 2D space by two different Gaussian distributions
+    and their corresponding labels.
+
+    Parameters
+    ----------
+    n_samples : int
+        The overall number of samples in the dataset.
+    input_dim : int
+        The number of features, i.e., dimension, of input samples.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The samples.
+    np.ndarray[float]
+        The corresponding labels.
+    """
+    half_samples = n_samples // 2  # Generate two equally balanced classes.
+
+    # Generate the blobs.
+    x1 = np.random.normal(1.0, 0.25, size=(half_samples, input_dim))
+    x2 = np.random.normal(2.0, 0.30, size=(half_samples, input_dim))
+
+    # Create matching labels.
+    y1 = np.zeros(half_samples)
+    y2 = np.ones(half_samples)
+
+    data = np.concatenate((x1, x2))
+    labels = np.concatenate((y1, y2))
+
+    # Shuffle data to improve convergence behavior and more closely emulate real data.
+    shuffled_indices = np.arange(n_samples)
+    np.random.shuffle(shuffled_indices)
+
+    return data[shuffled_indices], labels[shuffled_indices]
+
+
+data, labels = generate_data()
+print(
+    f"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels."
+)
+```
+
+%% Cell type:code id:1d71c18b tags:
+
+``` python
+def plot_data(data: np.ndarray, labels: np.ndarray, title: str) -> None:
+    """
+    Plot data colored by labels.
+
+    Parameters
+    ----------
+    data : np.ndarray[float]
+        The data to plot.
+    labels : np.ndarray[float]
+        The corresponding labels.
+    title : str
+        The plot title.
+    """
+    plt.xlabel("Feature 1", fontweight="bold")
+    plt.ylabel("Feature 2", fontweight="bold")
+    plt.grid()
+    plt.scatter(data[:, 0], data[:, 1], c=labels, vmin=0.0, vmax=1.0)
+    plt.title(title, fontweight="bold")
+    plt.show()
+
+
+plot_data(data, labels, "Labeled training data")
+```
+
+%% Cell type:code id:d9549e82 tags:
+
+``` python
+def sigmoid(z: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
+    """
+    Compute sigmoid.
+
+    Parameters
+    ----------
+    z : float
+        The input for the sigmoid function.
+
+    Returns
+    ----------
+    float
+        The input's sigmoid function value.
+    """
+    return 1.0 / (1.0 + np.exp(-z))
+
+
+def lr_predict(w: np.ndarray, x: np.ndarray) -> np.ndarray:
+    """
+    Return prediction of logit model for data x using model weights w.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The parameters, i.e., weights to be learned (after bias trick), shape = [n_features + 1, ].
+        There is one weight for every input dimension plus a bias.
+    x : np.ndarray[float]
+        The dataset (after bias trick), shape = [n_samples, n_features +1].
+        The 0th input should be 1.0 to take the bias into account in a simple dot product.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The predicted activations of the logit model for the input dataset, shape = [n_samples, ],
+        i.e., the sigmoid of the dot product of the weights and the input data.
+    """
+    return sigmoid(x @ w)
+
+
+def mse(y_est: np.ndarray, y: np.ndarray) -> np.ndarray:
+    """
+    Compute mean-square-error loss.
+
+    Parameters
+    ----------
+    y_est : np.ndarray[float]
+        The predictions, shape = [n_samples, ].
+    y : np.ndarray[float]
+        The ground-truth labels, shape = [n_samples, ].
+
+    Returns
+    ----------
+    np.ndarray[float]
+        MSE loss
+    """
+    return (
+        (1.0 / y.shape[0]) * (y - y_est).T @ (y - y_est)
+    )  # Return MSE loss for considered batch.
+
+
+def lr_loss(
+    w: np.ndarray, x: np.ndarray, y: np.ndarray
+) -> tuple[np.ndarray, np.ndarray]:
+    """
+    Return the loss and the gradient with respect to the weights.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The model's weights to be learned, where weights[0] is the bias.
+    x : np.ndarray[float]
+        The input data of shape [N x D+1], 0th element of each sample is assumed to be 1 (bias trick).
+    y : np.ndarray[float]
+        The ground-truth labels of shape [N,].
+
+    Returns
+    -------
+    np.ndarray[float]
+        The scalar mean-square-error loss for the input batch of samples.
+    np.ndarray[float]
+        The gradient of the loss with respect to the weights for the batch.
+    """
+    y_est = lr_predict(w, x)  # Compute logit prediction for all samples in batch.
+    loss = mse(y_est, y)  # Compute MSE loss over all samples in batch.
+    # Compute gradient vector of loss w.r.t. weights.
+    gradient = (
+        (-2.0 / y.shape[0]) * ((y - y_est) * y_est * y_est * np.exp(-x @ w)).T @ x
+    )
+    return loss, gradient
+
+
+def lr_train(
+    w: np.ndarray,
+    x: np.ndarray,
+    y: np.ndarray,
+    epochs: int = 100,
+    eta: float = 0.001,
+    b: int = 10,
+) -> np.ndarray:
+    """
+    Train the model, i.e., update the weights following the negative gradient until the model converges.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The model weights to be learned, where weights[0] is the bias.
+    x : np.ndarray[float]
+        The input data of shape [N x D+1], where each sample's 0th element is assumed to be 1 for bias trick.
+    y : np.ndarray[float]
+        The ground-truth labels of shape [N,].
+    epochs : int
+        The number of epochs to be trained.
+    eta : float
+        The learning rate.
+    b : int
+        The batch size.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The trained weights.
+    """
+    n_samples = y.shape[0]  # Determine number of samples.
+    n_batches = n_samples // b  # Determine number of full batches in data (drop last).
+    print(f"Data is divided into {n_batches} batches.")
+
+    for epoch in range(epochs):  # Loop over epochs.
+        # The number of epochs is a gradient-descent hyperparameter
+        # that controls the number of complete passes through the train set.
+        # The batch size is a gradient-descent hyperparameter
+        # that controls the number of training samples to work through before the
+        # model's internal parameters are updated.
+
+        loss_sum = 0.0  # Initiate loss for each epoch.
+        accuracy = 0.0  # Initiate accuracy for each epoch.
+
+        for nb in range(n_batches):
+            x_ = x[nb * b : (nb + 1) * b]
+            y_ = y[nb * b : (nb + 1) * b]
+            loss, gradient = lr_loss(w, x_, y_)
+            loss_sum += loss
+
+            corr = np.sum((lr_predict(w, x_) + 0.5).astype(int) == y_)
+            accuracy += corr
+            w -= eta * gradient
+
+        # Calculate loss + accuracy after each epoch.
+        loss_sum /= n_batches
+        accuracy /= n_samples
+        accuracy *= 100
+
+        # Print every tenth epoch the training status.
+        if epoch % 10 == 0:
+            print(f"Epoch: {epoch}, Loss: {loss_sum}, Accuracy: {accuracy}")
+    return w
+```
+
+%% Cell type:code id:75702c70 tags:
+
+``` python
+# Bias trick: Prepend data with 1's for additional bias dimension.
+ones = np.ones(
+    shape=(
+        data.shape[0],
+        1,
+    )
+)
+data_bt = np.hstack([ones, data])
+
+weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.
+
+# Before training.
+plot_data(data, lr_predict(weights, data_bt), "Continuous predictions of untrained model")
+plot_data(data, np.around(lr_predict(weights, data_bt)), "Mapped predictions of untrained model")
+
+# Train model.
+weights = lr_train(weights, data_bt, labels, b=10, epochs=100)
+
+# After training.
+plot_data(data, lr_predict(weights, data_bt), "Continuous predictions of trained model")
+plot_data(data, np.around(lr_predict(weights, data_bt)), "Mapped predictions of trained model")
+```
+
+%% Cell type:markdown id:c8d616f4 tags:
+
+#### Teil b
+Testen Sie die serielle Implementierung auf einem CPU-basierten Knoten des bwUniClusters.
+
+- Erstellen Sie dazu ein Python-Skript basierend auf untenstehendem Code sowie den Funktionen in der obigen Implementierung, welches Sie mithilfe eines Submit-Skripts auf dem Cluster starten.
+- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5`, die Sie in den Datensätzen `data` bzw. `labels` vorfinden. Der Datensatz enthält 10 000 Samples mit je 2 Features.
+- Die zu verwendende Anzahl der Epochen sowie die Batchgröße können Sie als Command-Line-Argumente des Python-Skripts übergeben.
+- Laden Sie wie auf den vorherigen Übungsblättern die benötigten Module und aktivieren Sie Ihre virtuelle Python-Umgebung, bevor Sie das eigentliche Skript ausführen (siehe untenstehendes Submit-Skript).
+
+%% Cell type:code id:80c0733c tags:
+
+``` python
+import argparse
+
+import h5py
+import numpy as np
+
+
+np.random.seed(842424)  # Fix random seed for reproducibility.
+
+##################################
+# PUT FUNCTION DEFINITIONS HERE! #
+##################################
+
+if __name__ == "__main__":
+    data_path = (
+        "/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5"
+    )
+    parser = argparse.ArgumentParser(prog="Logit")
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=100,
+        help="The number of epochs to train.",
+    )
+
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=10,
+        help="The batch size.",
+    )
+
+    args = parser.parse_args()
+
+    with h5py.File(data_path, "r") as f:
+        data = np.array(f["data"])
+        labels = np.array(f["labels"])
+
+    print(
+        f"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels."
+    )
+
+    # Bias trick: Prepend data with 1's for additional bias dimension.
+    ones = np.ones(
+        (
+            data.shape[0],
+            1,
+        )
+    )
+    data_bt = np.hstack([ones, data])
+    weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.
+    weights = lr_train(weights, data_bt, labels, b=args.batch_size, epochs=args.epochs)
+```
+
+%% Cell type:code id:3a1abbe8 tags:
+
+``` python
+#!/bin/bash
+
+#SBATCH --job-name=logit_serial            # Job name
+#SBATCH --partition=dev_single             # Queue for the resource allocation.
+#SBATCH --time=5:00                        # Wall-clock time limit
+#SBATCH --cpus-per-task=40                 # Number of CPUs required per MPI task
+#SBATCH --ntasks-per-node=1                # Maximum count of tasks per node
+#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.
+
+export OMP_NUM_THREADS=40
+export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.
+export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.
+
+# Set up modules.
+module purge                               # Unload all currently loaded modules.
+module load compiler/gnu/13.3              # Load required modules.
+module load mpi/openmpi/4.1
+module load devel/cuda/12.4
+module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1
+
+source ${VENVDIR}/bin/activate # Activate your virtual environment.
+
+python ${PYDIR}/logit_serial.py --epochs 100 --batch_size 10
+```
+
+%% Cell type:markdown id:83c5b31f tags:
+
+### Aufgabe 3
+Implementieren Sie ausgehend von obigem seriellen Code eine daten-parallele Version der logistischen Regression. Datenparallelismus bezeichnet eine Form der Parallelisierung, bei der das Training über die vorliegenden Samples in einem Batch der effektiven Batchgröße $b_\text{eff}$ parallelisiert wird.
+Die Daten werden partitioniert und auf die verschiedenen Prozessoren verteilt, die diese parallel bearbeiten.
+Jeder Prozessor verfügt über eine eigene Kopie des Modells und arbeitet lokal mit den jeweils vorliegenden Samples, wobei bei Bedarf mit den anderen Prozessoren kommuniziert wird, um die Kopien konsistent zu halten. Für den Gradientenabstieg bedeutet dies, dass es neben der oben erwähnten globalen effektiven Batchgröße $b_\text{eff}$ auf jedem Prozessor $p$ eine lokale Batchgröße $b_p$ ("Mini-Mini-Batch") gibt, für die gilt:
+
+$$b_\text{eff}=\sum_{p}b_p$$
+
+Jeder Prozessor berechnet für seine lokal vorliegenden Batches der Größe $b_p$ die Kostenfunktion sowie deren Gradient bezüglich der Gewichte. Nach Abarbeitung eines lokalen Batches müssen nun alle Prozessoren die jeweils lokal berechneten Gradienten austauschen und über diese mitteln, sodass jeder Prozessor anschließend die Gewichte seiner lokalen Modellkopie entsprechend der effektiven Batchgröße redundant und mit allen anderen Kopien konsistent aktualisieren kann.
+
+- Wie in den vorherigen Übungen auch laden wir dazu die Daten entlang der Sample-Achse verteilt auf die vorliegenden Prozessoren. Untenstehend finden Sie einen entsprechenden Dataloader.
+- Implementieren Sie ausgehend von den seriellen Funktionen eine daten-parallele Version des Gradientenabstiegs für die logistische Regression und testen Sie Ihren Code auf vier CPU-basierten Knoten des bwUniClusters.
+- Erstellen Sie dazu analog zu Aufgabenteil 2b ein Python-Skript sowie ein Submit-Bash-Skript (siehe auch vorherige Übungsblätter).
+- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5`, die Sie in den Datensätzen `data` bzw. `labels` vorfinden. Der Datensatz enthält 100 000 Samples mit je 2 Features.
+- Trainieren Sie für $epochs = 100$ Epochen und nutzen Sie zunächst eine effektive Batchgröße von $b_\mathrm{eff}=100$.
+- Sie können diesen Datensatz ebenfalls mit Ihrer seriellen Variante der logistischen Regression auf dem Cluster klassifizieren. Vergleichen Sie die Güte des trainierten Modells für die gleiche Anzahl an Epochen $epochs = 100$ und die gleiche (effektive) Batchgröße $b_\mathrm{(eff)}=100$. Was fällt Ihnen auf? Variieren Sie gegebenenfalls die Hyperparameter Ihrer parallelen Version, sodass Sie eine vergleichbare Qualität des trainierten Modells erhalten.
+
+%% Cell type:code id:5789926d tags:
+
+``` python
+import numpy as np
+from mpi4py import MPI
+import h5py
+
+
+if __name__ == "__main__":
+    data_path = "/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5"
+    comm = MPI.COMM_WORLD  # Set up communicator.
+    rank, size = comm.rank, comm.size
+
+    with h5py.File(data_path, "r") as f:  # Load data in sample-parallel fashion.
+        chunk = int(f["data"].shape[0] / size)
+        if rank == size - 1:
+            data = np.array(f["data"][rank * chunk :])
+            labels = np.array(f["labels"][rank * chunk :])
+        else:
+            data = np.array(f["data"][rank * chunk : (rank + 1) * chunk])
+            labels = np.array(f["labels"][rank * chunk : (rank + 1) * chunk])
+
+    print(
+        f"Rank {rank}/{size}: Local data has {data.shape[0]} samples with {data.shape[1]} features and "
+        f"{labels.shape[0]} labels. 0th elements are: {data[0]}, {labels[0]}"
+    )
+```
+%% Cell type:markdown id:7cfce73b tags:
+
+# Skalierbare Methoden der Künstlichen Intelligenz
+Dr. Charlotte Debus (<charlotte.debus@kit.edu>)
+Dr. Markus Götz (<markus.goetz@kit.edu>)
+Dr. Marie Weiel (<marie.weiel@kit.edu>)
+Dr. Kaleb Phipps (<kaleb.phipps@kit.edu>)
+
+## Übung 4 am 21.01.25: Logistische Regression
+In der vierten Übung beschäftigen wir uns mit der logistischen Regression und ihrer Parallelisierung. Die logistische Regression ist ein einfacher Machine-Learning-Algorithmus zur binären Klassifizierung. Er berechnet die gewichtete Summe seiner Eingaben und gibt eine Aktivierung aus, die die gewichtete Summe in das feste Intervall $\left(0,1\right)$ abbildet. Auf diese Weise können wir die Ausgabe als Wahrscheinlichkeit für die Zugehörigkeit zu einer Klasse interpretieren. Im Falle mehrerer Klassen trainiert man einfach mehrere Modelle. In dieser Übung werden wir die logistische Regression verwenden, um zu entscheiden, zu welcher von zwei Verteilungen ein Datenpunkt gehört.
+
+### Aufgabe 1
+Wie in der Vorlesung vom 05.12.24 besprochen kommt in der logistischen Regression zur numerischen Minimierung der Kostenfunktion $L$ und damit zur Bestimmung der Modellparameter $W$ typischerweise das iterative Gradientenabstiegsverfahren zum Einsatz:
+
+$$W_{i+1}=W_i-\eta\nabla_W L$$
+
+In dieser Update-Regel bezeichnet $\eta$ die sogenannte Lernrate.
+Dieses Verfahren erfordert die Berechnung des Gradienten der Kostenfunktion bezüglich der Modellparameter. Der Gradient $\nabla_W L$ entspricht gerade dem Vektor mit den Komponenten $\frac{\partial L}{\partial W_j}$.
+Berechnen Sie den Gradienten der Mean-Square-Error-Kostenfunktion
+
+$$L\left(Y,\hat{Y}\right)=MSE\left(Y,\hat{Y}\right)=\frac{1}{N}\left(Y-\hat{Y}\right)^T\left(Y-\hat{Y}\right)=\frac{1}{N}\sum_{i=1}^N\left(y_i-\hat{y}_i\right)^2$$
+
+für die logistische Regression mit
+
+$$\hat{Y}=sig\left(XW\right)=\frac{1}{1+e^{-XW}}$$
+
+per Hand. Betrachten Sie hierzu einen Datensatz $\lbrace\text{Samples, Labels}\rbrace=\lbrace X, Y\rbrace$ bestehend aus $N$ Samples mit jeweils $D$ Features, d.h. für die Daten $X\in\mathbb{R}^{N\times\left(D+1\right)}$, die Labels $Y\in\mathbb{R}^N$, die Modellparameter $W\in\mathbb{R}^{D+1}$ sowie die Modellvorhersage $\hat{Y}\in\mathbb{R}^N$ gilt (nach Anwendung des Bias Tricks):
+
+$$X=\begin{pmatrix}
+1 & x_{11} & x_{12} & \dots & x_{1D}\\
+1 & x_{21} & x_{22} & \dots & x_{2D}\\
+\vdots & \vdots & \vdots & \ddots & \vdots \\
+1 & x_{N1} & x_{N2} & \dots & x_{ND}
+\end{pmatrix},\quad
+Y=\left(y_1, y_2, \dots, y_N\right)^T,\quad
+W=\left(w_0, w_1,\dots,w_D\right)^T,\quad
+\hat{Y}=sig\left(XW\right)=\left(\hat{y}_1, \hat{y}_2,\dots, \hat{y}_N\right)^T$$
+
+Hierbei entspricht $w_0$ gerade dem Bias.
+### Aufgabe 2
+#### Teil a
+Wir betrachten zunächst eine serielle Implementierung der logistischen Regression zur binären Klassifikation eines künstlich erstellten Datensatzes. Der Datensatz besteht zu gleichen Teilen aus Samples, die aus zwei verschiedenen Gaußverteilungen gezogen und anschließend durchmischt wurden. Untenstehend finden Sie eine beispielhafte Funktion zur Generierung solcher Daten, eine veranschaulichende Plotfunktion sowie den seriellen Code zur lokalen Ausführung im Notebook.
+
+Wie Sie der Funktionssignatur entnehmen können, hat der Gradient-Descent-Trainingsalgorithmus `lr_train` drei Hyperparameter:
+
+- Die Lernrate `eta` legt die Schrittgröße des Gradientenabstiegs fest.
+- Die Anzahl der Epochen `epochs` legt die Anzahl der vollständigen Durchläufe durch den Trainingsdatensatz fest.
+- Die Batchgröße `b` legt die Anzahl der Samples des Trainingsdatensatzes fest, die durchlaufen werden, bevor die Parameter des Modells innerhalb einer Epoche aktualisiert werden.
+
+Analysieren Sie das Trainingsverhalten des Algorithmus für Daten bestehend aus 10 000 Samples mit je 2 Features für verschiedene Kombinationen von `epochs` und `b`. Was fällt Ihnen auf?
+
+|`b`|  1 | 10 |  10 | 100 |  100 |  2000 | 10,000 |  10,000 |
+|-----|----|----|-----|-----|------|-------|-------|--------|
+| `epochs` | 20 | 20 | 100 | 100 | 1000 | 10,000 | 10,000 | 100,000 |
+
+%% Cell type:code id:8abea2bb063ee34b tags:
+
+``` python
+from typing import Union
+
+import numpy as np
+import matplotlib.pyplot as plt
+
+
+np.random.seed(842424)  # Fix random seed for reproducibility.
+
+
+def generate_data(
+    n_samples: int = 10000, input_dim: int = 2
+) -> tuple[np.ndarray, np.ndarray]:
+    """
+    Generate artificial data, i.e., coordinates distributed in 2D space by two different Gaussian distributions
+    and their corresponding labels.
+
+    Parameters
+    ----------
+    n_samples : int
+        The overall number of samples in the dataset.
+    input_dim : int
+        The number of features, i.e., dimension, of input samples.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The samples.
+    np.ndarray[float]
+        The corresponding labels.
+    """
+    half_samples = n_samples // 2  # Generate two equally balanced classes.
+
+    # Generate the blobs.
+    x1 = np.random.normal(1.0, 0.25, size=(half_samples, input_dim))
+    x2 = np.random.normal(2.0, 0.30, size=(half_samples, input_dim))
+
+    # Create matching labels.
+    y1 = np.zeros(half_samples)
+    y2 = np.ones(half_samples)
+
+    data = np.concatenate((x1, x2))
+    labels = np.concatenate((y1, y2))
+
+    # Shuffle data to improve convergence behavior and more closely emulate real data.
+    shuffled_indices = np.arange(n_samples)
+    np.random.shuffle(shuffled_indices)
+
+    return data[shuffled_indices], labels[shuffled_indices]
+
+
+data, labels = generate_data()
+print(
+    f"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels."
+)
+```
+
+%% Cell type:code id:1d71c18b tags:
+
+``` python
+def plot_data(data: np.ndarray, labels: np.ndarray, title: str) -> None:
+    """
+    Plot data colored by labels.
+
+    Parameters
+    ----------
+    data : np.ndarray[float]
+        The data to plot.
+    labels : np.ndarray[float]
+        The corresponding labels.
+    title : str
+        The plot title.
+    """
+    plt.xlabel("Feature 1", fontweight="bold")
+    plt.ylabel("Feature 2", fontweight="bold")
+    plt.grid()
+    plt.scatter(data[:, 0], data[:, 1], c=labels, vmin=0.0, vmax=1.0)
+    plt.title(title, fontweight="bold")
+    plt.show()
+
+
+plot_data(data, labels, "Labeled training data")
+```
+
+%% Cell type:code id:d9549e82 tags:
+
+``` python
+def sigmoid(z: Union[float, np.ndarray]) -> Union[float, np.ndarray]:
+    """
+    Compute sigmoid.
+
+    Parameters
+    ----------
+    z : float
+        The input for the sigmoid function.
+
+    Returns
+    ----------
+    float
+        The input's sigmoid function value.
+    """
+    return 1.0 / (1.0 + np.exp(-z))
+
+
+def lr_predict(w: np.ndarray, x: np.ndarray) -> np.ndarray:
+    """
+    Return prediction of logit model for data x using model weights w.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The parameters, i.e., weights to be learned (after bias trick), shape = [n_features + 1, ].
+        There is one weight for every input dimension plus a bias.
+    x : np.ndarray[float]
+        The dataset (after bias trick), shape = [n_samples, n_features +1].
+        The 0th input should be 1.0 to take the bias into account in a simple dot product.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The predicted activations of the logit model for the input dataset, shape = [n_samples, ],
+        i.e., the sigmoid of the dot product of the weights and the input data.
+    """
+    return sigmoid(x @ w)
+
+
+def mse(y_est: np.ndarray, y: np.ndarray) -> np.ndarray:
+    """
+    Compute mean-square-error loss.
+
+    Parameters
+    ----------
+    y_est : np.ndarray[float]
+        The predictions, shape = [n_samples, ].
+    y : np.ndarray[float]
+        The ground-truth labels, shape = [n_samples, ].
+
+    Returns
+    ----------
+    np.ndarray[float]
+        MSE loss
+    """
+    return (
+        (1.0 / y.shape[0]) * (y - y_est).T @ (y - y_est)
+    )  # Return MSE loss for considered batch.
+
+
+def lr_loss(
+    w: np.ndarray, x: np.ndarray, y: np.ndarray
+) -> tuple[np.ndarray, np.ndarray]:
+    """
+    Return the loss and the gradient with respect to the weights.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The model's weights to be learned, where weights[0] is the bias.
+    x : np.ndarray[float]
+        The input data of shape [N x D+1], 0th element of each sample is assumed to be 1 (bias trick).
+    y : np.ndarray[float]
+        The ground-truth labels of shape [N,].
+
+    Returns
+    -------
+    np.ndarray[float]
+        The scalar mean-square-error loss for the input batch of samples.
+    np.ndarray[float]
+        The gradient of the loss with respect to the weights for the batch.
+    """
+    y_est = lr_predict(w, x)  # Compute logit prediction for all samples in batch.
+    loss = mse(y_est, y)  # Compute MSE loss over all samples in batch.
+    # Compute gradient vector of loss w.r.t. weights.
+    gradient = (
+        (-2.0 / y.shape[0]) * ((y - y_est) * y_est * y_est * np.exp(-x @ w)).T @ x
+    )
+    return loss, gradient
+
+
+def lr_train(
+    w: np.ndarray,
+    x: np.ndarray,
+    y: np.ndarray,
+    epochs: int = 100,
+    eta: float = 0.001,
+    b: int = 10,
+) -> np.ndarray:
+    """
+    Train the model, i.e., update the weights following the negative gradient until the model converges.
+
+    Parameters
+    ----------
+    w : np.ndarray[float]
+        The model weights to be learned, where weights[0] is the bias.
+    x : np.ndarray[float]
+        The input data of shape [N x D+1], where each sample's 0th element is assumed to be 1 for bias trick.
+    y : np.ndarray[float]
+        The ground-truth labels of shape [N,].
+    epochs : int
+        The number of epochs to be trained.
+    eta : float
+        The learning rate.
+    b : int
+        The batch size.
+
+    Returns
+    -------
+    np.ndarray[float]
+        The trained weights.
+    """
+    n_samples = y.shape[0]  # Determine number of samples.
+    n_batches = n_samples // b  # Determine number of full batches in data (drop last).
+    print(f"Data is divided into {n_batches} batches.")
+
+    for epoch in range(epochs):  # Loop over epochs.
+        # The number of epochs is a gradient-descent hyperparameter
+        # that controls the number of complete passes through the train set.
+        # The batch size is a gradient-descent hyperparameter
+        # that controls the number of training samples to work through before the
+        # model's internal parameters are updated.
+
+        loss_sum = 0.0  # Initiate loss for each epoch.
+        accuracy = 0.0  # Initiate accuracy for each epoch.
+
+        for nb in range(n_batches):
+            x_ = x[nb * b : (nb + 1) * b]
+            y_ = y[nb * b : (nb + 1) * b]
+            loss, gradient = lr_loss(w, x_, y_)
+            loss_sum += loss
+
+            corr = np.sum((lr_predict(w, x_) + 0.5).astype(int) == y_)
+            accuracy += corr
+            w -= eta * gradient
+
+        # Calculate loss + accuracy after each epoch.
+        loss_sum /= n_batches
+        accuracy /= n_samples
+        accuracy *= 100
+
+        # Print every tenth epoch the training status.
+        if epoch % 10 == 0:
+            print(f"Epoch: {epoch}, Loss: {loss_sum}, Accuracy: {accuracy}")
+    return w
+```
+
+%% Cell type:code id:75702c70 tags:
+
+``` python
+# Bias trick: Prepend data with 1's for additional bias dimension.
+ones = np.ones(
+    shape=(
+        data.shape[0],
+        1,
+    )
+)
+data_bt = np.hstack([ones, data])
+
+weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.
+
+# Before training.
+plot_data(data, lr_predict(weights, data_bt), "Continuous predictions of untrained model")
+plot_data(data, np.around(lr_predict(weights, data_bt)), "Mapped predictions of untrained model")
+
+# Train model.
+weights = lr_train(weights, data_bt, labels, b=10, epochs=100)
+
+# After training.
+plot_data(data, lr_predict(weights, data_bt), "Continuous predictions of trained model")
+plot_data(data, np.around(lr_predict(weights, data_bt)), "Mapped predictions of trained model")
+```
+
+%% Cell type:markdown id:c8d616f4 tags:
+
+#### Teil b
+Testen Sie die serielle Implementierung auf einem CPU-basierten Knoten des bwUniClusters.
+
+- Erstellen Sie dazu ein Python-Skript basierend auf untenstehendem Code sowie den Funktionen in der obigen Implementierung, welches Sie mithilfe eines Submit-Skripts auf dem Cluster starten.
+- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5`, die Sie in den Datensätzen `data` bzw. `labels` vorfinden. Der Datensatz enthält 10 000 Samples mit je 2 Features.
+- Die zu verwendende Anzahl der Epochen sowie die Batchgröße können Sie als Command-Line-Argumente des Python-Skripts übergeben.
+- Laden Sie wie auf den vorherigen Übungsblättern die benötigten Module und aktivieren Sie Ihre virtuelle Python-Umgebung, bevor Sie das eigentliche Skript ausführen (siehe untenstehendes Submit-Skript).
+
+%% Cell type:code id:80c0733c tags:
+
+``` python
+import argparse
+
+import h5py
+import numpy as np
+
+
+np.random.seed(842424)  # Fix random seed for reproducibility.
+
+##################################
+# PUT FUNCTION DEFINITIONS HERE! #
+##################################
+
+if __name__ == "__main__":
+    data_path = (
+        "/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5"
+    )
+    parser = argparse.ArgumentParser(prog="Logit")
+    parser.add_argument(
+        "--epochs",
+        type=int,
+        default=100,
+        help="The number of epochs to train.",
+    )
+
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=10,
+        help="The batch size.",
+    )
+
+    args = parser.parse_args()
+
+    with h5py.File(data_path, "r") as f:
+        data = np.array(f["data"])
+        labels = np.array(f["labels"])
+
+    print(
+        f"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels."
+    )
+
+    # Bias trick: Prepend data with 1's for additional bias dimension.
+    ones = np.ones(
+        (
+            data.shape[0],
+            1,
+        )
+    )
+    data_bt = np.hstack([ones, data])
+    weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.
+    weights = lr_train(weights, data_bt, labels, b=args.batch_size, epochs=args.epochs)
+```
+
+%% Cell type:code id:3a1abbe8 tags:
+
+``` python
+#!/bin/bash
+
+#SBATCH --job-name=logit_serial            # Job name
+#SBATCH --partition=dev_single             # Queue for the resource allocation.
+#SBATCH --time=5:00                        # Wall-clock time limit
+#SBATCH --cpus-per-task=40                 # Number of CPUs required per MPI task
+#SBATCH --ntasks-per-node=1                # Maximum count of tasks per node
+#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.
+
+export OMP_NUM_THREADS=40
+export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.
+export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.
+
+# Set up modules.
+module purge                               # Unload all currently loaded modules.
+module load compiler/gnu/13.3              # Load required modules.
+module load mpi/openmpi/4.1
+module load devel/cuda/12.4
+module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1
+
+source ${VENVDIR}/bin/activate # Activate your virtual environment.
+
+python ${PYDIR}/logit_serial.py --epochs 100 --batch_size 10
+```
+
+%% Cell type:markdown id:83c5b31f tags:
+
+### Aufgabe 3
+Implementieren Sie ausgehend von obigem seriellen Code eine daten-parallele Version der logistischen Regression. Datenparallelismus bezeichnet eine Form der Parallelisierung, bei der das Training über die vorliegenden Samples in einem Batch der effektiven Batchgröße $b_\text{eff}$ parallelisiert wird.
+Die Daten werden partitioniert und auf die verschiedenen Prozessoren verteilt, die diese parallel bearbeiten.
+Jeder Prozessor verfügt über eine eigene Kopie des Modells und arbeitet lokal mit den jeweils vorliegenden Samples, wobei bei Bedarf mit den anderen Prozessoren kommuniziert wird, um die Kopien konsistent zu halten. Für den Gradientenabstieg bedeutet dies, dass es neben der oben erwähnten globalen effektiven Batchgröße $b_\text{eff}$ auf jedem Prozessor $p$ eine lokale Batchgröße $b_p$ ("Mini-Mini-Batch") gibt, für die gilt:
+
+$$b_\text{eff}=\sum_{p}b_p$$
+
+Jeder Prozessor berechnet für seine lokal vorliegenden Batches der Größe $b_p$ die Kostenfunktion sowie deren Gradient bezüglich der Gewichte. Nach Abarbeitung eines lokalen Batches müssen nun alle Prozessoren die jeweils lokal berechneten Gradienten austauschen und über diese mitteln, sodass jeder Prozessor anschließend die Gewichte seiner lokalen Modellkopie entsprechend der effektiven Batchgröße redundant und mit allen anderen Kopien konsistent aktualisieren kann.
+
+- Wie in den vorherigen Übungen auch laden wir dazu die Daten entlang der Sample-Achse verteilt auf die vorliegenden Prozessoren. Untenstehend finden Sie einen entsprechenden Dataloader.
+- Implementieren Sie ausgehend von den seriellen Funktionen eine daten-parallele Version des Gradientenabstiegs für die logistische Regression und testen Sie Ihren Code auf vier CPU-basierten Knoten des bwUniClusters.
+- Erstellen Sie dazu analog zu Aufgabenteil 2b ein Python-Skript sowie ein Submit-Bash-Skript (siehe auch vorherige Übungsblätter).
+- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5`, die Sie in den Datensätzen `data` bzw. `labels` vorfinden. Der Datensatz enthält 100 000 Samples mit je 2 Features.
+- Trainieren Sie für $epochs = 100$ Epochen und nutzen Sie zunächst eine effektive Batchgröße von $b_\mathrm{eff}=100$.
+- Sie können diesen Datensatz ebenfalls mit Ihrer seriellen Variante der logistischen Regression auf dem Cluster klassifizieren. Vergleichen Sie die Güte des trainierten Modells für die gleiche Anzahl an Epochen $epochs = 100$ und die gleiche (effektive) Batchgröße $b_\mathrm{(eff)}=100$. Was fällt Ihnen auf? Variieren Sie gegebenenfalls die Hyperparameter Ihrer parallelen Version, sodass Sie eine vergleichbare Qualität des trainierten Modells erhalten.
+
+%% Cell type:code id:5789926d tags:
+
+``` python
+import numpy as np
+from mpi4py import MPI
+import h5py
+
+
+if __name__ == "__main__":
+    data_path = "/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5"
+    comm = MPI.COMM_WORLD  # Set up communicator.
+    rank, size = comm.rank, comm.size
+
+    with h5py.File(data_path, "r") as f:  # Load data in sample-parallel fashion.
+        chunk = int(f["data"].shape[0] / size)
+        if rank == size - 1:
+            data = np.array(f["data"][rank * chunk :])
+            labels = np.array(f["labels"][rank * chunk :])
+        else:
+            data = np.array(f["data"][rank * chunk : (rank + 1) * chunk])
+            labels = np.array(f["labels"][rank * chunk : (rank + 1) * chunk])
+
+    print(
+        f"Rank {rank}/{size}: Local data has {data.shape[0]} samples with {data.shape[1]} features and "
+        f"{labels.shape[0]} labels. 0th elements are: {data[0]}, {labels[0]}"
+    )
+```
No results found