From ea0f238ecc7742c46467190ca0b798987cfe9a3f Mon Sep 17 00:00:00 2001
From: Marie Weiel <marie.weiel@kit.edu>
Date: Mon, 16 Dec 2024 09:57:36 +0100
Subject: [PATCH] add sheet 4

---
 4_logit/sheet_4.ipynb | 547 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 547 insertions(+)
 create mode 100644 4_logit/sheet_4.ipynb

diff --git a/4_logit/sheet_4.ipynb b/4_logit/sheet_4.ipynb
new file mode 100644
index 0000000..7d66111
--- /dev/null
+++ b/4_logit/sheet_4.ipynb
@@ -0,0 +1,547 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "7cfce73b",
+   "metadata": {},
+   "source": [
+    "# Skalierbare Methoden der KÃ¼nstlichen Intelligenz\n",
+    "Dr. Charlotte Debus (<charlotte.debus@kit.edu>)  \n",
+    "Dr. Markus GÃ¶tz (<markus.goetz@kit.edu>)   \n",
+    "Dr. Marie Weiel (<marie.weiel@kit.edu>)  \n",
+    "Dr. Kaleb Phipps (<kaleb.phipps@kit.edu>)  \n",
+    "\n",
+    "## Ãœbung 4 am 21.01.25: Logistische Regression\n",
+    "In der vierten Ãœbung beschÃ¤ftigen wir uns mit der logistischen Regression und ihrer Parallelisierung. Die logistische Regression ist ein einfacher Machine-Learning-Algorithmus zur binÃ¤ren Klassifizierung. Er berechnet die gewichtete Summe seiner Eingaben und gibt eine Aktivierung aus, die die gewichtete Summe in das feste Intervall $\\left(0,1\\right)$ abbildet. Auf diese Weise kÃ¶nnen wir die Ausgabe als Wahrscheinlichkeit fÃ¼r die ZugehÃ¶rigkeit zu einer Klasse interpretieren. Im Falle mehrerer Klassen trainiert man einfach mehrere Modelle. In dieser Ãœbung werden wir die logistische Regression verwenden, um zu entscheiden, zu welcher von zwei Verteilungen ein Datenpunkt gehÃ¶rt.\n",
+    "\n",
+    "### Aufgabe 1\n",
+    "Wie in der Vorlesung vom 05.12.24 besprochen kommt in der logistischen Regression zur numerischen Minimierung der Kostenfunktion $L$ und damit zur Bestimmung der Modellparameter $W$ typischerweise das iterative Gradientenabstiegsverfahren zum Einsatz:\n",
+    "\n",
+    "$$W_{i+1}=W_i-\\eta\\nabla_W L$$\n",
+    "\n",
+    "In dieser Update-Regel bezeichnet $\\eta$ die sogenannte Lernrate.\n",
+    "Dieses Verfahren erfordert die Berechnung des Gradienten der Kostenfunktion bezÃ¼glich der Modellparameter. Der Gradient $\\nabla_W L$ entspricht gerade dem Vektor mit den Komponenten $\\frac{\\partial L}{\\partial W_j}$. \n",
+    "Berechnen Sie den Gradienten der Mean-Square-Error-Kostenfunktion\n",
+    "\n",
+    "$$L\\left(Y,\\hat{Y}\\right)=MSE\\left(Y,\\hat{Y}\\right)=\\frac{1}{N}\\left(Y-\\hat{Y}\\right)^T\\left(Y-\\hat{Y}\\right)=\\frac{1}{N}\\sum_{i=1}^N\\left(y_i-\\hat{y}_i\\right)^2$$\n",
+    "\n",
+    "fÃ¼r die logistische Regression mit\n",
+    "\n",
+    "$$\\hat{Y}=sig\\left(XW\\right)=\\frac{1}{1+e^{-XW}}$$\n",
+    "\n",
+    "per Hand. Betrachten Sie hierzu einen Datensatz $\\lbrace\\text{Samples, Labels}\\rbrace=\\lbrace X, Y\\rbrace$ bestehend aus $N$ Samples mit jeweils $D$ Features, d.h. fÃ¼r die Daten $X\\in\\mathbb{R}^{N\\times\\left(D+1\\right)}$, die Labels $Y\\in\\mathbb{R}^N$, die Modellparameter $W\\in\\mathbb{R}^{D+1}$ sowie die Modellvorhersage $\\hat{Y}\\in\\mathbb{R}^N$ gilt (nach Anwendung des Bias Tricks):\n",
+    "\n",
+    "$$X=\\begin{pmatrix}\n",
+    "1 & x_{11} & x_{12} & \\dots & x_{1D}\\\\\n",
+    "1 & x_{21} & x_{22} & \\dots & x_{2D}\\\\\n",
+    "\\vdots & \\vdots & \\vdots & \\ddots & \\vdots \\\\\n",
+    "1 & x_{N1} & x_{N2} & \\dots & x_{ND}\n",
+    "\\end{pmatrix},\\quad\n",
+    "Y=\\left(y_1, y_2, \\dots, y_N\\right)^T,\\quad\n",
+    "W=\\left(w_0, w_1,\\dots,w_D\\right)^T,\\quad\n",
+    "\\hat{Y}=sig\\left(XW\\right)=\\left(\\hat{y}_1, \\hat{y}_2,\\dots, \\hat{y}_N\\right)^T$$\n",
+    "\n",
+    "Hierbei entspricht $w_0$ gerade dem Bias.\n",
+    "### Aufgabe 2\n",
+    "#### Teil a\n",
+    "Wir betrachten zunÃ¤chst eine serielle Implementierung der logistischen Regression zur binÃ¤ren Klassifikation eines kÃ¼nstlich erstellten Datensatzes. Der Datensatz besteht zu gleichen Teilen aus Samples, die aus zwei verschiedenen GauÃŸverteilungen gezogen und anschlieÃŸend durchmischt wurden. Untenstehend finden Sie eine beispielhafte Funktion zur Generierung solcher Daten, eine veranschaulichende Plotfunktion sowie den seriellen Code zur lokalen AusfÃ¼hrung im Notebook.\n",
+    "\n",
+    "Wie Sie der Funktionssignatur entnehmen kÃ¶nnen, hat der Gradient-Descent-Trainingsalgorithmus `lr_train` drei Hyperparameter:\n",
+    "\n",
+    "- Die Lernrate `eta` legt die SchrittgrÃ¶ÃŸe des Gradientenabstiegs fest.\n",
+    "- Die Anzahl der Epochen `epochs` legt die Anzahl der vollstÃ¤ndigen DurchlÃ¤ufe durch den Trainingsdatensatz fest.\n",
+    "- Die BatchgrÃ¶ÃŸe `b` legt die Anzahl der Samples des Trainingsdatensatzes fest, die durchlaufen werden, bevor die Parameter des Modells innerhalb einer Epoche aktualisiert werden.\n",
+    "\n",
+    "Analysieren Sie das Trainingsverhalten des Algorithmus fÃ¼r Daten bestehend aus 10 000 Samples mit je 2 Features fÃ¼r verschiedene Kombinationen von `epochs` und `b`. Was fÃ¤llt Ihnen auf?\n",
+    "\n",
+    "|`b`|  1 | 10 |  10 | 100 |  100 |  2000 | 10,000 |  10,000 |  \n",
+    "|-----|----|----|-----|-----|------|-------|-------|--------|\n",
+    "| `epochs` | 20 | 20 | 100 | 100 | 1000 | 10,000 | 10,000 | 100,000 |"
+   ]
+  },
+  {
+   "metadata": {},
+   "cell_type": "code",
+   "outputs": [],
+   "execution_count": null,
+   "source": [
+    "from typing import Union\n",
+    "\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "\n",
+    "np.random.seed(842424)  # Fix random seed for reproducibility.\n",
+    "\n",
+    "\n",
+    "def generate_data(\n",
+    "    n_samples: int = 10000, input_dim: int = 2\n",
+    ") -> tuple[np.ndarray, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    Generate artificial data, i.e., coordinates distributed in 2D space by two different Gaussian distributions\n",
+    "    and their corresponding labels.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    n_samples : int\n",
+    "        The overall number of samples in the dataset.\n",
+    "    input_dim : int\n",
+    "        The number of features, i.e., dimension, of input samples.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    np.ndarray[float]\n",
+    "        The samples.\n",
+    "    np.ndarray[float]\n",
+    "        The corresponding labels.\n",
+    "    \"\"\"\n",
+    "    half_samples = n_samples // 2  # Generate two equally balanced classes.\n",
+    "\n",
+    "    # Generate the blobs.\n",
+    "    x1 = np.random.normal(1.0, 0.25, size=(half_samples, input_dim))\n",
+    "    x2 = np.random.normal(2.0, 0.30, size=(half_samples, input_dim))\n",
+    "\n",
+    "    # Create matching labels.\n",
+    "    y1 = np.zeros(half_samples)\n",
+    "    y2 = np.ones(half_samples)\n",
+    "\n",
+    "    data = np.concatenate((x1, x2))\n",
+    "    labels = np.concatenate((y1, y2))\n",
+    "\n",
+    "    # Shuffle data to improve convergence behavior and more closely emulate real data.\n",
+    "    shuffled_indices = np.arange(n_samples)\n",
+    "    np.random.shuffle(shuffled_indices)\n",
+    "\n",
+    "    return data[shuffled_indices], labels[shuffled_indices]\n",
+    "\n",
+    "\n",
+    "data, labels = generate_data()\n",
+    "print(\n",
+    "    f\"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels.\"\n",
+    ")"
+   ],
+   "id": "8abea2bb063ee34b"
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1d71c18b",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-12-04T10:16:06.936764Z",
+     "start_time": "2023-12-04T10:16:06.760413Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def plot_data(data: np.ndarray, labels: np.ndarray, title: str) -> None:\n",
+    "    \"\"\"\n",
+    "    Plot data colored by labels.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    data : np.ndarray[float]\n",
+    "        The data to plot.\n",
+    "    labels : np.ndarray[float]\n",
+    "        The corresponding labels.\n",
+    "    title : str\n",
+    "        The plot title.\n",
+    "    \"\"\"\n",
+    "    plt.xlabel(\"Feature 1\", fontweight=\"bold\")\n",
+    "    plt.ylabel(\"Feature 2\", fontweight=\"bold\")\n",
+    "    plt.grid()\n",
+    "    plt.scatter(data[:, 0], data[:, 1], c=labels, vmin=0.0, vmax=1.0)\n",
+    "    plt.title(title, fontweight=\"bold\")\n",
+    "    plt.show()\n",
+    "\n",
+    "\n",
+    "plot_data(data, labels, \"Labeled training data\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d9549e82",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-12-04T10:24:01.368983Z",
+     "start_time": "2023-12-04T10:24:01.359768Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def sigmoid(z: Union[float, np.ndarray]) -> Union[float, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    Compute sigmoid.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    z : float\n",
+    "        The input for the sigmoid function.\n",
+    "\n",
+    "    Returns\n",
+    "    ----------\n",
+    "    float\n",
+    "        The input's sigmoid function value.\n",
+    "    \"\"\"\n",
+    "    return 1.0 / (1.0 + np.exp(-z))\n",
+    "\n",
+    "\n",
+    "def lr_predict(w: np.ndarray, x: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Return prediction of logit model for data x using model weights w.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    w : np.ndarray[float]\n",
+    "        The parameters, i.e., weights to be learned (after bias trick), shape = [n_features + 1, ].\n",
+    "        There is one weight for every input dimension plus a bias.\n",
+    "    x : np.ndarray[float]\n",
+    "        The dataset (after bias trick), shape = [n_samples, n_features +1].\n",
+    "        The 0th input should be 1.0 to take the bias into account in a simple dot product.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    np.ndarray[float]\n",
+    "        The predicted activations of the logit model for the input dataset, shape = [n_samples, ],\n",
+    "        i.e., the sigmoid of the dot product of the weights and the input data.\n",
+    "    \"\"\"\n",
+    "    return sigmoid(x @ w)\n",
+    "\n",
+    "\n",
+    "def mse(y_est: np.ndarray, y: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Compute mean-square-error loss.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    y_est : np.ndarray[float]\n",
+    "        The predictions, shape = [n_samples, ].\n",
+    "    y : np.ndarray[float]\n",
+    "        The ground-truth labels, shape = [n_samples, ].\n",
+    "\n",
+    "    Returns\n",
+    "    ----------\n",
+    "    np.ndarray[float]\n",
+    "        MSE loss\n",
+    "    \"\"\"\n",
+    "    return (\n",
+    "        (1.0 / y.shape[0]) * (y - y_est).T @ (y - y_est)\n",
+    "    )  # Return MSE loss for considered batch.\n",
+    "\n",
+    "\n",
+    "def lr_loss(\n",
+    "    w: np.ndarray, x: np.ndarray, y: np.ndarray\n",
+    ") -> tuple[np.ndarray, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    Return the loss and the gradient with respect to the weights.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    w : np.ndarray[float]\n",
+    "        The model's weights to be learned, where weights[0] is the bias.\n",
+    "    x : np.ndarray[float]\n",
+    "        The input data of shape [N x D+1], 0th element of each sample is assumed to be 1 (bias trick).\n",
+    "    y : np.ndarray[float]\n",
+    "        The ground-truth labels of shape [N,].\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    np.ndarray[float]\n",
+    "        The scalar mean-square-error loss for the input batch of samples.\n",
+    "    np.ndarray[float]\n",
+    "        The gradient of the loss with respect to the weights for the batch.\n",
+    "    \"\"\"\n",
+    "    y_est = lr_predict(w, x)  # Compute logit prediction for all samples in batch.\n",
+    "    loss = mse(y_est, y)  # Compute MSE loss over all samples in batch.\n",
+    "    # Compute gradient vector of loss w.r.t. weights.\n",
+    "    gradient = (\n",
+    "        (-2.0 / y.shape[0]) * ((y - y_est) * y_est * y_est * np.exp(-x @ w)).T @ x\n",
+    "    )\n",
+    "    return loss, gradient\n",
+    "\n",
+    "\n",
+    "def lr_train(\n",
+    "    w: np.ndarray,\n",
+    "    x: np.ndarray,\n",
+    "    y: np.ndarray,\n",
+    "    epochs: int = 100,\n",
+    "    eta: float = 0.001,\n",
+    "    b: int = 10,\n",
+    ") -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Train the model, i.e., update the weights following the negative gradient until the model converges.\n",
+    "\n",
+    "    Parameters\n",
+    "    ----------\n",
+    "    w : np.ndarray[float]\n",
+    "        The model weights to be learned, where weights[0] is the bias.\n",
+    "    x : np.ndarray[float]\n",
+    "        The input data of shape [N x D+1], where each sample's 0th element is assumed to be 1 for bias trick.\n",
+    "    y : np.ndarray[float]\n",
+    "        The ground-truth labels of shape [N,].\n",
+    "    epochs : int\n",
+    "        The number of epochs to be trained.\n",
+    "    eta : float\n",
+    "        The learning rate.\n",
+    "    b : int\n",
+    "        The batch size.\n",
+    "\n",
+    "    Returns\n",
+    "    -------\n",
+    "    np.ndarray[float]\n",
+    "        The trained weights.\n",
+    "    \"\"\"\n",
+    "    n_samples = y.shape[0]  # Determine number of samples.\n",
+    "    n_batches = n_samples // b  # Determine number of full batches in data (drop last).\n",
+    "    print(f\"Data is divided into {n_batches} batches.\")\n",
+    "\n",
+    "    for epoch in range(epochs):  # Loop over epochs.\n",
+    "        # The number of epochs is a gradient-descent hyperparameter\n",
+    "        # that controls the number of complete passes through the train set.\n",
+    "        # The batch size is a gradient-descent hyperparameter\n",
+    "        # that controls the number of training samples to work through before the\n",
+    "        # model's internal parameters are updated.\n",
+    "\n",
+    "        loss_sum = 0.0  # Initiate loss for each epoch.\n",
+    "        accuracy = 0.0  # Initiate accuracy for each epoch.\n",
+    "\n",
+    "        for nb in range(n_batches):\n",
+    "            x_ = x[nb * b : (nb + 1) * b]\n",
+    "            y_ = y[nb * b : (nb + 1) * b]\n",
+    "            loss, gradient = lr_loss(w, x_, y_)\n",
+    "            loss_sum += loss\n",
+    "\n",
+    "            corr = np.sum((lr_predict(w, x_) + 0.5).astype(int) == y_)\n",
+    "            accuracy += corr\n",
+    "            w -= eta * gradient\n",
+    "\n",
+    "        # Calculate loss + accuracy after each epoch.\n",
+    "        loss_sum /= n_batches\n",
+    "        accuracy /= n_samples\n",
+    "        accuracy *= 100\n",
+    "\n",
+    "        # Print every tenth epoch the training status.\n",
+    "        if epoch % 10 == 0:\n",
+    "            print(f\"Epoch: {epoch}, Loss: {loss_sum}, Accuracy: {accuracy}\")\n",
+    "    return w"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "75702c70",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2023-12-04T10:24:18.647301Z",
+     "start_time": "2023-12-04T10:24:16.254189Z"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Bias trick: Prepend data with 1's for additional bias dimension.\n",
+    "ones = np.ones(\n",
+    "    shape=(\n",
+    "        data.shape[0],\n",
+    "        1,\n",
+    "    )\n",
+    ")\n",
+    "data_bt = np.hstack([ones, data])\n",
+    "\n",
+    "weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.\n",
+    "\n",
+    "# Before training.\n",
+    "plot_data(data, lr_predict(weights, data_bt), \"Continuous predictions of untrained model\")\n",
+    "plot_data(data, np.around(lr_predict(weights, data_bt)), \"Mapped predictions of untrained model\")\n",
+    "\n",
+    "# Train model.\n",
+    "weights = lr_train(weights, data_bt, labels, b=10, epochs=100)\n",
+    "\n",
+    "# After training.\n",
+    "plot_data(data, lr_predict(weights, data_bt), \"Continuous predictions of trained model\")\n",
+    "plot_data(data, np.around(lr_predict(weights, data_bt)), \"Mapped predictions of trained model\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c8d616f4",
+   "metadata": {},
+   "source": [
+    "#### Teil b\n",
+    "Testen Sie die serielle Implementierung auf einem CPU-basierten Knoten des bwUniClusters. \n",
+    "\n",
+    "- Erstellen Sie dazu ein Python-Skript basierend auf untenstehendem Code sowie den Funktionen in der obigen Implementierung, welches Sie mithilfe eines Submit-Skripts auf dem Cluster starten.  \n",
+    "- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5`, die Sie in den DatensÃ¤tzen `data` bzw. `labels` vorfinden. Der Datensatz enthÃ¤lt 10 000 Samples mit je 2 Features.\n",
+    "- Die zu verwendende Anzahl der Epochen sowie die BatchgrÃ¶ÃŸe kÃ¶nnen Sie als Command-Line-Argumente des Python-Skripts Ã¼bergeben. \n",
+    "- Laden Sie wie auf den vorherigen ÃœbungsblÃ¤ttern die benÃ¶tigten Module und aktivieren Sie Ihre virtuelle Python-Umgebung, bevor Sie das eigentliche Skript ausfÃ¼hren (siehe untenstehendes Submit-Skript). "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "80c0733c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import argparse\n",
+    "\n",
+    "import h5py\n",
+    "import numpy as np\n",
+    "\n",
+    "\n",
+    "np.random.seed(842424)  # Fix random seed for reproducibility.\n",
+    "\n",
+    "##################################\n",
+    "# PUT FUNCTION DEFINITIONS HERE! #\n",
+    "##################################\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    data_path = (\n",
+    "        \"/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n10000_d2.h5\"\n",
+    "    )\n",
+    "    parser = argparse.ArgumentParser(prog=\"Logit\")\n",
+    "    parser.add_argument(\n",
+    "        \"--epochs\",\n",
+    "        type=int,\n",
+    "        default=100,\n",
+    "        help=\"The number of epochs to train.\",\n",
+    "    )\n",
+    "\n",
+    "    parser.add_argument(\n",
+    "        \"--batch_size\",\n",
+    "        type=int,\n",
+    "        default=10,\n",
+    "        help=\"The batch size.\",\n",
+    "    )\n",
+    "\n",
+    "    args = parser.parse_args()\n",
+    "\n",
+    "    with h5py.File(data_path, \"r\") as f:\n",
+    "        data = np.array(f[\"data\"])\n",
+    "        labels = np.array(f[\"labels\"])\n",
+    "\n",
+    "    print(\n",
+    "        f\"We have {data.shape[0]} samples with {data.shape[1]} features and {labels.shape[0]} labels.\"\n",
+    "    )\n",
+    "\n",
+    "    # Bias trick: Prepend data with 1's for additional bias dimension.\n",
+    "    ones = np.ones(\n",
+    "        (\n",
+    "            data.shape[0],\n",
+    "            1,\n",
+    "        )\n",
+    "    )\n",
+    "    data_bt = np.hstack([ones, data])\n",
+    "    weights = np.random.rand(data_bt.shape[1])  # Initialize model parameters randomly.\n",
+    "    weights = lr_train(weights, data_bt, labels, b=args.batch_size, epochs=args.epochs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a1abbe8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!/bin/bash\n",
+    "\n",
+    "#SBATCH --job-name=logit_serial            # Job name\n",
+    "#SBATCH --partition=dev_single             # Queue for the resource allocation.\n",
+    "#SBATCH --time=5:00                        # Wall-clock time limit  \n",
+    "#SBATCH --cpus-per-task=40                 # Number of CPUs required per MPI task\n",
+    "#SBATCH --ntasks-per-node=1                # Maximum count of tasks per node\n",
+    "#SBATCH --mail-type=ALL                    # Notify user by email when certain event types occur.\n",
+    "\n",
+    "export OMP_NUM_THREADS=40\n",
+    "export VENVDIR=<path/to/your/venv>         # Export path to your virtual environment.\n",
+    "export PYDIR=<path/to/your/python/script>  # Export path to directory containing Python script.\n",
+    "\n",
+    "# Set up modules.\n",
+    "module purge                               # Unload all currently loaded modules.\n",
+    "module load compiler/gnu/13.3              # Load required modules.\n",
+    "module load mpi/openmpi/4.1\n",
+    "module load devel/cuda/12.4\n",
+    "module load lib/hdf5/1.14.4-gnu-13.3-openmpi-4.1\n",
+    "\n",
+    "source ${VENVDIR}/bin/activate # Activate your virtual environment.\n",
+    "\n",
+    "python ${PYDIR}/logit_serial.py --epochs 100 --batch_size 10"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "83c5b31f",
+   "metadata": {},
+   "source": [
+    "### Aufgabe 3\n",
+    "Implementieren Sie ausgehend von obigem seriellen Code eine daten-parallele Version der logistischen Regression. Datenparallelismus bezeichnet eine Form der Parallelisierung, bei der das Training Ã¼ber die vorliegenden Samples in einem Batch der effektiven BatchgrÃ¶ÃŸe $b_\\text{eff}$ parallelisiert wird.\n",
+    "Die Daten werden partitioniert und auf die verschiedenen Prozessoren verteilt, die diese parallel bearbeiten.\n",
+    "Jeder Prozessor verfÃ¼gt Ã¼ber eine eigene Kopie des Modells und arbeitet lokal mit den jeweils vorliegenden Samples, wobei bei Bedarf mit den anderen Prozessoren kommuniziert wird, um die Kopien konsistent zu halten. FÃ¼r den Gradientenabstieg bedeutet dies, dass es neben der oben erwÃ¤hnten globalen effektiven BatchgrÃ¶ÃŸe $b_\\text{eff}$ auf jedem Prozessor $p$ eine lokale BatchgrÃ¶ÃŸe $b_p$ (\"Mini-Mini-Batch\") gibt, fÃ¼r die gilt:\n",
+    "\n",
+    "$$b_\\text{eff}=\\sum_{p}b_p$$\n",
+    "\n",
+    "Jeder Prozessor berechnet fÃ¼r seine lokal vorliegenden Batches der GrÃ¶ÃŸe $b_p$ die Kostenfunktion sowie deren Gradient bezÃ¼glich der Gewichte. Nach Abarbeitung eines lokalen Batches mÃ¼ssen nun alle Prozessoren die jeweils lokal berechneten Gradienten austauschen und Ã¼ber diese mitteln, sodass jeder Prozessor anschlieÃŸend die Gewichte seiner lokalen Modellkopie entsprechend der effektiven BatchgrÃ¶ÃŸe redundant und mit allen anderen Kopien konsistent aktualisieren kann. \n",
+    "\n",
+    "- Wie in den vorherigen Ãœbungen auch laden wir dazu die Daten entlang der Sample-Achse verteilt auf die vorliegenden Prozessoren. Untenstehend finden Sie einen entsprechenden Dataloader. \n",
+    "- Implementieren Sie ausgehend von den seriellen Funktionen eine daten-parallele Version des Gradientenabstiegs fÃ¼r die logistische Regression und testen Sie Ihren Code auf vier CPU-basierten Knoten des bwUniClusters. \n",
+    "- Erstellen Sie dazu analog zu Aufgabenteil 2b ein Python-Skript sowie ein Submit-Bash-Skript (siehe auch vorherige ÃœbungsblÃ¤tter). \n",
+    "- Nutzen Sie die Daten und Labels in der HDF5-Datei `/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5`, die Sie in den DatensÃ¤tzen `data` bzw. `labels` vorfinden. Der Datensatz enthÃ¤lt 100 000 Samples mit je 2 Features.\n",
+    "- Trainieren Sie fÃ¼r $epochs = 100$ Epochen und nutzen Sie zunÃ¤chst eine effektive BatchgrÃ¶ÃŸe von $b_\\mathrm{eff}=100$. \n",
+    "- Sie kÃ¶nnen diesen Datensatz ebenfalls mit Ihrer seriellen Variante der logistischen Regression auf dem Cluster klassifizieren. Vergleichen Sie die GÃ¼te des trainierten Modells fÃ¼r die gleiche Anzahl an Epochen $epochs = 100$ und die gleiche (effektive) BatchgrÃ¶ÃŸe $b_\\mathrm{(eff)}=100$. Was fÃ¤llt Ihnen auf? Variieren Sie gegebenenfalls die Hyperparameter Ihrer parallelen Version, sodass Sie eine vergleichbare QualitÃ¤t des trainierten Modells erhalten. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "5789926d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "from mpi4py import MPI\n",
+    "import h5py\n",
+    "\n",
+    "\n",
+    "if __name__ == \"__main__\":\n",
+    "    data_path = \"/pfs/work7/workspace/scratch/ku4408-VL-ScalableAI/data/logit_data_n100000_d2.h5\"\n",
+    "    comm = MPI.COMM_WORLD  # Set up communicator.\n",
+    "    rank, size = comm.rank, comm.size\n",
+    "\n",
+    "    with h5py.File(data_path, \"r\") as f:  # Load data in sample-parallel fashion.\n",
+    "        chunk = int(f[\"data\"].shape[0] / size)\n",
+    "        if rank == size - 1:\n",
+    "            data = np.array(f[\"data\"][rank * chunk :])\n",
+    "            labels = np.array(f[\"labels\"][rank * chunk :])\n",
+    "        else:\n",
+    "            data = np.array(f[\"data\"][rank * chunk : (rank + 1) * chunk])\n",
+    "            labels = np.array(f[\"labels\"][rank * chunk : (rank + 1) * chunk])\n",
+    "\n",
+    "    print(\n",
+    "        f\"Rank {rank}/{size}: Local data has {data.shape[0]} samples with {data.shape[1]} features and \"\n",
+    "        f\"{labels.shape[0]} labels. 0th elements are: {data[0]}, {labels[0]}\"\n",
+    "    )\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
-- 
GitLab