C. Ates
On the nature of learning
--to be cont'd--
Jupyter Notebooks
We will be using Jupyter notebooks (formerly ipython notebooks), which enable us to share with you the details of the python code of the week. The notebook environment provides the full story of the machine learning project. We see in the lecture that a ML project consists of multiple phases, starting from the analysis of the physical problem to the deployment of the model. The notebooks provide the mean to explain all these steps with visuals and share them with you in a convenient way.
Jupyter notebooks are python files that can be run in a web browser. It also supports now multiple languages such including Java, R, Julia, Matlab, Octave, Scheme, Processing and Scala. The notebook itself is organized in the form of cells, which can be executed individually. The results of the code within the cell will be displayed right below that cell. It is also possible to run the cells independently, so that we can recreate cells, play with the order of cells and explore options independently. All the outputs will be embedded to the notebook as well –hence can be shared directly without even running the script. Another advantage is the publishing: GIT environments have Jupyter notebook rendering engines and all the work including the output can be shared with third parties explicitly. Combined with cloud combining, we can store the notebooks on the cloud, run the script remotely, save and share everything online, beginning to the end.
Why are we using NumPy and Pandas?
Data is inseparable from machine learning applications and constitutes one of the pillars of data driven engineering. When we start working with different models, you will see that we will be mainly relying on NumPy and Pandas for data management. Why is this the case? Why do we not simply use lists, or list of lists?
NumPy stands for “Numerical Python”, one of the oldest Python libraries. Its core purpose is to manage multidimensional arrays (Tensors := n-dimensional arrays). In ML, we deal with very high dimensional data, hence we need a mean to operate with n dimensional arrays. What is different about NumPy compared to traditional lists of python is that in NumPy operates on homogenous data, hence the type of the data stored in the array is known (e.g., int32). The metadata about the array includes its shape, size, data type, and other existing attributes. This enables to allocate space in RAM in a very efficient way with minimum memory footprint. This strategy also allows to operate with memory blocks, known as vectorised operation. In short, we can perform an operation to every element at once (e.g., data normalization) –no iterative loops. Furthermore, these calculations are done by lower level libraries written in C or Fortran, taking advantage of the fast computation of the compiled codes.
The Pandas library builds on top of NumPy, so everything we do via Pandas goes back to NumPy, which is then pushed to a low level code written in C/Fortran. You can consider them as Russian dolls. What Pandas enables is to navigate and operate in any direction over the ndarrays. You can pick any combinations of column and row selections, fetch them and operate on them and update them. With this high level management, data-processing becomes extremely convenient –meaning with little amount of code, you can achieve a lot. We will use it to make our lives easier. Another important aspect of Pandas is indexing. Pandas objects (series, data frames) can be indexed in various ways, for instance date and time information can be used as index for our observations, rather than simple integers. By doing so, we can combine different databases of various sizes without losing the sequential information. So, the multidimensional arrays will have columns with varying sizes in a natural way. Herein, the empty slots can be managed via Pandas in various ways. ML libraries utilize NumPy to manage the data flow but popular libraries such as TensorFlow can be fed with Pandas series / data frames as well. Hence, managing your dataset with Pandas is usually the easiest way and compatible with the ML model implementations we will use in the lecture. We will learn more about its capabilities in due time while working on different datasets.