|
|
|
...
|
|
|
|
Ode to Learning
|
|
|
|
... |
|
|
\ No newline at end of file |
|
|
|
## Ode to Learning
|
|
|
|
|
|
|
|
_C. Ates_
|
|
|
|
|
|
|
|
[[_TOC_]]
|
|
|
|
|
|
|
|
#### On the nature of learning
|
|
|
|
|
|
|
|
\--to be cont'd--
|
|
|
|
|
|
|
|
#### <span dir="">Why are we using NumPy and Pandas?</span>
|
|
|
|
|
|
|
|
Data is inseparable from machine learning applications and constitutes one of the pillars of data driven engineering. When we start working with different models, you will see that we will be mainly relying on NumPy and Pandas for data management. Why is this the case? Why do we not simply use lists, or list of lists?
|
|
|
|
|
|
|
|
NumPy stands for “Numerical Python”, one of the oldest Python libraries. Its core purpose is to manage multidimensional arrays (Tensors := n-dimensional arrays). In ML, we deal with very high dimensional data, hence we need a mean to operate with n dimensional arrays. What is different about NumPy compared to traditional lists of python is that in NumPy operates on homogenous data, hence the type of the data stored in the array is known (e.g., int32). The metadata about the array includes its shape, size, data type, and other existing attributes. This enables to allocate space in RAM in a very efficient way with minimum memory footprint. This strategy also allows to operate with memory blocks, known as vectorised operation. In short, we can perform an operation to every element at once (e.g., data normalization) –no iterative loops. Furthermore, these calculations are done by lower level libraries written in C or Fortran, taking advantage of the fast computation of the compiled codes. <span dir=""> </span>
|
|
|
|
|
|
|
|
The Pandas library builds on top of NumPy, so everything we do via Pandas goes back to NumPy, which is then pushed to a low level code written in C/Fortran. You can consider them as Russian dolls. What Pandas enables is to navigate and operate in any direction over the ndarrays. You can pick any combinations of column and row selections, fetch them and operate on them and update them. With this high level management, data-processing becomes extremely convenient –meaning with little amount of code, you can achieve a lot. We will use it to make our lives easier. Another important aspect of Pandas is indexing. Pandas objects (series, data frames) can be indexed in various ways, for instance date and time information can be used as index for our observations, rather than simple integers. By doing so, we can combine different databases of various sizes without losing the sequential information. So, the multidimensional arrays will have columns with varying sizes in a natural way. Herein, the empty slots can be managed via Pandas in various ways. ML libraries utilize NumPy to manage the data flow but popular libraries such as TensorFlow can be fed with Pandas series / data frames as well. Hence, managing your dataset with Pandas is usually the easiest way and compatible with the ML model implementations we will use in the lecture. We will learn more about its capabilities in due time while working on different datasets.
|
|
|
|
|
|
|
|
* <https://numpy.org/>
|
|
|
|
* <https://pandas.pydata.org/> |
|
|
\ No newline at end of file |