GitBook: [#2880] update

2021-11-30 23:18:19 +00:00 · 2021-11-30 23:18:19 +00:00 · 1b45fddbff
commit 1b45fddbff
parent 5d0d76bd31
1 changed files with 88 additions and 14 deletions
--- a/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
+++ b/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
@ -6,22 +6,49 @@ Data can be **continuous** (**infinity** values) or **categorical** (nominal) wh

 ### Categorical types

-* **Binary**: Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: `dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})`
-*   **Ordinal**: The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "professional", "expert") you can map them to numbers as we saw in the binary case.
+#### Binary

-    * For **alphabetic columns** you can order them more easily:
+Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with:&#x20;

-    ```
-    # First get all the uniq values alphabetically sorted
-    alpha_sorted = dataset.column2.sort_values().unique().tolist()
-    # Assign each one a value
-    alpha_mapping = {alpha:idx for idx,alpha in enumerate(alpha)}
-    # Just map it as done with binary values
-    ```
-* **Cyclical**: Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
-  * There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
-* **Dates:** Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
-  * Usually dates are used as **index**
+```python
+dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
+```
+
+#### **Ordinal**
+
+The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
+
+```python
+column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}
+dataset['column2'] = dataset.column2.map(column2_mapping)
+```
+
+* For **alphabetic columns** you can order them more easily:
+
+```python
+# First get all the uniq values alphabetically sorted
+possible_values_sorted = dataset.column2.sort_values().unique().tolist()
+# Assign each one a value
+possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}
+dataset['column2'] = dataset.column2.map(possible_values_mapping)
+```
+
+#### **Cyclical**
+
+Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
+
+* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
+
+```python
+column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)
+dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)
+```
+
+#### **Dates**
+
+Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
+
+* Usually dates are used as **index**

 ```python
 # Transform dates to datetime
@ -51,3 +78,50 @@ dataset['weekday'] = dataset.transaction_date.dt.weekday
 dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())
 ```

+#### Multi-category/nominal
+
+**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
+
+* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
+* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
+
+To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
+
+You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
+
+You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
+
+## Collinear/Multicollinearity
+
+Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
+
+In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
+
+VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
+
+```python
+from statsmodels.stats.outliers_influence import variance_inflation_factor
+from statsmodels.tools.tools import add_constant
+
+#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)
+onehot_encoded = pd.get_dummies(dataset.column1)
+X = add_constant(onehot_encoded) # Add previously one-hot encoded data
+print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))
+```
+
+## Categorical Imbalance
+
+This occurs when there is **not the same amount of each category** in the training data.
+
+```python
+# Get statistic of the features
+print(dataset.describe(include='all'))
+# Get an overview of the features
+print(dataset.info())
+# Get imbalance information of the target column
+print(dataset.target_column.value_counts())
+```
+
+In an imbalance there is always a **majority class or classes** and a **minority class or classes**.
+
+There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe