GitBook: [#2881] update

2021-11-30 23:36:04 +00:00 · 2021-11-30 23:36:04 +00:00 · 7a93916e07
commit 7a93916e07
parent 1b45fddbff
1 changed files with 46 additions and 1 deletions
--- a/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
+++ b/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
@ -124,4 +124,49 @@ print(dataset.target_column.value_counts())

 In an imbalance there is always a **majority class or classes** and a **minority class or classes**.

-There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe
+There are 2 main ways to fix this problem:
+
+* **Undersampling**: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.
+
+```python
+from imblearn.under_sampling import RandomUnderSampler
+rus = RandomUserSampler(random_state=1337)
+
+X = dataset[['column1', 'column2', 'column3']].copy()
+y = dataset.target_column
+
+X_under, y_under = rus.fit_resample(X,y)
+print(y_under.value_counts()) #Confirm data isn't imbalanced anymore
+```
+
+* **Oversampling**: Generating more data for the minority class until it has as many samples as the majority class.
+
+```python
+from imblearn.under_sampling import RandomOverSampler
+ros = RandomOverSampler(random_state=1337)
+
+X = dataset[['column1', 'column2', 'column3']].copy()
+y = dataset.target_column
+
+X_over, y_over = ros.fit_resample(X,y)
+print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
+```
+
+{% hint style="info" %}
+Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
+{% endhint %}
+
+### SMOTE oversampling
+
+**SMOTE** is usually a **more trustable way to oversample the data**.
+
+```python
+from imblearn.over_sampling import SMOTE
+
+# Form SMOTE the target_column need to be numeric, map it if necessary
+smote = SMOTE(random_state=1337)
+X_smote, y_smote = smote.fit_resample(dataset[['column1', 'column2', 'column3']], dataset.target_column)
+dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
+dataset['target_column'] = y_smote
+print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
+```