GitBook: [#2882] update

2021-12-01 00:10:15 +00:00 · 2021-12-01 00:10:15 +00:00 · 5308f1b4d0
commit 5308f1b4d0
parent 7a93916e07
1 changed files with 25 additions and 0 deletions
--- a/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
+++ b/a.i.-exploiting/bra.i.nsmasher-presentation/ml-basics/feature-engineering.md
@ -152,6 +152,8 @@ X_over, y_over = ros.fit_resample(X,y)
 print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
 ```

+You can use the argument **`sampling_strategy`** to indicate the **percentage** you want to **undersample or oversample** (**by default it's 1 (100%)** which means to equal the number of minority classes with majority classes)
+
 {% hint style="info" %}
 Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
 {% endhint %}
@ -170,3 +172,26 @@ dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
 dataset['target_column'] = y_smote
 print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
 ```
+
+## Rarely Occurring Categories
+
+Imagine a dataset where one of the target classes **occur very little times**.
+
+This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The **raw** **oversampling** and **undersampling** methods could be also used here, but generally those techniques **won't give really good results**.
+
+### Weights
+
+In some algorithms it's possible to **modify the weights of the targeted data** so some of them get by default more importance when generating the model.
+
+```python
+weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True
+model = LogisticRegression(class_weight=weights)
+```
+
+You can **mix the weights with over/under-sampling techniques** to try to improve the results.
+
+### PCA - Principal Component Analysis
+
+Is a method that helps to reduce the dimensionality of the data. It's going to **combine different features** to **reduce the amount** of them generating **more useful features (**less computation is needed).
+
+The resulting features aren't understandable by humans, so it also **anonymize the data**.