GitBook: [#2882] update

This commit is contained in:
CPol 2021-12-01 00:10:15 +00:00 committed by gitbook-bot
parent 7a93916e07
commit 5308f1b4d0
No known key found for this signature in database
GPG Key ID: 07D2180C7B12D0FF

View File

@ -152,6 +152,8 @@ X_over, y_over = ros.fit_resample(X,y)
print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
```
You can use the argument **`sampling_strategy`** to indicate the **percentage** you want to **undersample or oversample** (**by default it's 1 (100%)** which means to equal the number of minority classes with majority classes)
{% hint style="info" %}
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
{% endhint %}
@ -170,3 +172,26 @@ dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
dataset['target_column'] = y_smote
print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
```
## Rarely Occurring Categories
Imagine a dataset where one of the target classes **occur very little times**.
This is like the category imbalance from the previous section, but the rarely occurring category is occurring even less than "minority class" in that case. The **raw** **oversampling** and **undersampling** methods could be also used here, but generally those techniques **won't give really good results**.
### Weights
In some algorithms it's possible to **modify the weights of the targeted data** so some of them get by default more importance when generating the model.
```python
weights = {0: 10 1:1} #Assign weight 10 to False and 1 to True
model = LogisticRegression(class_weight=weights)
```
You can **mix the weights with over/under-sampling techniques** to try to improve the results.
### PCA - Principal Component Analysis
Is a method that helps to reduce the dimensionality of the data. It's going to **combine different features** to **reduce the amount** of them generating **more useful features (**less computation is needed).
The resulting features aren't understandable by humans, so it also **anonymize the data**.