GitBook: [#2881] update
This commit is contained in:
parent
1b45fddbff
commit
7a93916e07
@ -124,4 +124,49 @@ print(dataset.target_column.value_counts())
|
||||
|
||||
In an imbalance there is always a **majority class or classes** and a **minority class or classes**.
|
||||
|
||||
There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe
|
||||
There are 2 main ways to fix this problem:
|
||||
|
||||
* **Undersampling**: Removing randomly selected data from the majority class so it has the same number of samples as the minority class.
|
||||
|
||||
```python
|
||||
from imblearn.under_sampling import RandomUnderSampler
|
||||
rus = RandomUserSampler(random_state=1337)
|
||||
|
||||
X = dataset[['column1', 'column2', 'column3']].copy()
|
||||
y = dataset.target_column
|
||||
|
||||
X_under, y_under = rus.fit_resample(X,y)
|
||||
print(y_under.value_counts()) #Confirm data isn't imbalanced anymore
|
||||
```
|
||||
|
||||
* **Oversampling**: Generating more data for the minority class until it has as many samples as the majority class.
|
||||
|
||||
```python
|
||||
from imblearn.under_sampling import RandomOverSampler
|
||||
ros = RandomOverSampler(random_state=1337)
|
||||
|
||||
X = dataset[['column1', 'column2', 'column3']].copy()
|
||||
y = dataset.target_column
|
||||
|
||||
X_over, y_over = ros.fit_resample(X,y)
|
||||
print(y_over.value_counts()) #Confirm data isn't imbalanced anymore
|
||||
```
|
||||
|
||||
{% hint style="info" %}
|
||||
Undersamplig or Oversampling aren't perfect if you get statistics (with `.describe()`) of the over/under-sampled data and compare them to the original you will see **that they changed.** Therefore oversampling and undersampling are modifying the training data.
|
||||
{% endhint %}
|
||||
|
||||
### SMOTE oversampling
|
||||
|
||||
**SMOTE** is usually a **more trustable way to oversample the data**.
|
||||
|
||||
```python
|
||||
from imblearn.over_sampling import SMOTE
|
||||
|
||||
# Form SMOTE the target_column need to be numeric, map it if necessary
|
||||
smote = SMOTE(random_state=1337)
|
||||
X_smote, y_smote = smote.fit_resample(dataset[['column1', 'column2', 'column3']], dataset.target_column)
|
||||
dataset_smote = pd.DataFrame(X_smote, columns=['column1', 'column2', 'column3'])
|
||||
dataset['target_column'] = y_smote
|
||||
print(y_smote.value_counts()) #Confirm data isn't imbalanced anymore
|
||||
```
|
||||
|
Loading…
Reference in New Issue
Block a user