GitBook: [#2880] update

This commit is contained in:
CPol 2021-11-30 23:18:19 +00:00 committed by gitbook-bot
parent 5d0d76bd31
commit 1b45fddbff
No known key found for this signature in database
GPG Key ID: 07D2180C7B12D0FF

View File

@ -6,22 +6,49 @@ Data can be **continuous** (**infinity** values) or **categorical** (nominal) wh
### Categorical types
* **Binary**: Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: `dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})`
* **Ordinal**: The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "professional", "expert") you can map them to numbers as we saw in the binary case.
#### Binary
* For **alphabetic columns** you can order them more easily:
Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: 
```
# First get all the uniq values alphabetically sorted
alpha_sorted = dataset.column2.sort_values().unique().tolist()
# Assign each one a value
alpha_mapping = {alpha:idx for idx,alpha in enumerate(alpha)}
# Just map it as done with binary values
```
* **Cyclical**: Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
* **Dates:** Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
* Usually dates are used as **index**
```python
dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
```
#### **Ordinal**
The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.
```python
column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}
dataset['column2'] = dataset.column2.map(column2_mapping)
```
* For **alphabetic columns** you can order them more easily:
```python
# First get all the uniq values alphabetically sorted
possible_values_sorted = dataset.column2.sort_values().unique().tolist()
# Assign each one a value
possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}
dataset['column2'] = dataset.column2.map(possible_values_mapping)
```
#### **Cyclical**
Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.
* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**
```python
column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)
dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)
```
#### **Dates**
Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).
* Usually dates are used as **index**
```python
# Transform dates to datetime
@ -51,3 +78,50 @@ dataset['weekday'] = dataset.transaction_date.dt.weekday
dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())
```
#### Multi-category/nominal
**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.
* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.
To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).
You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.
You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.
## Collinear/Multicollinearity
Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.
In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.
VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.
```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)
onehot_encoded = pd.get_dummies(dataset.column1)
X = add_constant(onehot_encoded) # Add previously one-hot encoded data
print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))
```
## Categorical Imbalance
This occurs when there is **not the same amount of each category** in the training data.
```python
# Get statistic of the features
print(dataset.describe(include='all'))
# Get an overview of the features
print(dataset.info())
# Get imbalance information of the target column
print(dataset.target_column.value_counts())
```
In an imbalance there is always a **majority class or classes** and a **minority class or classes**.
There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe