# Feature Engineering

## Basic types of possible data

Data can be **continuous** (**infinity** values) or **categorical** (nominal) where the amount of possible values are **limited**.

### Categorical types

#### Binary

Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with:&#x20;

```python
dataset["column2"] = dataset.column2.map({"T": 1, "F": 0})
```

#### **Ordinal**

The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case.

```python
column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3}
dataset['column2'] = dataset.column2.map(column2_mapping)
```

* For **alphabetic columns** you can order them more easily:

```python
# First get all the uniq values alphabetically sorted
possible_values_sorted = dataset.column2.sort_values().unique().tolist()
# Assign each one a value
possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)}
dataset['column2'] = dataset.column2.map(possible_values_mapping)
```

#### **Cyclical**

Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday.

* There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used**

```python
column2_dummies = pd.get_dummies(dataset.column2, drop_first=True)
dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1)
```

#### **Dates**

Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one).

* Usually dates are used as **index**

```python
# Transform dates to datetime
dataset["column_date"] = pd.to_datetime(dataset.column_date)
# Make the date feature the index
dataset.set_index('column_date', inplace=True)
print(dataset.head())

# Sum usage column per day
daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']})
# Flatten and rename usage column
daily_sum.columns = daily_sum.columns.get_level_values(0)
daily_sum.columns = ['daily_usage']
print(daily_sum.head())

# Fill days with 0 usage
idx = pd.date_range('2020-01-01', '2020-12-31')
daily_sum.index = pd.DatetimeIndex(daily_sum.index)
df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values


# Get day of the week, Monday=0, Sunday=6, and week days names
dataset['DoW'] = dataset.transaction_date.dt.dayofweek
## do the same in a different way
dataset['weekday'] = dataset.transaction_date.dt.weekday
# get day names
dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name())
```

#### Multi-category/nominal

**More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature.

* A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**.
* A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**.

To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**).

You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false.

You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns.

## Collinear/Multicollinearity

Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2.

In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column.

VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**.

```python
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

#dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True)
onehot_encoded = pd.get_dummies(dataset.column1)
X = add_constant(onehot_encoded) # Add previously one-hot encoded data
print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns))
```

## Categorical Imbalance

This occurs when there is **not the same amount of each category** in the training data.

```python
# Get statistic of the features
print(dataset.describe(include='all'))
# Get an overview of the features
print(dataset.info())
# Get imbalance information of the target column
print(dataset.target_column.value_counts())
```

In an imbalance there is always a **majority class or classes** and a **minority class or classes**.

There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe