# Feature Engineering ## Basic types of possible data Data can be **continuous** (**infinity** values) or **categorical** (nominal) where the amount of possible values are **limited**. ### Categorical types #### Binary Just **2 possible values**: 1 or 0. In case in a dataset the values are in string format (e.g. "True" and "False") you assign numbers to those values with: ```python dataset["column2"] = dataset.column2.map({"T": 1, "F": 0}) ``` #### **Ordinal** The **values follows an order**, like in: 1st place, 2nd place... If the categories are strings (like: "starter", "amateur", "professional", "expert") you can map them to numbers as we saw in the binary case. ```python column2_mapping = {'starter':0,'amateur':1,'professional':2,'expert':3} dataset['column2'] = dataset.column2.map(column2_mapping) ``` * For **alphabetic columns** you can order them more easily: ```python # First get all the uniq values alphabetically sorted possible_values_sorted = dataset.column2.sort_values().unique().tolist() # Assign each one a value possible_values_mapping = {value:idx for idx,value in enumerate(possible_values_sorted)} dataset['column2'] = dataset.column2.map(possible_values_mapping) ``` #### **Cyclical** Looks **like ordinal value** because there is an order, but it doesn't mean one is bigger than the other. Also the **distance between them depends on the direction** you are counting. Example: The days of the week, Sunday isn't "bigger" than Monday. * There are **different ways** to encode cyclical features, ones may work with only just some algorithms. **In general, dummy encode can be used** ```python column2_dummies = pd.get_dummies(dataset.column2, drop_first=True) dataset_joined = pd.concat([dataset[['column2']], column2_dummies], axis=1) ``` #### **Dates** Date are **continuous** **variables**. Can be seen as **cyclical** (because they repeat) **or** as **ordinal** variables (because a time is bigger than a previous one). * Usually dates are used as **index** ```python # Transform dates to datetime dataset["column_date"] = pd.to_datetime(dataset.column_date) # Make the date feature the index dataset.set_index('column_date', inplace=True) print(dataset.head()) # Sum usage column per day daily_sum = dataset.groupby(df_daily_usage.index.date).agg({'usage':['sum']}) # Flatten and rename usage column daily_sum.columns = daily_sum.columns.get_level_values(0) daily_sum.columns = ['daily_usage'] print(daily_sum.head()) # Fill days with 0 usage idx = pd.date_range('2020-01-01', '2020-12-31') daily_sum.index = pd.DatetimeIndex(daily_sum.index) df_filled = daily_sum.reindex(idx, fill_value=0) # Fill missing values # Get day of the week, Monday=0, Sunday=6, and week days names dataset['DoW'] = dataset.transaction_date.dt.dayofweek ## do the same in a different way dataset['weekday'] = dataset.transaction_date.dt.weekday # get day names dataset['day_name'] = dataset.transaction_date.apply(lambda x: x.day_name()) ``` #### Multi-category/nominal **More than 2 categories** with no related order. Use `dataset.describe(include='all')` to get information about the categories of each feature. * A **referring string** is a **column that identifies an example** (like a name of a person). This can be duplicated (because 2 people may have the same name) but most will be unique. This data is **useless and should be removed**. * A **key column** is used to **link data between tables**. In this case the elements are unique. his data is **useless and should be removed**. To **encode multi-category columns into numbers** (so the ML algorithm understand them), **dummy encoding is used** (and **not one-hot encoding** because it **doesn't avoid perfect multicollinearity**). You can get a **multi-category column one-hot encoded** with `pd.get_dummies(dataset.column1)`. This will transform all the classes in binary features, so this will create **one new column per possible class** and will assign 1 **True value to one column**, and the rest will be false. You can get a **multi-category column dummie encoded** with `pd.get_dummies(dataset.column1, drop_first=True)`. This will transform all the classes in binary features, so this will create **one new column per possible class minus one** as the **last 2 columns will be reflect as "1" or "0" in the last binary column created**. This will avoid perfect multicollinearity, reducing the relations between columns. ## Collinear/Multicollinearity Collinear appears when **2 features are related to each other**. Multicollineratity appears when those are more than 2. In ML **you want that your features are related with the possible results but you don't want them to be related between them**. That's why the **dummy encoding mix the last two columns** of that and **is better than one-hot encoding** which doesn't do that creating a clear relation between all the new featured from the multi-category column. VIF is the **Variance Inflation Factor** which **measures the multicollinearity of the features**. A value **above 5 means that one of the two or more collinear features should be removed**. ```python from statsmodels.stats.outliers_influence import variance_inflation_factor from statsmodels.tools.tools import add_constant #dummies_encoded = pd.get_dummies(dataset.column1, drop_first=True) onehot_encoded = pd.get_dummies(dataset.column1) X = add_constant(onehot_encoded) # Add previously one-hot encoded data print(pd.Series([variance_inflation_factor(X.values,i) for i in range(X.shape[1])], index=X.columns)) ``` ## Categorical Imbalance This occurs when there is **not the same amount of each category** in the training data. ```python # Get statistic of the features print(dataset.describe(include='all')) # Get an overview of the features print(dataset.info()) # Get imbalance information of the target column print(dataset.target_column.value_counts()) ``` In an imbalance there is always a **majority class or classes** and a **minority class or classes**. There are 2 main ways to fix this problem. Using undersampling: REmoving randomly selected data fom the majority class so it has the same numbe