Exploratory Data Analysis

Since data cleaning and EDA goes hand in hand, content on this page picks up from what we left on Data Preparation page. There were string values present in the target variable that we cleaned. These were the values:

Missing

No response

No Data

I/O

stop

start

bad

No connection

These valus are present in all the features in uniform amount, without any evident pattern with respect to their occurence in the target variable. Let us visualize the number of unique values in each feature:

Segregating the features on the basis of unique values will help in performing fine grain analysis. First Let's jump onto exploration of numerical features.

Analysis of numerical features

Like rest of the features, numerical features will have the above mentioned 8 string values, and they need to be imputed for building a prediction model. Before performing imputation, let's look at the numerical features with respect to time.:

From a naked eye, it is difficult to detect a pattern. Decomposing the time series into, its trend and seasonality will help us understand more about it. Let's first look at the seasonalities for each feature:

Next are the trends:

All the features display seasonality to some extent. B19, B17, and B15 display similar trend and seasonality. Similarly, B24, B25, and B16 display similarity in trend and seasonality. B2 and B18 does not seems to match along any features.

The spread of each feature can be found here:

Apart from the B_18, features are not highly skewed, however, they exhibit large variances. Let's explore the relationships features exhibit with each other using correlation values:

Visualizing a couple of relationships of features that exhibit a strong correlation

Analysis of categorical features

As discussed, the 8 noisy string values are present in each feature, but how many are these?

Being in such a low percentage, we shall either replace those values, or create a seperate aggregated label. Moving forward, Let us visualize the number of unique values in every categorical feature

From the above plot we can see that features B_4, B_5, B_9, B_14, B_20, B_22 and B_23 have only 1 value. Hence we shall drop these features while performing feature selection for our prediction model.

Here is the distribution of feature B_3 and B_21 that have more than 3 unique categories: