SALE ANALYSIS FOR SUPERSTORE

Using SSIS (ETL), SQL (Analysis), Power BI (Visualization), Tableau (Forecasting) & Python (Cleaning and Build Predictive Model)

CONTENTS

  1. The Data Source
  2. Business Requirements
  3. Data Cleaning with Python in Google Colab
  4. Design Data Warehouse and ETL Process
  5. Fact 1: Analysis of the store's overall YTD business situation and sale metrics ( Using Power BI )
  6. Fact 2: RFM Customer Segmentation Analysis ( Using Power BI )
  7. Fact 3: Customer Cohort Analysis( Using SQL & Tableau )
  8. Fact 4: Forecasting and Predictive Returned Goods ( Using Tableau & Python )

The Data Source

This is a data set about a global store, recording customer transactions from 2014 to 2017 . This is an extremely complete sample data set for learning and practicing Business Intelligence and Analysis processes. This data set is in .xlsx file format and has 3 sheets: "Orders", "Returns" and "People", but in this project, I will only use 2 sheets, "Orders", "Returns".

Ảnh 1

Sheet: Orders

Ảnh 1

Sheet: Orders

Sheet: Returns

The Orders sheet has: 51290 rows and 24 columns
With columns: CustomerID, OrderDate, Market, Region, ProductID...

The Returned sheet has: 2034 rows and 3 columns
Returned, OrderID, Region...

Download dataset in my repository: Here

Business Requirements

Analysis of the store's overall YTD business situation and sale metrics

Ảnh 1

  • Analysis of the general business situation of the store with some sale metrics such as YTD Total Sales, YTD Total Profit, YTD Quantity Sold, Best Seller, ....
  • When a business conducts YTD analysis, they are reviewing and evaluating their performance, financial status, or other data from the beginning of the year up to the current point in the same fiscal year or statistical period.

RFM Customer Segmentation Analysis

  • Recency (R): How recently did the customer make a purchase?
  • Frequency (F): How often does the customer make purchases within a specific timeframe?
  • Monetary (M): How much money has the customer spent within a specific timeframe?
  • RFM analysis aims to identify the most valuable customer segments from which to have effective marketing campaigns for each segment.

Ảnh 1

Customer Cohort Analysis

Ảnh 1

  • Cohort analysis is a method that categorizes users based on common attributes and studies their behavior trends over time, revealing insights into group-specific patterns and performance.
  • Cohort analysis is used to study how distinct user groups behave over time, providing insights into their behavior and performance to inform strategy optimization and data-driven decisions.

Forecasting and Predictive model

  • Forecasting in businesses involves predicting and estimating the future using historical and current data to support strategic decision-making, financial planning, and resource management.
  • By examining past sales data, the company can identify patterns and trends that provide insights into future sales performance.
Ảnh 1

Data Cleaning with Python in Google Colab

Import neccessary libraries and read file:

dash1

Data overview:

dash1

df1.info()

df2.isnull().sum()

Ảnh 1
Ảnh 1

df1.duplicated().sum()

Unique Value

Ảnh 1
Ảnh 1

My Exploration

  • Total Records: 51290 records.
  • Total Columns: 24 columns.
  • All columns with correct data type.
  • There are 1 columns with null values: Postal Code with 41296 null values, so I will delete this field.
  • All columns with correct data type.
  • Some field like: Sales, Discounts,... need to be round

The Scripts of Cleaning

  • Delete the Postal Code columns
  • Rename all columns to lowercase and replace the space by underscore if exist
  • Round number into 2 decimal
  • Join sheet 2 "Return" to get the data about return goods
  • Label for each Order ID: Yes by 1 and fill No by 0
  • Sort the data by "Row_ID"
  • Drop duplicated values if exist

Here is my code for cleaning in Python:

# Join sheet 2 to get the data about return goods (left-join)
df = pd.merge(df1,df2, on='Order ID', how='left')
# Fill the Nan in Returned by No
df['Returned'] = df['Returned'].fillna('No')
# Delete PostalCode columns
df =df.drop('Postal Code', axis =1)
df = df.drop('Region_y', axis =1)
# Rename all columns to lowercase
df.rename(columns=lambda x: x.lower().replace(" ", "_"), inplace=True)
# Change the datatype of (OrderDate and ShipDate) to datetime
df['order_date'] = pd.to_datetime(df['order_date'])
df['ship_date'] = pd.to_datetime(df['ship_date'])
# Round number into 2 decimal
df['profit'] = df['profit'].round(2)
df['shipping_cost'] = df['shipping_cost'].round(2)
# Sort the data by "Row_ID
df = df.sort_values(by='row_id')
# Drop duplicated value
df = df.drop_duplicates(keep='first')
# reset row_id
df['row_id'] = range(1, len(df) + 1)
הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


Data After Clean (with 51290 Records)
(Save it to new files)

Ảnh 1

Download the Data After Cleaning: Here

Design Data Warehouse

Bus Matrix of Sales

dash1

Data Warehouse

dash1

ETL Process

Dim Customer

Dim Product

Ảnh 1
Ảnh 1

Dim Time

dash1

Dim Location

dash1

Fact Sales

dash1

Download my ETL project in SSIS: Here

Fact 1: Analysis of the store's overall YTD business situation and sale metrics
( Using Power BI )

dash1

All Segment Overall YTD Business Situation

  • Total YTD Sales: $4.30M and 26.25% increase over the previous year
  • Total YTD Profits: $504.17K and 23.89% increase over the previous year
  • Total YTD Quantity: 60.62K and 25.94% increase over the previous year
  • Total YTD Profit Margin: 11.73% and 1.87% decrease over the previous year
  • Technology products brought in the highest revenue last year (2017), followed by Furniture and finally Office Supplies
  • Top 5 and Bottom 5 Product in the Dashboard
  • Asia Pacific market brings the most revenue to the company and Africa market brings the least revenue for the company.

Conclusion and Recommend Strategies:

1. Profit Optimization Strategy

  • Assessment: The profit performance of all segments has increased this year compared to the previous year, except for a slight decrease in the overall profit margin. This might result from increased sales volume, but attention is needed to control pricing and costs to ensure sustainable profitability.
  • Strategy: Optimize profit margins by adjusting pricing and controlling production costs. Ensure that the profit growth is not solely reliant on increased sales but also on improving price and cost management.

2. Focus on High-Growth Segments:

  • Assessment: Both the "Consumer", "Corporate" and "Home Office" segments have shown significant growth in key metrics such as sales, profit, and quantity.
  • Strategy: Concentrate on developing and providing products and services that cater to the needs of these segments. Build targeted marketing campaigns aimed at customers within these segments to increase market share and sales.

3. Product Portfolio Optimization:

  • Assessment: Products categorized under "Technology" generated the highest revenue in the previous year, followed by "Furniture" and "Office Supplies" in 2017.
  • Strategy: Adjust the product portfolio to optimize revenue. Consider investing in the development and promotion of high-profit-margin products, as well as assessing the optimization of underperforming products

4. Market Expansion and Risk Management:

  • Assessment: "Asia Pacific" brought in the highest revenue, while "Africa" contributed the least.
    Besides, the density of customers buying goods in Africa is high, but the revenue is small, this may come from the fact that in the African market, only low-value products can be sold.
    The USCA market has a low density of buyers but contributes a large amount of revenue, proving that it can be sold here.
  • Strategy: Seize market expansion opportunities in regions like "Asia Pacific" to continue capitalizing on growth potential.
    Simultaneously, consider ways to optimize business operations in markets contributing lower revenue to minimize risks.
    Besides, The company should have marketing strategies in the USCA market to promote sales from this potential market

Fact 2: RFM Customer Segmentation Analysis
( Using Power BI )

dash1

Conclusion and Recommend Strategies:

1. Promising (16.77%):

  • Evaluation: Customers in this group show positive signs with relatively high potential.
  • Strategy: Create special promotions or offers to encourage them to continue shopping and increase engagement with your brand.

2. New Customers (14.92%):

  • Evaluation: This group consists of new customers, so it's important to retain them.
  • Strategy: Send welcome promotions or special deals for their next purchase. Create a positive experience to encourage them to come back.

3. Cannot Lose Them (13.71%):

  • Evaluation: This group has potential brand loyalty, but maintaining positive interaction is key.
  • Strategy: Provide added value through new product information and loyalty programs. Ensure they always feel interested in your brand.

4. Champions (11.27%):

  • Evaluation: These are exceptional customers who generate high profits.
  • Strategy: Maintain a strong relationship by offering excellent customer service and exclusive benefits. Encourage them to refer others.

5. Loyal (7.21%):

  • Evaluation: This group shows loyalty, but may need additional motivation to sustain it.
  • Strategy: Offer special benefits and rewards to show appreciation for their loyalty.

6. About to Sleep (6.55%), Hibernating (7.96%), At Risk (5.24%)

  • Evaluation: These groups are related to loss or potential loss.
  • Strategy: Send promotional messages, update products, and services to rekindle their interest. Actively monitor and engage to retain them.

7. Lost Customers (9.73%):

  • Evaluation: This group has the potential to develop into more positive customer segments.
  • Strategy: Understand their needs and desires better to offer appropriate products and services.

8. Need Attention (2.07%):

  • Evaluation: This group needs attention to prevent further loss.
  • Strategy: Contact them directly to understand their issues and concerns, then provide suitable solutions.

Conclusion:

  • Promising, New Customers, Cannot Lose Them: The total percentage of these groups is nearly 45%, indicating a potential and important customer set. The key strategy is to create and maintain positive relationships, create value, and make sure they don't lose points.
  • Champions, Loyal: At about 18%, these are the groups that bring in stable profits. Strategy should focus on maintaining and driving positive interactions to maintain loyalty and increase revenue.
  • About to Sleep, Hibernating, At Risk: These groups together make up about 20%, indicating a risk of losing points or stopping interaction. Strategy needs to focus on re-igniting interest and engagement so as not to lose customer points.
  • Lost Customers: With more than 9%, this is a lost group, but there is still a chance to recover. Strategy should focus on restoring the relationship and motivating them to return.
  • Potential, Need Attention: This group together makes up about 6%, representing growth potential and issues that need attention. Custom strategy is to dig deeper and provide solutions to exploit potential or solve problems.

In conclusion, based on the RFM analysis, the business should concentrate on customer retention, reactivating potential customers, strengthening interactions with valuable and potential segments, and developing engagement strategies to optimize business effectiveness and increase profitability.

Download my Dashboard about YTD Sales and Customer Segmentation: Here

Fact 3: Customer Cohort Analysis
( Using SQL & Tableau )

SQL Analysis Scripts

הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Tableau Dashboard

dash1

View my Tableau Dashboard about Cohort Analysis: Here

Analysis:

1. Retention Rate Over Time:

  • All customer groups from 2016 exhibited a consistent trend of decreasing numbers over the course of several months.
  • Initial cohorts began with larger customer counts, but these figures gradually declined over time. This emphasizes the necessity for robust retention and customer acquisition strategies.

2. Engagement Patterns:

  • There is significant variation in customer numbers across months and cohorts.
  • Certain cohorts experienced an increase during specific periods followed by a decrease. This might reflect marketing campaigns or special events that generated temporary interest.

3. Churn Analysis:

  • Early cohorts generally maintained lower churn rates compared to their later counterparts.
  • As time progressed, churn rates tended to increase in subsequent cohorts, underlining the importance of strategies to prevent customer attrition.

4. Impact of Marketing Campaigns:

  • Several cohorts experienced substantial spikes in customer counts after specific months, which could be linked to strategic marketing campaigns or captivating events.
  • This underlines the efficacy of well-timed and thoughtfully executed marketing endeavors, showcasing their potential to generate transient enthusiasm and engagement among customers.

Strategies Recommend:

  • Segmented Retention Campaigns: Tailor retention campaigns based on customer behavior and engagement patterns observed within different cohorts. Offer personalized incentives and rewards to re-engage customers who show signs of waning interest.
  • Loyalty Programs: Introduce or enhance loyalty programs that offer exclusive benefits to long-standing customers. This could help counteract the increasing churn rates observed in later cohorts.
  • Seasonal Promotions: Leverage the identified seasonal trends by launching targeted promotional campaigns during off-peak months. Encourage customers' participation with limited-time offers or themed events.
  • Reactivation Campaigns: Develop strategies to win back customers who have churned, especially focusing on the early cohorts with lower churn rates. Offer enticing incentives or personalized communications to reignite their interest.
  • Continuous Engagement Events: Organize periodic engagement events or product launches to sustain customer excitement throughout the year. This can help counteract the cyclical decline in engagement.

Fact 4: Forecasting and Predictive model
( Using Tableau & Python )

Forecasting in Tableau

dash1

View my Tableau Dashboard about Forecasting Sales: Here

dash1 dash1

Forecast Model with Trend is Additive and Season is Multiplicative, I got the good result that MASE = 0.26.
That proves the model is well chosen with the data.

Predictive Model for Returned Good

Set up and Read file

dash1

EDA Data

In my dataset, there are 3 types of data:

  • Object Data type
  • Numeric Data type
  • Datetime Data type
  • # objectcolumns
    object_cols = [f for f in data.columns if data[f].dtype =="O"]
    print(object_cols)
    # numerical columns
    numeric_cols = [f for f in data.columns if data[f].dtype != "O" and data[f].dtype != "datetime64[ns]"]
    print(numeric_cols)
    # datetime columns
    date_cols = [f for f in data.columns if data[f].dtype == "datetime64[ns]"]
    print(date_cols)
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1

    Feature Selection

    Categorical Feature

    Categorical Features : will be selected from the Object data type columns.

  • First, delete unused fields such as: order_id, customer_id, customer_name, product_id.
  • In 5 features: city, state, country, region_x, market, select market field as market and drop the remaining fields
  • In 3 fields: product_name, sub_category, category, select category field as category and drop the remaining fields.
  • # categorical feature
    categorical_features = [ 'ship_mode', 'segment', 'market','category', 'order_priority']
    # Visualization for categorical feature
    ROWS, COLS = 2,3
    fig, ax = plt.subplots(ROWS,COLS, figsize=(19,19))
    row, col = 0, 0,
    for i, categorical_feature in enumerate(categorical_features):
    if col == COLS - 1:
    row += 1
    col = i % COLS
    data[categorical_feature].value_counts().plot(kind='bar', ax=ax[row, col]).set_title(categorical_feature)
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1

    Numerical Feature

    In this project, the datetime feature is order_date and ship_date, corresponding to purchase date and delivery date.
    It can be assumed that items with a large difference between the order date and the ship date will result in returns.
    Therefore, from order_date and ship_date will generate a new feature called date_diff: the difference between order_date and ship_date

    Numerical Features : will be selected from the Numeric data type columns.

  • Delete unused fields such as: row_id
  • date_diff is a numerical feature
  • # numerical feature
    numerical_features= [ 'sales', 'quantity', 'discount', 'profit', 'shipping_cost','date_diff']
    # Distribution Visualization for numerical features
    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
    axes = axes.flatten()
    # Plot histograms for numerical features
    for i, col in enumerate(numerical_features):
    ax = axes[i]
    data[col].hist(ax=ax)
    ax.set_title(col)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    plt.tight_layout()
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1

    Target Feature

    The target feature in this project is returned

    dash1

    Outlier Detection for numerical feature

    # Function
    def count_outliers(data, col):
    q1 = data[col].quantile(0.25, interpolation='nearest')
    q3 = data[col].quantile(0.75, interpolation='nearest')
    IQR = q3 - q1
    LLP = q1 - 1.5 * IQR
    ULP = q3 + 1.5 * IQR
    lower_outliers = data[data[col] < LLP][col].size
    upper_outliers = data[data[col] > ULP][col].size
    if lower_outliers == 0 and upper_outliers == 0:
    print("No outliers in", col)
    else:
    print("There are outliers in", col)
    print('Count of lower outliers:', lower_outliers)
    print('Count of upper outliers:', upper_outliers)
    # The outlier of numerical features
    numerical_features = ['sales', 'quantity', 'discount', 'profit', 'shipping_cost', 'date_diff']
    for feature in numerical_features:
    count_outliers(data, feature)
    # Display the outliers by box - plot
    fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(18, 10))
    axes = axes.flatten()
    # Plot box plots for numerical features
    for i, col in enumerate(numerical_features):
    ax = axes[i]
    data.boxplot(column=[col], ax=ax)
    ax.set_title(col)
    ax.set_ylabel('Value')
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1
    • Count of lower outliers in sales: 0
    • Count of upper outliers in sales: 5655
    • Count of upper outliers in quantity: 877
    • Count of upper outliers in discount: 4172
    • Count of lower outliers in profit: 3526
    • Count of upper outliers in profit: 6229
    • Count of upper outliers in shipping_cost: 5909
    • No outliers in date_diff

    There are lots of outlier in numerical feature. So it is necessary to tranform!

    Transofmation with Log Transofmation

    data_transformed = data[numerical_features]
    # Log-Transform
    log_transform_features = ['sales', 'quantity', 'discount', 'profit', 'shipping_cost','date_diff']
    for feature in log_transform_features:
    data_transformed[feature] = np.log1p(data_transformed[feature])
    # check oulier
    for feature in log_transform_features:
    count_outliers(data_transformed, feature)
    # fit data
    log_transform_features = ['sales', 'quantity', 'discount', 'profit', 'shipping_cost','date_diff']
    for feature in log_transform_features:
    data_transformed[f'{feature}_log'] = np.log1p(data_transformed[feature])
    data_transformed.drop(log_transform_features, axis=1, inplace=True)
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1

    Encoder for Categorical feature

    # check unique value of categorical features
    for feature in categorical_features:
    print(feature,data[feature].nunique())
    # One hot Encoding for categorical and target feature
    categorical_dummies = pd.get_dummies(data[categorical_features], drop_first=True)
    target_dummy = pd.get_dummies(data['returned'], drop_first=True)
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Final Dataset

    # concatenate
    datafinal = pd.concat([categorical_dummies , data_transformed,target_dummy], axis=1)
    # View correlation
    corr = datafinal.corr()
    fig, ax = plt.subplots(figsize=(10, 10))
    cax = ax.matshow(corr, cmap='coolwarm')
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.colorbar(cax)
    # Visualization all features
    num_rows = 5
    num_cols = 5
    fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 15))
    features = datafinal.columns
    counter = 0
    for row in range(num_rows):
    for col in range(num_cols):
    if counter < len(features):
    axes[row, col].hist(datafinal[features[counter]])
    axes[row, col].set_title(features[counter])
    counter += 1
    plt.tight_layout()
    plt.show()
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1 dash1

    Final Dataset

    Split to Train and Test

    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import MinMaxScaler
    from imblearn.over_sampling import SMOTE
    # Feature selection
    X = datafinal.drop(["Y",'profit_log'], axis=1)
    y = datafinal['Y']
    # Split train ,test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Upsampling with SMOTE and Min max Scaler

    # Upsampling with SMOTE
    sm = SMOTE(k_neighbors=5)
    X_train_resample, y_train_resample = sm.fit_resample(X_train, y_train)
    from sklearn.preprocessing import MinMaxScaler
    # Scaling with Min-Max Scaler
    numeric_columns = ['sales_log', 'quantity_log', 'discount_log',
    'shipping_cost_log', 'date_diff_log']
    scaler = MinMaxScaler()
    scaler.fit(X_train_resample[numeric_columns])
    X_train_resample[numeric_columns] = scaler.transform(X_train_resample[numeric_columns])
    X_test[numeric_columns] = scaler.transform(X_test[numeric_columns])
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    XGBoost Model Training

    import xgboost as xgb
    from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
    model_xgb = xgb.XGBClassifier(random_state=42, n_estimators=200)
    model_xgb.fit(X_train_resample, y_train_resample)
    y_pred_xgb = model_xgb.predict(X_test)
    # Result
    print("XGBoost Report:")
    print(classification_report(y_test, y_pred_xgb))
    # Accuracy on Train and Test
    accuracy_train_xgb = model_xgb.score(X_train_resample, y_train_resample)
    accuracy_test_xgb = accuracy_score(y_test, y_pred_xgb)
    print(f"Accuracy on Train: {accuracy_train_xgb:.2f}")
    print(f"Accuracy on Test: {accuracy_test_xgb:.2f}")
    # Confusion matrix
    conf_matrix_xgb = confusion_matrix(y_test, y_pred_xgb)
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix_xgb, annot=True, cmap='Blues', fmt='g')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix - XGBoost')
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Ảnh 1
    Ảnh 1

    XGBoost Mectrics

    XGBoost confusion matrix

    Random Forest Classifier Training

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
    model_rfc = RandomForestClassifier(random_state=42, n_estimators=200)
    model_rfc.fit(X_train_resample, y_train_resample)
    y_pred_rfc = model_rfc.predict(X_test)
    # Result
    print("Random Forest Classifier Report:")
    print(classification_report(y_test, y_pred_rfc))
    # Accuracy on Train and Test
    accuracy_train_rfc = model_rfc.score(X_train_resample, y_train_resample)
    accuracy_test_rfc = accuracy_score(y_test, y_pred_rfc)
    print(f"Accuracy on Train: {accuracy_train_rfc:.2f}")
    print(f"Accuracy on Test: {accuracy_test_rfc:.2f}")
    # Confusion matrix
    conf_matrix_rfc = confusion_matrix(y_test, y_pred_rfc)
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix_rfc, annot=True, cmap='Blues', fmt='g')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix - Random Forest Classifier')
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Ảnh 1
    Ảnh 1

    Random Forest Classifier Mectrics

    Random Forest Classifier confusion matrix

    Support Vector Classifier

    from sklearn.svm import SVC
    # Create and train Support Vector Classifier model
    model_svc = SVC(random_state=42)
    model_svc.fit(X_train_resample, y_train_resample)
    # Predict on test set
    y_pred_svc = model_svc.predict(X_test)
    # Print classification report
    print("Classification Report for Support Vector Classifier:")
    print(classification_report(y_test, y_pred_svc))
    # Compute and print accuracy for train and test
    accuracy_train_svc = model_svc.score(X_train_resample, y_train_resample)
    accuracy_test_svc = accuracy_score(y_test, y_pred_svc)
    print(f"Accuracy on train set: {accuracy_train_svc:.4f}")
    print(f"Accuracy on test set: {accuracy_test_svc:.4f}")
    # Compute and plot confusion matrix
    conf_matrix_svc = confusion_matrix(y_test, y_pred_svc)
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix_svc, annot=True, cmap='Blues', fmt='g')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix - Support Vector Classifier')
    plt.show()
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    Ảnh 1
    Ảnh 1

    Support Vector Classifier Mectrics

    Support Vector Classifier confusion matrix

    Model Evalutation

    from sklearn.metrics import f1_score
    import pandas as pd
    # List of models
    models = [model_xgb, model_rfc, model_svc]
    # List of model names
    model_names = ['XGBoost', 'Random Forest', 'Support Vector Classifier']
    # Calculate evaluation metrics for each model
    accuracy = [model.score(X_test, y_test) for model in models]
    train_accuracy = [model.score(X_train, y_train) for model in models]
    f1_scores = [f1_score(y_test, model.predict(X_test)) for model in models]
    # Create a dictionary to store evaluation metrics
    evaluation_metrics = {
    'Model': model_names,
    'Accuracy': accuracy,
    'Train Accuracy': train_accuracy,
    'F1 Score': f1_scores
    }
    # Convert the dictionary to a DataFrame
    metrics_df = pd.DataFrame(evaluation_metrics)
    # Print the DataFrame
    print(metrics_df)
    # Find the model with the highest accuracy and F1-score
    best_accuracy_model = metrics_df.loc[metrics_df['Accuracy'].idxmax(), 'Model']
    best_f1_model = metrics_df.loc[metrics_df['F1 Score'].idxmax(), 'Model']
    print(f"Best model based on Accuracy: {best_accuracy_model}")
    print(f"Best model based on F1 Score: {best_f1_model}")
     
    הההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההההה
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    dash1

    The "Random Forest Classifier Model" has the highest accuracy on the test dataset (Accuracy = 85.19%) and the F1-score is relatively better than the other models.
    So I will choose this model!

    Save Model

    dash1
    View my full project: Here
    Previous Post Next Post
    Visits my social profile and get connected: