Garment Workers Productivity | Daniyar Kurmanbayev

Garments Workers Productivity Predictor

Overview

The model for predicting the garment workers' productivity. The results showed that target productivity for workers made determined manually was overestimated. The model might help to determine better target productivity.

Introduction

The garment industry is one of the most dominating industries in this era of industrial globalization. It is a highly labour-intensive industry that requires a large number of human resources to produce its goods and fill up the global demand for garment products. Because of the dependency on human labour, the production of a garment company comprehensively relies on the productivity of the employees who are working in different departments of the company. A common problem in this industry is that the actual productivity of the garment employees sometimes does not meet the targeted productivity that was set for them by the authorities to meet the production goals in due time. When the productivity gap occurs, the company faces a huge loss in production. This study aims to solve this problem by predicting the actual productivity of the employees

Dataset exploration

1197 rows and 15 columns
Column ‘department’ is of 2 classes Sewing and Finishing
Column WIP has missing values but only in department finishing. Thus, all missing values imputed to 0
‘Idle_men’, ‘idle_time’ & ‘no_of_style_change’ have too frequent values in ‘0’
‘quarters’ column has 5 categories which are a working month split into 5 quarters
Column ‘day’ has all days except Friday
‘targeted_productivity’ is distributed between, 7% to 80% setting expectations by the company
‘actual_productivity’ the target variable for this problem is distributed between, 23% to 112%

Data patterns

This visualization represents the distribution of actual productivity displayed by employees across various days of the week. Its easily observed that almost all days there is a good set of teams with productivity over 60 % and on Sundays to Wednesdays there’s a considerable amount of teams showing lower productivity below 40%

The plot below represents the correlation between all the features in the dataset

This plot above compares the distribution of productivity targets set by the company and actual observed productivity of employees

Data preprocessing

Date and Target Productivity features are dropped from the dataset and keep our label Target as Actual Productivity. Then, missing values were imputed to zero. Now, 2 types of data were left: Categorical And Numerical Data.

To handle Categorical data, Label Encoding and Get Dummies Encoding were used separately to have an overview of the two methods, both of which resulted in the numerical dataset.

Model

Random Forest Regressor was picked to solve our problem as our target label has continue values instead of the discrete values and thus objective is to minimize the MSE (Mean Square Error)

Simple Random Forest Regressor

Random Forest model with the following hyperparameters:

n_estimators : 100, Criterion:”mse”, max_depth:None, max_features :”auto”,Bootstrap: True

Gives the MSE of 0.018206618907488367.

Scaling

Then, 2 types of pipeline methods were tried: One with MinMaxScalar to obtain normalization of the dataset and Another with StandardScaler to obtain standardization of the dataset individually with the above RandomForest Regressor model.

A) Minmax Scalar:

The MSE as 0.018002495997471855 was obtained

B)Standard Scalar:

The MSE as 0.01809908623249897 was obtained

As seen the results with all these models yield approximately the same results.

Combined feature selection

The different feature selection methods were also combined to identify the best set of features that impact the accuracy more. The following methods were utilized: Pearson correlation coefficients, mutual information, recursive feature elimination, and embedded meta- transformers of Linear Regressor, Random Forest Regressor, and LGMB Regressor.

Since 23 features were after features encoding, iteratively from 1 to 22 number of features were asked from the methods mentioned before. With each set of features the Random Forest Regressor model was built and evaluated. The best accuracy gave 19 features which are: 'smv', 'over_time', 'no_of_workers', 'incentive', 'wip', 'team', 'no_of_style_change', 'idle_men', 'quarter_Quarter5', 'quarter_Quarter4', 'quarter_Quarter3', 'quarter_Quarter2', 'quarter_Quarter1', 'idle_time', 'department_sweing', 'department_finishing ', 'department_finishing', 'day_Monday', 'day_Wednesday'. The mean squared error of 0.016633314717583395 was reached and it is 0.2064% of improvement comparing to using all the features.

Bellow on Table 1 the feature support of each method, when iterating the number of features top features according to the total column, which represents the number of votes of methods, were used for evaluation.

Grid search

The GridSearchCV was utilized to tune Random Forest Regressors hyper-parameters with 3 folds cross-validation. In Table 2 the grid of hyper-parameters was used and in the right columns, the returned best parameters are displayed.

RandomizedSearchCV

Also, the RandomizedSearchCV was used for finding the best parameters with only 10 iterations, the same grid was passed into the RandomizedSearchCV. As the result, the following parameters were returned as the best.

{
'n_estimators': 500,

'min_samples_split': 10,

'min_samples_leaf': 4,

'max_features': 'log2',
'max_depth': 11.666666666666668,

'bootstrap': False

}

Fortunately, the RandomizedSearchCV’s best estimator allowed the team to achieve the 0.01623657551618213 MSE and 0.2064% improvements compared to the last best result.

Results

Different models using the range of different approaches of selecting features, normalization, values encoding, choosing hyper-parameters were built. In Table 4 top models are displayed. It also shows how the results were improving while implementing different approaches. The improvements that we achieved are not huge, but still important

To conclude the results of model building, describing the approach with the best accuracy, it includes the Dummy encoding of categorical features, standard scaling of values for normalization, using 19 features obtained from combined feature selection methods, RandomForestRegressor with 500 trees and other parameters got from RandomSearchCV, that gave the team 0.016237 MSE.

Conclusion

In the above graph blue line represents targeted productivity and the red line represents the corresponding predicted probability

Since there is a huge gap when comparing Predicted vs Target set by the company, the data further can be divided into 2 categories by the rule:
- [less productive if Pred - Targeted < 0 & more productive if Pred - Targeted >0]
- Members in the More productive category can be incentivized and used for overtime
  work as they can be a bit more reliable
- Members in the Less productive category can be put under a performance improvement program or moved to teams with lesser WIP or lesser number of idle workers

Furthermore, the continuous value of predicted productivity can also be compared with teams/workers' existing probability to benchmark what could be a real productivity target.

Teams with a higher productivity prediction can also be looked for no_of_idle_workers on the team and they can be delegated to a team with lesser productivity