Comment
Author: Admin | 2025-04-28
Data sampling approaches. For random oversampling methods, samples from the minority class are copied and extracted repeatedly in the oversampling process by some kind of algorithm [62]. Compared with ROS, random undersampling (RUS) methods construct various training subdatasets by applying random sample removal strategy. In contrast to ROS methods, SMOTE creates synthetic samples in feature domains instead of replacing in oversampling. Furthermore, SMOTE methods consider the k nearest neighbors in feature space and generate a new sample by interpolation strategy. Feature selection methods, which have been barely explored by scholars, are generally adopted to enhance classification performance [63] by extracting unique features for class discrimination.Algorithm-level methods explore data classification at the algorithm level and can be classified into cost-sensitive methods and ensemble methods. Cost-sensitive methods apply more weight to misclassified instances in training process and select the most interesting samples. Ensemble methods consist of several parallel or serial classifiers, which output combined results from above classifiers. Ensemble methods are presented as bagging and boosting [64] types or hybrid methods, which have their own advantages. Bagging builds subclassifiers by training subsamples extracted from the whole dataset and combines the individual models into the final classification. Boosting implements the same procedure, which trains models from several individual subsamples extracted from the whole dataset, like bagging, and weights each classifier with adaptive value based on the misclassified ratio. Hybrid methods combine multiple data sampling methods with basic learning algorithms, e.g., naïve Bayes [65], and addresses currently known problems.We believe that a certain class-balancing algorithm could not achieve best performance in all domain datasets because different characteristics are contained in different datasets. Fernandez et al. [66] conducted some experiments and examined data balancing in big data. The results show that RUS, ROS and SMOTE have significant performance differences in certain datasets. RUS and ROS have their own advantages in different configurations, whereas SMOTE performs the worst of the three.Deep neural networks are trained with big data, which are usually much larger than traditional datasets, and acquire better performance because of their strong feature extraction capability. Although the methodologies for class equalization are similar between traditional datasets and big data, methods for addressing data imbalance in big data require specially designed algorithms. Data imbalance problems in the mining domain for deep learning methods can be classified into class imbalance and scale imbalance [67].Class imbalance means a certain class in the dataset is over-represented. For the belt tear datasets, it refers to foreground- (tear part in belt image)-to-background (other objects in belt image) imbalance, which usually means background objects outnumber foreground objects. To address class imbalance problems, two typical sampling methods are frequently used: hard sampling and soft sampling methods. In general, the difference between hard
Add Comment