Prediction of Cervical Cancer (Risk Factors) Data Set
Cervical cancer is the second most common cancer that affects women, ranked after the breast cancer 1. The acceptability of increased cervical cancer risk and cause of cancer death expected. Cervical screening program has minimized the rate of death in developed countries. Cervical cancer is the one of the deadliest disease, it can be cured if detected in early stage 2. The common risk factors related with cervical cancer include first intercourse at an early age, pregnancy at early age, improper menstrual hygiene, having sex with multiple partners, weak immune system, smoking, use of oral contraceptives, and others 1. The common symptoms associated with cervical cancer include vaginal discharge, abnormal vaginal bleeding, and moderate pain during sexual intercourse 3.
Cervical cancer is rare and unusual in women younger than age 20 particularly. However, many young women become infected with multiple types of human papilloma virus, which then can increase their risk of getting cervical cancer in the future 2. Manual diagnosis of cancer involves screening which involves many factors, including knowledge of the health practitioner, availability of screening setup as well as availability of medical facility within reach. However, manual diagnosis by health practitioners is tedious, subjective and provides high chances of error 4. In fact, the manual processing has a limitation on performance accuracy, machine learning techniques became prominent in medicine and healthcare, providing an alternate method for early diagnosis 1.
Therefore, machine learning tools based on computational methods can serve as second eye to the diagnosis, providing higher confidence level and reducing chances of error. The most important problems during diagnosis are determination of the finest screening plan and estimation of individual risk each patient 5. The data used to classify the diagnosis and risk factors of cervical cancer is from Hospital Universitario de Caracas’ in Caracas, Venezuela which has been extracted by the dataset from UCI Database. In this context, this paper focuses on the feature selection which can be used for this dataset in terms of the useful performance for accuracy. Then, the suitable choice of the classifier is defined to predict the risk factors for the dataset and lastly, the reason of choosing test evaluation option to be used in machine learning evaluation. Therefore, the aim of this paper is to predict the risk factor of cervical cancer in order to detect early prevention of cervical cancer.
2.0 Data set
The dataset is taken from UCI Database on Cervical Cancer (Risk Factors) to propose the solution of the problem. The dataset involves demographic information, behaviours and historical medical records of 858 patients. This dataset is consisted of 858 instances and 36 attributes. The distribution and type of attributes in the dataset have been presented in the Table 1.
The training data must be chosen carefully to minimize the impact that the limited training-set size has on classifier performance for this dataset in predicting the risk factors of cervical cancer. One important choice is the proper class distribution of the training set. The class distribution of this dataset is unbalanced because it has missing values in certain attributes. As the biopsy serves as the benchmark for diagnosing cervical cancer, the classification task in this paper used the biopsy outcome as the target. Several patients decided not to answer some of the questions because of privacy concerns. Therefore, it contains a missing value. The missing values or null values will affect the performance of the dataset to classify the risk factors of cervical cancer. For the solution, missing values for each column were imputed by the mode (most frequent value) because the value of the answer might be correlated with the probability for a value being missing.
3.0 Features selection
Feature selection is also known as attribute selection or variable selection. Features selection plays an important role in machine learning which are enables the machine learning algorithm to train faster, reduces the complexity of a model and makes it easier to interpret. Features selection also improves the accuracy of a model if the right subset is chosen and reduces overfitting. In fact, features selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.
Therefore, the dataset will be affected by its number of features to make the prediction model is accurate. In current understanding, the process of finding and selecting the most useful features in a dataset is a crucial step of the machine learning. In order to classify and predict the risk factor of cervical cancer, the dataset have to reduce the attributes to make the prediction is more accurate. The most valuable attributes have to be chosen to make the prediction is accurate and have a good performance. Based on the dataset, the features selection can be done through removing a features with a high percentage of missing values which means that find a features with a fraction of missing values above a specified threshold (60% missing values) and then remove is straightforward.
Besides that, collinear features is one of the methods in choosing the accurate features to make prediction. Collinear features are features that are highly correlated with one another which lead to decreased generalization performance on the test set due to high variance and less model interpretability. In machine learning, finding collinear features are based on a specified correlation coefficient value. Each pair of correlated features, it identifies one of the features for removal. The possible feature that can be removed is STD-Time because of the highest number of missing value that are not affect the performance of the classifier.
Biopsy is the target variable (feature) for this dataset of cervical cancer. The potential selected features that can be used are age, first sexual intercourse, number of pregnancies, smokes, hormonal contraceptives and STDs: genital herpes. This selected features is importantly to predict the class of main risk factor for Cervical Cancer patients and will affect the performance of the machine learning classifiers. If one of the importance feature are not selected, the accuracy performance of the classifier might be low and not accurate.
4.0 Choice of the classifiers
The two classifiers in the machine learning namely parametric classifier and non-parametric classifier has been used widely in the real world application. A parametric classifier is defined as a pre-defined form for the function modelling the data with a set of parameters of fixed size. The example of parametric classifier is linear regression. Algorithms that do not make a strong assumptions about the form of the mapping function are known as non-parametric classifier. By not making assumption, they are free to learn any functional form with an unknown number of parameters from the training data. The example of non-parametric classifier is decision tree and k-nearest neighbour.
As mentioned before, parametric classifier is the algorithm that simplify the function to a known form which has a fixed number of parameters such as linear regression. In fact, the dataset of risk factors for cervical cancer can used this parametric classifier to learn the coefficients of the dataset includes standard deviation, means, mode and median. Linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship between independent and dependant variable 7. However, the best generalization should be minimizing the errors in the complexity of the data. The dataset of risk factor for cervical cancer is complex with 36 attributes and need other classifier to predict the main risk factors. Therefore, it is not suitable to choose this parametric classifier to this dataset but it is useful and simple algorithm for the people who want to learn and understand more on machine learning algorithm
Several non-parametric classifier for machine learning are decision tree, support vector machine (SVM) and k-nearest neighbours (KNN). The classifiers are free to learn any functional form from the training data and the method does not assume anything about the form of the mapping function other than patterns that are close are likely have a similar output variable (biopsy). The non-parametric classifier that has been identified and suitable to classify and predict the risk factors of cervical cancer is K-Nearest Neighbour (KNN). The classifier assigns the input to the class having most examples the k neighbours of the input. All neighbours have equal vote and the class having the maximum number of voters among the k neighbours is chosen.
In KNN classifier, there is no need of explicit training phase. The data is divided into testing set (40%) and training set (60%). For every row of test set nearest neighbour k based on Euclidean distance of training set point will be observed and based on the majority votes classification is achieved. For selecting k nearest point from the data of training set, Euclidian distance will be measured. Cross-validation will be used in the dataset to make the performance measure unbiased. Cross-validation is done by repeated use of the same data split differently 7.
K-Fold cross-validation is one of the validation process and suitable metric for the dataset which consists in repeating the training data random splitting process K times to come up with an average performance measure. K is typically 10 or 30. As K increases, the percentage of training instances increases and get more robust estimators, but the validation set becomes smaller. Based on the size and characteristics of the dataset for cervical cancer patients, the cross-validation using K-Fold cross validation is suitable and preferable to be used in machine learning algorithms. This is because the dataset is large enough to get a number of training and validation set pairs, then randomly divide each part into two and use one half training and the other half for validation. The classification of risk factor for cervical cancer can be evaluated through this cross validation process to reduce unbiased performance and error percentages of classification.
5.0 Test evaluation option
Testing evaluation is a vital role in building a performing machine learning model. Assessing classifier of model performance consist of two steps which are use model to predict the class of risk factors for cervical cancer and use some indicator to compare the predicted values with the real values. Train the model and test the performance on the same dataset will contribute to the evaluation of the prediction model. The classifier of model can be validated through a certain process by splitting the training data and testing data.
Besides that, the K-Fold cross validation will be used to avoid over-fitting of the dataset. All these evaluation metrics option will be calculated based on the confusion matrix as shown in Figure 1. A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known 4. The most basic terms in confusion matrix are True Positive (TP), True Negative (TN), False Positive (FP) and False Negative (FN).
Figure 1: Confusion matrix
A test set enables to test the model on unseen data. The performance of chosen classifiers will be measured based on evaluation metrics namely Precision, Recall, F1-score, Specificity and Accuracy as shown in Table 3. The test evaluation option of this metrics could improve the performance of the classifier.
Table 3: Evaluation metrics
Accuracy Rate of the correct prediction for both healthy and not healthy patients.
Sensitivity=true positive rate The percentage of sick people who are correctly identified as having the disease.
negative rate The percentage of healthy people who are correctly diagnosed as healthy.
Recall Recall defines the ratio of correctly predicted true observation
Precision Positive predictive value.
F1-score Harmonic mean that combines precision and recall.