Several data mining programs do not accecpt non-numerical values within data
tables.
If - for some - reasons non-numerical values (completely missing values or error
indicators (e.g. "nan" or "error") are found in your dataset those missing values have to be replaced by numbers.
TableButler offers four methods to replace missing values by more or less useful numbers:
Assume a data row like:
My_gene | 2 | -2 | nan | 1 | -1 | 3 | -3 | 4 | -10 |
Replace constant value:
If you define as constant e.g. 0 the missing value is replaced by 0 (or any other value you like or think is useful)
Our sample data set would change to:
My_gene | 2 | -2 | 0 | 1 | -1 | 3 | -3 | 4 | -10 |
A fast, easy to understand and follow, but not very intelligent
way to impute missing values.
This will add some kind of bias to your data set.
Row average:
Calculate average from all numerical values in the respective row.
In the example: Average= -6 / 8 ~ -0.89
Our sample data set would change to:
My_gene | 2 | -2 | -0.89 | 1 | -1 | 3 | -3 | 4 | -10 |
Hot deck imputation:
A random selected numerical value from the original column will be selected to
replace the missing value.
The imputed data set may look different after different imputation runs (because
the selected data value is randomly selected).
Our sample data set might change to any of the following:
My_gene | 2 | -2 | 3 | 1 | -1 | 3 | -3 | 4 | -10 |
My_gene | 2 | -2 | -10 | 1 | -1 | 3 | -3 | 4 | -10 |
My_gene | 2 | -2 | 2 | 1 | -1 | 3 | -3 | 4 | -10 |
If multiple values are missing, imputation is done non-recursively (i.e. imputed values are not used for random selection)
Most similar
An average from the most similar genes is calculated and imputed for the missing value.
In a first step, the most similar genes would be searched.
To do this,
TableButler calculates all Euclidean distances between the row to be imputed
and ALL other rows in the data set.
Although positions where a value is missing in one or the other row are not used,
rows with many missing values may be assigned to optimistig distances.
The best few (to be defined as P1, in our
example lets take 5) are used:
15_gene | 2 | -1 | 2 | 1 | -1 | 3 | -3 | 4 | -10 |
1532_gene | 2 | -2 | 1 | 1 | -1 | 2 | -3 | 4 | -10 |
3048_gene | 2 | -2 | 3 | 1 | -1 | 3 | -3 | 3 | -10 |
7032_gene | 2 | -2 | 2 | 1 | -2 | 3 | -3 | 4 | -10 |
13854_gene | 2 | -1 | 1 | 1 | -1 | 3 | -3 | 4 | -10 |
Now the average for the missing column is calculated: (2+1+3+2+1) / 5 = 9 / 5 = 1.8
Our sample data set would change to:
My_gene | 2 | -2 | 1.8 | 1 | -1 | 3 | -3 | 4 | -10 |
At present TableButler does NOT create a copy of the imputed dataset. Thus Most-Similar imputation might use already imputed values in early rows for imputation of missing values in later rows (recursive processing).
Multiple Imputations:
If more then one value is missing the process is repeated for each single value.