TableButler - Compute columns

TableButler - Data imputation

Several data mining programs do not accecpt non-numerical values within data tables.
If - for some - reasons non-numerical values (completely missing values or error indicators (e.g. "nan" or "error") are found in your dataset those missing values have to be replaced by numbers.

TableButler offers four methods to replace missing values by more or less useful numbers:

replace non numerical cells with constant value
replace non numerical cells with row average (arithmetic mean)
replace non numerical cells with any value from row (Hot deck imputation)
replace non numerical cells with average from xx most similar gene vectors (eucledian distance)

Assume a data row like:

My_gene

-2

nan

-1

-3

-10

Replace constant value:

If you define as constant e.g. 0 the missing value is replaced by 0 (or any other value you like or think is useful)

Our sample data set would change to:

My_gene

-2

-1

-3

-10

A fast, easy to understand and follow, but not very intelligent way to impute missing values.
This will add some kind of bias to your data set.

Row average:

Calculate average from all numerical values in the respective row.
In the example: Average= -6 / 8 ~ -0.89

Our sample data set would change to:

My_gene

-2

-0.89

-1

-3

-10

Hot deck imputation:

A random selected numerical value from the original column will be selected to replace the missing value.
The imputed data set may look different after different imputation runs (because the selected data value is randomly selected).

Our sample data set might change to any of the following:

My_gene

-2

-1

-3

-10

My_gene

-2

-10

-1

-3

-10

My_gene

-2

-1

-3

-10

or ...

If multiple values are missing, imputation is done non-recursively (i.e. imputed values are not used for random selection)

Most similar

An average from the most similar genes is calculated and imputed for the missing value.

In a first step, the most similar genes would be searched.
To do this, TableButler calculates all Euclidean distances between the row to be imputed and ALL other rows in the data set.
Although positions where a value is missing in one or the other row are not used, rows with many missing values may be assigned to optimistig distances.
The best few (to be defined as P1, in our example lets take 5) are used:

15_gene	2	-1	2	1	-1	3	-3	4	-10
1532_gene	2	-2	1	1	-1	2	-3	4	-10
3048_gene	2	-2	3	1	-1	3	-3	3	-10
7032_gene	2	-2	2	1	-2	3	-3	4	-10
13854_gene	2	-1	1	1	-1	3	-3	4	-10

Now the average for the missing column is calculated: (2+1+3+2+1) / 5 = 9 / 5 = 1.8

Our sample data set would change to:

My_gene

-2

1.8

-1

-3

-10

At present TableButler does NOT create a copy of the imputed dataset. Thus Most-Similar imputation might use already imputed values in early rows for imputation of missing values in later rows (recursive processing).

Multiple Imputations:

If more then one value is missing the process is repeated for each single value.

Last edited 07.10.2005,