Are deep learning models superior for missing data imputation in surveys? Evidence from an empirical comparison
Section 2. Missing data imputation methods

We first introduce notation. Consider a sample with n MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGUbaaaa@32A0@ units, each of which is associated with p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGWbaaaa@32A2@ variables. Let Y i j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGzbWaaSbaaSqaaiaadMgacaWGQb aabeaaaaa@3494@ be the value of variable j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbaaaa@329C@ for individual i , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGPbGaaiilaaaa@334B@ where j = 1, , p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbGaaGjbVlaai2dacaaMe8UaaG ymaiaaiYcacaaMe8UaeSOjGSKaaGilaiaaysW7caWGWbaaaa@3DD5@ and i = 1, , n . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGPbGaaGjbVlabg2da9iaaysW7ca aIXaGaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaad6gacaGGUaaa aa@3EC3@ Here, Y MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGzbaaaa@328B@ can be continuous, binary, categorical or mixed binary-continuous. For each individual i , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGPbGaaiilaaaa@334B@ let Y i = ( Y i 1 , , Y i p ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaadMgaaeqaaO GaaGjbVlabg2da9iaaysW7daqadeqaaiaadMfadaWgaaWcbaGaamyA aiaaigdaaeqaaOGaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaadM fadaWgaaWcbaGaamyAaiaadchaaeqaaaGccaGLOaGaayzkaaGaaiOl aaaa@456B@ For each variable j , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbGaaiilaaaa@334C@ let Y j = ( Y 1 j , , Y n j ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaadQgaaeqaaO GaaGjbVlabg2da9iaaysW7daqadeqaaiaadMfadaWgaaWcbaGaaGym aiaadQgaaeqaaOGaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaadM fadaWgaaWcbaGaamOBaiaadQgaaeqaaaGccaGLOaGaayzkaaGaaiOl aaaa@456C@ Let Y = ( Y 1 , , Y n ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaGjbVlabg2da9iaaysW7da qadeqaaiaahMfadaWgaaWcbaGaaGymaaqabaGccaaISaGaaGjbVlab lAciljaaiYcacaaMe8UaaCywamaaBaaaleaacaWGUbaabeaaaOGaay jkaiaawMcaaaaa@41BF@ be the n × p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGUbGaaGjbVlabgEna0kaaysW7ca WGWbaaaa@38C6@ matrix comprising the data for all records included in the sample. We write Y = ( Y obs , Y mis ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaGjbVlabg2da9iaaysW7da qadeqaaiaahMfadaWgaaWcbaGaae4BaiaabkgacaqGZbaabeaakiaa iYcacaaMe8UaaCywamaaBaaaleaacaqGTbGaaeyAaiaabohaaeqaaa GccaGLOaGaayzkaaGaaiilaaaa@42FB@ where Y obs MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab+gacaqGIb Gaae4Caaqabaaaaa@3588@ and Y mis MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4Caaqabaaaaa@358D@ are respectively the observed and missing parts of Y . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaiOlaaaa@3341@ We write Y mis = ( Y mis , 1 , , Y mis , p ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4CaaqabaGccaaMe8Uaeyypa0JaaGjbVpaabmqabaGaaCywamaa BaaaleaacaqGTbGaaeyAaiaabohacaaISaGaaGPaVlaaigdaaeqaaO GaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaahMfadaWgaaWcbaGa aeyBaiaabMgacaqGZbGaaGilaiaaykW7caWGWbaabeaaaOGaayjkai aawMcaaiaacYcaaaa@4F9F@ where Y mis , j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4CaiaaiYcacaaMc8UaamOAaaqabaaaaa@38BD@ represents all missing values for variable j , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbGaaiilaaaa@334C@ with j = 1, , p . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbGaaGjbVlabg2da9iaaysW7ca aIXaGaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaadchacaGGUaaa aa@3EC6@ Similarly, we write Y obs = ( Y obs , 1 , , Y obs , p ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab+gacaqGIb Gaae4CaaqabaGccaaMe8Uaeyypa0JaaGjbVpaabmqabaGaaCywamaa BaaaleaacaqGVbGaaeOyaiaabohacaaISaGaaGPaVlaaigdaaeqaaO GaaGilaiaaysW7cqWIMaYscaaISaGaaGjbVlaahMfadaWgaaWcbaGa ae4BaiaabkgacaqGZbGaaGilaiaaykW7caWGWbaabeaaaOGaayjkai aawMcaaaaa@4EE0@ for the corresponding observed data.

In MI, the analyst generates values of the missing data Y mis MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4Caaqabaaaaa@358D@ using pre-specified models estimated with Y obs , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab+gacaqGIb Gaae4CaaqabaGccaGGSaaaaa@3642@ resulting in a completed dataset. The analyst then repeats the process to generate L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbaaaa@327E@ completed datasets, { Y ( l ) : l = 1, , L } , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaadaGadeqaaiaahMfadaahaaWcbeqaam aabmqabaGaamiBaaGaayjkaiaawMcaaaaakiaaykW7caaI6aGaaGjb VlaaysW7caWGSbGaaGjbVlabg2da9iaaysW7caaIXaGaaGilaiaays W7cqWIMaYscaaISaGaaGjbVlaadYeaaiaawUhacaGL9baacaGGSaaa aa@49D1@ that are available for inference or dissemination. For inference, the analyst can compute sample estimates for population estimands in each completed dataset Y ( l ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaWbaaSqabeaadaqadeqaai aadYgaaiaawIcacaGLPaaaaaGccaGGSaaaaa@35F1@ and combine them using MI inference rules developed by Rubin (1987), which will be reviewed in Section 3.

2.1  MICE with classification tree models

Under MICE, the analyst begins by specifying a separate univariate conditional model for each variable with missing values. The analyst then specifies an order to iterate through the sequence of the conditional models, when doing imputation. We write the ordered list of the variables as ( Y ( 1 ) , , Y ( p ) ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaadaqadeqaaiaahMfadaWgaaWcbaWaae WabeaacaaIXaaacaGLOaGaayzkaaaabeaakiaaiYcacaaMe8UaeSOj GSKaaGilaiaaysW7caWHzbWaaSbaaSqaamaabmqabaGaamiCaaGaay jkaiaawMcaaaqabaaakiaawIcacaGLPaaacaGGUaaaaa@4085@ Next, the analyst initializes each Y mis , ( j ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4CaiaaiYcacaaMc8+aaeWabeaacaWGQbaacaGLOaGaayzkaaaa beaakiaac6caaaa@3B03@ The most popular options are to sample from (i) the marginal distribution of the corresponding Y obs , ( j ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab+gacaqGIb Gaae4CaiaaiYcacaaMc8+aaeWabeaacaWGQbaacaGLOaGaayzkaaaa beaakiaacYcaaaa@3AFC@ or (ii) the conditional distribution of Y ( j ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaamaabmqabaGaam OAaaGaayjkaiaawMcaaaqabaaaaa@3534@ given all the other variables, constructed using only available cases.

After initialization, the MICE algorithm follows an iterative process that cycles through the sequence of univariate models. For each variable j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbaaaa@329C@ at each iteration t , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWG0bGaaiilaaaa@3356@ one fits the conditional model ( Y ( j ) | Y obs , ( j ) , { Y ( k ) ( t ) : k < j } , { Y ( k ) ( t 1 ) : k > j } ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaWaaqGabeaacaWHzbWaaSbaaS qaaiaacIcacaWGQbGaaiykaaqabaGccaaMc8oacaGLiWoacaaMc8Ua aCywamaaBaaaleaacaqGVbGaaeOyaiaabohacaaISaGaaGPaVlaacI cacaWGQbGaaiykaaqabaGccaaISaGaaGjbVlaaiUhacaWHzbWaa0ba aSqaaiaacIcacaWGRbGaaiykaaqaaiaacIcacaWG0bGaaiykaaaaki aaykW7caaI6aGaaGjbVlaaysW7caWGRbGaaGjbVlabgYda8iaaysW7 caWGQbGaaGyFaiaaiYcacaaMe8UaaGPaVlaaiUhacaWHzbWaa0baaS qaaiaacIcacaWGRbGaaiykaaqaaiaacIcacaWG0bGaaGPaVlabgkHi TiaaykW7caaIXaGaaiykaaaakiaaykW7caaI6aGaaGjbVlaaysW7ca WGRbGaaGjbVlabg6da+iaaysW7caWGQbGaaGyFaiaacMcacaGGUaaa aa@7552@ Next, one replaces Y mis , ( j ) ( t ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaa0baaSqaaiaab2gacaqGPb Gaae4CaiaaiYcacaaMc8+aaeWabeaacaWGQbaacaGLOaGaayzkaaaa baWaaeWabeaacaWG0baacaGLOaGaayzkaaaaaaaa@3CCB@ with draws from the implied model ( Y mis , ( j ) ( t ) | Y obs , ( j ) , { Y ( k ) ( t ) : k < j } , { Y ( k ) ( t 1 ) : k > j } ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaGaaCywamaaDaaaleaacaqGTb GaaeyAaiaabohacaaISaGaaGPaVlaacIcacaWGQbGaaiykaaqaaiaa cIcacaWG0bGaaiykaaaakiaaykW7daabbeqaaiaaykW7caWHzbWaaS baaSqaaiaab+gacaqGIbGaae4CaiaaiYcacaaMc8UaaiikaiaadQga caGGPaaabeaaaOGaay5bSdGaaGilaiaaysW7caaI7bGaaCywamaaDa aaleaacaGGOaGaam4AaiaacMcaaeaacaGGOaGaamiDaiaacMcaaaGc caaMc8UaaGOoaiaaysW7caaMe8Uaam4AaiaaysW7cqGH8aapcaaMe8 UaamOAaiaai2hacaaISaGaaGjbVlaaykW7caaI7bGaaCywamaaDaaa leaacaGGOaGaam4AaiaacMcaaeaacaGGOaGaamiDaiaaykW7cqGHsi slcaaMc8UaaGymaiaacMcaaaGccaaMc8UaaGOoaiaaysW7caaMe8Ua am4AaiaaysW7cqGH+aGpcaaMe8UaamOAaiaai2hacaGGPaGaaiOlaa aa@7CB6@ The iterative process continues for T MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGubaaaa@3286@ total iterations until convergence, and the values at the final iteration make up a completed dataset Y ( l ) = ( Y obs , Y mis ( T ) ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaWbaaSqabeaadaqadeqaai aadYgaaiaawIcacaGLPaaaaaGccaaMe8Uaeyypa0JaaGjbVpaabmqa baGaaCywamaaBaaaleaacaqGVbGaaeOyaiaabohaaeqaaOGaaGilai aaysW7caWHzbWaa0baaSqaaiaab2gacaqGPbGaae4Caaqaamaabmqa baGaamivaaGaayjkaiaawMcaaaaaaOGaayjkaiaawMcaaiaac6caaa a@4813@ The entire process is then repeated L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbaaaa@327E@ times to create the L MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbaaaa@327E@ completed datasets. We provide pseudocode detailing each step of the MICE algorithm in the supplementary material.

Under MICE-CART, the analyst uses CART (Breiman et al., 1984) for the univariate conditional models in the MICE algorithm. CART follows a decision tree structure that uses recursive binary splits to partition the predictor space into distinct non-overlapping regions. The top of the tree often represents its root and each successive binary split divides the predictor space into two new branches as one moves down the tree. The splitting criterion at each leaf is usually chosen to minimize an information theoretic entropy measure. Splits that do not decrease the lack of fit by an reasonable amount based on a set threshold are pruned off. The tree is then built until a stopping criterion is met; e.g., minimum number of observations in each leaf.

Once the tree has been fully constructed, one generates Y mis , ( j ) ( t ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaa0baaSqaaiaab2gacaqGPb Gaae4CaiaaiYcacaaMc8+aaeWabeaacaWGQbaacaGLOaGaayzkaaaa baWaaeWabeaacaWG0baacaGLOaGaayzkaaaaaaaa@3CCB@ by traversing down the tree to the appropriate leaf using the combinations in ( { Y k ( t ) : k < j } , { Y k ( t 1 ) : k > j } ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaGaaG4EaiaahMfadaqhaaWcba Gaam4AaaqaaiaacIcacaWG0bGaaiykaaaakiaaykW7caaI6aGaaGjb VlaaysW7caWGRbGaaGjbVlabgYda8iaaysW7caWGQbGaaGyFaiaaiY cacaaMe8UaaGPaVlaaiUhacaWHzbWaa0baaSqaaiaadUgaaeaacaGG OaGaamiDaiaaykW7cqGHsislcaaMc8UaaGymaiaacMcaaaGccaaMc8 UaaGOoaiaaysW7caaMe8Uaam4AaiaaysW7cqGH+aGpcaaMe8UaamOA aiaai2hacaGGPaGaaiilaaaa@5FE0@ and then sampling from the Y ( j ) obs MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGzbWaa0baaSqaamaabmqabaGaam OAaaGaayjkaiaawMcaaaqaaiaab+gacaqGIbGaae4Caaaaaaa@37FE@ values in that leaf. That is, given any combination in ( { Y k ( t ) : k < j } , { Y k ( t 1 ) : k > j } ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaGaaG4EaiaahMfadaqhaaWcba Gaam4AaaqaaiaacIcacaWG0bGaaiykaaaakiaaykW7caaI6aGaaGjb VlaaysW7caWGRbGaaGjbVlabgYda8iaaysW7caWGQbGaaGyFaiaaiY cacaaMe8UaaGPaVlaaiUhacaWHzbWaa0baaSqaaiaadUgaaeaacaGG OaGaamiDaiaaykW7cqGHsislcaaMc8UaaGymaiaacMcaaaGccaaMc8 UaaGOoaiaaysW7caaMe8Uaam4AaiaaysW7cqGH+aGpcaaMe8UaamOA aiaai2hacaGGPaGaaiilaaaa@5FE0@ one uses the proportion of values of Y j obs MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaa0baaSqaaiaadQgaaeaaca qGVbGaaeOyaiaabohaaaaaaa@3678@ in the corresponding leaf to approximate the conditional distribution ( Y ( j ) | Y obs , ( j ) , { Y ( k ) ( t ) : k < j } , { Y ( k ) ( t 1 ) : k > j } ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaWaaqGabeaacaWHzbWaaSbaaS qaaiaacIcacaWGQbGaaiykaaqabaGccaaMc8oacaGLiWoacaaMc8Ua aCywamaaBaaaleaacaqGVbGaaeOyaiaabohacaaISaGaaGPaVlaacI cacaWGQbGaaiykaaqabaGccaaISaGaaGjbVlaaykW7caaI7bGaaCyw amaaDaaaleaacaGGOaGaam4AaiaacMcaaeaacaGGOaGaamiDaiaacM caaaGccaaMc8UaaGOoaiaaysW7caaMe8Uaam4AaiaaysW7cqGH8aap caaMe8UaamOAaiaai2hacaaISaGaaGjbVlaaysW7caaI7bGaaCywam aaDaaaleaacaGGOaGaam4AaiaacMcaaeaacaGGOaGaamiDaiaaykW7 cqGHsislcaaMc8UaaGymaiaacMcaaaGccaaMc8UaaiOoaiaaysW7ca aMe8Uaam4AaiaaysW7cqGH+aGpcaaMe8UaamOAaiaai2hacaGGPaGa aiOlaaaa@76D9@ The iterative process again continues for T MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGubaaaa@3286@ total iterations, and the values at the final iteration make up a completed dataset.

MICE-RF instead uses random forests for the univariate conditional models in MICE (e.g., Stekhoven and Bühlmann, 2012; Shah, Bartlett, Carpenter, Nicholas and Hemingway, 2014). Random forests (Ho, 1995; Breiman, 2001) is an ensemble tree method which builds multiple decision trees to the data, instead of a single tree like CART. Specifically, random forests constructs multiple decision trees using bootstrapped samples of the original, and only uses a sample of the predictors for the recursive partitions in each tree. This approach can reduce the prevalence of unstable trees as well as the correlation among individual trees significantly, since it prevents the same variables from dominating the partitioning process across all trees. Theoretically, this decorrelation should result in predictions with less variance (Hastie, Tibshirani and Friedman, 2009).

For imputation, the analyst first trains a random forests model for each Y ( j ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaamaabmqabaGaam OAaaGaayjkaiaawMcaaaqabaaaaa@3534@ using available cases, given all other variables. Next, the analyst generates predictions for Y mis , j MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4CaiaaiYcacaaMc8UaamOAaaqabaaaaa@38BD@ under that model. Specifically, for any categorical Y ( j ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaamaabmqabaGaam OAaaGaayjkaiaawMcaaaqabaGccaGGSaaaaa@35EE@ and given any particular combination in ( { Y k ( t ) : k < j } , { Y k ( t 1 ) : k > j } ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaGGOaGaaG4EaiaahMfadaqhaaWcba Gaam4AaaqaaiaacIcacaWG0bGaaiykaaaakiaaykW7caaI6aGaaGjb VlaaysW7caWGRbGaaGjbVlabgYda8iaaysW7caWGQbGaaGyFaiaaiY cacaaMe8UaaGjbVlaaiUhacaWHzbWaa0baaSqaaiaadUgaaeaacaGG OaGaamiDaiaaykW7cqGHsislcaaMc8UaaGymaiaacMcaaaGccaaMc8 UaaGOoaiaaysW7caaMe8Uaam4AaiaaysW7cqGH+aGpcaaMe8UaamOA aiaai2hacaGGPaGaaiilaaaa@5FE2@ the analyst first generates predictions for each tree based on the values Y j obs MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaa0baaSqaaiaadQgaaeaaca qGVbGaaeOyaiaabohaaaaaaa@3678@ in the corresponding leaf for that tree, and then uses the most commonly occurring majority level of among all predictions from all the trees. For a continuous Y ( j ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaamaabmqabaGaam OAaaGaayjkaiaawMcaaaqabaGccaGGSaaaaa@35EE@ the analyst instead uses the average of all the predictions from all the trees. The iterative process again cycles through all the variables, for T MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGubaaaa@3286@ total iterations, and the values at the final iteration make up a completed dataset. A particularly important hyperparameter in random forests is the maximum number of trees d . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGKbGaaiOlaaaa@3348@

For our evaluations, we use the mice R package to implement both MICE-CART and MICE-RF, and retain the default hyperparameter setting in the package to mimic the common practice in real world applications. Specifically, we set the minimum number of observations in each terminal leaf to 5 and the pruning threshold to 0.0001 in MICE-CART. In MICE-RF, the maximum number of trees d MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGKbaaaa@3296@ is set to be 10.

2.2  Generative Adversarial Imputation Network (GAIN)

GAIN (Yoon, Jordon and Schaar, 2018) is an imputation method based on GANs (Goodfellow et al., 2014), which consist of a generator function G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ and a discriminator function D . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebGaaiOlaaaa@3328@ For any data matrix Y = ( Y obs , Y mis ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaGjbVlabg2da9iaaysW7da qadeqaaiaahMfadaWgaaWcbaGaae4BaiaabkgacaqGZbaabeaakiaa iYcacaaMe8UaaCywamaaBaaaleaacaqGTbGaaeyAaiaabohaaeqaaa GccaGLOaGaayzkaaGaaiilaaaa@42FB@ we replace Y mis MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaab2gacaqGPb Gaae4Caaqabaaaaa@358D@ with random noise, Z i j , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGAbWaaSbaaSqaaiaadMgacaWGQb aabeaakiaacYcaaaa@354F@ sampled from a uniform distribution. The generator G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ inputs this initialized data and a mask matrix M , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHnbGaaiilaaaa@3333@ with M i j { 0 , 1 } MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGnbWaaSbaaSqaaiaadMgacaWGQb aabeaakiaaysW7cqGHiiIZcaaMe8+aaiWabeaacaaIWaGaaiilaiaa ysW7caaIXaaacaGL7bGaayzFaaaaaa@3F14@ indicating observed values of Y , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaiilaaaa@333F@ and outputs predicted values for both the observed data and missing data, Y ^ . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaaceWHzbGbaKaacaGGUaaaaa@3351@ The discriminator D MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebaaaa@3276@ utilizes Y ^ = ( Y obs , Y ^ mis ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaaceWHzbGbaKaacaaMe8Uaeyypa0JaaG jbVpaabmqabaGaaCywamaaBaaaleaacaqGVbGaaeOyaiaabohaaeqa aOGaaiilaiaaysW7ceWHzbGbaKaadaWgaaWcbaGaaeyBaiaabMgaca qGZbaabeaaaOGaayjkaiaawMcaaaaa@4265@ and a hint matrix H MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHibaaaa@327E@ of the same dimension to identify which values are observed or imputed by G , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbGaaiilaaaa@3329@ which results in a predicted mask matrix M ^ . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaaceWHnbGbaKaacaGGUaaaaa@3345@ The hint matrix, sampled from the Bernoulli distribution with p MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGWbaaaa@32A2@ equal to a “hint rate” hyperparameter, reveals to D MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebaaaa@3276@ partial information about M MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHnbaaaa@3283@ in order to help guide G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ to learn the underlying distribution of Y . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbGaaiOlaaaa@3341@

We first train D MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebaaaa@3276@ to minimize the loss function, L D ( M , M ^ ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaadseaaeqaaO GaaGPaVpaabmqabaGaaCytaiaaiYcacaaMe8UabCytayaajaaacaGL OaGaayzkaaGaaiilaaaa@3B41@ for each mini-batch of size n i : MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGUbWaaSbaaSqaaiaadMgaaeqaaO GaaGPaVlaacQdaaaa@360D@

L D ( M , M ^ ) = i = 1 n i j = 1 J M i j log ( M ^ i j ) + ( 1 M i j ) log ( 1 M ^ i j ) . ( 2.1 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaadseaaeqaaO GaaGPaVpaabmqabaGaaCytaiaaiYcacaaMe8UabCytayaajaaacaGL OaGaayzkaaGaaGjbVlaaysW7cqGH9aqpcaaMe8UaaGjbVpaaqahabe WcbaGaamyAaiaaykW7caaI9aGaaGPaVlaaigdaaeaacaWGUbWaaSba aWqaaiaadMgaaeqaaaqdcqGHris5aOWaaabCaeqaleaacaWGQbGaaG PaVlaai2dacaaMc8UaaGymaaqaaiaadQeaa0GaeyyeIuoakiaaykW7 caWGnbWaaSbaaSqaaiaadMgacaWGQbaabeaakiaayIW7caqGSbGaae 4BaiaabEgacaaMc8+aaeWabeaaceWGnbGbaKaadaWgaaWcbaGaamyA aiaadQgaaeqaaaGccaGLOaGaayzkaaGaaGjbVlabgUcaRiaaysW7da qadeqaaiaaigdacaaMe8UaeyOeI0IaaGjbVlaad2eadaWgaaWcbaGa amyAaiaadQgaaeqaaaGccaGLOaGaayzkaaGaaGjbVlaabYgacaqGVb Gaae4zaiaaykW7daqadeqaaiaaigdacaaMe8UaeyOeI0IaaGjbVlqa d2eagaqcamaaBaaaleaacaWGPbGaamOAaaqabaaakiaawIcacaGLPa aacaaIUaGaaGzbVlaaywW7caaMf8UaaGzbVlaaywW7caGGOaGaaGOm aiaac6cacaaIXaGaaiykaaaa@8B75@

Next, G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ is trained to minimize the loss function (2.2), which is composed of a generator loss, L G ( M , M ^ ) , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaadEeaaeqaaO GaaGPaVpaabmqabaGaaCytaiaaiYcacaaMe8UabCytayaajaaacaGL OaGaayzkaaGaaiilaaaa@3B44@ and a reconstruction loss, L M ( Y , Y ^ , M ) . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaad2eaaeqaaO GaaGPaVpaabmqabaGaaCywaiaaiYcacaaMe8UabCywayaajaGaaGil aiaaysW7caWHnbaacaGLOaGaayzkaaGaaiOlaaaa@3E7D@ The generator loss (2.3) is minimized when D MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebaaaa@3276@ incorrectly identifies imputed values as being observed. The reconstruction loss (2.4) is minimized when the predicted values are similar to the observed values, and is weighted by the hyperparameter β: MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaHYoGycaaMc8UaaiOoaaaa@3597@

L ( Y , Y ^ , M , M ^ ) = L G ( M , M ^ ) + β L M ( Y , Y ^ , M ) , ( 2.2 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbGaaGPaVpaabmqabaGaaCywai aaiYcacaaMe8UabCywayaajaGaaGilaiaaysW7caWHnbGaaGilaiaa ysW7ceWHnbGbaKaaaiaawIcacaGLPaaacaaMe8UaaGjbVlabg2da9i aaysW7caaMe8UaamitamaaBaaaleaacaWGhbaabeaakiaaykW7daqa deqaaiaah2eacaaISaGaaGjbVlqah2eagaqcaaGaayjkaiaawMcaai aaysW7cqGHRaWkcaaMe8UaeqOSdiMaamitamaaBaaaleaacaWGnbaa beaakiaaykW7daqadeqaaiaahMfacaaISaGaaGjbVlqahMfagaqcai aaiYcacaaMe8UaaCytaaGaayjkaiaawMcaaiaaiYcacaaMf8UaaGzb VlaaywW7caaMf8UaaGzbVlaacIcacaaIYaGaaiOlaiaaikdacaGGPa aaaa@6DC6@

L G ( M , M ^ ) = i = 1 n i j = 1 J M i j log ( 1 M ^ i j ) , ( 2.3 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaadEeaaeqaaO GaaGPaVpaabmqabaGaaCytaiaaiYcacaaMe8UabCytayaajaaacaGL OaGaayzkaaGaaGjbVlaaysW7cqGH9aqpcaaMe8UaaGjbVpaaqahabe WcbaGaamyAaiaaykW7caaI9aGaaGPaVlaaigdaaeaacaWGUbWaaSba aWqaaiaadMgaaeqaaaqdcqGHris5aOWaaabCaeqaleaacaWGQbGaaG PaVlaai2dacaaMc8UaaGymaaqaaiaadQeaa0GaeyyeIuoakiaaykW7 caWGnbWaaSbaaSqaaiaadMgacaWGQbaabeaakiaayIW7caqGSbGaae 4BaiaabEgacaaMc8+aaeWabeaacaaIXaGaaGjbVlabgkHiTiaaysW7 ceWGnbGbaKaadaWgaaWcbaGaamyAaiaadQgaaeqaaaGccaGLOaGaay zkaaGaaGilaiaaywW7caaMf8UaaGzbVlaaywW7caaMf8Uaaiikaiaa ikdacaGGUaGaaG4maiaacMcaaaa@73E9@

L M ( Y , Y ^ , M ) = i = 1 n i j = 1 J ( 1 M i j ) L rec ( Y i j , Y ^ i j ) , ( 2.4 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaad2eaaeqaaO GaaGPaVpaabmqabaGaaCywaiaaiYcacaaMe8UabCywayaajaGaaGil aiaaysW7caWHnbaacaGLOaGaayzkaaGaaGjbVlaaysW7cqGH9aqpca aMe8UaaGjbVpaaqahabeWcbaGaamyAaiaai2dacaaIXaaabaGaamOB amaaBaaameaacaWGPbaabeaaa0GaeyyeIuoakmaaqahabeWcbaGaam OAaiaai2dacaaIXaaabaGaamOsaaqdcqGHris5aOGaaGPaVpaabmqa baGaaGymaiaaysW7cqGHsislcaaMe8UaamytamaaBaaaleaacaWGPb GaamOAaaqabaaakiaawIcacaGLPaaacaaMe8UaamitamaaBaaaleaa caqGYbGaaeyzaiaabogaaeqaaOGaaGPaVpaabmqabaGaamywamaaBa aaleaacaWGPbGaamOAaaqabaGccaaISaGaaGjbVlqadMfagaqcamaa BaaaleaacaWGPbGaamOAaaqabaaakiaawIcacaGLPaaacaaISaGaaG zbVlaaywW7caaMf8UaaGzbVlaaywW7caGGOaGaaGOmaiaac6cacaaI 0aGaaiykaaaa@78BA@

where

L rec ( Y i j , Y ^ i j ) = { ( Y ^ i j Y i j ) 2 if Y i j is continuous Y i j log Y ^ i j if Y i j is categorical . ( 2.5 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbWaaSbaaSqaaiaabkhacaqGLb Gaae4yaaqabaGcdaqadeqaaiaadMfadaWgaaWcbaGaamyAaiaadQga aeqaaOGaaGilaiaaysW7ceWGzbGbaKaadaWgaaWcbaGaamyAaiaadQ gaaeqaaaGccaGLOaGaayzkaaGaaGjbVlaaysW7cqGH9aqpcaaMe8Ua aGjbVpaaceaabaqbaeaabiGaaaqaamaabmqabaGabmywayaajaWaaS baaSqaaiaadMgacaWGQbaabeaakiaaysW7cqGHsislcaaMe8Uaamyw amaaBaaaleaacaWGPbGaamOAaaqabaaakiaawIcacaGLPaaadaahaa WcbeqaaiaaikdaaaaakeaacaqGPbGaaeOzaiaaysW7caaMe8Uaamyw amaaBaaaleaacaWGPbGaamOAaaqabaGccaaMe8UaaGjbVlaabMgaca qGZbGaaGjbVlaaysW7caqGJbGaae4Baiaab6gacaqG0bGaaeyAaiaa b6gacaqG1bGaae4BaiaabwhacaqGZbaabaGaeyOeI0IaamywamaaBa aaleaacaWGPbGaamOAaaqabaGccaaMc8UaaeiBaiaab+gacaqGNbGa aGPaVlqadMfagaqcamaaBaaaleaacaWGPbGaamOAaaqabaaakeaaca qGPbGaaeOzaiaaysW7caaMe8UaamywamaaBaaaleaacaWGPbGaamOA aaqabaGccaaMe8UaaGjbVlaabMgacaqGZbGaaGjbVlaaysW7caqGJb GaaeyyaiaabshacaqGLbGaae4zaiaab+gacaqGYbGaaeyAaiaaboga caqGHbGaaeiBaiaai6caaaaacaGL7baacaaMf8UaaGzbVlaaywW7ca aMf8UaaGzbVlaacIcacaaIYaGaaiOlaiaaiwdacaGGPaaaaa@A052@

In our experiments, we model both G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ and D MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebaaaa@3276@ as fully-connected neural networks, each with three hidden layers, and θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaH4oqCaaa@3363@ hidden units per hidden layer. The hidden layer weights are initialized uniformly at random with the Xavier initialization method (Glorot and Bengio, 2010). We use leaky ReLU activation function (Maas, Hannun and Ng, 2013) for each hidden layer, and a softmax activation function for the output layer for G MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGhbaaaa@3279@ in the case of categorical variables, or a sigmoid activation function in the case of numerical variables and for the output of D . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGebGaaiOlaaaa@3328@ We facilitate this choice of output layer for numerical variables by transforming all continuous variables to be within range (0, 1) using the MinMax normalization: Y i j * = { Y i j min ( Y j ) } / { max ( Y j ) min ( Y j ) } , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGzbWaa0baaSqaaiaadMgacaWGQb aabaGaaiOkaaaakiaaysW7cqGH9aqpcaaMe8+aaSGbaeaadaGadeqa aiaadMfadaWgaaWcbaGaamyAaiaadQgaaeqaaOGaaGjbVlabgkHiTi aaysW7caqGTbGaaeyAaiaab6gacaaMc8UaaGikaiaadMfadaWgaaWc baGaeyyXICTaamOAaaqabaGccaaIPaaacaGL7bGaayzFaaGaaGPaVd qaaiaaykW7daGadeqaaiaab2gacaqGHbGaaeiEaiaaykW7caaIOaGa amywamaaBaaaleaacqGHflY1caWGQbaabeaakiaaiMcacaaMe8Uaey OeI0IaaGjbVlaab2gacaqGPbGaaeOBaiaaykW7caaIOaGaamywamaa BaaaleaacqGHflY1caWGQbaabeaakiaaiMcaaiaawUhacaGL9baaaa Gaaiilaaaa@6ACC@ where min ( Y j ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaqGTbGaaeyAaiaab6gacaaMc8UaaG ikaiaadMfadaWgaaWcbaGaeyyXICTaamOAaaqabaGccaaIPaaaaa@3BB7@ and max ( Y j ) MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaqGTbGaaeyyaiaabIhacaaMc8UaaG ikaiaadMfadaWgaaWcbaGaeyyXICTaamOAaaqabaGccaaIPaaaaa@3BB9@ are the minimum and maximum of variable j , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGQbGaaiilaaaa@334C@ respectively. After imputation, we transform each value back to its original scale. We generate multiple imputations using several runs of the model with varying initial imputation of the missing values.

To implement GAIN in our evaluations, we use the same architecture as the one in Yoon, Jordon, and Schaar (2018). We set β = 100 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaHYoGycaaMe8Uaeyypa0JaaGjbVl aaigdacaaIWaGaaGimaiaacYcaaaa@3A4D@ θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaH4oqCaaa@3363@ equal to the number of features of the input data, and tune the hint rate on a single simulation. Following the common practice in the GAN literature (Berthelot, Schumm and Metz, 2017; Ham, Jun and Kim, 2020), we track the evolution of GAIN’s generator and discriminator losses, and manually tune the hint rate so that the two losses are qualitatively similar. Specifically, we first coarsely select the hint rate among {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}. Then we determine the final value by an additional fine tuning step. In the MAR scenario, for example, after observing that the optimal value is in the range (0.1, 0.2), we perform a search among {0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19}. Finally, we set the optimal hint rate for MCAR and MAR scenarios to be 0.3 and 0.13, respectively. We train the networks for 200 epochs using stochastic gradient descent (SGD) and mini-batches of size 512 to learn the parameter weights. We use the Adam optimizer to adapt the learning rate, with an initial rate of 0.001 (Kingma and Ba, 2014).

2.3  Multiple Imputation using Denoising Autoencoders (MIDA)

MIDA (Gondara and Wang, 2018; Lu et al., 2020) extends a class of neural networks, denoising autoencoders, for MI. An autoencoder is a neural network model trained to learn the identity function of the input data. Denoising autoencoders intentionally corrupt the input data in order to prevent the networks from learning the identity function, but rather a useful low-dimensional representation of the input data. The MIDA architecture consists of an encoder and decoder, each modeled as a fully-connected neural network with three hidden layers, with θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaH4oqCaaa@3363@ hidden units per hidden layer. We first perform an initial imputation on missing values using the mean for continuous variables and the most frequent label for categorical variables, which results in a completed data Y 0 . MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaaicdaaeqaaO GaaiOlaaaa@3431@ The encoder inputs Y 0 , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWHzbWaaSbaaSqaaiaaicdaaeqaaO Gaaiilaaaa@342F@ and corrupts the input data by randomly dropping out half of the variables. The corrupted input data is mapped to a higher dimensional representation by adding Θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqqHyoquaaa@3324@ hidden units to each successive hidden layer of the encoder. The decoder receives output from the encoder, and symmetrically scales the encoding back to the original input dimension. All hidden layers use a hyperbolic tangent (tanh) activation function, while the output layer of the decoder uses a softmax (sigmoid) activation function in the case of categorical (numerical) variables. Multiple imputations are generated by using multiple runs with the hidden layer weights initialized as a Gaussian random variable.

Following Lu et al. (2020), we train MIDA in two phases: a primary phase and fine-tuning phase. In the primary phase, we feed the initially imputed data to MIDA and train for N prime MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGobWaaSbaaSqaaiaabchacaqGYb GaaeyAaiaab2gacaqGLbaabeaaaaa@3758@ epochs. In the fine-tuning phase, MIDA is trained for N tune MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGobWaaSbaaSqaaiaabshacaqG1b GaaeOBaiaabwgaaeqaaaaa@3674@ epochs on the output in the primary phase, and produces the outcome. The loss function is used in both phases and closely resembles the reconstruction loss in GAIN:

L ( Y i j 0 , Y ^ i j , M i j ) = { ( 1 M i j ) ( Y i j 0 Y ^ i j ) 2 if Y i j is continuous ( 1 M i j ) Y i j 0 log Y ^ i j if Y i j is categorical . ( 2.6 ) MathType@MTEF@5@5@+= feaagKart1ev2aaatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbGaaGPaVpaabmqabaGaamywam aaBaaaleaacaWGPbGaamOAamaaBaaameaacaaIWaaabeaaaSqabaGc caaISaGaaGjbVlqadMfagaqcamaaBaaaleaacaWGPbGaamOAaaqaba GccaaISaGaaGjbVlaad2eadaWgaaWcbaGaamyAaiaadQgaaeqaaaGc caGLOaGaayzkaaGaaGjbVlaaysW7cqGH9aqpcaaMe8UaaGjbVpaace aabaqbaeaabiGaaaqaamaabmqabaGaaGymaiaaysW7cqGHsislcaaM e8UaamytamaaBaaaleaacaWGPbGaamOAaaqabaaakiaawIcacaGLPa aacaaMe8+aaeWabeaacaWGzbWaaSbaaSqaaiaadMgacaWGQbWaaSba aWqaaiaaicdaaeqaaaWcbeaakiaaysW7cqGHsislcaaMe8Uabmyway aajaWaaSbaaSqaaiaadMgacaWGQbaabeaaaOGaayjkaiaawMcaamaa CaaaleqabaGaaGOmaaaaaOqaaiaabMgacaqGMbGaaGjbVlaaysW7ca WGzbWaaSbaaSqaaiaadMgacaWGQbaabeaakiaaysW7caaMe8UaaeyA aiaabohacaaMe8UaaGjbVlaabogacaqGVbGaaeOBaiaabshacaqGPb GaaeOBaiaabwhacaqGVbGaaeyDaiaabohaaeaacqGHsislcaaMc8+a aeWabeaacaaIXaGaaGjbVlabgkHiTiaaysW7caWGnbWaaSbaaSqaai aadMgacaWGQbaabeaaaOGaayjkaiaawMcaaiaaysW7caWGzbWaaSba aSqaaiaadMgacaWGQbWaaSbaaWqaaiaaicdaaeqaaaWcbeaakiaayk W7caqGSbGaae4BaiaabEgacaaMc8UabCywayaajaWaaSbaaSqaaiaa dMgacaWGQbaabeaaaOqaaiaabMgacaqGMbGaaGjbVlaaysW7caWGzb WaaSbaaSqaaiaadMgacaWGQbaabeaakiaaysW7caaMe8UaaeyAaiaa bohacaaMe8UaaGjbVlaabogacaqGHbGaaeiDaiaabwgacaqGNbGaae 4BaiaabkhacaqGPbGaae4yaiaabggacaqGSbGaaGOlaaaaaiaawUha aiaaywW7caaMf8UaaGzbVlaaywW7caaMf8UaaiikaiaaikdacaGGUa GaaGOnaiaacMcaaaa@BDEE@

To implement MIDA in our evaluations, we use the same architecture and tune the hyperparameters in a single simulation as in Lu et al. (2020). We plot the evolution of loss function L , MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGmbGaaiilaaaa@332E@ and select the number of additional units Θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqqHyoquaaa@3324@ among {1, 2, 3, 4, 5, 6, 7 ,8, 9, 10} to reduce the loss. In our experiments, we set θ MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqaH4oqCaaa@3363@ equal to the number of features of the input data and add Θ = 7 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacqqHyoqucaaMe8Uaeyypa0JaaGjbVl aaiEdaaaa@3805@ hidden units to each of the three hidden layers of the encoder. We train the model for N prime = 100 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGobWaaSbaaSqaaiaabchacaqGYb GaaeyAaiaab2gacaqGLbaabeaakiaaysW7cqGH9aqpcaaMe8UaaGym aiaaicdacaaIWaaaaa@3DB1@ epochs in the primary phase and N tune = 2 MathType@MTEF@5@5@+= feaagKart1ev2aqatCvAUfeBSjuyZL2yd9gzLbvyNv2CaerbuLwBLn hiov2DGi1BTfMBaeXatLxBI9gBaebbnrfifHhDYfgasaacH8rrps0l bbf9q8WrFfeuY=Hhbbf9v8qqaqFr0xc9pk0xbba9q8WqFfea0=yr0R Yxir=Jbba9q8aq0=yq=He9q8qqQ8frFve9Fve9Ff0dmeaabaqaciGa caGaaeqabaGaaiaadaaakeaacaWGobWaaSbaaSqaaiaabshacaqG1b GaaeOBaiaabwgaaeqaaOGaaGjbVlabg2da9iaaysW7caaIYaaaaa@3B5A@ epochs in the fine-tuning phase. Similar as in GAIN, we learn the model parameters using SGD with mini-batches of size 512, and use the Adam optimizer to adapt the learning rate with the initial rate being 0.001.


Date modified: