在統計應用領域,我們經常會遇到所謂的"計數資料"(count data),比方說就診人數,死亡人數,顧客人數等等。在分析這些計數資料時,最常使用的模型莫過於卜瓦松模型(Poisson model)。但在實際的應用上,使用者經常需要面對兩大問題:一是過度離散(overdispersion)問題,二是觀測值有太多零(excess zeros)的問題。這些問題通常都會導致卜瓦松模型估計的結果有偏誤。因此 WenSui Liu 和 Jimmy Cela 在 2008 年 SAS Global Forum 發表了一篇技術文件,介紹一些其他比較適合處理計數資料的模型。
。資料
該文使用的資料是 Deb and Trivedi (1997) 所發表的一篇關於醫療保健在老人上的利用研究所使用的資料。裡面包含了:
應變數 -
1. HOSP = 住院次數
因變數 -
2. EXCLHLTH = 自我健康評量(1=極佳,0=其他)
3. POORHLTH = 自我健康評量(1=極差,0=其他)
4. NUMCHRON = 慢性病數目
5. AGE = 年齡
6. MALE = 性別(1=男性,0=女性)
7. SCHOOL = 教育程度(以年為單位)
8. PRIVINS = 保險(1=有健保,0=沒有健保)
由於應變數 HOSP 的變異數是平均值的兩倍,這表示該筆資料有過度離散的問題。此外,4406 份樣本中,有 3541 人沒有任何住院記錄,所以這筆資料也有太多零的問題。我們從下面這張圖也可以看出一些端倪:
。卜瓦松模型
程式:
/* METHOD 1: PROC NLMIXED */
proc nlmixed data = tblNMES;
parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;
mu = exp(b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);
ll = -mu + HOSP * log(mu) - log(fact(HOSP));
model HOSP ~ general(ll);
predict mu out = poi_out (rename = (pred = Yhat));
run;
/* METHOD 2: PROC COUNTREG */
proc countreg data = tblNMES type = poisson;
model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS;
run;
原作者使用比較複雜的 PROC NLMIXED 和 PROC GOUNTREG 程序來完成模型配飾。這兩種方法與 PROC GENMOD 做出來的結果應該會一致。簡單來講,在 PROC NLMIXED 程序裡面,要先設定每個參數的初始值(params),然後把還沒加上 link function 前的公式打出來(mu),然後把 log-likelihood 寫好(ll),最後才是配飾模型(model)。而 PROC COUNTREG 程序則比較單純,只要把後面的 type= 設定成 poisson 即可。
報表:
Model Fit Summary
Log Likelihood -3046
AIC 6108
SBC 6159
Parameter Estimates
Standard Approx
Parameter Estimate Error t Value Pr > |t|
Intercept -3.329044 0.339728 -9.80 <.0001
exclhlth -0.723412 0.175644 -4.12 <.0001
poorhlth 0.626157 0.067858 9.23 <.0001
numchron 0.264462 0.018277 14.47 <.0001
age 0.186406 0.042014 4.44 <.0001
male 0.103186 0.056274 1.83 0.0667
school -0.000206 0.007871 -0.03 0.9791
privins 0.108652 0.069251 1.57 0.1167
觀測機率與預測機率比較圖:
(1) Cameron and Trivedi (1996)
程式:
data ols_tmp;
set poi_out;
dep = ((HOSP - Yhat) ** 2 - HOSP) / Yhat;
run;
proc reg data = ols_tmp;
model dep = Yhat / noint; /* FIT A OLS REGRESSION WITHOUT INTERCEPT */
run; quit;
報表:
Parameter Estimates
Parameter Standard
Variable Label DF Estimate Error t Value Pr > |t|
Yhat Predicted Value 1 1.63419 0.22609 7.23 <.0001
用此法檢定出來的 p-value < .0001,所以表示過度離散的現象存在於這筆資料中。
(2) Greene (2002)
程式:
proc iml;
use poi_out;
read all var {HOSP} into y;
read all var {Yhat} into yhat;
close poisson_out;
e = (y - yhat);
n = nrow(y);
ybar = y`[, :];
LM = (e` * e - n * ybar) ** 2 / (2 * yhat` * yhat);
Pvalue = 1 - probchi(LM, 1);
print LM Pvalue;
quit;
報表:
LM PVALUE
794.14707 0
同樣地,用此法算出來的 p-value 逼近於零,所以最後預測機率值和觀測機率值出現落差是完全不意外。
。負二項模型
負二項模型的 PROC NLMIXED 程序內的寫法跟卜瓦松模型很像,唯一要注意的就是他的 log-likelihood 長得不一樣。看起來是複雜些。所以還是用 PROC COUNTREG 程序來做比較簡單。 程式如下:
/* METHOD 1: PROC NLMIXED */
proc nlmixed data = tblNMES;
parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;
mu = exp(b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);
ll = lgamma(HOSP + 1 / alpha) - lgamma(HOSP + 1) - lgamma(1 / alpha) +
HOSP * log(alpha * mu) -
(HOSP + 1 / alpha) * log(1 + alpha * mu);
model HOSP ~ general(ll);
predict mu out = nb_out (rename = (pred = Yhat));
run;
/* METHOD 2: PROC COUNTREG */
proc countreg data = tblNMES type = negativebinom method = qn;
model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS;
run;
報表:
Model Fit Summary
Log Likelihood -2857
AIC 5731
SBC 5789
Parameter Estimates
Standard Approx
Parameter Estimate Error t Value Pr > |t|
Intercept -3.752640 0.446835 -8.40 <.0001
exclhlth -0.697875 0.193318 -3.61 0.0003
poorhlth 0.613926 0.095392 6.44 <.0001
numchron 0.289418 0.026470 10.93 <.0001
age 0.238444 0.055265 4.31 <.0001
male 0.153862 0.073033 2.11 0.0351
school -0.002271 0.010203 -0.22 0.8238
privins 0.093922 0.090494 1.04 0.2993
_Alpha 1.766727 0.160492 11.01 <.0001
觀測機率與預測機率比較圖:
。跨欄(Hurdle)模型
跨欄模型是 Dr. Mullahy 於 1986 年研發,他是假設當應變數是零,則服從二項分配,當大於零時,則服從另一個機率密度函數長得很像卜瓦松分配的分配(該分配沒有實際名字)。其機率密度函數可以用下面公式來表示:
因此其 log-likelihood 就變得非常複雜:
所以這個模型只能用 PROC NLMIXED 來處理。程式如下:
/* METHOD 1: PROC NLMIXED */
proc nlmixed data = tblNMES tech = dbldog;
parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0
b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;
eta0 = a0 + a1 * EXCLHLTH + a2 * POORHLTH + a3 * NUMCHRON + a4 * AGE +
a5 * MALE + a6 * SCHOOL + a7 * PRIVINS;
exp_eta0 = exp(eta0);
p0 = exp_eta0 / (1 + exp_eta0);
etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;
exp_etap = exp(etap);
if HOSP = 0 then ll = log(p0);
else ll = log(1 - p0) - exp_etap + HOSP * etap - lgamma(HOSP + 1)
- log(1 - exp(-exp_etap));
model HOSP ~ general(ll);
predict exp_etap out = hdl_out1 (keep = pred HOSP rename = (pred = Yhat));
predict p0 out = hdl_out2 (keep = pred rename = (pred = p0));
run;
報表:
Fit Statistics
-2 Log Likelihood 5758.4
AIC (smaller is better) 5790.4
AICC (smaller is better) 5790.6
BIC (smaller is better) 5892.7
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha
a0 4.2311 0.4889 4406 8.65 <.0001 0.05
a1 0.5826 0.1991 4406 2.93 0.0035 0.05
a2 -0.6953 0.1073 4406 -6.48 <.0001 0.05
a3 -0.3078 0.02890 4406 -10.65 <.0001 0.05
a4 -0.2752 0.06061 4406 -4.54 <.0001 0.05
a5 -0.1948 0.08008 4406 -2.43 0.0151 0.05
a6 -0.00593 0.01126 4406 -0.53 0.5982 0.05
a7 -0.01924 0.09944 4406 -0.19 0.8466 0.05
b0 -0.4693 0.5627 4406 -0.83 0.4043 0.05
b1 -0.9422 0.4949 4406 -1.90 0.0570 0.05
b2 0.3373 0.1008 4406 3.35 0.0008 0.05
b3 0.1426 0.02967 4406 4.81 <.0001 0.05
b4 -0.01229 0.06834 4406 -0.18 0.8573 0.05
b5 -0.03854 0.09227 4406 -0.42 0.6762 0.05
b6 -0.01815 0.01290 4406 -1.41 0.1597 0.05
b7 0.2589 0.1139 4406 2.27 0.0231 0.05
觀測機率與預測機率比較圖:
。零膨脹(Zero-inflated: ZIP)模型
以下皆簡稱 ZIP 模型。此模型是專門處理觀測值裡面有太多零的問題。在 PROC NLMIXED 和 PROC COUNTREG 程序裡面都可以處理。特別注意的是,在 PROC COUNTREG 程序裡面,除了要把 type 設定成 zip 外,還要在 model 後面加上額外的指令。如下面紅色程式碼所示。
程式:
/* METHOD 1: PROC COUNTREG */
proc countreg data = tblNMES type = zip;
model HOSP = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS
/ zi(link = logistic, var = EXCLHLTH POORHLTH NUMCHRON AGE MALE SCHOOL PRIVINS);
run;
/* METHOD 2: PROC NLMIXED */
proc nlmixed data = tblNMES tech = dbldog;
parms a0 = 0 a1 = 0 a2 = 0 a3 = 0 a4 = 0 a5 = 0 a6 = 0 a7 = 0
b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0;
eta0 = a0 + a1 * EXCLHLTH + a2 * POORHLTH + a3 * NUMCHRON + a4 * AGE +
a5 * MALE + a6 * SCHOOL + a7 * PRIVINS;
exp_eta0 = exp(eta0);
p0 = exp_eta0 / (1 + exp_eta0);
etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;
exp_etap = exp(etap);
if HOSP = 0 then ll = log(p0 + (1 - p0) * exp(-exp_etap));
else ll = log(1 - p0) + HOSP * etap - exp_etap - lgamma(HOSP + 1);
model HOSP ~ general(ll);
predict exp_etap out = zip_out1 (keep = pred HOSP rename = (pred = Yhat));
predict p0 out = zip_out2 (keep = pred rename = (pred = p0));
run;
報表:
Model Fit Summary
Log Likelihood -2878
AIC 5788
SBC 5890
Parameter Estimates
Standard Approx
Parameter Estimate Error t Value Pr > |t|
Intercept -0.366506 0.572032 -0.64 0.5217
exclhlth -0.919990 0.458460 -2.01 0.0448
poorhlth 0.324926 0.101157 3.21 0.0013
numchron 0.127746 0.033867 3.77 0.0002
age -0.024359 0.068806 -0.35 0.7233
male -0.059629 0.099133 -0.60 0.5475
school -0.012473 0.013520 -0.92 0.3562
privins 0.229208 0.114004 2.01 0.0444
Inf_Intercept 4.265976 0.971218 4.39 <.0001
Inf_exclhlth -0.369944 0.717395 -0.52 0.6061
Inf_poorhlth -0.589745 0.195174 -3.02 0.0025
Inf_numchron -0.280116 0.062396 -4.49 <.0001
Inf_age -0.405962 0.119765 -3.39 0.0007
Inf_male -0.334773 0.162429 -2.06 0.0393
Inf_school -0.019390 0.022126 -0.88 0.3808
Inf_privins 0.224859 0.196133 1.15 0.2516
觀測機率與預測機率比較圖:
。ZIP-tau 模型
這是上面 ZIP 模型的延伸。他的表現比 ZIP 模型好一點點,因為 AIC 比較小,但也沒有差太多。程式如下:
/* METHOD 1: PROC NLMIXED */
proc nlmixed data = tblNMES;
parms b0 = 0 b1 = 0 b2 = 0 b3 = 0 b4 = 0 b5 = 0 b6 = 0 b7 = 0 tau = 1;
eta0 = tau * (b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS);
exp_eta0 = exp(eta0);
p0 = exp_eta0 / (1 + exp_eta0);
etap = b0 + b1 * EXCLHLTH + b2 * POORHLTH + b3 * NUMCHRON + b4 * AGE +
b5 * MALE + b6 * SCHOOL + b7 * PRIVINS;
exp_etap = exp(etap);
if HOSP = 0 then ll = log(p0 + (1 - p0) * exp(-exp_etap));
else ll = log(1 - p0) + HOSP * etap - exp_etap - lgamma(HOSP + 1);
model HOSP ~ general(ll);
predict exp_etap out = zip_out1 (keep = pred HOSP rename = (pred = Yhat));
predict p0 out = zip_out2 (keep = pred rename = (pred = p0));
run;
報表:
Fit Statistics
-2 Log Likelihood 5768.7
AIC (smaller is better) 5786.7
AICC (smaller is better) 5786.7
BIC (smaller is better) 5844.2
Parameter Estimates
Standard
Parameter Estimate Error DF t Value Pr > |t| Alpha
b0 -1.3944 0.2698 4406 -5.17 <.0001 0.05
b1 -0.2685 0.09606 4406 -2.80 0.0052 0.05
b2 0.3223 0.05980 4406 5.39 <.0001 0.05
b3 0.1391 0.02195 4406 6.34 <.0001 0.05
b4 0.1040 0.02789 4406 3.73 0.0002 0.05
b5 0.07254 0.03383 4406 2.14 0.0321 0.05
b6 -0.00039 0.004641 4406 -0.08 0.9331 0.05
b7 0.04292 0.04216 4406 1.02 0.3087 0.05
tau -1.8406 0.4585 4406 -4.01 <.0001 0.05
觀測機率與預測機率比較圖:
上式是把每個觀測值用兩個模型都算出預測值出來,然後相除取對數。之後再用這些m值去計算一個V統計量:
當 V > 1.96 時,就表示第一個模型比較好,當 V < -1.96 時,就表示第二個模型比較好。若介於 -1.96 和 1.96 之間,就表示兩模型沒有顯著差異。以卜瓦松模型和ZIP模型比較為例,其SAS程式如下:
data poi_pred (keep = poi_prob);
set poi_out; /* OUTPUT FROM POISSON REGRESSION */
do i = 0 to 8;
poi_prob = pdf('poisson', i , Yhat);
if hosp = i then output;
end;
run;
data zip_pred (keep = zip_prob);
merge zip_out1 zip_out2; /* OUTPUT FROM ZIP REGRESSION */
do i = 0 to 8;
if i = 0 then zip_prob = p0 + (1 - p0) * pdf('poisson', i, Yhat);
else zip_prob = (1 - p0) * pdf('poisson', i, Yhat);
if hosp = i then output;
end;
run;
data compare;
merge poi_pred zip_pred;
m = log(zip_prob / poi_prob);
run;
proc sql;
select
mean(m) as mbar,
std(m) as s,
sqrt(count(*)) * mean(m) / std(m) as v
from
compare;
quit;
報表:
mbar s v
0.038138 0.375956 6.733444
結果顯示 V 統計量為 6.73 大於 1.96,所以表示 ZIP 模型比卜瓦松模型要來的好。
。比較
把所有估計參數值並排在一起比較:
CONTACT INFORMATION
Wensui Liu, Statistical Project Manager
ChoicePoint Precision Marketing, Alpharetta, GA
Email: wensui.liu@choicepoint.com
Jimmy Cela, AVP
ChoicePoint Precision Marketing, Alpharetta, GA
Email: jimmy.cela@choicepoint.com
沒有留言:
張貼留言
要問問題的人請在文章下方的intensedebate欄位留言,請勿使用blogger預設的意見表單。今後用blogger意見表單留言的人我就不回應了。