サムネがコーヒーの記事は書きかけです。

GDPと出生率の多項式回帰とBICによるモデル評価【データ分析練習】

この記事では、GDP(per capita)と出生率についてその相関を多項式回帰する練習を行います。

データの取得

以下のサイトから、国別のGDPと出生率をダウンロードします。

https://data.worldbank.org/indicator/SP.DYN.TFRT.IN

データの加工

今回は、過去20ヵ年について、欠損データを含む国を含めないようにします。また、過去20ヵ年分のデータが追跡できていない国も除外します。

with open("data.csv","r") as fp:
    lines = [i.replace("\n","") for i in fp.readlines()][4:]

years = [int(j.replace("\"","")) for j in lines[0].split(",")[4:-2]][40:]
print(years)

def h(data:list[str]) -> list[str]:
    for i in range(len(data)):
        if data[i].replace("\"","") == "SP.DYN.TFRT.IN":
            return data[i+1:-2]
    return ["empty"]

def isEmpty(data:list[str]) -> bool:
    for i in data:
        if i.replace("\"","") == '':
            return True
    return False
countries_for_br = [i.split(",")[0].replace("\"","") for i in lines[1:]]
dic_birthrate = {i.split(",")[0].replace("\"",""):[float(j.replace("\"","")) for j in h(i.split(","))][40:] for i in lines[1:] if not isEmpty(h(i.split(",")))}



with open("gdp.csv","r") as fp2:
    data = [i.replace("\n","") for i in fp2.readlines()][1:-1]
countries = sorted(set([i.split(",")[0] for i in data]))
c = 0
i = 0
dic2 = {} 
l = []
while i<len(data):
    if data[i].split(',')[0] == countries[c]:
        l.append(data[i])
        i += 1
    else:
        c += 1
        l = []
    dic2[countries[c]] = l
year_range = {i.split(',')[0]:[int(dic2[i.split(',')[0]][0].split(",")[2]),int(dic2[i.split(',')[0]][-1].split(",")[2])] for i in data}
dic_gdp = {i:[float(j.split(",")[3]) for j in dic2[i]] for i in dic2 if 2000>=year_range[i][0] >= 1990 and year_range[i][1] == 2020}
countries_final = [i for i in countries_for_br if i in dic_gdp.keys() and i in dic_birthrate.keys()]

取得結果(リンク先のデータを参照)

Country,GDPpercapita,birthRate
aruba,24487.8635601966,1.325
africa eastern and southern,1363.54074100996,4.41690156838726
afghanistan,516.866552182696,4.75
africa western and central,1683.43639098491,5.0493293781153
angola,1603.99347745326,5.371
albania,5332.16047456847,1.4
arab world,5545.45524618957,3.18600277860242
united arab emirates,37629.1741687956,1.46
argentina,8496.42414176374,1.911
armenia,4505.86736397124,1.575
antigua and barbuda,14787.6357752901,1.569
australia,51720.3707634401,1.581
austria,48809.2268762243,1.44
azerbaijan,4229.91064904503,1.7
burundi,216.826741341377,5.177
belgium,45517.7949301847,1.55
benin,1237.94929535039,5.048
burkina faso,833.244342636324,4.869
bangladesh,2233.30552399269,2.003
bulgaria,10129.8129597638,1.56
bahrain,23501.9194628168,1.832
bahamas, the,23862.7109929122,1.394
bosnia and herzegovina,6012.06276705855,1.359
belarus,6542.79749130553,1.382
belize,5266.87616004213,1.999
bermuda,107791.886435134,1.3
bolivia,3068.81255507604,2.651
brazil,6794.48915877166,1.649
barbados,16643.8065787177,1.628
brunei darussalam,27179.4119853371,1.796
bhutan,3009.92417072191,1.433
botswana,5863.20324072216,2.836
central african republic,435.469251582856,5.985
canada,43258.2638715601,1.4
central europe and the baltics,16293.8000324588,1.51005854820682
switzerland,85656.322666307,1.46
chile,13094.4595313609,1.537
china,10408.6697561349,1.281
cote divoire,2288.11949762033,4.472
cameroon,1539.13055856256,4.543
congo, dem. rep.,524.666675441091,6.206
congo, rep.,1838.44813852491,4.234
colombia,5307.21522798919,1.737
comoros,1519.58682298786,4.052
cabo verde,2924.10180677162,1.908
costa rica,12132.8768848243,1.555
caribbean small states,8851.42495915636,1.67280611122737
cuba,9499.59020230432,1.5
curacao,16109.8617501685,1.6
cyprus,28036.189453125,1.328
czechia,22992.8793833348,1.71
germany,46772.8253507545,1.53
djibouti,2917.9962809563,2.848
denmark,60915.4243995462,1.67
dominican republic,7167.91915908327,2.303
algeria,3337.25251157504,2.942
east asia & pacific (excluding high income),8262.4792534288,1.59724733309313
early-demographic dividend,3215.09292533334,2.37759450882849
east asia & pacific,11475.4625106863,1.56002449062815
europe & central asia (excluding high income),7441.46530668546,1.86448059623822
europe & central asia,23984.2787277322,1.67177215294223
ecuador,5645.19928965332,2.051
egypt, arab rep.,3398.80143153083,2.96
euro area,38159.7299117801,1.48531254883392
spain,26959.6754367326,1.23
estonia,23595.2436836441,1.58
ethiopia,918.652594077418,4.243
european union,34330.3665256275,1.49554835807721
fragile and conflict affected situations,1716.21638516599,4.41223768054089
finland,49170.7521512441,1.37
fiji,4864.11704680273,2.495
france,39055.2829280754,1.83
faroe islands,61980.2900239725,2.3
micronesia, fed. sts.,3639.41269869588,2.754
gabon,6680.08267035383,3.548
united kingdom,40318.5575660493,1.56
georgia,4255.74299321254,1.971
ghana,2176.57945945697,3.623
guinea,1073.65933709951,4.489
gambia, the,704.029916035493,4.777
guinea-bissau,710.258138543444,4.094
equatorial guinea,6327.59901151363,4.346
greece,17658.9473011192,1.34
grenada,8437.56673228706,2.023
greenland,54571.2091605745,2.019
guatemala,4604.57667897864,2.484
guam,34780.8616624614,2.593
guyana,6863.07434593017,2.418
high income,43415.6766809454,1.52963754127926
hong kong sar, china,46107.7652757721,0.868
honduras,2354.11961431042,2.394
heavily indebted poor countries (hipc),958.439861836993,4.69154037648335
croatia,14198.7539594744,1.48
haiti,1283.14082978936,2.869
hungary,16120.9890495243,1.56
ibrd only,6149.37060118069,1.86670366012729
ida & ibrd total,4888.96620866939,2.41816612865103
ida total,1343.28785148776,3.99593674640546
ida blend,1719.89506474412,4.13253213343056
indonesia,3894.27220196922,2.194
ida only,1156.36967071505,3.92880962670282
india,1910.42147282449,2.051
ireland,85420.1908556097,1.63
iran, islamic rep.,2746.42044957439,1.708
iraq,4332.30412463995,3.551
iceland,59200.1779441092,1.72
israel,44846.7915954816,2.9
italy,31911.0357890017,1.24
jamaica,4897.26589668631,1.358
jordan,4042.7693498462,2.873
japan,39918.1675583443,1.34
kazakhstan,9121.63713797145,3.13
kenya,1936.42458891104,3.397
kyrgyz republic,1182.52170043018,3.0
cambodia,1577.91174726099,2.381
kiribati,1430.55157354026,3.333
korea, rep.,31721.2989141857,0.837
kuwait,24300.32943621,2.14
latin america & caribbean (excluding high income),6789.39264970385,1.88457086158023
lao pdr,2593.35509719847,2.541
lebanon,5599.95752260733,2.103
liberia,597.529691893048,4.174
libya,7568.04201096325,2.507
st. lucia,8458.1627606739,1.411
latin america & caribbean,7289.95703684922,1.88409615586538
least developed countries: un classification,1095.77064832518,4.03388237834914
low income,727.617214412844,4.68298017205308
liechtenstein,157754.954373893,1.46
sri lanka,3893.84151516764,2.0
lower middle income,2268.98274570582,2.60118784635833
low & middle income,4723.79886732848,2.42761060955685
lesotho,989.847170700549,3.049
late-demographic dividend,9706.8431033298,1.46576420412029
lithuania,20339.5212699052,1.48
luxembourg,117370.496900162,1.37
latvia,18207.1396408631,1.55
macao sar, china,37646.3161952623,1.072
morocco,3258.12133789063,2.353
moldova,4500.62446389838,1.77
madagascar,462.404223694817,3.918
maldives,7282.35848902548,1.712
middle east & north africa,6491.49912763085,2.66051651348392
mexico,8655.00068206803,1.905
middle income,5206.16829542532,2.17362857537939
north macedonia,5965.45023195365,1.3
mali,822.906143972366,6.035
malta,28977.5655673636,1.13
myanmar,1477.45287032594,2.174
middle east & north africa (excluding high income),3093.87847282169,2.70572710770652
montenegro,7677.15222569996,1.75
mongolia,4041.17419587656,2.9
mozambique,449.95520984244,4.713
mauritania,1868.4665735741,4.455
mauritius,9007.41910231356,1.44
malawi,628.699481965162,3.995
malaysia,10160.7832470071,1.818
north america,61451.9761542648,1.61321126898951
namibia,4251.17276622988,3.349
new caledonia,34800.7595147977,2.037
niger,564.822001755189,6.892
nigeria,2074.61392802438,5.309
nicaragua,1863.10925664373,2.349
netherlands,52162.5701150406,1.55
norway,67329.6777910967,1.48
nepal,1139.19027666733,2.055
new zealand,41596.5055023403,1.61
oecd members,38326.8576017416,1.58869257852731
oman,16707.6230063214,2.687
other small states,11683.1860582817,2.88770283086551
pakistan,1322.31503608163,3.555
panama,12569.1715435651,2.344
peru,6056.34390291445,2.216
philippines,3224.42255131214,2.777
papua new guinea,2446.06630562295,3.274
poland,15816.989398397,1.38
pre-demographic dividend,1337.46415333973,4.91627620452391
puerto rico,31393.9073690446,0.9
portugal,22242.406417972,1.4
paraguay,5353.34806456279,2.497
west bank and gaza,3233.56863835858,3.57
pacific island small states,3707.31058055127,3.28141280185239
post-demographic dividend,44474.5497503182,1.47514396782951
french polynesia,18910.3801296387,1.705
qatar,52315.6600783116,1.816
romania,13047.4320617591,1.6
russian federation,10169.0869140625,1.505
rwanda,774.689259444614,3.873
south asia,1852.30008753936,2.26709778262632
saudi arabia,19539.5658107346,2.465
sudan,608.33251953125,4.542
senegal,1490.20313694645,4.454
singapore,60729.4503486794,1.1
solomon islands,2222.4656361989,4.038
sierra leone,493.478777501977,4.08
el salvador,3903.39583878605,1.819
somalia,416.217774718244,6.417
serbia,7733.80286920471,1.48
sub-saharan africa (excluding high income),1492.13401817781,4.66702430877934
sub-saharan africa,1493.10069058974,4.66682509450994
small states,10702.9616881658,2.68167304375299
sao tome and principe,2161.31019577016,3.893
suriname,4796.53331389928,2.371
slovak republic,19545.7428171932,1.57
slovenia,25545.2410027136,1.6
sweden,52837.9039778149,1.66
eswatini,3372.89560064925,2.89
seychelles,12808.9874683728,2.29
syrian arab republic,533.3852317094,2.798
turks and caicos islands,20882.2612702141,1.679
chad,643.772221286329,6.346
east asia & pacific (ida & ibrd countries),8354.06518215147,1.59448078675562
europe & central asia (ida & ibrd countries),8423.19600153735,1.81347154476807
togo,897.194575433309,4.323
thailand,6990.93550262041,1.341
tajikistan,852.333670212118,3.237
latin america & the caribbean (ida & ibrd countries),7081.44704554668,1.89480015841851
timor-leste,1660.30833959898,3.247
middle east & north africa (ida & ibrd countries),3092.20555999448,2.69579740081228
tonga,4606.06455169288,3.267
south asia (ida & ibrd),1852.30008753936,2.26709778262632
sub-saharan africa (ida & ibrd countries),1493.10069058974,4.66682509450994
trinidad and tobago,13871.7982194664,1.631
tunisia,3497.68141046011,2.114
turkiye,8561.07094781987,1.917
tuvalu,4973.77456126816,3.188
tanzania,1042.0966796875,4.795
uganda,846.767201292225,4.693
ukraine,3751.74072265625,1.217
upper middle income,9157.82001579761,1.56760391795853
uruguay,15619.5426555695,1.477
united states,63530.633483909,1.6375
uzbekistan,1749.65581532206,2.9
st. vincent and the grenadines,8335.25647560551,1.814
virgin islands (u.s.),39552.1685953523,2.03
vietnam,3586.34730171843,1.955
vanuatu,2877.52017323582,3.778
world,10883.0760474916,2.29902994764743
samoa,4042.75122353237,3.997
kosovo,4310.81118337317,1.529
south africa,5741.64312911877,2.401
zambia,956.831363865706,4.379
zimbabwe,1372.69667433317,3.545

プロット結果


import matplotlib.pyplot as plt 
import numpy as np 
fig = plt.figure()
X = [sum(dic_gdp[i][-21:])/21 for i in countries_final]
Y = [sum(dic_birthrate[i])/21 for i in countries_final]
plt.scatter(X,Y,s = 3,color = "black")

plt.grid()
plt.ylabel("Birth rate")
plt.xlabel("GDP per Capita")

fig.savefig("FA2.png",dpi = 500)

多項式回帰

k=9までの多項式回帰を行い、BICで最適化します。

まずはpolyfitを行います。

n = 10
for i in range(n):
    z_i = [0 for i in range(n-1-i)]+[i for i in np.polyfit(X,Y,i)]
    plt.scatter(X,[pol_k(i,z_i) for i in X],s = 3,label = f"k={i}")
>>>
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         2.93906838]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -4.60757523e-05  3.77399533e+00]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  1.49900449e-09
 -1.39435138e-04  4.46454323e+00]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -4.36553298e-14  6.59662903e-09
 -2.84873654e-04  5.14680822e+00]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  1.29770815e-18 -2.55648281e-13  1.70562322e-08
 -4.49054114e-04  5.66621803e+00]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
 -3.00267234e-23  7.67072032e-18 -7.21315131e-13  3.08015357e-08
 -5.93011079e-04  6.00149770e+00]
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.20843502e-27
 -3.39004914e-22  3.69816159e-17 -1.99521819e-12  5.60178675e-08
 -7.85855086e-04  6.35884784e+00]
[ 0.00000000e+00  0.00000000e+00 -1.69736097e-32  6.22987778e-27
 -9.17752217e-22  6.98748060e-17 -2.95599835e-12  6.96459438e-08
 -8.65049884e-04  6.47983603e+00]
[ 0.00000000e+00  6.51199488e-37 -2.33573705e-31  3.52911446e-26
 -2.93803756e-21  1.47617600e-16 -4.59413550e-12  8.71552794e-08
 -9.45045693e-04  6.58311717e+00]
[-1.96600717e-41  7.90527317e-36 -1.34246642e-30  1.26339059e-25
 -7.29007194e-21  2.70322934e-16 -6.56784440e-12  1.03782234e-07
 -1.00687284e-03  6.65202906e+00]

BICによるモデルの評価

$$BIC =n\ln (\hat{\sigma}^2)+k\ln (n) $$

を使用して、BICを最小化するkを求めます。

min = np.inf
n = 10
for i in range(n):
    z_i = [0 for i in range(n-1-i)]+[i for i in np.polyfit(X,Y,i)]
    y_pk =  [pol_k(i,z_i) for i in X]
    sigma_2 = sum([(y_pk[i]-Y[i])**2 for i in range(len(X))])/len(X)
    #plt.scatter(X,[pol_k(i,z_i) for i in X],s = 3,label = f"k={i}")
    if len(X)*np.log(sigma_2)+i*np.log(len(X))<min:
        min = len(X)*np.log(sigma_2)+i*np.log(len(X))
        print(i)
        print(min)
    plt.scatter(i,len(X)*np.log(sigma_2)+i*np.log(len(X)),color = 'black')
>>>
0
129.90981458752884
1
67.86542994279054
2
19.525891609777148
3
-18.51446366315489
4
-38.59681305745106
5
-40.51390283509447

よって、k=5の時が一番良い多項式モデルが得られることがわかりました。

コメントを残す

メールアドレスが公開されることはありません。 が付いている欄は必須項目です