Pythonla Regresyon Modeli Uyguluyorum: Ev Fiyat Tahmini

Yeni medium yazıma hoş geldiniz!

5 min readDec 5, 2020

--

2020'nin son zamanlarında, keyifle inceleyebileceğiniz bolca grafik görselleriyle oluşturduğum yazımı paylaşıyorum. Bu yazımda, pythonla çeşitli regresyon yöntemleri kullanarak konut fiyatlarını tahmin etmeye çalışıyorum. Bunu başarmak için çok çeşitli yöntemler var ve her yöntemin kendine has artıları olabiliyor. Bence, “regresyon” en önemli yöntemlerden biri çünkü bize veriler hakkında çok fazla fikir verebiliyor. Değişkenler arasındaki ilişkiyi yorumlamamızı kolaylaştırıyor. Görselleri oluşturduğum kodlarını da detay kısımlarına ekledim. Üzerinde çalışacağım veri kümesi için en iyi regresyonu bulmak için gerçekleştirdiğim işlemleri beraber inceleyelim.

Gerekli Kütüphanelerin Yüklenmesi

📈 Veri Okuma

Data=Train[‘SalePrice’]
fig = px.box(Data, x=’SalePrice’, points=”all” ,title=”Sale Price Box Plot”)
fig.show()

Korelasyon vazgeçilmezim

corr=Train.corr()
corr=corr.sort_values(by=’SalePrice’)
colorscale = [[0, ‘#edf8fb’], [.3, ‘#b3cde3’], [.6, ‘#8856a7’], [1, ‘#810f7c’]]
fig= go.Figure(data=go.Heatmap(z=corr.values, y=corr.index,x=corr.columns, colorscale=colorscale))
fig.update_layout(
autosize=False, width=850, height=850)
fig.show()

Korelasyon ilk 10'da durum nasıl?

top=10
top_corr=corr.nlargest(top,’SalePrice’)[‘SalePrice’].index
top_corr_values=np.corrcoef(Train[top_corr.values].values.T)
colorscale = [[0, ‘#edf8fb’], [.3, ‘#b3cde3’], [.6, ‘#8856a7’], [1, ‘#810f7c’]]
fig= go.Figure(data=go.Heatmap(z=top_corr_values, y=top_corr,x=top_corr, colorscale=colorscale))
fig.update_layout( autosize=False, width=500,height=500)
fig.show()

data=pd.DataFrame()
data[‘Year Sold’]=Train.groupby([‘YrSold’]).size().index.astype(str)
data[‘Houses’]=Train.groupby([‘YrSold’]).size().values
fig=px.bar(data,y=’Year Sold’,x=’Houses’,color=’Year Sold’,title=”Top Year Sold”)
fig.show()

2009 yılında evlerin en çok; 2010 yılında ise evlerin en az satıldığı yıl olduğunu görüyoruz.

data = pd.concat([Train[‘SalePrice’], Train[‘YrSold’]], axis=1)
fig = px.box(data, y=’SalePrice’, x=’YrSold’,color=’YrSold’,title=”BoxPlot Year Sold”)
fig.show()

data=Train.groupby([‘Neighborhood’,’MSZoning’])[‘SalePrice’].count().unstack()
x=data.index
fig = go.Figure(go.Bar(x=x, y=data[‘C (all)’], name=’C (all)’))
fig.add_trace(go.Bar(x=x, y=data[‘FV’], name=’FV’))
fig.add_trace(go.Bar(x=x, y=data[‘RH’], name=’RH’))
fig.add_trace(go.Bar(x=x, y=data[‘RL’], name=’RL’))
fig.add_trace(go.Bar(x=x, y=data[‘RM’], name=’RM’))
fig.update_layout(barmode=’stack’, xaxis={‘categoryorder’:’category ascending’},title=’BarPlot Distribution Neighborhood by MsZoning’)
fig.show()

data = pd.concat([Train[‘SalePrice’], Train[‘Neighborhood’]], axis=1)
fig = px.box(data, y=’SalePrice’, x=’Neighborhood’,color=’Neighborhood’,title=’BoxPlot Neighborhood by SalePrice’)
fig.show()

data=pd.DataFrame()
data[‘MSSubClass’]=Train[‘MSSubClass’].value_counts().index
data[‘Houses’]=Train[‘MSSubClass’].value_counts().values
fig = px.pie(data, values=’Houses’, names=’MSSubClass’,color=’MSSubClass’,color_discrete_sequence=px.colors.sequential.RdBu)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’,title = ‘Distribution of Sub Class ‘)
fig.show()

plt.style.use(‘seaborn-white’)
fig, ax = plt.subplots(figsize=(14,8))
palette = [“#9b59b6”, “#3498db”, “#95a5a6”, “#e74c3c”, “#34495e”, “#2ecc71”, “#FF8000”, “#AEB404”, “#FE2EF7”, “#64FE2E”]
sns.swarmplot(x=”OverallQual”, y=”SalePrice”, data=Train, ax=ax, palette=palette, linewidth=1)
plt.show()

Data=pd.concat([Train[‘OverallQual’],Train[‘SalePrice’]],axis=1)
fig=px.box(Data,x=’OverallQual’,y=’SalePrice’,color=’OverallQual’)
fig.show()

🎀 Veri Hazırlama

Outlier’da kırmızı kartsız olmaz :)

sns.distplot(Train[‘SalePrice’], fit=norm, color= “red”)
fig = plt.figure()
res = stats.probplot(Train[‘SalePrice’], plot=plt)
plt.show()

⚡️ Kayıp veri (missing data)

total =features.isnull().sum().sort_values(ascending=False)
percent = (features.isnull().sum()/features.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=[‘Total’, ‘Percent’])
missing_data=missing_data.drop((missing_data[missing_data[‘Total’]==0]).index,0)
missing_data.head(20)
display(missing_data.head(20).style.background_gradient(cmap=’Reds’))

🔨 Yeni özellikler oluşturuyorum

features[‘age_houses’]= (features[‘YrSold’] — features[‘YearBuilt’] )
features[‘age_houses’].describe()

features[features[‘age_houses’]<0]

features.loc[features[‘YrSold’] < features[‘YearBuilt’],’YrSold’ ] = 2009
features[‘age_houses’]= (features[‘YrSold’] — features[‘YearBuilt’] )
features[‘age_houses’].describe()

✈️ Özelliği Transfer Et

features['OverallCond'] = features['OverallCond'].astype(str)
features['MSSubClass'] = features['MSSubClass'].astype(str)
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)

İstatistiksel model için inceleyebileceğiniz bir kaynak: https://towardsdatascience.com/skewed-data-a-problem-to-your-statistical-model-9a6b5bb74e37

numerical=features.select_dtypes(exclude=object).columns
skewness=features[numerical].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness=skewness[abs(skewness)>0.75]
skewness.head(10)

from scipy.special import boxcox1p
skewed_features = skewness.index
lam = 0.15
for cols in skewed_features:
    features[cols] = boxcox1p(features[cols], lam)

🏆 Final