300字范文 > airbnb北京民宿运营情况分析

airbnb北京民宿运营情况分析

时间：2024-02-23 01:24:50

相关推荐

airbnb北京民宿运营情况分析

airbnb运营状况分析

airbnb北京民宿运营状况分析导入数据step1查看数据结构，处理缺失值#查看各个数据结构step2 统计民宿分布情况step3 去除离群值，做相关性分析step4 查看价格分布情况step5 找出运营状况良好的民宿并分析原因词云分析运营状况最好的民宿分布图一个简单的svm分类器，预测房屋的运营状况, 标签列：good为1， worse为0总结

airbnb北京民宿运营状况分析

导入数据

import numpy as npimport matplotlib.pyplot as pltfrom scipy.special import jnfrom IPython.display import display, clear_outputimport timeimport pandas as pdimport seaborn as snsfrom sklearn.preprocessing import StandardScalerplt.rcParams['font.sans-serif'] = [u'SimHei']plt.rcParams['axes.unicode_minus'] = Falselisting = r'C:\Users\Administrator\Desktop\数据分析\listings.csv'neighbour = r'C:\Users\Administrator\Desktop\数据分析\neighbourhoods.csv'review = r'C:\Users\Administrator\Desktop\数据分析\reviews_detail.csv'neigh_geo = r'C:\Users\Administrator\Desktop\数据分析\neighbourhoods.geojson'reviews = pd.read_csv(review)listing = pd.read_csv(listing)neighbour = pd.read_csv(neighbour)

step1查看数据结构，处理缺失值#查看各个数据结构

#查看各特征丢失缺失情况

#listing.isnull().sum()

#neighbourhood_group 整列缺失，删除

#listing = listing.drop(‘neighbourhood_group’, axis=1)

#删除name为空的记录,只有1条

#listing = listing[-listing[‘name’].isnull()]

print('listing->数据量: %d, 维度: %d' %(listing.shape[0], listing.shape[1]))print('reviews->数据量: %d, 维度: %d' %(reviews.shape[0], reviews.shape[1]))print('neighbour->数据量: %d, 维度: %d' %(neighbour.shape[0], neighbour.shape[1]))

listing->数据量: 28452, 维度: 16
reviews->数据量: 99, 维度: 6
neighbour->数据量: 16, 维度: 2

#listing.info() #查看具体信息, 发现neighbourhood_group字段整列缺失，后续可以删除

#neighbour neighbour表包含北京各区名字

#reviews.info() # reviews包含200k+ 的评论信息，中英文都有，后续可以考虑做情感分析判断民宿的满意度

listing = listing.drop(['neighbourhood_group'],axis=1)

step2 统计民宿分布情况

res = listing.groupby('neighbourhood') #按区域分组area_list = neighbour['neighbourhood'].to_list()neighb_counts = {}for area in area_list:counts = res.get_group(area).shape[0]neighb_counts[area] = counts

#发现东城+海淀+朝阳的民宿数量占比超过60%

plt.figure(figsize=(11,11))plt.title('各区民宿分布情况')label = [i[0] for i in neighb_counts.items()]data = [i[1] for i in neighb_counts.items()]max_area = sorted(neighb_counts.items(), key=lambda x:x[1],reverse=True)[:3] #选择民宿最多的三个区突出显示max_area = [i[0] for i in max_area]explode = [0.1 if i in max_area else 0 for i in label]color = sns.color_palette("RdBu", len(label))plt.pie(data, labels = label, explode = explode, autopct='%.2f%%', colors=color)plt.legend(loc='best', fontsize=6.5)plt.show()

可以看到，超过60%的房源都集中在朝阳，海淀，东城三个核心区域

step3 去除离群值，做相关性分析

#通过观察数据head 5行发现room_type值有重复，说明这应该是一个分类字段, 并且是文字信息，后续处理时应该map 到数字编码

listing[“room_type”].value_counts()

#task2 相关性分析，做相关性分析前可以先检查一下数据分布，发现离群值,这里重点观察价格

listing['price'].describe()plt.figure(figsize=(7,1.5))plt.title('价格分布情况')sns.boxplot(listing['price'],whis=0.5, palette="Blues")

**加粗

#listing[listing['price']>20000].shape #price>20000的有21条数据#listing[listing['price']==0].shape#price==0的有3条数据；删除这些离群值listing = listing[listing['price']!=0]listing = listing[listing['price']<=20000]copy_listing = listing.copy() #复制原表zscore = StandardScaler()scale_features = ['price', 'reviews_per_month', 'number_of_reviews']copy_listing[scale_features] = zscore.fit_transform(copy_listing[scale_features])

#热力图分析相关性, 重点分析价格和其他变量的相关性，显然房间编号和房主编号没有影响先去除这两个特征

room_type 和 neighbour都是文字信息，应该map 到数字编码

观察热力图发现价格与维度有比较明显的正相关，说明在北京地区越靠北的民宿价格越高

no_corr = ['id', 'host_id']columns = [feature for feature in copy_listing.columns if feature not in no_corr]sub_df = copy_listing[:][columns] room_type = sub_df['room_type'].unique()type_label = {k:v for v,k in enumerate(room_type)} #0: 整租/公寓, 1:单人房间 2:合租

#关于区域：按各区民宿数量排序，然后进行map

sorted_area = sorted(neighb_counts.items(), key=lambda x:x[1])sorted_area = [i[1] for i in sorted_area]neighbour_label = {k:v for v,k in enumerate(sorted_area)}sub_df['neighbour_label'] = sub_df['neighbourhood'].map(neighbour_label)sub_df['room_label'] = sub_df['room_type'].map(type_label)

#相关矩阵

corr_matrix = sub_df.corr()plt.figure(figsize=(10, 5))sns.set(font_scale=1.5)sns.heatmap(corr_matrix, vmin=-1, vmax=1, cmap=sns.color_palette('RdBu', n_colors=corr_matrix.size))plt.title('Correlation heat map')

观察热力图发现，房屋价格与纬度有正相关关系，说明在北京地区越往北的房子价格越高

step4 查看价格分布情况

#创建区域-价格透视表# 观察发现海淀和朝阳两区性价比最好，平均价格低而且处于城中心；

下边应该分析房间类型的区域分布

neighbourhood_price=pd.pivot_table(listing,index="neighbourhood",values="price",aggfunc=np.mean)neighbourhood_price.head()plt.figure(figsize=(30,10),dpi=80)plt.bar(neighbourhood_price.index,neighbourhood_price.price,color="b") plt.xticks(rotation=45) # 旋转坐标标签

通过透视图可以发现海淀和朝阳两区性价比最好，平均价格低而且处于城中心

step5 找出运营状况良好的民宿并分析原因

假设评论数量高的房源为运营状况良好的，所以我们假定评论数量大于整体评论数量分布的Q3分位数的为运营良好的房源

小于整体评论数量分布的Q1分位数的为运营较差的房源

找出运营状态好的民宿的共有特征，因为上图中我们看到总评论数和每月平均评论数有强正相关，平均每月评论数存在空值

#所以选取总评论数作为评价指标

假设平均总评论数大于4分位值的民宿是运营状态好的民宿

median = listing['number_of_reviews'].quantile(0.75)median2 = listing['number_of_reviews'].quantile(0.25)good_house = listing[listing['number_of_reviews']>median]worse_house = listing[listing['number_of_reviews']<=median2]print(good_house['price'].mean())print(worse_house['price'].mean())

446
698

from folium.plugins import HeatMapimport foliumworld_map = folium.Map()lati = 39.9longti = 116.3beijing_map = folium.Map(location=[lati, longti], zoom_start=12)sub_data = good_house.sort_values(by="reviews_per_month" , ascending=False)heat_data = sub_data.iloc[:200][['latitude', 'longitude']].values.tolist()HeatMap(heat_data).add_to(beijing_map)beijing_map

运营状况好的房源平均价格为447元，运营状况较差的平均价格为698元

运营状况好的民宿主要集中在中心城区

#统计name的描述长度与经营状态的关系,

#判断是否含中文，如果没有中文则分词，如果有中文则直接返回字符串长度

#发现运营状态好的房源的平均name描述长度大于运营状态不好的，说明应该认真写房源的名字

import jieba import numpy as npdef is_contain_chinese(check_str):for ch in check_str:if u'\u4e00' <= ch <= u'\u9fff':return Truereturn Falsedef count_words(data):word_list = data['name'].values.tolist()counts = []for sent in word_list:try:if is_contain_chinese(sent)==True:counts.append(len(sent))else:seg = sent.split()counts.append(len(seg))except Exception as e:counts.append(0)return np.mean(counts)print('good_house_words_count: ', count_words(good_house))print('bad_house_words_count: ', count_words(worse_house))

30.18
22.8

数据中name字段为该民宿的描述信息，运营状况好的民宿平均用30个词来描述自己的房子，运营状况差的平均用23个词

词云分析

from collections import Counterstop_words = ['着', '了', '过','的', '得', '一']def words2dict(data):words =[]res = data['name'].values.tolist()for i in res:try:if is_contain_chinese(i) == True:seg_list = jieba.cut(i, cut_all=False, HMM=True)for word in seg_list:if is_contain_chinese(word) == True and word not in stop_words:words.append(word)except Exception as e:passc = Counter(words)return cbad_c = words2dict(worse_house)c = words2dict(good_house)

从词云可以看出，运营状态好的房源着重突出天安门，三里屯，故宫，国贸等地标性建筑或者商圈

from PIL import Imagefrom wordcloud import WordCloudfont = r'C:\Windows\Fonts\simfang.ttf'png = 'twitt.png' # <- 词云要填充的图案，我这里用推特小鸟fig = plt.gcf()fig.set_size_inches(16, 14)fig1 = plt.subplot(121) # 1 行 3列第 1张图mask = np.array(Image.open(png)) # img -> array, 词云的填充对象wc1 = WordCloud(background_color='white', mask=mask, font_path=font).generate_from_frequencies(c)plt.imshow(wc1)plt.title('good house chinese words')plt.axis('off') #关闭边框线

fig1 = plt.subplot(122) # 1 行 3列第 1张图mask = np.array(Image.open(png)) # img -> array, 词云的填充对象wc1 = WordCloud(background_color='white', mask=mask, font_path=font).generate_from_frequencies(bad_c)plt.imshow(wc1)plt.title('bad house chinese words')plt.axis('off') #关闭边框线

通过词云发现，运营状况好的民宿更喜欢突出国贸，天安门，故宫，三里屯等地理指标性名词

运营状况最好的民宿

median_best = good_house['number_of_reviews'].quantile(0.75)best_house = good_house[good_house['number_of_reviews']>median_best]best_house = best_house.sort_values(by='number_of_reviews',ascending=False) # 按评论数排序counts = {}room_typ_good = best_house.groupby('neighbourhood')for typ in area_list:try:res = room_typ_good.get_group(typ)counts[typ] = res.shape[0]except Exception as e:counts[typ] = 0best_house['price'].describe()

count 1669.000000
mean 410.194727
std 396.428221
min 60.000000
25% 208.000000
50% 349.000000
75% 497.000000
max 9998.000000
Name: price, dtype: float64

分布图

plt.figure(figsize=(11,11))plt.title('运营状况最好的民宿分布图')label = [i[0] for i in counts.items()]data = [i[1] for i in counts.items()]max_area = sorted(counts.items(), key=lambda x:x[1],reverse=True)[:3] #选择民宿最多的三个区突出显示max_area = [i[0] for i in max_area]explode = [0.1 if i in max_area else 0 for i in label]color = sns.color_palette("RdBu", len(label))plt.pie(data, labels = label, explode = explode, autopct='%.2f%%', colors=color)plt.legend(loc='best', fontsize=6.5)plt.show()

收入情况

best_house['income'] = best_house['price'] * best_house['reviews_per_month']best_house['income'].describe()

count 1669.000000
mean 1286.337837
std 1274.554455
min 68.580000
25% 517.000000
50% 916.500000
75% 1683.400000
max 22995.400000
Name: income, dtype: float64

一个简单的svm分类器，预测房屋的运营状况, 标签列：good为1， worse为0

number_of_reviews和reviews_per_month都是与分类标准有高度相关的特征，不可以使用

#做一个简单的svm分类器，预测什么样的房子会经营成功

are_map = {key:value for value,key in enumerate(area_list)}features_no = ['id', 'name', 'host_id', 'host_name', 'last_review', 'number_of_reviews','reviews_per_month']good_house_sub = good_house.drop(columns=features_no)worse_house_sub = worse_house.drop(columns=features_no)

#添加tag列， 1为good， map room_type, neighbourhood为数字

good_house_sub['tag']=1good_house_sub['room_type'] = good_house_sub['room_type'].map(type_label)good_house_sub['neighbourhood'] = good_house_sub['neighbourhood'].map(are_map)worse_house_sub['tag']=0worse_house_sub['room_type'] = worse_house['room_type'].map(type_label)worse_house_sub['neighbourhood'] = worse_house['neighbourhood'].map(are_map)worse_house_sub.fillna(0, inplace=True)df_data = pd.concat([good_house_sub, worse_house_sub], axis=0)import numpy as npfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerfrom sklearn.utils import shufflefrom sklearn.svm import SVCfrom sklearn.model_selection import train_test_splitdf_data = shuffle(df_data)

y = df_data['tag']X = df_data.drop(columns=['tag'])X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))clf.fit(X_train, y_train)clf.score(X_test, y_test)