Happy Data Year was data analysis championship from Rosbank held from November 22 to January 10.
Participants should have predicted the geolocation-based popularity index for placing an ATM.
The training sample contains adresses of six thousand ATMs of Rosbank and its partners, as well as the target variable - the ATM popularity index. In the test sample, there are another two and a half thousand ATMs, equally divided into public and private parts.
The main feature of this contest was that participants have to collect geodata by their own. Hence the success depended not so much from models as from feature engineering.
Lets look at the data first
import pickle import math from math import sin, cos, sqrt, atan2, radians import time import collections import osmread from tqdm import tqdm_notebook import wget from collections import Counter import numpy as np import pandas as pd from scipy import stats from sklearn.neighbors import NearestNeighbors from sklearn.metrics import mean_squared_error, r2_score from sklearn.model_selection import train_test_split import xgboost from sklearn.decomposition import PCA from sklearn import preprocessing as pr from scipy import stats import lightgbm as lgb from matplotlib import pyplot as plt %matplotlib inline %config InlineBackend.figure_format = 'svg' SEED = 12
train = pd.read_csv('http://nagornyy.me/datasets/rosbank/train.csv', index_col=0) test = pd.read_csv('http://nagornyy.me/datasets/rosbank/test.csv', index_col=0) train['isTrain'] = True test['isTrain'] = False full = train.append(test, sort=False)
full.shape
As you can see there are not so many features in the dataset:
id
is an unique indentifier of the ATM;atm_group
is the id of the bank ATM belongs to.address
is the only genuine featue which can tell us something about the location of the ATM;address_rus
is the geocoded address
, which was generated by organisers using Google Maps and Yandex Maps API.lat
and long
— latitude and longitude, genereted in the same way as the previous feature.target
— popularity index.full.target.hist();
Distribution of target
is quite normal, may it needs one simple transformation.
Lets look at the spatial distribution of ATMs using Folium library.
import folium import branca fmap = folium.Map(location=[55.805827, 70.515146], max_zoom=7) colorscale = branca.colormap.linear.RdBu_04.scale(full.target.min(), full.target.max()) for index, row in full[["lat", "long", "isTrain", "target"]].dropna(subset=["lat", "long"]).sample(frac=1).iterrows(): if row.isTrain: folium.CircleMarker(location=[row["lat"], row["long"]], radius=10, opacity=1, weight=3, color=colorscale(row.target), tooltip=f"Index:{index},\nTarget: {row.target}").add_to(fmap) else: folium.RegularPolygonMarker(location=[row["lat"], row["long"]], radius=10, fill_opacity=0, opacity=0.7, weight=3, color="#999999", number_of_sides=3, tooltip=index).add_to(fmap) fmap.save('../../../static/pages/rosbank.html')
The map shows us the coordinates of ATMs. Points from the test set have the shape of a circle and from the test set have a shaoe of triangle. Color of ATMs from the train set demonstrates its popularity index — blue circles are more popular than red.
The distribution of the points is quite natural. We don't see any congestion of the points from the test or train set on the map.
Load all the data with the create_full_df
function. Then extract the name of the city from the addresses and save it to the city
column. Mark cities with less than 20 ATMs eith the name RARE
, and then apply label encoding.
def create_full_df(city_label_enk=True): train = pd.read_csv('http://nagornyy.me/datasets/rosbank/train.csv', index_col=0) test = pd.read_csv('http://nagornyy.me/datasets/rosbank/test.csv', index_col=0) train['isTrain'] = True test['isTrain'] = False full = train.append(test, sort=False) full.id = full.id.astype("int") full.atm_group = full.atm_group.astype("category") counts = full.groupby('address_rus')['id'].count().reset_index().rename(columns={'id':'count'}) full = pd.merge(full, counts, how='left', on='address_rus') full['city'] = full[~full.address_rus.isnull()].address_rus.apply(lambda x: x.split(',')[2]) rare_cities = full.city.value_counts()[(full.city.value_counts() < 20) ==True].index if city_label_enk: full.city = full.city.apply(lambda x: 'RARE' if x in rare_cities else x) full.city = full.city.rank().astype("category") print("full_df created", time.ctime()) return full
Implement Haversine Formula to calculate the distance
between geographical points.
def distance(x,y): """ Params ---------- x : tuple, lat, long of the first point y : tuple, lat, long of the second point 6373.0 : radius of the Earth Result ---------- result : distanse between two points """ lat_a, long_a, lat_b, long_b = map(radians, [*x,*y]) dlon = long_b - long_a dlat = lat_b - lat_a a = sin(dlat/2)**2 + cos(lat_a) * cos(lat_b) * sin(dlon/2)**2 c = 2 * atan2(sqrt(a), sqrt(1 - a)) return 6373.0 * c
Then we define the function add_kneighbors_atm
to calculate the distance between the nearest ATMs. The function takes a dataframe with the fields lat
,long
and adds to it the distance between each point and n_neighbors
of other points closest to it, their indices, as well as the average distance to these points.
def add_kneighbors_atm(df, n_neighbors=7): df = df.copy() knc = NearestNeighbors(metric=distance) dots = df[['lat','long']].dropna() knc.fit(X=dots) distances, indexes = knc.kneighbors(X=dots, n_neighbors=n_neighbors) for i in range(1,6): dots['distance_%s'%i] = distances[:,i] dots['indexes_%s'%i] = indexes[:,i] dots['kurtosis_%s'%i] = stats.kurtosis(distances[:,i]) dots['skew_%s'%i] = stats.skew(distances[:,i]) dots['mean'] = dots.iloc[:,dots.columns.str.contains('distance')].mean(axis=1) df = pd.concat([df, dots.drop(columns=['lat', 'long'])], axis=1) print("neighbors added", time.ctime()) return df
Define loss finction RMSE at last.
def rmse(y_true, y_pred): return sqrt(mean_squared_error(y_true, y_pred))
full = create_full_df() full = add_kneighbors_atm(full)
And get very preliminary results with LightGBM:
def simple_classifier(df): full = df.copy() full.isTrain = full.isTrain.fillna(False) col_drop = ['id', 'address', 'address_rus', 'target', 'isTrain'] + \ [f'{n}_{i}' for i in range(1, 6) for n in ['indexes', 'kurtosis', 'skew']] X_ = full[full.isTrain].drop(columns=col_drop, errors='ignore') Y_ = full.loc[full.isTrain, 'target'] X_train, X_valid, Y_train, Y_valid = train_test_split( X_, Y_, test_size=0.2, random_state=SEED ) gbm = lgb.LGBMRegressor(objective = 'regression', max_depth = 3, colsample_bytre = 0.8, subsample = 0.8, learning_rate = 0.1, n_estimators = 300, random_state=SEED ) gbm.fit(X_train, Y_train, eval_set=[(X_valid, Y_valid)], eval_metric='rmse', early_stopping_rounds=5, verbose=False); print("RMSE: {rmse:.4}\nR^2: {r2:.3}".format( rmse=rmse(Y_valid, gbm.predict(X_valid)), r2=r2_score(Y_valid, gbm.predict(X_valid)) )) return gbm gbm = simple_classifier(full)
Initially, we see a pretty good quality of predictions.
lgb.plot_importance(gbm);
The most important feature is the longitude, the second important featue is the city. Apparently, this is a consequence of the uneven development of the regions. In some cities, mainly in the west, ATMs are often used, in the east - less.
One of the main parts of data analysis is the generation of features. If we have the coordinates of a point, we can recognize its environment, which almost always affects the performance of our model. Information about the environment can be gathered from the Open Street Maps. Russian maps in POIs can be downloaded at https://download.geofabrik.de/russia-latest.osm.pbf.
Lets investigate how far away from ATMs placed important objects that, at our opinion, will affect their popularity - metro stations, shops, etc. Something similar was done in the article “[How to predict the number of emergency calls in different parts of the city?] (https://habr.com/company/ru_mts/blog/353334/) ".
So, first we need to parse the source file with all the objects that are marked on the map of Russia. In the structure of the OSM file there are [three kinds of objects] (https://wiki.openstreetmap.org/wiki/Elements) - Node
,Way
and Relation
. We are most likely to be interested in the objects of the first type - these are any points on the map from benches to cities. It is more difficult to count the distance from ATMs to the ways, and it’s not at all clear what to do with relationships.
Also, any type of object can have a property tags
- this is a dictionary with very dofferent attributes — type, name, working hours, etc.
Let's select objects of type node
, which are described by at least one attribute. Since the data file took several gigabytes, the filtering process took several hours.
def extract_osm_tags(): # wget.download("https://download.geofabrik.de/russia-latest.osm.pbf") osm_file = osmread.parse_file('russia-latest.osm.pbf') tagged_nodes = [ entry for entry in tqdm_notebook(osm_file) if isinstance(entry, osmread.Node) if len(entry.tags) > 0 ] with open('tagged_nodes.pickle', 'wb') as fout: pickle.dump(tagged_nodes, fout, protocol=pickle.HIGHEST_PROTOCOL)
It turned out to find more than 4.6 million points that satisfy our conditions. We pickled them in a file for the fearther use.
with open('/datasets/rosbank/tagged_nodes.pickle.zip', 'rb') as fin: tagged_nodes = pickle.load(fin)
all_tags = [k for node in tagged_nodes for k in node.tags.keys()] len(all_tags)
tags_counter = Counter(all_tags) len(tags_counter)
We see enormous amount of unique tags — 9528332, and 4313 if unique tags. What is the distridution of the tags?
tags_counter.most_common(30)
Very uneven, as wee see. The difference detween the most prevalent tag and top-20 tag is large.
So in 4601457 objects we found 4313 unique tags, which is too much, especially considering their uneven distribution - even the third most popular tag is found in less than 10% of observations (most of all, by the way, electrical infrastructure ([power] (https://wiki.openstreetmap.org/wiki/Key:power)) - supports of electric transmission lines and everything related).
The dimension can be reduced by removing not very useful tags. For example, in many of the name of the object is written in different languages and in different styles: name
,name: en
, official_name
, etc. We could to get rid of them, but lets make features that will show the amount of different names. Probably, these features will talk about the popularity of the object.
The same for tags starting with wikipedia
is [links to Wikipedia articles in different languages] (https://wiki.openstreetmap.org/wiki/Key:wikipedia).
I did not decide do I need to remove the color
tag, so leave it.
tagged_nodes_list = [] for node in tqdm_notebook(tagged_nodes): d = {k: v for k, v in node.tags.items() if not (k.startswith("name:") or k.startswith("old_name") or k.startswith("alt_name") or k.startswith("wikipedia") #or k.startswith("color:") or k.startswith("official_name:") ) } num_names = len([k for k in node.tags.keys() if k.startswith("name")]) num_official_names = len([k for k in node.tags.keys() if k.startswith("official_name")]) num_alt_names = len([k for k in node.tags.keys() if k.startswith("alt_name")]) num_old_names = len([k for k in node.tags.keys() if k.startswith("old_name")]) num_all_names = num_names + num_official_names + num_alt_names + num_old_names num_wiki = len([k for k in node.tags.keys() if k.startswith("wikipedia")]) d.update({ "lat": node.lat, "lon": node.lon, "id": node.id, "timestamp": node.timestamp, "num_names": num_names, "num_official_names": num_official_names, "num_alt_names": num_alt_names, "num_old_names": num_old_names, "num_all_names": num_all_names, "num_wiki": num_wiki }) tagged_nodes_list.append(d)
The number of tags was redused from 4313 to 3755 (13% less):
c = Counter([tag for node in tagged_nodes_list for tag in node]) len(c)
And the tags that are found in more than 500 objects are only 337 (their total number, let me remind you, is more than 4.6 million):
frq_titles = [title for title, count in c.most_common() if count > 500] len(frq_titles)
Now we can select all the tags to calculate 1) the distance between ATMs and N, where nearby points and 2) nummber of different points in a certain radius.
df_features = collections.OrderedDict([]) # Set of points we are interested in POINT_FEATURE_FILTERS = [ ('tagged', lambda node: len(node.tags) > 0), ('railway=station', lambda node: node.tags.get('railway') == 'station'), ] + [(i, lambda node: i in node.tags) for i in frq_titles if i is not 'id'] X_zone_centers = full[['lat', 'long']].dropna().values for prefix, point_filter in tqdm_notebook(POINT_FEATURE_FILTERS, desc='Фильтр'): coords = np.array([ [node.lat, node.lon] for node in tagged_nodes if point_filter(node) ]) neighbors = NearestNeighbors().fit(coords) # feature "number of points in radius R" for radius in [0.001, 0.003, 0.005, 0.007, 0.01]: dists, inds = neighbors.radius_neighbors(X=X_zone_centers, radius=radius) df_features['{}_points_in_{}'.format(prefix, radius)] = \ np.array([len(x) for x in inds]) # feature "distance to the nearest K points" for n_neighbors in [1, 3, 5, 10]: dists, inds = neighbors.kneighbors(X=X_zone_centers, n_neighbors=n_neighbors) df_features['{}_mean_dist_k_{}'.format(prefix, n_neighbors)] = \ dists.mean(axis=1) df_features['{}_max_dist_k_{}'.format(prefix, n_neighbors)] = \ dists.max(axis=1) df_features['{}_std_dist_k_{}'.format(prefix, n_neighbors)] = \ dists.std(axis=1) # featue "distance to the any nearest point" df_features['{}_min'.format(prefix)] = dists.min(axis=1) # save new data df_features = pd.DataFrame(df_features, index=full[['lat', 'long']].dropna().index) df_features.to_csv('/datasets/rosbank/features_500.csv.zip', compression='zip')
df_features = pd.read_csv('../../../static/datasets/rosbank/features_500.csv.zip', compression='zip', index_col=0) df_features.head()
tagged_points_in_0.001 | tagged_points_in_0.003 | tagged_points_in_0.005 | tagged_points_in_0.007 | tagged_points_in_0.01 | tagged_mean_dist_k_1 | tagged_max_dist_k_1 | tagged_std_dist_k_1 | tagged_mean_dist_k_3 | tagged_max_dist_k_3 | ... | ruins_mean_dist_k_3 | ruins_max_dist_k_3 | ruins_std_dist_k_3 | ruins_mean_dist_k_5 | ruins_max_dist_k_5 | ruins_std_dist_k_5 | ruins_mean_dist_k_10 | ruins_max_dist_k_10 | ruins_std_dist_k_10 | ruins_min | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||
8526 | 0 | 1 | 9 | 54 | 127 | 0.002476 | 0.002476 | 0.0 | 0.003067 | 0.003431 | ... | 5.291163 | 5.291224 | 0.000050 | 5.291307 | 5.291765 | 0.000237 | 5.291705 | 5.292427 | 0.000463 | 5.291101 |
8532 | 4 | 16 | 36 | 64 | 127 | 0.000393 | 0.000393 | 0.0 | 0.000615 | 0.000731 | ... | 5.275803 | 5.275864 | 0.000050 | 5.275947 | 5.276405 | 0.000237 | 5.276346 | 5.277068 | 0.000463 | 5.275742 |
8533 | 11 | 133 | 272 | 447 | 843 | 0.000348 | 0.000348 | 0.0 | 0.000372 | 0.000395 | ... | 5.294956 | 5.295017 | 0.000050 | 5.295101 | 5.295560 | 0.000237 | 5.295501 | 5.296225 | 0.000465 | 5.294894 |
8684 | 79 | 467 | 955 | 1409 | 2054 | 0.000052 | 0.000052 | 0.0 | 0.000183 | 0.000260 | ... | 0.352660 | 0.495956 | 0.175553 | 0.446310 | 0.589103 | 0.177902 | 0.682405 | 1.099252 | 0.282489 | 0.105433 |
37 | 8 | 69 | 141 | 218 | 286 | 0.000168 | 0.000168 | 0.0 | 0.000339 | 0.000513 | ... | 15.634626 | 18.100765 | 1.757582 | 18.533833 | 25.578784 | 4.167641 | 24.116806 | 31.336085 | 6.483454 | 14.132749 |
5 rows × 6084 columns
So, we generated a bunch of features showing the distance between ATMs, from ATMs to all possible points from among the top 500, and the number of these points in different radii. It turned out 6087 features, most of which, of course, are irrelevant. What is next? Of course ...
For further work, it is necessary to select the most important features, for example, 300 of them. Let us turn to the methods of feature extraction called Wrapper methods with the random forest model. I wrote about feature selection methods in the article Feature engineering.
mega_full = pd.concat([full.set_index("id"), df_features], sort=False, axis=1)
col_drop = ['address', 'address_rus', 'target', 'isTrain']
def select_features(df): from sklearn.ensemble import RandomForestRegressor from sklearn.feature_selection import RFE X = df.query('isTrain == True').drop(columns=col_drop) y = df.query('isTrain == True')['target'] rfr = RandomForestRegressor(n_jobs=4, random_state=SEED, n_estimators=10, verbose=0) rfr_selector = RFE(rfr, n_features_to_select=10, step=10, verbose=0) rfr_selector = rfr_selector.fit(X, y) features_ranks = {i: X.columns[rfr_selector.ranking_ <= i] for i in set(rfr_selector.ranking_)} # download at http://nagornyy.me/datasets/rosbank/features_ranks.pickle.zip pickle.dump(features_ranks, open("../../../static/datasets/rosbank/features_ranks.pickle", "wb")) return features_ranks
best_features = pickle.load(open('../../../static/datasets/rosbank/features_ranks.pickle', 'rb'))
Here we see the names of the top-45 features.
for tier in list(best_features)[:10]: print(tier, len(best_features[tier])) print(' 💥 '.join(best_features[5]))
mega_full.atm_group = mega_full.atm_group.astype("category")#cat.add_categories(-1).fillna(-1) mega_full.city = mega_full.city.astype("category")#cat.add_categories(-1).fillna(-1) lgb.plot_importance( simple_classifier(mega_full[best_features[4].tolist() + ['target', 'isTrain']]), figsize=(7, 9) );
Я собрал базу банкоматов с banki.ru, чтобы улучшить результаты классификатора. К сожалению, я не успел применить их в модели. Ваша задача — закончить построение классификатора используюя эти данные.
Далее идёт часть ноутбука с кодом загрузки данных и ссылкой на скачивания готового файла.
База банкоматов, которую выдали организаторы, очевидно неполная. Здравый смысл подсказывает, что в стране установлено намного больше ≈9000 банкоматов. Скорее, это число исчисляется стонями тысяч. Следовательно, один из путей повышения качества классификатора — расширение базы данных.
Также хорошо было бы увеличить количество информации о каждом банкомате. Если мы хотим выявить предикторы популярности, то, кажется, одинм из самых важных из них будет график работы. Однако это информации нет ни в исходных данных, ни в OSM.
Но где достать наиболее полную базу банкоматов России? Тут на помощь приходит сайт banki.ru, где, помимо всего прочего собрана огромная база отделений банков и банкоматов. Кстати говоря, очень рекомендую это сайт — так куча полезных статей, информации о вкладах, кретитных тарифах и т.п.
Итак, нам необходимо собрать этот набора данных сайта. О веб-скраппинге и парсинге я подробно писал в соответсвующей статье.
Загрузка данных будет происходить в три этапа:
К счастью, города можно загрузить в формате JSON по прямому URL.
Всего в базе 9276 городов. Ожидаемо, больше всего банкоматов в Москве, затем идут Санкт-Петербург и Новосибирск.
import requests import pandas as pd import json from tqdm import tqdm_notebook import time import pickle import math
url_cities = ('https://www.banki.ru/bitrix/components/banks/universal.select.region/ajax.php' + '?bankid=0&baseUrl=%2Fbanks%2Fmap%2F&appendUrl=&type=city') res_cities = requests.get(url_cities)
sorted(res_cities.json()['data'], key=lambda x: x['count'], reverse=True)[:3]
Для загрузки банкоматов из города необходимо делать POST-запросы с определёнными заголовками на адрес https://www.banki.ru/api/. Заголовки можно скопировать из Chrome Developer Tools в формате cURL. Тут я хочу пожелать здоровья человеку, который сдел сервис для перевода формата cURL в пригодный для использования Питоновским моделуем requests.
NB: если запросы будут выполняться слишком часто, сайт попросит вести каптчу, что нам совершенно не нужно. Чтобы избежать этого, добавим пазузу в 0.3 секунды между запросами.
headers = { 'cookie': 'PHPSESSID=93bfdc45fd9cc9a8a8a3d50c2447639e; BANKI_RU_USER_IDENTITY_UID=5549793729193099991; aff_sub3=main; uid=uQo9b1wiJOM7IVHaBCLlAg==; __utmc=241422353; _ga=GA1.2.199415927.1545741541; _ym_uid=1545741541860272573; _ym_d=1545741541; ga_client_id=199415927.1545741541; scs=%7B%22t%22%3A1%7D; flocktory-uuid=721c443d-8411-4607-8c77-9eaa3490ad79-2; sbjs_migrations=1418474375998%3D1; sbjs_first_add=fd%3D2018-12-25%2015%3A39%3A03%7C%7C%7Cep%3Dhttps%3A%2F%2Fwww.banki.ru%2F%7C%7C%7Crf%3Dhttps%3A%2F%2Fwww.google.com%2F; sbjs_first=typ%3Dorganic%7C%7C%7Csrc%3Dgoogle%7C%7C%7Cmdm%3Dorganic%7C%7C%7Ccmp%3D%28none%29%7C%7C%7Ccnt%3D%28none%29%7C%7C%7Ctrm%3D%28none%29; ins-mig-done=1; lp_vid=c522904b-b4ae-467f-5a09-f28119eb8f57; spUID=1545741543376ba14961ee0.919b1d92; __io_lv=1545741551218; __io_uid_test=8; __io=5e24aa2a9.d4efafdc0_1545741551223; _io_un=; _io_un=; _io_un=25; __utmv=241422353.|1=siteDesign=new=1; lp_abtests=[{"SessionId":"300285","WidgetId":4568,"TestId":37}]; views_counter=%7B%22news%22%3A%5B10802672%5D%7D; _gid=GA1.2.1726664245.1546467651; __utmz=241422353.1546467651.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); _ym_isad=1; BANKI_RU_GUEST_ID=606957941; user-region-id=1; _fbp=fb.1.1546526535416.1197015680; decid_cache=1261000702; decid_png=1261000702; decid_etag=1261000702; DEC_ID=1261000702; BANKI_RU_DLDB=1265655189; __utma=241422353.199415927.1545741541.1546533080.1546536112.6; __utmt=1; _ic_c=6..google_organic; sbjs_current_add=fd%3D2019-01-03%2020%3A21%3A52%7C%7C%7Cep%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fbarnaul%2F%23%2F%21b1%3Aall%21s3%3Abankomaty%21s4%3Alist%21m4%3A1%21p1%3A1%7C%7C%7Crf%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fbarnaul%2F; sbjs_current=typ%3Dreferral%7C%7C%7Csrc%3Dbanki.ru%7C%7C%7Cmdm%3Dreferral%7C%7C%7Ccmp%3D%28none%29%7C%7C%7Ccnt%3D%2Fbanks%2Fmap%2Fbarnaul%2F%7C%7C%7Ctrm%3D%28none%29; sbjs_udata=vst%3D6%7C%7C%7Cuip%3D%28none%29%7C%7C%7Cuag%3DMozilla%2F5.0%20%28Macintosh%3B%20Intel%20Mac%20OS%20X%2010_14_2%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F71.0.3578.98%20Safari%2F537.36; ins-gaSSId=a8a876f6-e72b-5ea9-6348-60de0df03b4b_1546536114; lp_session=247585; _ym_visorc_502212=b; NonRobot=15465365840c4f6fe5021cfba98ee584c8ed0c080b729+acfc037aae7e3cd2095d5a5c38fcbc1d; __utmb=241422353.5.10.1546536112; _gat=1; _gat_bankiru_test=1; sbjs_session=pgs%3D5%7C%7C%7Ccpg%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fbarnaul%2F%3F%23%2F%21b1%3Aall%21s3%3Abankomaty%21s4%3Alist%21m1%3A12%21m2%3A53.356132%21m3%3A83.74962%21p1%3A1; tmr_detect=1%7C1546536586502; BANKI_RU_LAST_VISIT=03.01.2019+20%3A29%3A46; BANKI_RU_BANNERS=452_20055_1_04012019%2C452_10427_1_04012019%2C106_10808_1_04012019%2C106_797_15_04012019%2C106_809_11_04012019; insdrSV=32; lp_pageview=5; _gat_UA-38591118-1=1; bank_geo_link=#/!b1:all!s3:bankomaty!s4:list!m1:12!m2:53.356132!m3:83.74962!p1:1', 'origin': 'https://www.banki.ru', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,ru;q=0.7', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 'content-type': 'application/x-www-form-urlencoded; charset=UTF-8', 'accept': 'application/json, text/javascript, */*; q=0.01', 'referer': 'https://www.banki.ru/banks/map/barnaul/?', 'authority': 'www.banki.ru', 'x-requested-with': 'XMLHttpRequest', 'save-data': 'on', } data = { "jsonrpc":"2.0", "method":"bankGeo/getObjectsByFilter", "params":{ "with_empty_coordinates":True, "limit":10000, "type":["atm","self_office"], "region_id":[] }, "id":"2" }
atms_all = [] for city in tqdm_notebook(res_cities.json()['data']): data['params']['region_id'] = [city['region_id']] response_atms_city = requests.post('https://www.banki.ru/api/', headers=headers, data=json.dumps(data)) time.sleep(0.3) # иначе просят ввести каптчу atms_city = response_atms_city.json()['result']['data'].copy() for atm in atms_city: atm.update({ "region": city["region_name"] }) atms_all.append(atm)
Неплохой услов! Мы собрали информацию о 86021 банкомате за полтора часа, что гораздо больше, чем дали нам организаторы.
atms_all_df = pd.DataFrame(atms_all) atms_all_df.head()
active | address | bank_id | icon_url | id | is_main | latitude | longitude | name | region | region_id | sort | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | г. Москва, пер. Малый Сухаревский, д. 7 | 9259 | /upload/iblock/a0d/9259.gif | 6588499 | True | 55.770445 | 37.623941 | Банкомат | Москва | 4 | 5 | atm |
1 | True | г. Москва, просп. Кутузовский, д. 24 | 9259 | /upload/iblock/a0d/9259.gif | 6572679 | True | 55.744597 | 37.545693 | Банкомат | Москва | 4 | 5 | atm |
2 | True | г. Москва, ул. Малая Дмитровка, д. 10 | 9259 | /upload/iblock/a0d/9259.gif | 7122179 | True | 55.768714 | 37.606467 | Банкомат | Москва | 4 | 5 | atm |
3 | True | г. Москва, ул. Садовая-Черногрязская, д. 3 Б, ... | 9259 | /upload/iblock/a0d/9259.gif | 8493289 | True | 55.767398 | 37.652819 | Банкомат | Москва | 4 | 5 | atm |
4 | True | г. Москва, ул. Бакунинская, д. 23/41 | 9259 | /upload/iblock/a0d/9259.gif | 6602441 | True | 55.775659 | 37.683584 | Банкомат | Москва | 4 | 5 | atm |
Здесь почти всё так же, как и в предыдущем шаге, только посылаем немного другие данные — в params
> id_list
нужно поместить список id
нужных нам банкоматов. Если список будет слишком большой, получится ошибка, поэтому разобьем ранее собранный список банкоматов на пакеты по 500 штук, а затем будем запрашивать информацию о каждом пакете по очереди.
headers = { 'cookie': 'PHPSESSID=93bfdc45fd9cc9a8a8a3d50c2447639e; BANKI_RU_USER_IDENTITY_UID=5549793729193099991; aff_sub3=main; uid=uQo9b1wiJOM7IVHaBCLlAg==; __utmc=241422353; _ga=GA1.2.199415927.1545741541; _ym_uid=1545741541860272573; _ym_d=1545741541; ga_client_id=199415927.1545741541; scs=%7B%22t%22%3A1%7D; flocktory-uuid=721c443d-8411-4607-8c77-9eaa3490ad79-2; sbjs_migrations=1418474375998%3D1; sbjs_first_add=fd%3D2018-12-25%2015%3A39%3A03%7C%7C%7Cep%3Dhttps%3A%2F%2Fwww.banki.ru%2F%7C%7C%7Crf%3Dhttps%3A%2F%2Fwww.google.com%2F; sbjs_first=typ%3Dorganic%7C%7C%7Csrc%3Dgoogle%7C%7C%7Cmdm%3Dorganic%7C%7C%7Ccmp%3D%28none%29%7C%7C%7Ccnt%3D%28none%29%7C%7C%7Ctrm%3D%28none%29; ins-mig-done=1; lp_vid=c522904b-b4ae-467f-5a09-f28119eb8f57; spUID=1545741543376ba14961ee0.919b1d92; __io_lv=1545741551218; __io_uid_test=8; __io=5e24aa2a9.d4efafdc0_1545741551223; _io_un=; _io_un=; _io_un=25; __utmv=241422353.|1=siteDesign=new=1; lp_abtests=[{"SessionId":"300285","WidgetId":4568,"TestId":37}]; views_counter=%7B%22news%22%3A%5B10802672%5D%7D; _gid=GA1.2.1726664245.1546467651; __utmz=241422353.1546467651.2.2.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); BANKI_RU_GUEST_ID=606957941; user-region-id=1; _fbp=fb.1.1546526535416.1197015680; decid_cache=1261000702; decid_png=1261000702; decid_etag=1261000702; DEC_ID=1261000702; BANKI_RU_DLDB=1265655189; sbjs_current=typ%3Dreferral%7C%7C%7Csrc%3Dbanki.ru%7C%7C%7Cmdm%3Dreferral%7C%7C%7Ccmp%3D%28none%29%7C%7C%7Ccnt%3D%2Fbanks%2Fmap%2Fbarnaul%2F%7C%7C%7Ctrm%3D%28none%29; NonRobot=15465365840c4f6fe5021cfba98ee584c8ed0c080b729+acfc037aae7e3cd2095d5a5c38fcbc1d; __utma=241422353.199415927.1545741541.1546536112.1546536737.7; _ic_c=7..google_organic; sbjs_current_add=fd%3D2019-01-03%2021%3A28%3A04%7C%7C%7Cep%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fmoskva%2F%7C%7C%7Crf%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fbarnaul%2F%3F; sbjs_udata=vst%3D7%7C%7C%7Cuip%3D%28none%29%7C%7C%7Cuag%3DMozilla%2F5.0%20%28Macintosh%3B%20Intel%20Mac%20OS%20X%2010_14_2%29%20AppleWebKit%2F537.36%20%28KHTML%2C%20like%20Gecko%29%20Chrome%2F71.0.3578.98%20Safari%2F537.36; _ym_isad=1; ins-gaSSId=1b8edf15-7c15-36e1-ee0e-84d01dea36eb_1546540085; lp_session=271976; __utmt=1; __utmb=241422353.5.9.1546540079434; _gat=1; tmr_detect=1%7C1546543574331; _gat_bankiru_test=1; _ym_visorc_502212=b; BANKI_RU_LAST_VISIT=03.01.2019+22%3A26%3A14; insdrSV=38; BANKI_RU_BANNERS=452_20055_1_04012019%2C452_10427_1_04012019%2C106_10808_1_04012019%2C106_797_17_04012019%2C106_809_15_04012019; sbjs_session=pgs%3D4%7C%7C%7Ccpg%3Dhttps%3A%2F%2Fwww.banki.ru%2Fbanks%2Fmap%2Fmoskva%2F%23%2F%21b1%3Aall%21s3%3Abankomaty%21s4%3Alist%21m1%3A10%21m2%3A55.755773%21m3%3A37.617761%21p1%3A1; lp_pageview=3; bank_geo_link=#/!b1:all!s3:bankomaty!s4:list!m1:10!m2:55.755773!m3:37.617761!p1:2', 'origin': 'https://www.banki.ru', 'accept-encoding': 'gzip, deflate, br', 'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,ru;q=0.7', 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36', 'content-type': 'application/x-www-form-urlencoded; charset=UTF-8', 'accept': 'application/json, text/javascript, */*; q=0.01', 'referer': 'https://www.banki.ru/banks/map/moskva/', 'authority': 'www.banki.ru', 'x-requested-with': 'XMLHttpRequest', 'save-data': 'on', } data = { "jsonrpc":"2.0", "method":"bank/getBankObjectsData", "params":{ "id_list": [] }, "id":"6" } #response = requests.post('https://www.banki.ru/api/', headers=headers, data=json.dumps(data))
num_batches = math.ceil(len(atms_all)/500) allAtms_fullInfo = [] for batch in tqdm_notebook(range(0, num_batches)): data['params']['id_list'] = [atm['id'] for atm in atms_all[batch * 500: (batch + 1) * 500]] response = requests.post('https://www.banki.ru/api/', headers=headers, data=json.dumps(data)) time.sleep(.3) try: allAtms_fullInfo.extend(response.json()['result']['data']) except KeyError as err: print(err, response.json()) pickle.dump(allAtms_fullInfo, open('output/banki.ru_atms_fullInfo.pickle', 'wb')) len(allAtms_fullInfo)
atms_allInfo_df = pd.DataFrame(allAtms_fullInfo) atms_allInfo_df.head()
active | additional | address | bank_code | bank_icon_url | bank_id | bank_logo2_url | bank_logo_url | bank_name | bank_site | ... | metro_name | name | phone | region_id | schedule_entities | schedule_general | schedule_private_person | schedule_vip | type | without_weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Валюта: рубли | г. Москва, пер. Малый Сухаревский, д. 7 | absolutbank | /upload/iblock/a0d/9259.gif | 9259 | https://www.banki.ru/upload/iblock/3d8/absolut... | https://www.banki.ru/upload/iblock/f71/absolut... | Абсолют Банк | www.absolutbank.ru | ... | Цветной бульвар, Сухаревская, Трубная | Банкомат | 4 | atm | 0 | |||||
1 | 1 | С функцией приема наличных.<br />\nВалюта: руб... | г. Москва, просп. Кутузовский, д. 24 | absolutbank | /upload/iblock/a0d/9259.gif | 9259 | https://www.banki.ru/upload/iblock/3d8/absolut... | https://www.banki.ru/upload/iblock/f71/absolut... | Абсолют Банк | www.absolutbank.ru | ... | Выставочная, Кутузовская, Деловой центр | Банкомат | 4 | atm | 1 | |||||
2 | 1 | С функцией приема наличных.<br />\nВалюта: рубли. | г. Москва, ул. Малая Дмитровка, д. 10 | absolutbank | /upload/iblock/a0d/9259.gif | 9259 | https://www.banki.ru/upload/iblock/3d8/absolut... | https://www.banki.ru/upload/iblock/f71/absolut... | Абсолют Банк | www.absolutbank.ru | ... | Чеховская, Пушкинская, Тверская | Банкомат | 4 | atm | 1 | |||||
3 | 1 | С функцией приема наличных.<br />\nВалюта: руб... | г. Москва, ул. Садовая-Черногрязская, д. 3 Б, ... | absolutbank | /upload/iblock/a0d/9259.gif | 9259 | https://www.banki.ru/upload/iblock/3d8/absolut... | https://www.banki.ru/upload/iblock/f71/absolut... | Абсолют Банк | www.absolutbank.ru | ... | Красные Ворота, Комсомольская | Банкомат | 4 | atm | 1 | |||||
4 | 1 | С функцией приема наличных.<br />\nВалюта: руб... | г. Москва, ул. Бакунинская, д. 23/41 | absolutbank | /upload/iblock/a0d/9259.gif | 9259 | https://www.banki.ru/upload/iblock/3d8/absolut... | https://www.banki.ru/upload/iblock/f71/absolut... | Абсолют Банк | www.absolutbank.ru | ... | Бауманская | Банкомат | 4 | atm | 1 |
5 rows × 28 columns
atms_allInfo_df.isna().sum()
atms_allInfo_df.columns
atms_allInfo_df["accept_cash"] = atms_allInfo_df.additional.str.contains("приёма наличных")
atms_allInfo_df["accept_cash"].value_counts()
atms_allInfo_df.is_at_closed_place.value_counts()
atms_allInfo_df.is_at_closed_place.value_counts()
atms_allInfo_df.is_main_office.value_counts()
atms_allInfo_df.is_round_the_clock.value_counts()
atms_allInfo_df.is_works_as_shop.value_counts()
list(set(atms_all_df.columns) & set(atms_allInfo_df.columns))
atms_all_df.dtypes
atms_allInfo_df.dtypes
atms_allInfo_df[["latitude", "longitude", 'region_id']] = atms_allInfo_df[["latitude", "longitude", 'region_id']].apply(pd.to_numeric) atms_allInfo_df.active = atms_allInfo_df.active.astype(bool)
atms_allInfo_df[["latitude", "longitude"]] = atms_allInfo_df[["latitude", "longitude"]].round(7) atms_all_df[["latitude", "longitude"]] = atms_all_df[["latitude", "longitude"]].round(7)
atms_full_df = pd.merge( atms_all_df, atms_allInfo_df, on=['id', "latitude", "longitude"], how="outer", )
atms_full_df.shape, atms_all_df.shape, atms_allInfo_df.shape
atms_full_df.columns
Комментарии