Основными типами данных в Pandas являются Series и DataFrame.
Series – это проиндексированный одномерный массив значений. Он похож на простой словарь типа dict, где имя элемента будет соответствовать индексу, а значение – значению записи.
DataFrame — это проиндексированный многомерный массив значений, соответственно каждый столбец DataFrame, является структурой Series.
import pandas as pd import numpy as np %config InlineBackend.figure_format = 'svg'
labels = ['a', 'b', 'c'] my_list = [10, 20, 30] arr = np.array([10, 20, 30]) d = {'a': 10, 'b': 20, 'c': 30}
pd.Series(data=my_list)
pd.Series(data=my_list, index=labels)
pd.Series(arr,labels)
pd.Series(d)
pd.Series([sum,print,len])
ser1 = pd.Series([1, 2, 3, 4], index=['USA', 'Germany', 'USSR', 'Japan']) ser1
ser2 = pd.Series([1, 2, 5, 4], index=['USA', 'Germany', 'Italy', 'Japan']) ser2
ser1['USA']
ser1 + ser2
ser1[:2]
pd.DataFrame?
df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
W | X | Y | Z | |
---|---|---|---|---|
A | -0.648461 | -0.161792 | -0.097218 | 0.265221 |
B | -0.022018 | 1.174444 | -3.313616 | -0.651594 |
C | -1.391369 | 0.946182 | -0.323140 | -1.084017 |
D | 1.030518 | -0.208792 | 1.549492 | 0.061084 |
E | -1.740542 | -0.411707 | -0.531161 | 0.432765 |
Pandas может считывать и записывать данные в различных форматах: http://pandas.pydata.org/pandas-docs/version/0.20/io.html
Лично я предпочитаю хранить данные в бинарном формате Msgpack, если имеется достаточно ОЗУ.
Если размер данных значительно превышает объём ОЗУ, необходимо использоавть БД. Pandas поддерживает нативную работу со многими реляционными БД.
DF можно конвертировать в другие форматы, например MongoDB, при помощи odo
(часть интерфейса blaze).
Можно не только загружать данные из локальных файлов, но и из Интернета — достаточно вместо адреса на локальном компьютере указать прямю ссылку на файл.
Также существует дополнительный пакет, который называется pandas_datareader
. Он загружает данные из следующих источников:
- Yahoo! Finance<remote_data.yahoo>
- Google Finance<remote_data.google>
- Enigma<remote_data.enigma>
- Quandl<remote_data.quandl>
- St.Louis FED (FRED)<remote_data.fred>
- Kenneth French's data library<remote_data.ff>
- World Bank<remote_data.wb>
- OECD<remote_data.oecd>
- Eurostat<remote_data.eurostat>
- Thrift Savings Plan<remote_data.tsp>
- Nasdaq Trader symbol definitions<remote_data.nasdaq_symbols>
Ну и наконец существует социальная сеть data.world, где можно обмениваться данными и загружать их в pandas.
!pip3 install --upgrade beautifulsoup4
import bs4
Скачайте файл с данными о пассажирах титаника https://github.com/agconti/kaggle-titanic/raw/master/data/train.csv. Загрузите его в pandas dataframe.
titanic_full_df = pd.read_csv("https://nagornyy.me/datasets/titanic.csv", sep=",")
titanic_full_df.shape
titanic_full_df.info()
titanic_full_df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
titanic_full_df.columns
titanic_full_df.dtypes
titanic_full_df.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic_full_df.tail()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
titanic_full_df.isnull().sum()
titanic_full_df.sample(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
93 | 94 | 0 | 3 | Dean, Mr. Bertram Frank | male | 26.0 | 1 | 2 | C.A. 2315 | 20.5750 | NaN | S |
341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
309 | 310 | 1 | 1 | Francatelli, Miss. Laura Mabel | female | 30.0 | 0 | 0 | PC 17485 | 56.9292 | E36 | C |
295 | 296 | 0 | 1 | Lewy, Mr. Ervin G | male | NaN | 0 | 0 | PC 17612 | 27.7208 | NaN | C |
854 | 855 | 0 | 2 | Carter, Mrs. Ernest Courtenay (Lilian Hughes) | female | 44.0 | 1 | 0 | 244252 | 26.0000 | NaN | S |
titanic_full_df.sample(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
236 | 237 | 0 | 2 | Hold, Mr. Stephen | male | 44.0 | 1 | 0 | 26707 | 26.0000 | NaN | S |
465 | 466 | 0 | 3 | Goncalves, Mr. Manuel Estanslas | male | 38.0 | 0 | 0 | SOTON/O.Q. 3101306 | 7.0500 | NaN | S |
664 | 665 | 1 | 3 | Lindqvist, Mr. Eino William | male | 20.0 | 1 | 0 | STON/O 2. 3101285 | 7.9250 | NaN | S |
696 | 697 | 0 | 3 | Kelly, Mr. James | male | 44.0 | 0 | 0 | 363592 | 8.0500 | NaN | S |
881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
titanic_full_df["Age"].head()
type(titanic_df["Age"])
# Pass a list of column names titanic_full_df[["Age", "Sex"]].head()
Age | Sex | |
---|---|---|
0 | 22.0 | male |
1 | 38.0 | female |
2 | 26.0 | female |
3 | 35.0 | female |
4 | 35.0 | male |
titanic_full_df["Relatives"] = titanic_full_df["SibSp"] + titanic_full_df["Parch"]
titanic_full_df[["SibSp", "Parch", "Relatives"]].head()
SibSp | Parch | Relatives | |
---|---|---|---|
0 | 1 | 0 | 1 |
1 | 1 | 0 | 1 |
2 | 0 | 0 | 0 |
3 | 1 | 0 | 1 |
4 | 0 | 0 | 0 |
titanic_full_df.drop("Relatives", axis=1).head() # inplace=True
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic_full_df.index.tolist()[:10]
Выберем ряд по индексу
titanic_full_df.loc[442 : 450 : 2, ["Age", "Sex"]]
Age | Sex | |
---|---|---|
442 | 25.0 | male |
444 | NaN | male |
446 | 13.0 | female |
448 | 5.0 | female |
450 | 36.0 | male |
titanic_full_df.set_index(["Embarked"]).loc["S"].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Embarked | ||||||||||||
S | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 1 |
S | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 0 |
S | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 1 |
S | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 0 |
S | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | 0 |
titanic_full_df.iloc[0]
titanic_full_df.iloc[[564, 442]]
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
564 | 565 | 0 | 3 | Meanwell, Miss. (Marion Ogden) | female | NaN | 0 | 0 | SOTON/O.Q. 392087 | 8.050 | NaN | S | 0 |
442 | 443 | 0 | 3 | Petterson, Mr. Johan Emil | male | 25.0 | 1 | 0 | 347076 | 7.775 | NaN | S | 1 |
titanic_full_df.loc[[564, 442], ["Name", "Sex"]]
Name | Sex | |
---|---|---|
564 | Meanwell, Miss. (Marion Ogden) | female |
442 | Petterson, Mr. Johan Emil | male |
titanic_full_df == 1
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | True | False | False | False | False | False | True | False | False | False | False | False | True |
1 | False | True | True | False | False | False | True | False | False | False | False | False | True |
2 | False | True | False | False | False | False | False | False | False | False | False | False | False |
3 | False | True | True | False | False | False | True | False | False | False | False | False | True |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False | False | False | False | False | False | False |
6 | False | False | True | False | False | False | False | False | False | False | False | False | False |
7 | False | False | False | False | False | False | False | True | False | False | False | False | False |
8 | False | True | False | False | False | False | False | False | False | False | False | False | False |
9 | False | True | False | False | False | False | True | False | False | False | False | False | True |
10 | False | True | False | False | False | False | True | True | False | False | False | False | False |
11 | False | True | True | False | False | False | False | False | False | False | False | False | False |
12 | False | False | False | False | False | False | False | False | False | False | False | False | False |
13 | False | False | False | False | False | False | True | False | False | False | False | False | False |
14 | False | False | False | False | False | False | False | False | False | False | False | False | False |
15 | False | True | False | False | False | False | False | False | False | False | False | False | False |
16 | False | False | False | False | False | False | False | True | False | False | False | False | False |
17 | False | True | False | False | False | False | False | False | False | False | False | False | False |
18 | False | False | False | False | False | False | True | False | False | False | False | False | True |
19 | False | True | False | False | False | False | False | False | False | False | False | False | False |
20 | False | False | False | False | False | False | False | False | False | False | False | False | False |
21 | False | True | False | False | False | False | False | False | False | False | False | False | False |
22 | False | True | False | False | False | False | False | False | False | False | False | False | False |
23 | False | True | True | False | False | False | False | False | False | False | False | False | False |
24 | False | False | False | False | False | False | False | True | False | False | False | False | False |
25 | False | True | False | False | False | False | True | False | False | False | False | False | False |
26 | False | False | False | False | False | False | False | False | False | False | False | False | False |
27 | False | False | True | False | False | False | False | False | False | False | False | False | False |
28 | False | True | False | False | False | False | False | False | False | False | False | False | False |
29 | False | False | False | False | False | False | False | False | False | False | False | False | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
861 | False | False | False | False | False | False | True | False | False | False | False | False | True |
862 | False | True | True | False | False | False | False | False | False | False | False | False | False |
863 | False | False | False | False | False | False | False | False | False | False | False | False | False |
864 | False | False | False | False | False | False | False | False | False | False | False | False | False |
865 | False | True | False | False | False | False | False | False | False | False | False | False | False |
866 | False | True | False | False | False | False | True | False | False | False | False | False | True |
867 | False | False | True | False | False | False | False | False | False | False | False | False | False |
868 | False | False | False | False | False | False | False | False | False | False | False | False | False |
869 | False | True | False | False | False | False | True | True | False | False | False | False | False |
870 | False | False | False | False | False | False | False | False | False | False | False | False | False |
871 | False | True | True | False | False | False | True | True | False | False | False | False | False |
872 | False | False | True | False | False | False | False | False | False | False | False | False | False |
873 | False | False | False | False | False | False | False | False | False | False | False | False | False |
874 | False | True | False | False | False | False | True | False | False | False | False | False | True |
875 | False | True | False | False | False | False | False | False | False | False | False | False | False |
876 | False | False | False | False | False | False | False | False | False | False | False | False | False |
877 | False | False | False | False | False | False | False | False | False | False | False | False | False |
878 | False | False | False | False | False | False | False | False | False | False | False | False | False |
879 | False | True | True | False | False | False | False | True | False | False | False | False | True |
880 | False | True | False | False | False | False | False | True | False | False | False | False | True |
881 | False | False | False | False | False | False | False | False | False | False | False | False | False |
882 | False | False | False | False | False | False | False | False | False | False | False | False | False |
883 | False | False | False | False | False | False | False | False | False | False | False | False | False |
884 | False | False | False | False | False | False | False | False | False | False | False | False | False |
885 | False | False | False | False | False | False | False | False | False | False | False | False | False |
886 | False | False | False | False | False | False | False | False | False | False | False | False | False |
887 | False | True | True | False | False | False | False | False | False | False | False | False | False |
888 | False | False | False | False | False | False | True | False | False | False | False | False | False |
889 | False | True | True | False | False | False | False | False | False | False | False | False | False |
890 | False | False | False | False | False | False | False | False | False | False | False | False | False |
891 rows × 13 columns
titanic_full_df.Survived == 1
titanic_full_df[titanic_full_df["Survived"] == 0].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q | 0 |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | 4 |
titanic_full_df[titanic_full_df["Survived"] == 1]["Sex"].value_counts()
titanic_full_df[(titanic_full_df["Fare"] > 100) | (titanic_full_df["Name"].str.find("Master") != -1)].head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S | 4 |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q | 5 |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 |
31 | 32 | 1 | 1 | Spencer, Mrs. William Augustus (Marie Eugenie) | female | NaN | 1 | 0 | PC 17569 | 146.5208 | B78 | C | 1 |
50 | 51 | 0 | 3 | Panula, Master. Juha Niilo | male | 7.0 | 4 | 1 | 3101295 | 39.6875 | NaN | S | 5 |
titanic_full_df["Embarked"].unique()
titanic_full_df["Embarked"].nunique()
titanic_full_df["Survived"].value_counts()
titanic_full_df["Pclass"].value_counts()
titanic_full_df["Pclass"].replace({1: "Элита", 2: "Средний класс", 3: "Работяги"}, inplace=True)
titanic_full_df["Pclass"].value_counts()
titanic_full_df["Fare"].apply(lambda x: "Дёшево" if x < 20 else "Дорого")
titanic_full_df["Fare_Bin"] = titanic_full_df["Fare"].apply(lambda x: "Дёшево" if x < 20 else "Дорого")
titanic_full_df.sort_values(by="Fare", ascending=False)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
258 | 259 | 1 | Элита | Ward, Miss. Anna | female | 35.00 | 0 | 0 | PC 17755 | 512.3292 | NaN | C | 0 | Дорого |
737 | 738 | 1 | Элита | Lesurer, Mr. Gustave J | male | 35.00 | 0 | 0 | PC 17755 | 512.3292 | B101 | C | 0 | Дорого |
679 | 680 | 1 | Элита | Cardeza, Mr. Thomas Drake Martinez | male | 36.00 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C | 1 | Дорого |
88 | 89 | 1 | Элита | Fortune, Miss. Mabel Helen | female | 23.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 | Дорого |
27 | 28 | 0 | Элита | Fortune, Mr. Charles Alexander | male | 19.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 | Дорого |
341 | 342 | 1 | Элита | Fortune, Miss. Alice Elizabeth | female | 24.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S | 5 | Дорого |
438 | 439 | 0 | Элита | Fortune, Mr. Mark | male | 64.00 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S | 5 | Дорого |
311 | 312 | 1 | Элита | Ryerson, Miss. Emily Borie | female | 18.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C | 4 | Дорого |
742 | 743 | 1 | Элита | Ryerson, Miss. Susan Parker "Suzette" | female | 21.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C | 4 | Дорого |
118 | 119 | 0 | Элита | Baxter, Mr. Quigg Edmond | male | 24.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 1 | Дорого |
299 | 300 | 1 | Элита | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 1 | Дорого |
557 | 558 | 0 | Элита | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C | 0 | Дорого |
700 | 701 | 1 | Элита | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.00 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | 1 | Дорого |
380 | 381 | 1 | Элита | Bidois, Miss. Rosalie | female | 42.00 | 0 | 0 | PC 17757 | 227.5250 | NaN | C | 0 | Дорого |
716 | 717 | 1 | Элита | Endres, Miss. Caroline Louise | female | 38.00 | 0 | 0 | PC 17757 | 227.5250 | C45 | C | 0 | Дорого |
527 | 528 | 0 | Элита | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S | 0 | Дорого |
377 | 378 | 0 | Элита | Widener, Mr. Harry Elkins | male | 27.00 | 0 | 2 | 113503 | 211.5000 | C82 | C | 2 | Дорого |
730 | 731 | 1 | Элита | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S | 0 | Дорого |
779 | 780 | 1 | Элита | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.00 | 0 | 1 | 24160 | 211.3375 | B3 | S | 1 | Дорого |
689 | 690 | 1 | Элита | Madill, Miss. Georgette Alexandra | female | 15.00 | 0 | 1 | 24160 | 211.3375 | B5 | S | 1 | Дорого |
318 | 319 | 1 | Элита | Wick, Miss. Mary Natalie | female | 31.00 | 0 | 2 | 36928 | 164.8667 | C7 | S | 2 | Дорого |
856 | 857 | 1 | Элита | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.00 | 1 | 1 | 36928 | 164.8667 | NaN | S | 2 | Дорого |
268 | 269 | 1 | Элита | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.00 | 0 | 1 | PC 17582 | 153.4625 | C125 | S | 1 | Дорого |
332 | 333 | 0 | Элита | Graham, Mr. George Edward | male | 38.00 | 0 | 1 | PC 17582 | 153.4625 | C91 | S | 1 | Дорого |
609 | 610 | 1 | Элита | Shutes, Miss. Elizabeth W | female | 40.00 | 0 | 0 | PC 17582 | 153.4625 | C125 | S | 0 | Дорого |
498 | 499 | 0 | Элита | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 | Дорого |
305 | 306 | 1 | Элита | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 | Дорого |
297 | 298 | 0 | Элита | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 3 | Дорого |
708 | 709 | 1 | Элита | Cleaver, Miss. Alice | female | 22.00 | 0 | 0 | 113781 | 151.5500 | NaN | S | 0 | Дорого |
195 | 196 | 1 | Элита | Lurette, Miss. Elise | female | 58.00 | 0 | 0 | PC 17569 | 146.5208 | B80 | C | 0 | Дорого |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
363 | 364 | 0 | Работяги | Asim, Mr. Adola | male | 35.00 | 0 | 0 | SOTON/O.Q. 3101310 | 7.0500 | NaN | S | 0 | Дёшево |
477 | 478 | 0 | Работяги | Braund, Mr. Lewis Richard | male | 29.00 | 1 | 0 | 3460 | 7.0458 | NaN | S | 1 | Дёшево |
129 | 130 | 0 | Работяги | Ekstrom, Mr. Johan | male | 45.00 | 0 | 0 | 347061 | 6.9750 | NaN | S | 0 | Дёшево |
804 | 805 | 1 | Работяги | Hedman, Mr. Oskar Arvid | male | 27.00 | 0 | 0 | 347089 | 6.9750 | NaN | S | 0 | Дёшево |
825 | 826 | 0 | Работяги | Flynn, Mr. John | male | NaN | 0 | 0 | 368323 | 6.9500 | NaN | Q | 0 | Дёшево |
411 | 412 | 0 | Работяги | Hart, Mr. Henry | male | NaN | 0 | 0 | 394140 | 6.8583 | NaN | Q | 0 | Дёшево |
654 | 655 | 0 | Работяги | Hegarty, Miss. Hanora "Nora" | female | 18.00 | 0 | 0 | 365226 | 6.7500 | NaN | Q | 0 | Дёшево |
143 | 144 | 0 | Работяги | Burke, Mr. Jeremiah | male | 19.00 | 0 | 0 | 365222 | 6.7500 | NaN | Q | 0 | Дёшево |
202 | 203 | 0 | Работяги | Johanson, Mr. Jakob Alfred | male | 34.00 | 0 | 0 | 3101264 | 6.4958 | NaN | S | 0 | Дёшево |
371 | 372 | 0 | Работяги | Wiklund, Mr. Jakob Alfred | male | 18.00 | 1 | 0 | 3101267 | 6.4958 | NaN | S | 1 | Дёшево |
818 | 819 | 0 | Работяги | Holm, Mr. John Fredrik Alexander | male | 43.00 | 0 | 0 | C 7075 | 6.4500 | NaN | S | 0 | Дёшево |
843 | 844 | 0 | Работяги | Lemberopolous, Mr. Peter L | male | 34.50 | 0 | 0 | 2683 | 6.4375 | NaN | C | 0 | Дёшево |
326 | 327 | 0 | Работяги | Nysveen, Mr. Johan Hansen | male | 61.00 | 0 | 0 | 345364 | 6.2375 | NaN | S | 0 | Дёшево |
872 | 873 | 0 | Элита | Carlsson, Mr. Frans Olof | male | 33.00 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S | 0 | Дёшево |
378 | 379 | 0 | Работяги | Betros, Mr. Tannous | male | 20.00 | 0 | 0 | 2648 | 4.0125 | NaN | C | 0 | Дёшево |
466 | 467 | 0 | Средний класс | Campbell, Mr. William | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S | 0 | Дёшево |
597 | 598 | 0 | Работяги | Johnson, Mr. Alfred | male | 49.00 | 0 | 0 | LINE | 0.0000 | NaN | S | 0 | Дёшево |
271 | 272 | 1 | Работяги | Tornquist, Mr. William Henry | male | 25.00 | 0 | 0 | LINE | 0.0000 | NaN | S | 0 | Дёшево |
277 | 278 | 0 | Средний класс | Parkes, Mr. Francis "Frank" | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S | 0 | Дёшево |
806 | 807 | 0 | Элита | Andrews, Mr. Thomas Jr | male | 39.00 | 0 | 0 | 112050 | 0.0000 | A36 | S | 0 | Дёшево |
481 | 482 | 0 | Средний класс | Frost, Mr. Anthony Wood "Archie" | male | NaN | 0 | 0 | 239854 | 0.0000 | NaN | S | 0 | Дёшево |
302 | 303 | 0 | Работяги | Johnson, Mr. William Cahoone Jr | male | 19.00 | 0 | 0 | LINE | 0.0000 | NaN | S | 0 | Дёшево |
179 | 180 | 0 | Работяги | Leonard, Mr. Lionel | male | 36.00 | 0 | 0 | LINE | 0.0000 | NaN | S | 0 | Дёшево |
263 | 264 | 0 | Элита | Harrison, Mr. William | male | 40.00 | 0 | 0 | 112059 | 0.0000 | B94 | S | 0 | Дёшево |
815 | 816 | 0 | Элита | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0000 | B102 | S | 0 | Дёшево |
633 | 634 | 0 | Элита | Parr, Mr. William Henry Marsh | male | NaN | 0 | 0 | 112052 | 0.0000 | NaN | S | 0 | Дёшево |
413 | 414 | 0 | Средний класс | Cunningham, Mr. Alfred Fleming | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S | 0 | Дёшево |
822 | 823 | 0 | Элита | Reuchlin, Jonkheer. John George | male | 38.00 | 0 | 0 | 19972 | 0.0000 | NaN | S | 0 | Дёшево |
732 | 733 | 0 | Средний класс | Knight, Mr. Robert J | male | NaN | 0 | 0 | 239855 | 0.0000 | NaN | S | 0 | Дёшево |
674 | 675 | 0 | Средний класс | Watson, Mr. Ennis Hastings | male | NaN | 0 | 0 | 239856 | 0.0000 | NaN | S | 0 | Дёшево |
891 rows × 14 columns
titanic_full_df.isnull().any()
titanic_full_df.dropna().head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 1 | Элита | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | Дорого |
3 | 4 | 1 | Элита | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | Дорого |
6 | 7 | 0 | Элита | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S | 0 | Дорого |
10 | 11 | 1 | Работяги | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S | 2 | Дёшево |
11 | 12 | 1 | Элита | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S | 0 | Дорого |
titanic_full_df.dropna(subset=["Age", "Sex"]).head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | Работяги | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | Дёшево |
1 | 2 | 1 | Элита | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | Дорого |
2 | 3 | 1 | Работяги | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | Дёшево |
3 | 4 | 1 | Элита | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | Дорого |
4 | 5 | 0 | Работяги | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | Дёшево |
titanic_full_df.dropna(thresh=12).head() # не менее 12 заполненных колонок
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | Работяги | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 1 | Дёшево |
1 | 2 | 1 | Элита | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | Дорого |
2 | 3 | 1 | Работяги | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 0 | Дёшево |
3 | 4 | 1 | Элита | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | Дорого |
4 | 5 | 0 | Работяги | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 0 | Дёшево |
titanic_full_df.fillna("ПРОПУСК").head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | Работяги | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | ПРОПУСК | S | 1 | Дёшево |
1 | 2 | 1 | Элита | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 1 | Дорого |
2 | 3 | 1 | Работяги | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ПРОПУСК | S | 0 | Дёшево |
3 | 4 | 1 | Элита | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S | 1 | Дорого |
4 | 5 | 0 | Работяги | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | ПРОПУСК | S | 0 | Дёшево |
titanic_full_df["Age"].mean()
titanic_full_df["Age"].fillna(value=titanic_df["Age"].mean()).head()
titanic_full_df[["Sex", "Survived"]].pivot_table(index=["Sex"], columns=["Survived"], aggfunc=len)
Survived | 0 | 1 |
---|---|---|
Sex | ||
female | 81 | 233 |
male | 468 | 109 |
titanic_full_df[["Sex", "Survived", "Age"]].pivot_table( values=["Age"], index=["Sex"], columns=["Survived"], aggfunc="mean")
Age | ||
---|---|---|
Survived | 0 | 1 |
Sex | ||
female | 25.046875 | 28.847716 |
male | 31.618056 | 27.276022 |
titanic_full_df.groupby("Pclass")
titanic_full_df.groupby("Pclass").max()
PassengerId | Survived | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Relatives | Fare_Bin | |
---|---|---|---|---|---|---|---|---|---|---|---|
Pclass | |||||||||||
Работяги | 891 | 1 | van Melkebeke, Mr. Philemon | male | 74.0 | 8 | 6 | W./C. 6609 | 69.5500 | 10 | Дёшево |
Средний класс | 887 | 1 | del Carlo, Mr. Sebastiano | male | 70.0 | 3 | 3 | W/C 14208 | 73.5000 | 5 | Дёшево |
Элита | 890 | 1 | Young, Miss. Marie Grice | male | 80.0 | 3 | 4 | WE/P 5735 | 512.3292 | 5 | Дёшево |
titanic_full_df.groupby("Pclass").mean()["Age"]
titanic_full_df.groupby("Pclass").mean().loc["Работяги"]
Другие функции: count, min/max, describe(), first, std ...
titanic_full_df.groupby("Pclass").describe()["Age"]
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Pclass | ||||||||
Работяги | 355.0 | 25.140620 | 12.495398 | 0.42 | 18.0 | 24.0 | 32.0 | 74.0 |
Средний класс | 173.0 | 29.877630 | 14.001077 | 0.67 | 23.0 | 29.0 | 36.0 | 70.0 |
Элита | 186.0 | 38.233441 | 14.802856 | 0.92 | 27.0 | 37.0 | 49.0 | 80.0 |
titanic_full_df.groupby("Pclass").describe()["Age"].transpose()
Pclass | Работяги | Средний класс | Элита |
---|---|---|---|
count | 355.000000 | 173.000000 | 186.000000 |
mean | 25.140620 | 29.877630 | 38.233441 |
std | 12.495398 | 14.001077 | 14.802856 |
min | 0.420000 | 0.670000 | 0.920000 |
25% | 18.000000 | 23.000000 | 27.000000 |
50% | 24.000000 | 29.000000 | 37.000000 |
75% | 32.000000 | 36.000000 | 49.000000 |
max | 74.000000 | 70.000000 | 80.000000 |
titanic_full_df.groupby("Pclass")["Age"].agg(["min", "max", "std"])
min | max | std | |
---|---|---|---|
Pclass | |||
Работяги | 0.42 | 74.0 | 12.495398 |
Средний класс | 0.67 | 70.0 | 14.001077 |
Элита | 0.92 | 80.0 | 14.802856 |
titanic_full_df.groupby("Pclass").agg({"Age": np.mean, "PassengerId": "count"})
Age | PassengerId | |
---|---|---|
Pclass | ||
Работяги | 25.140620 | 491 |
Средний класс | 29.877630 | 184 |
Элита | 38.233441 | 216 |
titanic_full_df.groupby(["Pclass", "Sex"]).mean()["Fare"]
Если вы исползуете циклы, возможно вы что-то делаете не так. Иногда, однако, это необходимо.
ser1
for (index, value) in ser1.iteritems(): print("Страна {}, место {}.".format(index, value))
for index, row in titanic_df.iterrows(): print(index, row["Name"])
for group_name, group in titanic_full_df.groupby("Pclass"): print(group_name, group["Age"].mean())
df1 = pd.DataFrame( { 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'] }, index=[0, 1, 2, 3]) df2 = pd.DataFrame( { 'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7'] }, index=[4, 5, 6, 7]) df3 = pd.DataFrame( { 'A': ['A8', 'A9', 'A10', 'A11'], 'B': ['B8', 'B9', 'B10', 'B11'], 'C': ['C8', 'C9', 'C10', 'C11'], 'D': ['D8', 'D9', 'D10', 'D11'] }, index=[8, 9, 10, 11])
pd.concat([df1,df2,df3])
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
pd.concat([df1,df2,df3],axis=1)
A | B | C | D | A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 | NaN | NaN | NaN | NaN |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 | NaN | NaN | NaN | NaN |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 | NaN | NaN | NaN | NaN |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 | NaN | NaN | NaN | NaN |
8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A8 | B8 | C8 | D8 |
9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A9 | B9 | C9 | D9 |
10 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A10 | B10 | C10 | D10 |
11 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | A11 | B11 | C11 | D11 |
left = pd.DataFrame({ 'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'] }) right = pd.DataFrame({ 'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3'] })
pd.merge(left, right, how='inner', on='key')
key | A | B | C | D | |
---|---|---|---|---|---|
0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | A1 | B1 | C1 | D1 |
2 | K2 | A2 | B2 | C2 | D2 |
3 | K3 | A3 | B3 | C3 | D3 |
import matplotlib.pyplot as plt import seaborn as sns from pylab import rcParams %matplotlib inline sns.set_style("ticks") rcParams['figure.figsize'] = 12, 6
ser2
ser2.plot();
titanic_full_df.hist();
titanic_full_df["Age"].plot("kde", xlim=(titanic_full_df["Age"].min(), titanic_full_df["Age"].max()));
ser2.plot(kind="bar", fontsize=20);
titanic_full_df["Sex"].value_counts().plot(kind="pie", figsize=(7, 7), fontsize=20);
titanic_full_df["Pclass"].value_counts().plot(kind="pie", figsize=(7, 7), fontsize=20);
titanic_full_df[["Sex", "Survived"]].pivot_table(index=["Sex"], columns=["Survived"], aggfunc=len).plot(kind="bar");
titanic_full_df[["Age", "Survived"]].pivot_table(index=["Age"], columns=["Survived"], aggfunc=len).plot();
Вам предлагается поработать с набором данных о преступности на улицах LA.
Начнём анализ жертв. В наборе данных имеется информация о Возрасте, Поле, и Происхождении каждоый жертвы. Есть ли связь между этими признаками?
Люди какого происхождения чаще всего подвергаются преступлениям?
Комментарии