《Python 数据科学实践指南》读书笔记

ZhuYuanxiang 2019-03-04 00:00:00
Categories: Tags:

全书总评

C01.Python 介绍

Python 版本

Python 解释器

Python 之禅

C02.Python 基础知识

基础知识

流程控制

函数及异常

函数

异常

字符串

获取键盘输入

字符串处理

字符串操作

正则表达式

C05. 容器 ( Container ) 与集合 ( Collections )

元组 ( Tuple )

列表 ( List )

字典 ( Dictionary )

集合 ( Collections )

C06.Python 标准库

数学模块 : math

时间模块 : time,datetime,calendar

随机数模块 : random

取样

文件处理 : glob 和 fileinput

压缩 : bz2 和 gzip

漂亮打印 : pprint 模块

跟踪异常日志 : traceback 模块

网络数据传输 : JSON

C07. 用 Python 读写外部数据

CSV,csv 模块

Excel,pandas 模块 ( 参考 C10 )

MySQL,MySQLdb 模块,torndb 模块

PostgreSQL,psycopg2 模块

MongoDB,pymongo 模块

ElasticSearch,elasticsearch 模块

C08. 用 Python 解决统计问题

描述性统计

数据可视化

C09. 爬虫入门

request 模块

Xpath 模块

C10. 数据科学的第三方库

Numpy 模块

从这里开始,先弃了,这本书适合了解了以后,再来根据作者的实践角度查遗补缺

Pandas 模块

有时候 DataFrame 中的行列数量太多,print 打印出来会显示不完全。

1
2
3
4
5
6
# 显示所有列
pd.set_option("display.max_columns", None)
# 显示所有行
pd.set_option("display.max_rows", None)
# 设置 value 的显示长度为 100,默认为 50
pd.set_option("max_colwidth", 100)

set_option() 的所有属性 :
Available options:

Parameters

pat : str
Regexp which should match a single option.
Note: partial matches are supported for convenience, but unless you use the
full option name (e.g. x.y.z.option_name), your code may break in future
versions if new options with similar names are introduced.
value :
new value of option.

Returns

None

Raises

OptionError if no such option exists

Notes

The available options with its descriptions:

display.chop_threshold : float or None
if set to a float value, all float values smaller then the given threshold
will be displayed as exactly 0 by repr and friends.
[default: None] [currently: None]

display.colheader_justify : ‘left’/‘right’
Controls the justification of column headers. used by DataFrameFormatter.
[default: right] [currently: right]

display.column_space No description available.
[default: 12] [currently: 12]

display.date_dayfirst : boolean
When True, prints and parses dates with the day first, eg 20/01/2005
[default: False] [currently: False]

display.date_yearfirst : boolean
When True, prints and parses dates with the year first, eg 2005/01/20
[default: False] [currently: False]

display.encoding : str/unicode
Defaults to the detected encoding of the console.
Specifies the encoding to be used for strings returned by to_string,
these are generally strings meant to be displayed on the console.
[default: UTF-8] [currently: UTF-8]

display.expand_frame_repr : boolean
Whether to print out the full DataFrame repr for wide DataFrames across
multiple lines, max_columns is still respected, but the output will
wrap-around across multiple “pages” if its width exceeds display.width.
[default: True] [currently: True]

display.float_format : callable
The callable should accept a floating point number and return
a string with the desired format of the number. This is used
in some places like SeriesFormatter.
See formats.format.EngFormatter for an example.
[default: None] [currently: None]

display.height : int
Deprecated.
[default: 60] [currently: 60]
(Deprecated, use display.max_rows instead.)

display.large_repr : ‘truncate’/‘info’
For DataFrames exceeding max_rows/max_cols, the repr (and HTML repr) can
show a truncated table (the default from 0.13), or switch to the view from
df.info() (the behaviour in earlier versions of pandas).
[default: truncate] [currently: truncate]

display.latex.escape : bool
This specifies if the to_latex method of a Dataframe uses escapes special
characters.
method. Valid values: False,True
[default: True] [currently: True]

display.latex.longtable :bool
This specifies if the to_latex method of a Dataframe uses the longtable
format.
method. Valid values: False,True
[default: False] [currently: False]

display.latex.repr : boolean
Whether to produce a latex DataFrame representation for jupyter
environments that support it.
(default: False)
[default: False] [currently: False]

display.line_width : int
Deprecated.
[default: 80] [currently: 80]
(Deprecated, use display.width instead.)

display.max_categories : int
This sets the maximum number of categories pandas should output when
printing out a Categorical or a Series of dtype “category”.
[default: 8] [currently: 8]

display.max_columns : int
If max_cols is exceeded, switch to truncate view. Depending on
large_repr, objects are either centrally truncated or printed as
a summary view. ‘None’ value means unlimited.

1
2
3
4
5
6
7
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the width of the terminal and print a truncated object which fits
the screen width. The IPython notebook, IPython qtconsole, or IDLE
do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 20] [currently: 20]

display.max_colwidth : int
The maximum width in characters of a column in the repr of
a pandas data structure. When the column overflows, a “…”
placeholder is embedded in the output.
[default: 50] [currently: 200]

display.max_info_columns : int
max_info_columns is used in DataFrame.info method to decide if
per column information will be printed.
[default: 100] [currently: 100]

display.max_info_rows : int or None
df.info() will usually show null-counts for each column.
For large frames this can be quite slow. max_info_rows and max_info_cols
limit this null check only to frames with smaller dimensions than
specified.
[default: 1690785] [currently: 1690785]

display.max_rows : int
If max_rows is exceeded, switch to truncate view. Depending on
large_repr, objects are either centrally truncated or printed as
a summary view. ‘None’ value means unlimited.

1
2
3
4
5
6
7
In case python/IPython is running in a terminal and `large_repr`
equals 'truncate' this can be set to 0 and pandas will auto-detect
the height of the terminal and print a truncated object which fits
the screen height. The IPython notebook, IPython qtconsole, or
IDLE do not run in a terminal and hence it is not possible to do
correct auto-detection.
[default: 60] [currently: 60]

display.max_seq_items : int or None
when pretty-printing a long sequence, no more then max_seq_items
will be printed. If items are omitted, they will be denoted by the
addition of “…” to the resulting string.

1
2
If set to None, the number of items to be printed is unlimited.
[default: 100] [currently: 100]

display.memory_usage : bool, string or None
This specifies if the memory usage of a DataFrame should be displayed when
df.info() is called. Valid values True,False,’deep’
[default: True] [currently: True]

display.mpl_style : bool
Setting this to ‘default’ will modify the rcParams used by matplotlib
to give plots a more pleasing visual style by default.
Setting this to None/False restores the values to their initial value.
[default: None] [currently: None]

display.multi_sparse : boolean
“sparsify” MultiIndex display (don’t display repeated
elements in outer levels within groups)
[default: True] [currently: True]

display.notebook_repr_html : boolean
When True, IPython notebook will use html representation for
pandas objects (if it is available).
[default: True] [currently: True]

display.pprint_nest_depth : int
Controls the number of nested levels to process when pretty-printing
[default: 3] [currently: 3]

display.precision : int
Floating point output precision (number of significant digits). This is
only a suggestion
[default: 6] [currently: 6]

display.show_dimensions : boolean or ‘truncate’
Whether to print out dimensions at the end of DataFrame repr.
If ‘truncate’ is specified, only print out the dimensions if the
frame is truncated (e.g. not display all rows and/or columns)
[default: truncate] [currently: truncate]

display.unicode.ambiguous_as_wide : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]

display.unicode.east_asian_width : boolean
Whether to use the Unicode East Asian Width to calculate the display text
width.
Enabling this may affect to the performance (default: False)
[default: False] [currently: False]

display.width : int
Width of the display in characters. In case python/IPython is running in
a terminal this can be set to None and pandas will correctly auto-detect
the width.
Note that the IPython notebook, IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]

io.excel.xls.writer : string
The default Excel writer engine for ‘xls’ files. Available options:
‘xlwt’ (the default).
[default: xlwt] [currently: xlwt]

io.excel.xlsm.writer : string
The default Excel writer engine for ‘xlsm’ files. Available options:
‘openpyxl’ (the default).
[default: openpyxl] [currently: openpyxl]

io.excel.xlsx.writer : string
The default Excel writer engine for ‘xlsx’ files. Available options:
‘xlsxwriter’ (the default), ‘openpyxl’.
[default: xlsxwriter] [currently: xlsxwriter]

io.hdf.default_format : format
default format writing format, if None, then
put will default to ‘fixed’ and append will default to ‘table’
[default: None] [currently: None]

io.hdf.dropna_table : boolean
drop ALL nan rows when appending to a table
[default: False] [currently: False]

mode.chained_assignment : string
Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn
[default: warn] [currently: warn]

mode.sim_interactive : boolean
Whether to simulate interactive mode for purposes of testing
[default: False] [currently: False]

mode.use_inf_as_null : boolean
True means treat None, NaN, INF, -INF as null (old way),
False means None and NaN are null, but INF, -INF are not null
(new way).
[default: False] [currently: False]

Scikit-Learn 模块

C11. 图数据分析

图论基础

NetworkX 模块

利用 NetworkX 进行图分析

C12. 大数据工具

Hadoop

Spark