淘宝淘宝消费贷款怎么申请 dl

TAG光大信托黔信1号集合资金信托计划_苏州金阊虎丘理财产品_苏州列举网
& TAG光大信托黔信1号集合资金信托计划
TAG光大信托黔信1号集合资金信托计划
归属: 湖北武汉
联系时,请说明在“列举网”看到的!
列举网提醒您:为保障您的权益,请不要提前支付任何费用!
选信托--找资管--认准刘经理--理财经理:刘经理 (微信同)Q Q:诚信沟通高效服务,全市场优选交易对手【TAG】【光大信托-黔信1号集合资金信托计划】【规模】:4亿 【期限】:24月【预期收益】:100万≤认购金额<300万 &
8.5%300万≤认购金额<1000万 &
& 8.8%1000万≤认购金额 &
9%【资金】用于“广州佳兆业城市广场”项目建设以及“京粤项目”并购【担保方】水城水投,截至2016年底,公司总资产183.77亿元,净资产155.84亿元,负债率仅15.20%,担保能力强,主体、债券评级均为AA。【地区】六盘水水城在贵州区县排名前十,是六盘水机场所在地,交通便利;2016年地区生产总值233亿,一般预算收入21亿,可支配财政收入70亿,超过江苏建湖等地区财政,地区经济实力强。找光大信托-黔信1号集合资金信托计划咨询刘经理(微信同) Q Q:信托资管刚兑逐渐打破,如何在新背景下优选交易对手?七年资深理财经理为您深度解析信托市场新动向,严格排查项目风险点,守护本金安全。TAG光大信托黔信1号集合资金信托计划 日 15:00:25更新本文链接地址:http://su.lieju.com/jinrong_danbaodaikuan/.htm
联系我时,请说明是在列举网看到的,谢谢!
注册时间:
---------- 认证信息 ----------
相关区域:
相关城市:
热门城市:
免责声明:此信息系用户自行发布,内容的真实性及合法性由该用户负责。 &
此操作需要先登录最近也在做跟这个相关的问题,来分享一下自己的见解.&br&&br&&figure&&img data-rawheight=&546& data-rawwidth=&895& src=&https://pic3.zhimg.com/v2-b2b0bcddcdf8b0e0d9aa2_b.jpg& class=&origin_image zh-lightbox-thumb& width=&895& data-original=&https://pic3.zhimg.com/v2-b2b0bcddcdf8b0e0d9aa2_r.jpg&&&/figure&图1. 这张图清楚说明了image classification, object detection, semantic segmentation, instance segmentation之间的关系. 摘自COCO dataset (&a class=& external& href=&//link.zhihu.com/?target=https%3A//arxiv.org/pdf/.pdf& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&arxiv.org/pdf/&/span&&span class=&invisible&&.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&)&br&&br&Semantic segmentation的目的是在一张图里分割聚类出不同物体的pixel. 目前的主流框架都是基于Fully Convolutional Neural Networks (FCN,详情见&a class=& external& href=&//link.zhihu.com/?target=https%3A//people.eecs.berkeley.edu/%7Ejonlong/long_shelhamer_fcn.pdf& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&people.eecs.berkeley.edu&/span&&span class=&invisible&&/~jonlong/long_shelhamer_fcn.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&).FCN区别于物体识别网络诸如AlexNet最主要的差别是pixel-wise prediction,就是每个像素点都有个probability, 而AlexNet是一张图一个prediction.AlexNet或者VGG通过一个小的trick(&a class=& wrap external& href=&//link.zhihu.com/?target=https%3A//github.com/BVLC/caffe/blob/master/examples/net_surgery.ipynb& target=&_blank& rel=&nofollow noreferrer&&caffe/net_surgery.ipynb at master · BVLC/caffe · GitHub&i class=&icon-external&&&/i&&/a&)就可以转变成FCN. 这里有个八卦是当年FCN得到CVPR'15 best paper honorable mention的时候, Yann LeCun等人出来吐槽这个'FCN'的概念早就有了,AlexNet里面的fully connected layer (FC)本身就是个误导,因为FC layer可以看成是1x1的convolution, 本来就可以输入任意大小的图片.&br&&br&Semantic segmentation的其他典型代表还有诸&a class=& wrap external& href=&//link.zhihu.com/?target=http%3A//mi.eng.cam.ac.uk/projects/segnet/& target=&_blank& rel=&nofollow noreferrer&&SegNet&i class=&icon-external&&&/i&&/a&, &a class=& wrap external& href=&//link.zhihu.com/?target=https%3A//github.com/fyu/dilation& target=&_blank& rel=&nofollow noreferrer&&Dilated Convolution Net&i class=&icon-external&&&/i&&/a&&a class=& wrap external& href=&//link.zhihu.com/?target=http%3A//www.cv-foundation.org/openaccess/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf& target=&_blank& rel=&nofollow noreferrer&&, deconvolutionNet&i class=&icon-external&&&/i&&/a&&br&等.这里又有两个八卦,比如SegNet相关的几篇论文连续投了两年多到现在都还没中(作者要哭晕在厕所里了),以及关于deconvolution, dilated convolution, atrous convolution这几个概念的争论(这里有篇分析我觉得不错&a href=&//link.zhihu.com/?target=http%3A//www.inference.vc/dilated-convolutions-and-kronecker-factorisation/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Dilated Convolutions and Kronecker Factored Convolutions&i class=&icon-external&&&/i&&/a&).在我个人使用过程中,相对于FCN等带skip connection结构的网络,我更喜欢类似于Dilated Net这种桶状结构的网络,原因是带skip connection的网络由于需要normalize不同layer之间的activation, 比较难训练. Liu Wei有一篇专门分析这个layer之间normalization trick的论文(&a class=& external& href=&//link.zhihu.com/?target=http%3A//www.cs.unc.edu/%7Ewliu/papers/parsenet.pdf& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&http://www.&/span&&span class=&visible&&cs.unc.edu/~wliu/papers&/span&&span class=&invisible&&/parsenet.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&).&br&&br&切入正题,semantic segmentation把图片里人所在的区域分割出来了,但是本身并没有告诉这里面有多少个人,以及每个人分别的区域.这里就跟instance segmentation联系了起来,如何把每个人的区域都分别分割出来,是比semantic segmentation要难不少的问题.基于semantic segmentation来做instance segmentation的论文,大家可以看看Jifeng Dai最近的几篇论文:&a href=&//link.zhihu.com/?target=https%3A//arxiv.org/pdf/v1.pdf& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&arxiv.org/pdf/&/span&&span class=&invisible&&2v1.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&,&a class=& external& href=&//link.zhihu.com/?target=https%3A//arxiv.org/pdf/v1.pdf& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&arxiv.org/pdf/&/span&&span class=&invisible&&8v1.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&. 大致做法是在dense feature map上面整合个instance region proposal/score map/RoI, 然后再分割.&br&&br&这里instance segmentation本身又是跟object detection是紧密相关的.最近Facebook放出来的DeepMask和SharpMask(&a class=& wrap external& href=&//link.zhihu.com/?target=https%3A//github.com/facebookresearch/deepmask& target=&_blank& rel=&nofollow noreferrer&&GitHub - facebookresearch/deepmask: Torch implementation of DeepMask and SharpMask&i class=&icon-external&&&/i&&/a&), 很明确地点出了两者关系. 我之前跟Piotr Dollar也讨论过这个问题, 他自己觉得: semantic segmentation is a bad direction, we should focus on object detection. 我不赞同他的观点,但觉得还是挺有道理:) 这里可以想象, 如果object proposal和object detection能做得非常好, instance segmentation本身这个问题就能比较好的解决. COCO detection challenge (&a class=& wrap external& href=&//link.zhihu.com/?target=http%3A//mscoco.org/dataset/%23detections-eval& target=&_blank& rel=&nofollow noreferrer&&COCO - Common Objects in Context&i class=&icon-external&&&/i&&/a&) 里面一个track, 就是要求predict segmentation mask rather than bbox, 可惜今年只有两个队参加(你参加的话再差都是第三哟:p) .&br&&br&总结一下, instance segmentation其实是semantic segmentation和object detection殊途同归的一个结合点, 是个挺重要的研究问题. 我非常期待后面能同时结合semantic segmentation和object detection两者优势的instance segmentation算法和网络结构.&br&&br&&figure&&img data-rawheight=&390& data-rawwidth=&938& src=&https://pic4.zhimg.com/v2-f658ec027fbc8da8264b_b.jpg& class=&origin_image zh-lightbox-thumb& width=&938& data-original=&https://pic4.zhimg.com/v2-f658ec027fbc8da8264b_r.jpg&&&/figure&图2. Scene Parsing (&a href=&//link.zhihu.com/?target=http%3A//sceneparsing.csail.mit.edu/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&MIT Scene Parsing Challenge 2016&i class=&icon-external&&&/i&&/a&) from ADE20K dataset (&a href=&//link.zhihu.com/?target=http%3A//groups.csail.mit.edu/vision/datasets/ADE20K/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&ADE20K dataset&i class=&icon-external&&&/i&&/a&). 每张图的每个物体以及物体的物体都有清楚的标注.&br&&br&最后,我个人觉得之所以大家猛搞semantic segmentation而忽略instance segmentation的一个原因是没有好的数据集. pascal dataset里面一张图片里的instance数量非常少, 而且物体种类也只有20种. 这里自荐下我自己的工作, 我们组最近搞了个Scene parsing dataset and challenge (&a class=& wrap external& href=&//link.zhihu.com/?target=http%3A//sceneparsing.csail.mit.edu/& target=&_blank& rel=&nofollow noreferrer&&MIT Scene Parsing Challenge 2016&i class=&icon-external&&&/i&&/a&). 这里scene parsing跟semantic segmentation最大的不同是我们包含了150类概念类别(包括离散物体类别诸如person, car, table, 也包含了很多stuff类别, 如floor, ceiling, wall) , 图片里面每个pixel都需要被predict. 分割floor, ceiling, wall这些类对于robot navigation等应用也是非常重要,但是他们并没有instance segmentation的概念. 今年我们的scene parsing challenge采用semantic segmentation的框架进行, 大家提出了不少新颖的模型, 也挺受欢迎 . 我们明年的scene parsing challenge (ICCV'17) 将设立instance segmentation track, 希望能推动instance segmentation 的进步.&br&&br&再然后,其实semantic segmentation可以用到很多地方,比如说我们lab之前的一个PhD把这个用在medical imaging中癌症细胞的检测和分割(&a href=&//link.zhihu.com/?target=https%3A//people.csail.mit.edu/khosla/papers/arxiv2016_Wang.pdf& class=& external& target=&_blank& rel=&nofollow noreferrer&&&span class=&invisible&&https://&/span&&span class=&visible&&people.csail.mit.edu/kh&/span&&span class=&invisible&&osla/papers/arxiv2016_Wang.pdf&/span&&span class=&ellipsis&&&/span&&i class=&icon-external&&&/i&&/a&),拿了奖,还开了自己的startup :)
最近也在做跟这个相关的问题,来分享一下自己的见解. 图1. 这张图清楚说明了image classification, object detection, semantic segmentation, instance segmentation之间的关系. 摘自COCO dataset () Semantic segmentation的目的是…
&p&谢邀,当然历史记录般的厉害咯,恭喜恺明。跟恺明也算是关系户,认识六七年了,中大读硕的时候跟他就有一年overlap,后来断断续续保持联系,今年一块办了个CVPR'17 tutorial &a href=&//link.zhihu.com/?target=http%3A//deeplearning.csail.mit.edu/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&Deep Learning for Objects and Scenes&i class=&icon-external&&&/i&&/a& (广告:讲座视频已经上传)。好久没写知乎了而且去ICCV不能,那今天我就来写几句我眼中的恺明师兄吧。&/p&&p&其实从他发Dark Channel那篇论文开始,我就挺关注他的研究工作。那时候还是前Deep Learning时代的计算机视觉,一切都还不怎么work,还流行着LDA和各种graphical models。他那几篇low-level vision的论文读完就让人有眼前一亮的感觉,很多时候论文的立意都是从现象出发,然后追溯到背后的本质,然后再提出了一个朴实有效的解决问题的办法。虽然我自己不做low-level vision,但这些论文读完让我有 “啊哈” 的欣喜感觉,受到挺大的启发。这种从&b&现象和问题出发追溯本质的思想&/b&,给他后面更加优秀和广为人知的工作埋下了重要的伏笔。&/p&&p&后来恺明从中大(CUHK)博士毕业在MSRA当研究员的时候就开始领队打ImageNet比赛了,做image classification的问题。从low-level vision到high-level vision,对于一般研究者,本来是个挺大的转变。但这恰好赶上了deep learning的浪潮,如何训练更好的分类神经网络本身是个非常empirical的研究问题。神经网络太复杂了,很难有什么理论指导,所以这玩意更像一个现象。&/p&&p&这样,恺明之前的从现象到本质的研究方式让他发现了神经网络中很多的问题所在,并提出了一些很有效的解决办法。比如说从防治gradient vanishing问题, 导出了Parametric ReLU,以及进一步提出后来封神的ResNet(关于这个ResNet的问题,可以参见我另外一个回答(&a href=&https://www.zhihu.com/question//answer/& class=&internal&&周博磊:为什么现在的CNN模型都是在GoogleNet、VGGNet或者AlexNet上调整的?&/a&)。再比如针对在object detection里如何更有效地利用CNN feature map, 提出了Spatial Pyramid Networks,到后来Fast RCNN, 跟Ross一起的Faster RCNN, 以及现在获奖的Mask RCNN。你可以很清楚地看出这些优秀研究工作的连贯性。在一个研究问题上死磕5,6年,在现在这个乱花迷人眼的时代非常难得。而且这image classification and object detection是计算机视觉的核心问题,恺明能解决得如此漂亮,真是由衷佩服,对这个领域推动也是巨大的(连AlphaGo Zero都用了residual block,可以想象Resnet是如何应用到计算机视觉的研究和产品线中去的)。当然,恺明的这些研究工作有很多顶级优秀的Collaborator参与,比如说他MSRA的孙老大(现Face++),两位顶尖实习生Xiangyu Zhang, Shaoqing Ren,以及FAIR的顶尖高手Ross和Pitor等等,这里就不展开了。&/p&&p&恺明去年夏末的时候才从MSRA到美国来加入Facebook AI Research,一年不到就搞出了MaskRCNN这个黑武器。MaskRCNN是semantic segmentation和object detection的一个结合,成为了instance segmentation的利器(去年写过一篇关于instance segmentation的回答&a href=&https://www.zhihu.com/question//answer/& class=&internal&&周博磊:Instance Segmentation 比 Semantic Segmentation 难很多吗?&/a&,想不到MaskRCNN这么快横空出世)。Facebook内部工程线上面都广泛部署了Resnet和MaskRCNN,想必对公司贡献是非常大的。三周前我刚去FB总部参加了一个workshop,跟他聊了些近况和新的研究方向。你们的恺明大神还是依旧战斗在coding最前线的,哈哈,我是不会透露他正在蕴酿下一个什么大招的,大家期待就是了:)。&/p&&p&最后,大家可以再留意一下ICCV'17最后一天(也就是这周末)的Workshop &a href=&//link.zhihu.com/?target=https%3A//places-coco2017.github.io/& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&COCO + Places 2017&i class=&icon-external&&&/i&&/a& 。我参与举办了这次joint challenge,比赛项目是object detection, keypoint detection, scene parsing, instance segmentation等等。一个看点是,Kaiming和Ross所在的FAIR团队如何对垒中国的一些视觉公司如Face++和SenseTime等等。比赛结果会在当天揭晓。结果是挺有意思的,也值得大家思考。&/p&
谢邀,当然历史记录般的厉害咯,恭喜恺明。跟恺明也算是关系户,认识六七年了,中大读硕的时候跟他就有一年overlap,后来断断续续保持联系,今年一块办了个CVPR'17 tutorial
(广告:讲座视频已经上传)。好久没写知…
&figure&&img src=&https://pic3.zhimg.com/v2-caa68cb59f_b.jpg& data-rawwidth=&640& data-rawheight=&402& class=&origin_image zh-lightbox-thumb& width=&640& data-original=&https://pic3.zhimg.com/v2-caa68cb59f_r.jpg&&&/figure&&h2&&b&1 项目概述&/b&&/h2&&blockquote&&b&阿兰o麦席森o图灵&/b&(Alan Mathison Turing,—),英国数学家、逻辑学家,他被视为计算机之父。 1931年图灵进入剑桥大学国王学院,毕业后到美国普林斯顿大学攻读博士学位,二战爆发后回到剑桥,后曾协助军方破解德国的著名密码系统Enigma,帮助盟军取得了二战的胜利。&br&1936年,图灵向伦敦权威的数学杂志投了一篇论文,题为“论数字计算在决断难题中的应用”。在这篇开创性的论文中,图灵给“可计算性”下了一个严格的数学定义,并提出著名的“图灵机”(Turing Machine)的设想。“图灵机”不是一种具体的机器,而是一种思想模型,可制造一种十分简单但运算能力极强的计算装置,用来计算所有能想象得到的可计算函数。“图灵机”与“冯o诺伊曼机”齐名,被永远载入计算机的发展史中。1950年10月,图灵又发表了另一篇题为“机器能思考吗”的论文,成为划时代之作。也正是这篇文章,为图灵赢得了&b&“人工智能之父”&/b&的桂冠。&/blockquote&&figure&&img src=&http://pic4.zhimg.com/v2-c77b18ac1ad5fdf_b.jpg& data-caption=&& data-rawwidth=&1000& data-rawheight=&750& class=&origin_image zh-lightbox-thumb& width=&1000& data-original=&http://pic4.zhimg.com/v2-c77b18ac1ad5fdf_r.jpg&&&/figure&&ul&&li&&b&本项目需解决的问题&/b&&/li&&/ul&&p&本项目通过利用P2P平台Lending Club的贷款数据,进行机器学习,&b&构建贷款违约预测模型&/b&,对新增贷款申请人进行预测是否会违约,从而决定是否放款。&/p&&ul&&li&&b&建模思路&/b&&/li&&/ul&&p&以下是本次项目机器学习工作流程,实际操作中,其实每个步骤都是反复迭代的过程。&/p&&figure&&img src=&http://pic1.zhimg.com/v2-ddadeba51f19d57c6a8ac_b.jpg& data-caption=&& data-rawwidth=&973& data-rawheight=&618& class=&origin_image zh-lightbox-thumb& width=&973& data-original=&http://pic1.zhimg.com/v2-ddadeba51f19d57c6a8ac_r.jpg&&&/figure&&h2&&b&2 项目背景&/b&&/h2&&blockquote&作为旧金山的一家个人对个人的借贷公司,Lending Club成立于2006年。他们是第一家注册为按照美国证券交易委员会SEC(Securities and Exchange Commission)的安全标准向个人提供个人贷款的借贷公司。与传统借贷机构最大的不同是,Lending Club利用网络技术打造的这个交易平台,直接连接了个人投资者和个人借贷者,通过此种方式,缩短了资金流通的环节,尤其是绕过了传统的大银行等金融机构,使得投资者和借贷者都能得到更多实惠、更快捷。对于投资者来说可以获得更好的回报,而对于借贷者来说,则可以获得相对较低的贷款利率。&/blockquote&&p&关于Lending Club 更多介绍,可以参考虎嗅网的文章&u&&a href=&http://link.zhihu.com/?target=https%3A//www.huxiu.com/article/41472/1.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&《来认识一下即将上市的全球最大P2P网贷公司Lending Club》&/a&&/u&。若想了解Lending Club 2017年Q2业务情况也可以参考我上一个项目的报告&b&《&u&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&注册会计师带你探索风险分析(EDA)&/a&&/u&》&/b&。&/p&&h2&&b&3 场景解析(算法选择)&/b&&/h2&&p&贷款申请人向Lending Club平台申请贷款时,Lending Club平台通过线上或线下让客户填写贷款申请表,收集客户的基本信息,这里包括申请人的年龄、性别、婚姻状况、学历、贷款金额、申请人财产情况等信息,通常来说还会借助第三方平台如征信机构或FICO等机构的信息。通过这些信息属性来做线性回归 ,生成预测模型,Lending Club平台可以通过预测判断贷款申请是否会违约,从而决定是否向申请人发放贷款。&/p&&p&通过以上的业务逻辑和上一个项目报告&b&《&u&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&注册会计师带你探索风险分析(EDA)&/a&&/u&》&/b&的数据探索,下面进行场景解析。&/p&&p&1)首先,我们的场景是通过用户的历史行为(如历史数据的多维特征和贷款状态是否违约)来训练模型,通过这个模型对新增的贷款人“是否具有偿还能力,是否具有偿债意愿”进行分析,预测贷款申请人是否会发生违约贷款。这是一个监督学习的场景,因为已知了特征以及贷款状态是否违约(目标列),我们判定贷款申请人是否违约是一个&b&二元分类问题&/b&,可以通过一个&b&分类算法&/b&来处理,这里选用逻辑斯蒂回归(Logistic Regression)。&/p&&p&2)通过上一个项目报告&b&《&u&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&注册会计师带你探索风险分析(EDA)&/a&&/u&》&/b&可以看到,部分数据是半结构化数据,需要进行特征抽象。&/p&&p&现对该业务场景进行总结如下:&/p&&ul&&li&根据历史记录数据学习并对贷款是否违约进行预测,监督学习场景,选择&b&逻辑斯蒂回归(Logistic Regression)算法&/b&。&/li&&li&数据为半结构化数据,需要进行&b&特征抽象&/b&。&/li&&/ul&&figure&&img src=&http://pic1.zhimg.com/v2-177c9cdd795f5ac0ef264_b.jpg& data-caption=&& data-rawwidth=&2479& data-rawheight=&1135& class=&origin_image zh-lightbox-thumb& width=&2479& data-original=&http://pic1.zhimg.com/v2-177c9cdd795f5ac0ef264_r.jpg&&&/figure&&p&本项目报告,我将如何运用Python操作数据和机器学习的思考过程均记录下来。&/p&&h2&&b&4 数据预处理(Pre-Processing Data)&/b&&/h2&&ul&&li&&b&前期准备&/b&&/li&&/ul&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&c1&&# Imports&/span&
&span class=&c1&&# Numpy,Pandas&/span&
&span class=&kn&&import&/span& &span class=&nn&&numpy&/span& &span class=&k&&as&/span& &span class=&nn&&np&/span&
&span class=&kn&&import&/span& &span class=&nn&&pandas&/span& &span class=&k&&as&/span& &span class=&nn&&pd&/span&
&span class=&c1&&# matplotlib,seaborn,pyecharts&/span&
&span class=&kn&&import&/span& &span class=&nn&&matplotlib.pyplot&/span& &span class=&k&&as&/span& &span class=&nn&&plt&/span&
&span class=&n&&plt&/span&&span class=&o&&.&/span&&span class=&n&&style&/span&&span class=&o&&.&/span&&span class=&n&&use&/span&&span class=&p&&(&/span&&span class=&s1&&'ggplot'&/span&&span class=&p&&)&/span&
&span class=&c1&&#风格设置近似R这种的ggplot库&/span&
&span class=&kn&&import&/span& &span class=&nn&&seaborn&/span& &span class=&k&&as&/span& &span class=&nn&&sns&/span&
&span class=&n&&sns&/span&&span class=&o&&.&/span&&span class=&n&&set_style&/span&&span class=&p&&(&/span&&span class=&s1&&'whitegrid'&/span&&span class=&p&&)&/span&
&span class=&o&&%&/span&&span class=&n&&matplotlib&/span& &span class=&n&&inline&/span&
&span class=&kn&&import&/span& &span class=&nn&&missingno&/span& &span class=&k&&as&/span& &span class=&nn&&msno&/span&
&span class=&c1&&#
忽略弹出的warnings&/span&
&span class=&kn&&import&/span& &span class=&nn&&warnings&/span&
&span class=&n&&warnings&/span&&span class=&o&&.&/span&&span class=&n&&filterwarnings&/span&&span class=&p&&(&/span&&span class=&s1&&'ignore'&/span&&span class=&p&&)&/span&
&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&set_option&/span&&span class=&p&&(&/span&&span class=&s1&&'display.float_format'&/span&&span class=&p&&,&/span& &span class=&k&&lambda&/span& &span class=&n&&x&/span&&span class=&p&&:&/span& &span class=&s1&&'&/span&&span class=&si&&%.5f&/span&&span class=&s1&&'&/span& &span class=&o&&%&/span& &span class=&n&&x&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&ul&&li&&b&数据获取与解析&/b&&/li&&/ul&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&data&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&read_csv&/span&&span class=&p&&(&/span&&span class=&s1&&'LoanStats_2017Q2.csv'&/span& &span class=&p&&,&/span& &span class=&n&&encoding&/span&&span class=&o&&=&/span&&span class=&s1&&'latin-1'&/span&&span class=&p&&,&/span&&span class=&n&&skiprows&/span& &span class=&o&&=&/span& &span class=&mi&&1&/span&&span class=&p&&)&/span& &span class=&c1&&#读取数据&/span&
&span class=&n&&data&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span& &span class=&c1&&#查看表格默认前5行&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-893ae9fbb973ec67bc5823_b.jpg& data-caption=&& data-rawwidth=&1244& data-rawheight=&333& class=&origin_image zh-lightbox-thumb& width=&1244& data-original=&http://pic4.zhimg.com/v2-893ae9fbb973ec67bc5823_r.jpg&&&/figure&&p&统计每列属性缺失值的数量。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&check_null&/span& &span class=&o&&=&/span& &span class=&n&&data&/span&&span class=&o&&.&/span&&span class=&n&&isnull&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&(&/span&&span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&0&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&sort_values&/span&&span class=&p&&(&/span&&span class=&n&&ascending&/span&&span class=&o&&=&/span&&span class=&kc&&False&/span&&span class=&p&&)&/span&&span class=&o&&/&/span&&span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&))&/span& &span class=&c1&&#查看缺失值比例&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&n&&check_null&/span&&span class=&p&&[&/span&&span class=&n&&check_null&/span& &span class=&o&&&&/span& &span class=&mf&&0.2&/span&&span class=&p&&])&/span& &span class=&c1&&# 查看缺失比例大于20%的属性。&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-ae48b49d8a31_b.jpg& data-caption=&& data-rawwidth=&1383& data-rawheight=&820& class=&origin_image zh-lightbox-thumb& width=&1383& data-original=&http://pic2.zhimg.com/v2-ae48b49d8a31_r.jpg&&&/figure&&p&数据是否有缺失值或乱码一般是判断数据质量的主要因素。&/p&&p&对于缺失值的处理,一般来说先判定缺失的数据是否有意义。从上面信息可以发现,本次数据集缺失值较多的属性对我们模型预测意义不大,例如id和member_id以及url等。因此,我们直接删除这些没有意义且缺失值较多的属性。此外,如果缺失值对属性来说是有意义的,还得细分缺失值对应的属性是数值型变量或是分类类型变量。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&thresh_count&/span& &span class=&o&&=&/span& &span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&)&/span&&span class=&o&&*&/span&&span class=&mf&&0.4&/span& &span class=&c1&&# 设定阀值&/span&
&span class=&n&&data&/span& &span class=&o&&=&/span& &span class=&n&&data&/span&&span class=&o&&.&/span&&span class=&n&&dropna&/span&&span class=&p&&(&/span&&span class=&n&&thresh&/span&&span class=&o&&=&/span&&span class=&n&&thresh_count&/span&&span class=&p&&,&/span& &span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&1&/span& &span class=&p&&)&/span& &span class=&c1&&#若某一列数据缺失的数量超过阀值就会被删除&/span&
&/code&&/pre&&/div&&p&再次检查缺失值情况,发现缺失值比较多的数据列已被我们删除。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&data&/span&&span class=&o&&.&/span&&span class=&n&&isnull&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&(&/span&&span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&0&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&sort_values&/span&&span class=&p&&(&/span&&span class=&n&&ascending&/span&&span class=&o&&=&/span&&span class=&kc&&False&/span&&span class=&p&&)&/span&&span class=&o&&/&/span&&span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&nb&&len&/span&&span class=&p&&(&/span&&span class=&n&&data&/span&&span class=&p&&))&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic3.zhimg.com/v2-e9e208c65a_b.jpg& data-caption=&& data-rawwidth=&1258& data-rawheight=&764& class=&origin_image zh-lightbox-thumb& width=&1258& data-original=&http://pic3.zhimg.com/v2-e9e208c65a_r.jpg&&&/figure&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&data.to_csv('loans_2017q2_ml.csv', index = False) # 将初步预处理后的数据转化为csv
&/code&&/pre&&/div&&p&再次用pandas解析数据。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&read_csv&/span&&span class=&p&&(&/span&&span class=&s1&&'loans_2017q2_ml.csv'&/span&&span class=&p&&,&/span&&span class=&n&&encoding&/span&&span class=&o&&=&/span&&span class=&s1&&'gb2312'&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&dtypes&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&()&/span& &span class=&c1&&# 分类统计数据类型&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic1.zhimg.com/v2-60bcb8b2ba9e1a93c7f62eee27c3143c_b.jpg& data-caption=&& data-rawwidth=&1254& data-rawheight=&75& class=&origin_image zh-lightbox-thumb& width=&1254& data-original=&http://pic1.zhimg.com/v2-60bcb8b2ba9e1a93c7f62eee27c3143c_r.jpg&&&/figure&&p&我们通过Pandas的nunique方法来筛选属性分类为一的变量,剔除分类数量只有1的变量,Pandas方法&u&&a href=&http://link.zhihu.com/?target=http%3A//linkis.com/readthedocs.io/CpX1C& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&nunique()&/a&&/u&返回的是变量的分类数量(除去非空值)。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&loc&/span&&span class=&p&&[:,&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&o&&.&/span&&span class=&n&&nunique&/span&&span class=&p&&)&/span& &span class=&o&&!=&/span& &span class=&mi&&1&/span&&span class=&p&&]&/span&
&/code&&/pre&&/div&&p&查看数据的行列,发现数据已比之前少了3列。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&
&span class=&n&&out&/span&&span class=&p&&:(&/span&&span class=&mi&&105455&/span&&span class=&p&&,&/span& &span class=&mi&&98&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&ul&&li&&b&缺失值处理——分类型变量&/b&&/li&&/ul&&p&首先,我们查看分类变量缺失值的情况。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&objectColumns&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&isnull&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sort_values&/span&&span class=&p&&(&/span&&span class=&n&&ascending&/span&&span class=&o&&=&/span&&span class=&kc&&False&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic1.zhimg.com/v2-ea16d95c6cbb8_b.jpg& data-caption=&& data-rawwidth=&1254& data-rawheight=&616& class=&origin_image zh-lightbox-thumb& width=&1254& data-original=&http://pic1.zhimg.com/v2-ea16d95c6cbb8_r.jpg&&&/figure&&p&我们注意到分类变量中,&int_rate&、&revol_util&、“annual_inc”的属性实质意义是数值,但pandas因为它们含有“%”符号或数字间有逗号而误识别为字符。为了方便后续处理,我们先将他们的数据类型重分类。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'int_rate'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'int_rate'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&str&/span&&span class=&o&&.&/span&&span class=&n&&rstrip&/span&&span class=&p&&(&/span&&span class=&s1&&'%'&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&astype&/span&&span class=&p&&(&/span&&span class=&s1&&'float'&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'revol_util'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'revol_util'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&str&/span&&span class=&o&&.&/span&&span class=&n&&rstrip&/span&&span class=&p&&(&/span&&span class=&s1&&'%'&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&astype&/span&&span class=&p&&(&/span&&span class=&s1&&'float'&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'annual_inc'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'annual_inc'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&str&/span&&span class=&o&&.&/span&&span class=&n&&replace&/span&&span class=&p&&(&/span&&span class=&s2&&&,&&/span&&span class=&p&&,&/span&&span class=&s2&&&&&/span&&span class=&p&&)&/span&&span class=&o&&.&/span&&span class=&n&&astype&/span&&span class=&p&&(&/span&&span class=&s1&&'float'&/span&&span class=&p&&)&/span&
&span class=&n&&objectColumns&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&c1&&# 对objectColumns重新赋值&/span&
&/code&&/pre&&/div&&p&对分类型变量缺失值来个感性认知。&/p&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&msno.matrix(loans[objectColumns]) #缺失值可视化
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-24ee28fe33b143dd7d7d_b.jpg& data-caption=&& data-rawwidth=&1255& data-rawheight=&592& class=&origin_image zh-lightbox-thumb& width=&1255& data-original=&http://pic2.zhimg.com/v2-24ee28fe33b143dd7d7d_r.jpg&&&/figure&&p&从上图可以直观看出变量“emp_title”、“next_pymnt_d”缺失值较多,同时图的右边分别反映了缺失值最多和最少的行数,分别是第25行和第0行。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&msno&/span&&span class=&o&&.&/span&&span class=&n&&heatmap&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&])&/span& &span class=&c1&&#查看缺失值之间的相关性&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-b523db001f2a0a073b3399_b.jpg& data-caption=&& data-rawwidth=&1333& data-rawheight=&733& class=&origin_image zh-lightbox-thumb& width=&1333& data-original=&http://pic2.zhimg.com/v2-b523db001f2a0a073b3399_r.jpg&&&/figure&&p&上图显示了缺失值之间的相关性,当相关性为0时,说明一个变量与另一个变量之间没有影响。相关性接近1或-1说明变量之间呈现正相关或负相关。但我们从图中发现,并不完全如此。zip_code和其他变量之间的相关性比较强,这与我们的期望相反,zip_code一般来说和其他变量没什么关系,有可能表明数据某些行的记录是不完整的。&br&&/p&&p&这个热图对于选择变量对之间的数据完整性关系非常有用,但是当涉及到更大的数据关系时,它的视觉能力是有限的,而且对于非常大的数据集没有特别的支持。&/p&&p&我们使用pandas.fillna()处理文本变量缺失值,为分类变量缺失值创建一个分类“Unknown”。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&objectColumns&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span& &span class=&c1&&# 筛选数据类型为object的数据&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&fillna&/span&&span class=&p&&(&/span&&span class=&s2&&&Unknown&&/span&&span class=&p&&)&/span& &span class=&c1&&#以分类“Unknown”填充缺失值&/span&
&/code&&/pre&&/div&&p&再次查看分类变量缺失值的情况,发现缺失值已被清洗干净。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&msno&/span&&span class=&o&&.&/span&&span class=&n&&bar&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&])&/span& &span class=&c1&&#可视化&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-0d182c2b020fd7214a3bb_b.jpg& data-caption=&& data-rawwidth=&1315& data-rawheight=&648& class=&origin_image zh-lightbox-thumb& width=&1315& data-original=&http://pic4.zhimg.com/v2-0d182c2b020fd7214a3bb_r.jpg&&&/figure&&ul&&li&&b&缺失值处理——数值型变量&/b&&/li&&/ul&&p&查看数值型变量的缺失值情况。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&n&&np&/span&&span class=&o&&.&/span&&span class=&n&&number&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&isnull&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sum&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&sort_values&/span&&span class=&p&&(&/span&&span class=&n&&ascending&/span&&span class=&o&&=&/span&&span class=&kc&&False&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-e31edced3ce6f99db8f1_b.jpg& data-caption=&& data-rawwidth=&1263& data-rawheight=&762& class=&origin_image zh-lightbox-thumb& width=&1263& data-original=&http://pic2.zhimg.com/v2-e31edced3ce6f99db8f1_r.jpg&&&/figure&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&numColumns&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&n&&np&/span&&span class=&o&&.&/span&&span class=&n&&number&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&n&&msno&/span&&span class=&o&&.&/span&&span class=&n&&matrix&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&numColumns&/span&&span class=&p&&])&/span& &span class=&c1&&#缺失值可视化&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-5b4afa0a8a1352417aaf2c3f6d26c4d5_b.jpg& data-caption=&& data-rawwidth=&1231& data-rawheight=&472& class=&origin_image zh-lightbox-thumb& width=&1231& data-original=&http://pic2.zhimg.com/v2-5b4afa0a8a1352417aaf2c3f6d26c4d5_r.jpg&&&/figure&&div class=&highlight&&&pre&&code class=&language-text&&&span&&/span&pd.set_option('display.max_columns', len(loans.columns))
loans[numColumns]
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-2bd0fc536b0c4a61eb945_b.jpg& data-caption=&& data-rawwidth=&1359& data-rawheight=&400& class=&origin_image zh-lightbox-thumb& width=&1359& data-original=&http://pic2.zhimg.com/v2-2bd0fc536b0c4a61eb945_r.jpg&&&/figure&&p&从表格发现,第105,451行至105,454行的属性值&b&全为NaN&/b&,这些空行对我们预测模型的构建没有任何意义,在此先单独删除这些行。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&drop&/span&&span class=&p&&([&/span&&span class=&mi&&105451&/span&&span class=&p&&,&/span&&span class=&mi&&105452&/span&&span class=&p&&,&/span&&span class=&mi&&105453&/span&&span class=&p&&,&/span&&span class=&mi&&105454&/span&&span class=&p&&],&/span& &span class=&n&&inplace&/span& &span class=&o&&=&/span& &span class=&kc&&True&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&numColumns&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&tail&/span&&span class=&p&&()&/span& &span class=&c1&&# 默认查看表格倒数5行&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-991871fbc2c86c46991fdbefb8d41533_b.jpg& data-caption=&& data-rawwidth=&1250& data-rawheight=&257& class=&origin_image zh-lightbox-thumb& width=&1250& data-original=&http://pic4.zhimg.com/v2-991871fbc2c86c46991fdbefb8d41533_r.jpg&&&/figure&&p&对数值型变量的缺失值,我们采用均值插补的方法来填充缺失值,这里使用可sklearn的Preprocessing模块,参数strategy可选项有median或most_frequent以及median,具体详见官方文档&u&&a href=&http://link.zhihu.com/?target=http%3A//scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&sklearn.preprocessing.Imputer&/a&&/u&。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&kn&&from&/span& &span class=&nn&&sklearn.preprocessing&/span& &span class=&k&&import&/span& &span class=&n&&Imputer&/span&
&span class=&n&&imr&/span& &span class=&o&&=&/span& &span class=&n&&Imputer&/span&&span class=&p&&(&/span&&span class=&n&&missing_values&/span&&span class=&o&&=&/span&&span class=&s1&&'NaN'&/span&&span class=&p&&,&/span& &span class=&n&&strategy&/span&&span class=&o&&=&/span&&span class=&s1&&'mean'&/span&&span class=&p&&,&/span& &span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&0&/span&&span class=&p&&)&/span&
&span class=&c1&&# 针对axis=0 列来处理&/span&
&span class=&n&&imr&/span& &span class=&o&&=&/span& &span class=&n&&imr&/span&&span class=&o&&.&/span&&span class=&n&&fit&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&numColumns&/span&&span class=&p&&])&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&numColumns&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&imr&/span&&span class=&o&&.&/span&&span class=&n&&transform&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&numColumns&/span&&span class=&p&&])&/span&
&/code&&/pre&&/div&&p&&br&&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&msno&/span&&span class=&o&&.&/span&&span class=&n&&matrix&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&)&/span& &span class=&c1&&# 再次检查缺失值情况&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-f7b9cdc32f01cf18b45b21d_b.jpg& data-caption=&& data-rawwidth=&1274& data-rawheight=&485& class=&origin_image zh-lightbox-thumb& width=&1274& data-original=&http://pic2.zhimg.com/v2-f7b9cdc32f01cf18b45b21d_r.jpg&&&/figure&&ul&&li&&b&数据过滤&/b&&/li&&/ul&&p&对于同一份数据基于不同的数据挖掘目的,很多时候不需要把所有数据进行训练。冗余特征重复了包含在一个或多个其他属性中的许多或所有信息。例如,zip_code对于我们借款人的偿债能力并没有任何意义。grade和sub_grade是重复的属性信息。下一步,我们对数据进行过滤。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&objectColumns&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&n&&var&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&columns&/span&
&span class=&k&&for&/span& &span class=&n&&v&/span& &span class=&ow&&in&/span& &span class=&n&&var&/span&&span class=&p&&:&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&s1&&'&/span&&span class=&se&&\n&/span&&span class=&s1&&Frequency count for variable &/span&&span class=&si&&{0}&/span&&span class=&s1&&'&/span&&span class=&o&&.&/span&&span class=&n&&format&/span&&span class=&p&&(&/span&&span class=&n&&v&/span&&span class=&p&&))&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&v&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&())&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&n&&objectColumns&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic3.zhimg.com/v2-68d42c8b8561bcbf4a7bbfc6_b.jpg& data-caption=&& data-rawwidth=&1213& data-rawheight=&771& class=&origin_image zh-lightbox-thumb& width=&1213& data-original=&http://pic3.zhimg.com/v2-68d42c8b8561bcbf4a7bbfc6_r.jpg&&&/figure&&ul&&li&sub_grade:与Grade的信息重复&/li&&li&emp_title :缺失值较多,同时不能反映借款人收入或资产的真实情况&/li&&li&zip_code:地址邮编,邮编显示不全,没有意义&/li&&li&addr_state:申请地址所属州,不能反映借款人的偿债能力&/li&&li&last_credit_pull_d :LendingClub平台最近一个提供贷款的时间,没有意义
&/li&&li&policy_code : 变量信息全为1&/li&&li&pymnt_plan 基本是n
&/li&&li&title: title与purpose的信息重复,同时title的分类信息更加离散
&/li&&li&next_pymnt_d : 下一个付款时间,没有意义
&/li&&li&policy_code : 没有意义
&/li&&li&collection_recovery_fee: 全为0,没有意义&/li&&li&earliest_cr_line : 记录的是借款人发生第一笔借款的时间&/li&&li&issue_d : 贷款发行时间,这里提前向模型泄露了信息&/li&&li&last_pymnt_d、collection_recovery_fee、last_pymnt_amnt: 预测贷款违约模型是贷款前的风险控制手段,这些贷后信息都会影响我们训练模型的效果,在此将这些信息删除 &br&&/li&&/ul&&p&将以上重复或对构建预测模型没有意义的属性进行删除。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&drop_list&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'sub_grade'&/span&&span class=&p&&,&/span& &span class=&s1&&'emp_title'&/span&&span class=&p&&,&/span&
&span class=&s1&&'title'&/span&&span class=&p&&,&/span& &span class=&s1&&'zip_code'&/span&&span class=&p&&,&/span& &span class=&s1&&'addr_state'&/span&&span class=&p&&,&/span&
&span class=&s1&&'mths_since_last_delinq'&/span& &span class=&p&&,&/span&&span class=&s1&&'initial_list_status'&/span&&span class=&p&&,&/span&&span class=&s1&&'title'&/span&&span class=&p&&,&/span&&span class=&s1&&'issue_d'&/span&&span class=&p&&,&/span&&span class=&s1&&'last_pymnt_d'&/span&&span class=&p&&,&/span&&span class=&s1&&'last_pymnt_amnt'&/span&&span class=&p&&,&/span&
&span class=&s1&&'next_pymnt_d'&/span&&span class=&p&&,&/span&&span class=&s1&&'last_credit_pull_d'&/span&&span class=&p&&,&/span&&span class=&s1&&'policy_code'&/span&&span class=&p&&,&/span&&span class=&s1&&'collection_recovery_fee'&/span&&span class=&p&&,&/span& &span class=&s1&&'earliest_cr_line'&/span&&span class=&p&&]&/span&
&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&drop&/span&&span class=&p&&(&/span&&span class=&n&&drop_list&/span&&span class=&p&&,&/span& &span class=&n&&axis&/span&&span class=&o&&=&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span& &span class=&n&&inplace&/span& &span class=&o&&=&/span& &span class=&kc&&True&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&p&分类型变量从28列被精减至&b&11列&/b&。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'object'&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&shape&/span&
&span class=&n&&out&/span&&span class=&p&&:(&/span&&span class=&mi&&105448&/span&&span class=&p&&,&/span& &span class=&mi&&11&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s1&&'object'&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span& &span class=&c1&&# 再次概览数据&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-a92ef2bf2fde1ece01f51b5c7d70fb95_b.jpg& data-caption=&& data-rawwidth=&1105& data-rawheight=&187& class=&origin_image zh-lightbox-thumb& width=&1105& data-original=&http://pic2.zhimg.com/v2-a92ef2bf2fde1ece01f51b5c7d70fb95_r.jpg&&&/figure&&p&不同算法模型需要不同的数据类型来建立。例如逻辑回归只支持数值型的数据,而随机森林通常对字符型和数值型都支持。由于在场景分析中,我们判定本项目预测贷款违约是一个二元分类问题,我们选择的算法是逻辑回归算法模型,从数据预处理的过程中也发现数据的结构是半结构化,因此需要对特征数据作进一步转换。&/p&&h2&&b&5 特征工程(Feature Engineering)&/b&&/h2&&blockquote&&Coming up with features is difficult, time-consuming, requires expert knowledge. “Applied machine learning” is basically feature engineering.&————Andrew Ng&/blockquote&&p&&br&&/p&&blockquote&&feature engineering is manually designing what the input x's should be&&br&————Tomasz Malisiewicz, answer to &b&&u&&a href=&http://link.zhihu.com/?target=https%3A//www.quora.com/What-is-feature-engineering& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&“What is feature engineering?”&/a&&/u&&/b&&/blockquote&&p&特征工程是机器学习中最重要的步骤。实际工作中,特征工程是个反复迭代的过程,大部分时间也是在分析业务、分析case,不断地找特征。更好的特征意味着只需要用简单的模型,更好的特征也意味着能够获得更好的依据去预测结果。2014年天池比赛,第一名团队的模型并不复杂,但他们特征更贴近业务,对业务场景理解比较好,出来的结果比淘宝负责做推荐的准确率提升16%。&/p&&p&本次项目特征工程主要分4大部分:1、特征衍生 2、特征抽象 3、特征缩放 4、特征选择&/p&&ul&&li&&b&5.1 特征衍生&/b&&/li&&/ul&&p&&b&特征衍生&/b&是指利用现有的特征进行某种组合生成新的特征。在风险控制方面,传统银行获得企业的基本财务报表(资产负债表、利润表以及现金流量表),借助于现代成熟的财务管理体系,在不同业务场景的需求下,利用企业财务报表各种项目之间的组合,就可以衍生不同新特征反映企业不同的&b&财务状况&/b&,例如资产与负债项目组合能够生成反映&b&企业债务情况&/b&的特征,收入与应收账款组合能生成反映&b&应收账款周转率(资金效率)&/b&特征等,同时还能利用&b&企业财务报表之间的勾稽关系&/b&生成新特征来佐证&b&企业报表的质量&/b&。在金融风险控制中,要做好以上工作的前提是,你必须熟悉各种业务场景同时精通财务知识。&/p&&p&而Lending Club平台中,&installment&代表贷款每月分期的金额,我们将'annual_inc'除以12个月获得贷款申请人的月收入金额,然后再把&installment&(月负债)与('annual_inc'/12)(月收入)相除生成新的特征'installment_feat',新特征'installment_feat'代表客户每月还款支出占月收入的比,'installment_feat'的值越大,意味着贷款人的偿债压力越大,违约的可能性越大。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'installment_feat'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'installment'&/span&&span class=&p&&]&/span& &span class=&o&&/&/span& &span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'annual_inc'&/span&&span class=&p&&]&/span& &span class=&o&&/&/span& &span class=&mi&&12&/span&&span class=&p&&)&/span&
&/code&&/pre&&/div&&ul&&li&&b&5.2 特征抽象(feature abstraction)&/b&&/li&&/ul&&p&特征抽象是指将数据转换成算法可以理解的数据。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&c1&&#使用Pandas replace函数定义新函数:&/span&
&span class=&k&&def&/span& &span class=&nf&&coding&/span&&span class=&p&&(&/span&&span class=&n&&col&/span&&span class=&p&&,&/span& &span class=&n&&codeDict&/span&&span class=&p&&):&/span&
&span class=&n&&colCoded&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&Series&/span&&span class=&p&&(&/span&&span class=&n&&col&/span&&span class=&p&&,&/span& &span class=&n&&copy&/span&&span class=&o&&=&/span&&span class=&kc&&True&/span&&span class=&p&&)&/span&
&span class=&k&&for&/span& &span class=&n&&key&/span&&span class=&p&&,&/span& &span class=&n&&value&/span& &span class=&ow&&in&/span& &span class=&n&&codeDict&/span&&span class=&o&&.&/span&&span class=&n&&items&/span&&span class=&p&&():&/span&
&span class=&n&&colCoded&/span&&span class=&o&&.&/span&&span class=&n&&replace&/span&&span class=&p&&(&/span&&span class=&n&&key&/span&&span class=&p&&,&/span& &span class=&n&&value&/span&&span class=&p&&,&/span& &span class=&n&&inplace&/span&&span class=&o&&=&/span&&span class=&kc&&True&/span&&span class=&p&&)&/span&
&span class=&k&&return&/span& &span class=&n&&colCoded&/span&
&span class=&c1&&#把贷款状态LoanStatus编码为违约=1, 正常=0:&/span&
&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s2&&&loan_status&&/span&&span class=&p&&])&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s2&&&loan_status&&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&coding&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s2&&&loan_status&&/span&&span class=&p&&],&/span& &span class=&p&&{&/span&&span class=&s1&&'Current'&/span&&span class=&p&&:&/span&&span class=&mi&&0&/span&&span class=&p&&,&/span&&span class=&s1&&'Fully Paid'&/span&&span class=&p&&:&/span&&span class=&mi&&0&/span&&span class=&p&&,&/span&&span class=&s1&&'In Grace Period'&/span&&span class=&p&&:&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&s1&&'Late (31-120 days)'&/span&&span class=&p&&:&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&s1&&'Late (16-30 days)'&/span&&span class=&p&&:&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&s1&&'Charged Off'&/span&&span class=&p&&:&/span&&span class=&mi&&1&/span&&span class=&p&&})&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span& &span class=&s1&&'&/span&&span class=&se&&\n&/span&&span class=&s1&&After Coding:'&/span&&span class=&p&&)&/span&
&span class=&n&&pd&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&(&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s2&&&loan_status&&/span&&span class=&p&&])&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-55e12d58f292bd7ea15fbf_b.jpg& data-caption=&& data-rawwidth=&1108& data-rawheight=&111& class=&origin_image zh-lightbox-thumb& width=&1108& data-original=&http://pic4.zhimg.com/v2-55e12d58f292bd7ea15fbf_r.jpg&&&/figure&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&c1&&# 贷款状态分布可视化&/span&
&span class=&n&&fig&/span&&span class=&p&&,&/span& &span class=&n&&axs&/span& &span class=&o&&=&/span& &span class=&n&&plt&/span&&span class=&o&&.&/span&&span class=&n&&subplots&/span&&span class=&p&&(&/span&&span class=&mi&&1&/span&&span class=&p&&,&/span&&span class=&mi&&2&/span&&span class=&p&&,&/span&&span class=&n&&figsize&/span&&span class=&o&&=&/span&&span class=&p&&(&/span&&span class=&mi&&14&/span&&span class=&p&&,&/span&&span class=&mi&&7&/span&&span class=&p&&))&/span&
&span class=&n&&sns&/span&&span class=&o&&.&/span&&span class=&n&&countplot&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&=&/span&&span class=&s1&&'loan_status'&/span&&span class=&p&&,&/span&&span class=&n&&data&/span&&span class=&o&&=&/span&&span class=&n&&loans&/span&&span class=&p&&,&/span&&span class=&n&&ax&/span&&span class=&o&&=&/span&&span class=&n&&axs&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&])&/span&
&span class=&n&&axs&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&set_title&/span&&span class=&p&&(&/span&&span class=&s2&&&Frequency of each Loan Status&&/span&&span class=&p&&)&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'loan_status'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&value_counts&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&plot&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&o&&=&/span&&span class=&kc&&None&/span&&span class=&p&&,&/span&&span class=&n&&y&/span&&span class=&o&&=&/span&&span class=&kc&&None&/span&&span class=&p&&,&/span& &span class=&n&&kind&/span&&span class=&o&&=&/span&&span class=&s1&&'pie'&/span&&span class=&p&&,&/span& &span class=&n&&ax&/span&&span class=&o&&=&/span&&span class=&n&&axs&/span&&span class=&p&&[&/span&&span class=&mi&&1&/span&&span class=&p&&],&/span&&span class=&n&&autopct&/span&&span class=&o&&=&/span&&span class=&s1&&'&/span&&span class=&si&&%1.2f%%&/span&&span class=&s1&&'&/span&&span class=&p&&)&/span&
&span class=&n&&axs&/span&&span class=&p&&[&/span&&span class=&mi&&1&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&set_title&/span&&span class=&p&&(&/span&&span class=&s2&&&Percentage of each Loan status&&/span&&span class=&p&&)&/span&
&span class=&n&&plt&/span&&span class=&o&&.&/span&&span class=&n&&show&/span&&span class=&p&&()&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-f9cb9c4a2b_b.jpg& data-caption=&& data-rawwidth=&1008& data-rawheight=&504& class=&origin_image zh-lightbox-thumb& width=&1008& data-original=&http://pic4.zhimg.com/v2-f9cb9c4a2b_r.jpg&&&/figure&&p&从上一篇报告&b&《 &u&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&注册会计师带你探索风险分析(EDA)&/a&&/u&》&/b&也提到,平台贷款发生违约的数量占少数。贷款状态为正常的有103,743个,贷款正常状态占比为98.38%。贷款状态将作为我们建模的标签,贷款状态正常和贷款状态违约两者数量不平衡,绝大多数常见的机器学习算法对于不平衡数据集都不能很好地工作,稍后我们将会解决样本不平衡的问题。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&object_columns_df&/span& &span class=&o&&=&/span&&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span& &span class=&c1&&#筛选数据类型为object的变量&/span&
&span class=&nb&&print&/span&&span class=&p&&(&/span&&span class=&n&&object_columns_df&/span&&span class=&o&&.&/span&&span class=&n&&iloc&/span&&span class=&p&&[&/span&&span class=&mi&&0&/span&&span class=&p&&])&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-18b49a717b5fc11c05a95abeb400a5af_b.jpg& data-caption=&& data-rawwidth=&1257& data-rawheight=&243& class=&origin_image zh-lightbox-thumb& width=&1257& data-original=&http://pic4.zhimg.com/v2-18b49a717b5fc11c05a95abeb400a5af_r.jpg&&&/figure&&p&对变量“delinq_2yrs”、“total_acc”、“last_pymnt_amnt”、“revol_bal”的数据类型重分类。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'delinq_2yrs'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'delinq_2yrs'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&k&&lambda&/span& &span class=&n&&x&/span&&span class=&p&&:&/span& &span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&p&&))&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'total_acc'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'total_acc'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&k&&lambda&/span& &span class=&n&&x&/span&&span class=&p&&:&/span& &span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&p&&))&/span&
&span class=&n&&loans&/span&&span class=&p&&[&/span&&span class=&s1&&'revol_bal'&/span&&span class=&p&&]&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span& &span class=&p&&[&/span&&span class=&s1&&'revol_bal'&/span&&span class=&p&&]&/span&&span class=&o&&.&/span&&span class=&n&&apply&/span&&span class=&p&&(&/span&&span class=&k&&lambda&/span& &span class=&n&&x&/span&&span class=&p&&:&/span& &span class=&nb&&float&/span&&span class=&p&&(&/span&&span class=&n&&x&/span&&span class=&p&&))&/span&
&span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&select_dtypes&/span&&span class=&p&&(&/span&&span class=&n&&include&/span&&span class=&o&&=&/span&&span class=&p&&[&/span&&span class=&s2&&&object&&/span&&span class=&p&&])&/span&&span class=&o&&.&/span&&span class=&n&&describe&/span&&span class=&p&&()&/span&&span class=&o&&.&/span&&span class=&n&&T&/span& &span class=&c1&&# 再次检查数据&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic4.zhimg.com/v2-6b44f5bcac5a0d95c86d60ac_b.jpg& data-caption=&& data-rawwidth=&1251& data-rawheight=&276& class=&origin_image zh-lightbox-thumb& width=&1251& data-original=&http://pic4.zhimg.com/v2-6b44f5bcac5a0d95c86d60ac_r.jpg&&&/figure&&p&将变量类型为&object&的数量从30个缩减至7个。&/p&&p&&b&多值有序变量(Ordinal Values)&/b&&/p&&p&多值有序变量也称顺序数据(rank data),有序多值变量是某一有序类别的非数字型数据。有序多值变量虽然是类别,但这些类别是有序的。比如将产品分为一等品、二等品、三等品、次品等;在上一篇报告中&b&《&u&&a href=&https://zhuanlan.zhihu.com/p/& class=&internal&&注册会计师带你探索风险分析(EDA)&/a&&/u&》&/b&,我们得知Lending Club对贷款申请者信用等级分类——A至G,相应地按照不同信用等级匹配贷款利率——等级为A的客户信用评分比等级为B的客户好。&/p&&ul&&li&A &B &C & D & E & F & G ; 信用风险从低到高排序&/li&&/ul&&p&&b&多值无序变量(Nominal Values)&/b&&/p&&p&多值无序变量又称分类数据(categorical data),多值无序变量是某一类别的非数字型数据。它是对事物进行分类的结果,数据表现为类别,是用文字表述的。例如,借款人按照性别分为男、女两类;分类数据中的分类是无序的,意味着我们并不能像多值有序变量那样将多值无序变量(“purpose”)进行排序。&/p&&ul&&li&car & wedding & education & moving & house;这种排序不符合常识,也没有任何意义&/li&&/ul&&p&在此,我们对分类变量进一步细分。&/p&&ul&&li&&b&多值有序变量&/b&&/li&&ul&&li&grade&/li&&li&emp_length&/li&&/ul&&li&&b&多值无序变量&/b&&/li&&ul&&li&term&/li&&li&home_ownership&/li&&li&verification_status&/li&&li&pupose&/li&&li&application_type&/li&&/ul&&/ul&&p&对不同分类变量的转换需要使用不同的操作方法进行处理。&/p&&ul&&li&&b&有序特征的映射&/b&&/li&&/ul&&p&首先,我们对变量“emp_length”、&grade&进行特征抽象化,使用的方法是先构建一个mapping,再用pandas的replace( )进行映射转换,pandas的DataFrame.replace的具体用法,详见&u&&a href=&http://link.zhihu.com/?target=http%3A//pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html%3Fhighlight%3Dreplace%23pandas.DataFrame.replace& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&官方文档&/a&&/u&。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&c1&&# 构建mapping,对有序变量&emp_length”、“grade”进行转换&/span&
&span class=&n&&mapping_dict&/span& &span class=&o&&=&/span& &span class=&p&&{&/span&
&span class=&s2&&&emp_length&&/span&&span class=&p&&:&/span& &span class=&p&&{&/span&
&span class=&s2&&&10+ years&&/span&&span class=&p&&:&/span& &span class=&mi&&10&/span&&span class=&p&&,&/span&
&span class=&s2&&&9 years&&/span&&span class=&p&&:&/span& &span class=&mi&&9&/span&&span class=&p&&,&/span&
&span class=&s2&&&8 years&&/span&&span class=&p&&:&/span& &span class=&mi&&8&/span&&span class=&p&&,&/span&
&span class=&s2&&&7 years&&/span&&span class=&p&&:&/span& &span class=&mi&&7&/span&&span class=&p&&,&/span&
&span class=&s2&&&6 years&&/span&&span class=&p&&:&/span& &span class=&mi&&6&/span&&span class=&p&&,&/span&
&span class=&s2&&&5 years&&/span&&span class=&p&&:&/span& &span class=&mi&&5&/span&&span class=&p&&,&/span&
&span class=&s2&&&4 years&&/span&&span class=&p&&:&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span&
&span class=&s2&&&3 years&&/span&&span class=&p&&:&/span& &span class=&mi&&3&/span&&span class=&p&&,&/span&
&span class=&s2&&&2 years&&/span&&span class=&p&&:&/span& &span class=&mi&&2&/span&&span class=&p&&,&/span&
&span class=&s2&&&1 year&&/span&&span class=&p&&:&/span& &span class=&mi&&1&/span&&span class=&p&&,&/span&
&span class=&s2&&&& 1 year&&/span&&span class=&p&&:&/span& &span class=&mi&&0&/span&&span class=&p&&,&/span&
&span class=&s2&&&n/a&&/span&&span class=&p&&:&/span& &span class=&mi&&0&/span&
&span class=&p&&},&/span&
&span class=&s2&&&grade&&/span&&span class=&p&&:{&/span&
&span class=&s2&&&A&&/span&&span class=&p&&:&/span& &span class=&mi&&1&/span&&span class=&p&&,&/span&
&span class=&s2&&&B&&/span&&span class=&p&&:&/span& &span class=&mi&&2&/span&&span class=&p&&,&/span&
&span class=&s2&&&C&&/span&&span class=&p&&:&/span& &span class=&mi&&3&/span&&span class=&p&&,&/span&
&span class=&s2&&&D&&/span&&span class=&p&&:&/span& &span class=&mi&&4&/span&&span class=&p&&,&/span&
&span class=&s2&&&E&&/span&&span class=&p&&:&/span& &span class=&mi&&5&/span&&span class=&p&&,&/span&
&span class=&s2&&&F&&/span&&span class=&p&&:&/span& &span class=&mi&&6&/span&&span class=&p&&,&/span&
&span class=&s2&&&G&&/span&&span class=&p&&:&/span& &span class=&mi&&7&/span&
&span class=&p&&}&/span&
&span class=&p&&}&/span&
&span class=&n&&loans&/span& &span class=&o&&=&/span& &span class=&n&&loans&/span&&span class=&o&&.&/span&&span class=&n&&replace&/span&&span class=&p&&(&/span&&span class=&n&&mapping_dict&/span&&span class=&p&&)&/span& &span class=&c1&&#变量映射&/span&
&span class=&n&&loans&/span&&span class=&p&&[[&/span&&span class=&s1&&'emp_length'&/span&&span class=&p&&,&/span&&span class=&s1&&'grade'&/span&&span class=&p&&]]&/span&&span class=&o&&.&/span&&span class=&n&&head&/span&&span class=&p&&()&/span& &span class=&c1&&#查看效果&/span&
&/code&&/pre&&/div&&figure&&img src=&http://pic2.zhimg.com/v2-0b399f25ddda57ad1e6ef79_b.jpg& data-caption=&& data-rawwidth=&1251& data-rawheight=&213& class=&origin_image zh-lightbox-thumb& width=&1251& data-original=&http://pic2.zhimg.com/v2-0b399f25ddda57ad1e6ef79_r.jpg&&&/figure&&p&从上面表格可以看出多值有序变量经过处理后的效果,已经将“emp_length”,“grade”转化为算法可以理解的数据类型。&/p&&ul&&li&&b&独热编码(one-hot encoding)&/b&&/li&&/ul&&p&接下来,对多值无序变量进行独热编码(one-hot encoding)。&br&我们使用pandas的&u&&a href=&http://link.zhihu.com/?target=http%3A//pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&get_dummies( )&/a&&/u&方法创建虚拟特征,虚拟特征的每一列各代表变量属性的一个分类。然后再使用pandas的&u&&a href=&http://link.zhihu.com/?target=http%3A//pandas.pydata.org/pandas-docs/stable/generated/pandas.concat.html& class=& wrap external& target=&_blank& rel=&nofollow noreferrer&&concat()&/a&&/u&方法将新建虚拟特征和原数据进行拼接。&/p&&p&get_dummies返回的一组数据是一个稀疏矩阵,但这组数据已经可以带到算法中进行计算。&/p&&div class=&highlight&&&pre&&code class=&language-python3&&&span&&/span&&span class=&n&&n_columns&/span& &span class=&o&&=&/span& &span class=&p&&[&/span&&span class=&s2&&&home_ownership&&/span&&span class=&p&&,&/span& &span class=&s2&&&verification_status&&/span&&span class=&p&&,&/span& &span class=&s2&&&application_type&&/span&&span class=&p&&,&/span&&span class=&s2&&&purpose&&/span&&span class=&p&&,&/span& &span class=&s2&&&term&&/span&&span class=&p&&]&/span&
&span class=&n&&dummy_df&/span& &span class=&o&&=&/span& &span class=&n&&pd&/span&&span class=&o&&.&/span&&spa

我要回帖

更多关于 手机淘宝贷款怎么申请 的文章

 

随机推荐