使用Hive存储数据实践

数据存储需求是：每天会生成大量文章数据，每条文章数据包含标题、内容、URL、发表时间等多个字段，数据后续不会更新，因此考虑使用Hive作为数据仓库存储这些数据。以下介绍使用Hive存储数据的实践步骤以及注意事项。

创建表

创建外部表toutiao_category，建表语句如下所示，使用外部表是为了考虑数据存储的灵活性，对于外部表，若后续删除表，仅删除表元数据，不会删除表数据，可以继续读取数据进行分析。

CREATE EXTERNAL TABLE `toutiao_category`(                               
  `toutiao_has_mp4_video` int,                                           
  `toutiao_repin_count` int,                                             
  `abstract` string,                                                     
  `article_ptime` int,                                                   
  `toutiao_recommend` int,                                               
  `toutiao_article_type` int,                                           
  `category` string,                                                     
  `docid` bigint,                                                       
  `bury_count` int,                                                     
  `title` string,                                                       
  `content` string,                                                     
  `source` string,                                                       
  `comment_count` int,                                                   
  `article_url` string,                                                 
  `toutiao_middle_mode` string,                                         
  `toutiao_datetime` string,                                             
  `toutiao_aggr_type` int,                                               
  `toutiao_article_sub_type` int,                                       
  `toutiao_external_visit_count` int,                                   
  `ctime` int,                                                           
  `toutiao_favorite_count` int,                                         
  `toutiao_impression_count` int,                                       
  `toutiao_keywords` string,                                             
  `digg_count` int,                                                     
  `toutiao_more_mode` string,                                           
  `toutiao_go_detail_count` int,                                         
  `origin_url` string)                                                   
PARTITIONED BY (                                                         
  `date` string)                                                         
ROW FORMAT DELIMITED                                                     
  FIELDS TERMINATED BY '\t'                                             
  LINES TERMINATED BY '\n'                                               
STORED AS INPUTFORMAT                                                   
  'com.hadoop.mapred.DeprecatedLzoTextInputFormat'                       
OUTPUTFORMAT                                                             
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'           
LOCATION                                                                 
  'hdfs://heracles/user/mediadata/hive/warehouse/news/toutiao_category';

建表语句中：

“PARTITIONED BY (`date` string)”是按照天进行分区，类似于关系数据库中的分表操作，这样在实际存储表数据时，某一天的数据会单独存储于某个目录下，查询条件包含天时，就会只读取满足条件的天所对应目录下的数据，这样可以加快查询速度；
“ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’”表示对于存储的数据，按照“\n”划分成行，并对于行，按照“\t”划分出字段，划分出的字段的个数、顺序、属性应与建表语句中的字段说明一致；
“STORED AS INPUTFORMAT ‘com.hadoop.mapred.DeprecatedLzoTextInputFormat’”，因为数据采用LZO算法压缩，所以存储格式指定为“com.hadoop.mapred.DeprecatedLzoTextInputFormat”；
“LOCATION ‘hdfs://heracles/user/mediadata/hive/warehouse/crawl_news/toutiao_category’”表示数据在HDFS中存储的根目录。

压缩数据

按行并且每行按照“\t”分割字段导出某一天的文章数据，并使用lzop命令压缩数据：

lzop toutiao_category_20160101.txt -odata.lzo

压缩后的数据文件和原有文件相比，大小可以减小50%左右，节约了存储空间。

导入数据

公司集群中的Hive采用BeeLine连接Hive Server，使用BeeLine向toutiao_category表导入数据：

/usr/lib/hive/bin/beeline -u “jdbc:hive2://xxx.xxx.xxx.xxx:xxx/mediadata_news;principal=xxx” -e “load data local inpath ‘/data/rsync_dir/news/data.lzo’ overwrite into table toutiao_category partition (date=’20160101’);”

执行后，本地文件“/data/rsync_dir/news/data.lzo”被导入到toutiao_category表的“20160101”分区中：

/user/mediadata/hive/warehouse/crawl_news/toutiao_category/date=20160101/data.lzo

为压缩数据创建索引

对Hive表进行查询时，Hive实际上是将查询SQL转化为MapReduce Job，而对于导入到toutiao_category表的data.lzo文件，由于其是lzo格式，因此MapReduce Job在读取、分析该文件时，只会分配一个Mapper任务，因此为了提高查询速度，对lzo文件创建索引，这样MapReduce Job会对一个lzo文件分配多个Mapper任务：

hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar com.hadoop.compression.lzo.LzoIndexer /user/mediadata/hive/warehouse/news/toutiao_category/date=20160101/data.lzo

执行后，会增加index文件：

/user/mediadata/hive/warehouse/news/toutiao_category/date=20160101/data.lzo.index

查询数据

使用BeeLine执行SQL查询：

0: jdbc:hive2://xxx.xxx.xxx.xxx:xxx/mediadata_n> select ctime from toutiao_category where date=20160101 limit 10;
+————-+
| ctime |
+————-+
| 1451663999 |
| 1451663998 |
| 1451663980 |
| 1451663967 |
| 1451663956 |
| 1451663956 |
| 1451663955 |
| 1451663945 |
| 1451663933 |
| 1451663933 |
+————-+
10 rows selected (19.087 seconds)