[user00@master ~]$ mkdir baseball [user00@master ~]$ cd baseball/ [user00@master baseball]$ wget http://seanlahman.com/files/database/lahman2012-csv.zip [user00@master baseball]$ unzip lahman2012-csv.zip
[user00@master baseball]$ hadoop fs -mkdir baseball [user00@master baseball]$ hadoop fs -put *.csv baseball [user00@master baseball]$ hadoop fs -ls baseball Found 24 items -rw-r--r-- 1 user00 user00 198529 2015-08-05 11:50 baseball/AllstarFull.csv -rw-r--r-- 1 user00 user00 5730747 2015-08-05 11:50 baseball/Appearances.csv -rw-r--r-- 1 user00 user00 7304 2015-08-05 11:50 baseball/AwardsManagers.csv -rw-r--r-- 1 user00 user00 240867 2015-08-05 11:50 baseball/AwardsPlayers.csv -rw-r--r-- 1 user00 user00 16719 2015-08-05 11:50 baseball/AwardsShareManagers.csv -rw-r--r-- 1 user00 user00 220135 2015-08-05 11:50 baseball/AwardsSharePlayers.csv -rw-r--r-- 1 user00 user00 6488747 2015-08-05 11:50 baseball/Batting.csv -rw-r--r-- 1 user00 user00 644669 2015-08-05 11:50 baseball/BattingPost.csv -rw-r--r-- 1 user00 user00 8171830 2015-08-05 11:50 baseball/Fielding.csv -rw-r--r-- 1 user00 user00 298470 2015-08-05 11:50 baseball/FieldingOF.csv -rw-r--r-- 1 user00 user00 573945 2015-08-05 11:50 baseball/FieldingPost.csv -rw-r--r-- 1 user00 user00 175990 2015-08-05 11:50 baseball/HallOfFame.csv -rw-r--r-- 1 user00 user00 130719 2015-08-05 11:50 baseball/Managers.csv -rw-r--r-- 1 user00 user00 3662 2015-08-05 11:50 baseball/ManagersHalf.csv -rw-r--r-- 1 user00 user00 3049250 2015-08-05 11:50 baseball/Master.csv -rw-r--r-- 1 user00 user00 3602473 2015-08-05 11:50 baseball/Pitching.csv -rw-r--r-- 1 user00 user00 381812 2015-08-05 11:50 baseball/PitchingPost.csv -rw-r--r-- 1 user00 user00 700024 2015-08-05 11:50 baseball/Salaries.csv -rw-r--r-- 1 user00 user00 42933 2015-08-05 11:50 baseball/Schools.csv -rw-r--r-- 1 user00 user00 180758 2015-08-05 11:50 baseball/SchoolsPlayers.csv -rw-r--r-- 1 user00 user00 8369 2015-08-05 11:50 baseball/SeriesPost.csv -rw-r--r-- 1 user00 user00 550032 2015-08-05 11:50 baseball/Teams.csv -rw-r--r-- 1 user00 user00 3238 2015-08-05 11:50 baseball/TeamsFranchises.csv -rw-r--r-- 1 user00 user00 1609 2015-08-05 11:50 baseball/TeamsHalf.csv
MS Access Versions: lahman2012.mdb 2012readme.txt
hive
就可以進入 Hive 的互動式查詢介面。
[user00@master baseball]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.7.1.jar!/hive-log4j.properties Hive history file=/tmp/user00/hive_job_log_f6a9efc9-aaad-4d2a-9c6e-469664367645_315092396.txt hive>
hive>
開頭的,就代表是在 Hive 的互動式介面中執行的指令。show databases
檢視目前有哪幾個資料庫。
hive> show databases; OK default user00 Time taken: 0.698 seconds
quit;
或 exit;
。
hive> quit;
hive -e <HiveQL 查詢語法>
的方式執行查詢。create database <資料庫名稱>
。MASTER - Player names, DOB, and biographical info --- 2.1 MASTER table --- lahmanID Unique number assigned to each player playerID A unique code asssigned to each player. The playerID links the data in this file with records in the other files. managerID An ID for individuals who served as managers hofID An ID for individuals who are in teh baseball Hall of Fame birthYear Year player was born birthMonth Month player was born birthDay Day player was born birthCountry Country where player was born birthState State where player was born birthCity City where player was born deathYear Year player died deathMonth Month player died deathDay Day player died deathCountry Country where player died deathState State where player died deathCity City where player died nameFirst Player's first name nameLast Player's last name nameNote Note about player's name (usually signifying that they changed their name or played under two differnt names) nameGiven Player's given name (typically first and middle) nameNick Player's nickname weight Player's weight in pounds height Player's height in inches bats Player's batting hand (left, right, or both) throws Player's throwing hand (left or right) debut Date that player made first major league appearance finalGame Date that player made first major league appearance (blank if still active) college College attended lahman40ID ID used in Lahman Database version 4.0 lahman45ID ID used in Lahman database version 4.5 retroID ID used by retrosheet holtzID ID used by Sean Holtz's Baseball Almanac bbrefID ID used by Baseball Reference website
欄位 | 範例 | 資料型態 |
---|---|---|
lahmanID | 1 | INT |
playerID | aaronha01 | STRING |
managerID | NULL | INT |
hofID | aaronha01h | STRING |
birthYear | 1934 | INT |
birthMonth | 2 | INT |
birthDay | 5 | INT |
birthCountry | USA | STRING |
birthState | AL | STRING |
birthCity | Mobile | STRING |
deathYear | NULL | INT |
deathMonth | NULL | INT |
deathDay | NULL | INT |
deathCountry | NULL | STRING |
deathState | NULL | STRING |
deathCity | NULL | STRING |
nameFirst | Hank | STRING |
nameLast | Aaron | STRING |
nameNote | NULL | STRING |
nameGiven | Henry Louis | STRING |
nameNick | “Hammer, Hammerin’ Hank, Bad Henry” | STRING |
weight | 180 | INT |
height | 72 | INT |
bats | R | STRING |
throws | R | STRING |
debut | 4/13/1954 | STRING |
finalGame | 10/3/1976 | STRING |
college | NULL | STRING |
lahman40ID | aaronha01 | STRING |
lahman45ID | aaronha01 | STRING |
retroID | aaroh101 | STRING |
holtzID | aaronha01 | STRING |
bbrefID | aaronha01 | STRING |
create table <資料庫名稱>.<資料表名稱> ( 欄位1 資料型態1, ...., 欄位N 資料型態N )
[user00@master baseball]$ hive -e "create table $DBID.Master ( lahmanID INT, playerID STRING, managerID INT, hofID STRING, birthYear INT, birthMonth INT, birthDay INT, birthCountry STRING, birthState STRING, birthCity STRING, deathYear INT, deathMonth INT, deathDay INT, deathCountry STRING, deathState STRING, deathCity STRING, nameFirst STRING, nameLast STRING, nameNote STRING, nameGiven STRING, nameNick STRING, weight INT, height INT, bats STRING, throws STRING, debut STRING, finalGame STRING, college STRING, lahman40ID STRING, lahman45ID STRING, retroID STRING, holtzID STRING, bbrefID STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;"
Decimal datatype was introduced in Hive 0.11.0 (HIVE-2693) and revised in Hive 0.13.0 (HIVE-3976).
LOAD DATA LOCAL INPATH <檔案路徑> [OVERWRITE] INTO TABLE <資料表名稱>
。[user00@master baseball]$ hive -e "LOAD DATA LOCAL INPATH \"Master.csv\" OVERWRITE INTO TABLE $DBID.master;"若正常執行,您會看到如下的結果:
Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.7.1.jar!/hive-log4j.properties Hive history file=/tmp/user00/hive_job_log_a399a8dd-f313-4b63-929a-f545bfbdd559_1877336452.txt Copying data from file:/home/user00/baseball/Master.csv Copying file: file:/home/user00/baseball/Master.csv Loading data to table user00.master rmr: DEPRECATED: Please use 'rm -r' instead. Moved: 'hdfs://master:8020/user/hive/warehouse/user00.db/master' to trash at: hdfs://master:8020/user/user00/.Trash/Current chgrp: changing ownership of '/user/hive/warehouse/user00.db/master': User does not belong to hive Table user00.master stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 3049250, raw_data_size: 0] OK Time taken: 1.558 seconds
LOAD DATA LOCAL INPATH "Master.csv" OVERWRITE INTO TABLE userXX.master;
並手動取代 HiveQL 中的 userXX (XX 為數字)LOCAL
關鍵字,因此讀取的是目前工作目錄的 Master.csv,若沒有使用 LOCAL
關鍵字,則代表使用 HDFS 的檔案,此時建議給絕對路徑。如:LOAD DATA INPATH "/user/userXX/baseball/Master.csv" OVERWRITE INTO TABLE userXX.master;
。執行結果如下:
hive> LOAD DATA INPATH "/user/user00/baseball/Master.csv" OVERWRITE INTO TABLE user00.master; Loading data to table user00.master rmr: DEPRECATED: Please use 'rm -r' instead. Moved: 'hdfs://master:8020/user/hive/warehouse/user00.db/master' to trash at: hdfs://master:8020/user/user00/.Trash/Current chgrp: changing ownership of '/user/hive/warehouse/user00.db/master/Master.csv': User does not belong to hive Table user00.master stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 3049250, raw_data_size: 0] OK Time taken: 1.309 seconds
SHOW DATABASES;
。
[user00@master baseball]$ hive Logging initialized using configuration in jar:file:/usr/lib/hive/lib/hive-common-0.10.0-cdh4.7.1.jar!/hive-log4j.properties Hive history file=/tmp/user00/hive_job_log_33cf301e-65b1-4965-9ea6-af0895bd05e6_1775863642.txt hive> SHOW DATABASES; OK default user00 Time taken: 0.665 seconds
USE <資料庫名稱>;
。
hive> USE userXX; OK Time taken: 0.014 seconds
SHOW TABLES;
。
hive> SHOW TABLES; OK master mysql_data Time taken: 0.086 seconds
SELECT * FROM <資料表名稱> WHERE <條件>;
來進行資料的檢索。
hive> SELECT * FROM Master LIMIT 10; hive> SELECT lahmanID FROM Master LIMIT 10;
hive> SELECT lahmanID FROM Master WHERE birthyear > 1900 LIMIT 10; hive> SELECT COUNT( * ) FROM Master;
hive> SELECT COUNT( * ) FROM Master; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max= In order to set a constant number of reducers: set mapred.reduce.tasks= Starting Job = job_201508050909_0008, Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201508050909_0008 Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201508050909_0008 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2015-08-05 12:23:57,169 Stage-1 map = 0%, reduce = 0% 2015-08-05 12:24:02,201 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2015-08-05 12:24:03,212 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2015-08-05 12:24:04,220 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2015-08-05 12:24:05,229 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2015-08-05 12:24:06,237 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.74 sec 2015-08-05 12:24:07,251 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.37 sec 2015-08-05 12:24:08,261 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.37 sec 2015-08-05 12:24:09,271 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.37 sec MapReduce Total cumulative CPU time: 3 seconds 370 msec Ended Job = job_201508050909_0008 MapReduce Jobs Launched: Job 0: Map: 1 Reduce: 1 Cumulative CPU: 3.37 sec HDFS Read: 3049479 HDFS Write: 6 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 370 msec OK 18126 Time taken: 16.922 seconds hive> quit;
-S
參數,則會抑制 MapReduce 的標準錯誤輸出(STDERR)。
[user00@master baseball]$ hive -S -e "use $DBID; SELECT birthyear, lahmanID, nameFirst FROM master WHERE birthyear > 1900 limit 10;" 1934 1 Hank 1939 2 Tommie 1954 3 Don 1972 4 Andy 1951 11 Glenn 1972 12 Jeff 1967 13 Jim 1969 14 Kurt 1968 15 Kyle 1967 17 Paul