演讲人:陈俊杰-腾讯-资深研发工程师
目录
Par t 01
Iceberg社区高级特性介绍
BranchandTag
--Create a branch/tag for tableALTERTABLE table CREAT TAG/BRANCH tagName[AS OF {VERSIONsnapshotId}][RETAIN interval {DAYS | HOURS | MINUTES}]--Read from a branchSELECT*FROM table BRANCH/TAG branch_name--Insert into a branchINSERTINTO table BRANCH branch_name SELECT...
NewTableAPI
createBranch(String name, longsnapshotId);createTag(String name, longsnapshotId);
A-> B-> C (master)\(tag1)D-> E (archivebranch)\F-> G (testbranch)
spark().read() .format("iceberg").option("branch",branchName).load(table)
spark().write() .format("iceberg").option("branch",branchName).mode(SaveMode.Append).save(table)
Puffinformat
Afile format designed to store information such asindexesandstatisticsabout datamanaged in an Iceberg table that cannot be stored directly within the Iceberg manifest.
public interfaceUpdateStatisticsextendsPendingUpdate> {/**
/*** Remove the table's statistics file for given snapshot.**@returnthis for method chaining*/UpdateStatisticsremoveStatistics(longsnapshotId);}
Statistics
●Tablestatistics
●N u m b e ro fr o w s●N u m b e ro fd i s t i n c tv a l u e si nac o l u m n●T h e f a c t i o no fN U L Lv a l u e si nac o l u m n●M i n / m a xv a l u ei nac o l u m n●T h e a v e r a g ed a t as i z eo fac o l u m n
●HowstatisticshelpCBO?
View
A view is a logical table that can be referenced by future queries,theicebergviewdefinitionstandardizes the view metadata for ease of sharing the views acrossengines.
Par t 02
Iceberg高级特性解锁新场景
BRANCH解锁场景一:CDC入湖
WriterawCDCeventstothechangebranch,producechangelogfeedfromthebranch.
--Create a snapshot view for usersCREATEVIEW usersAS SELECTuser_cols.*,--the columns of the original tabletxId--the incremental transaction id, or timestampFROM (SELECTROW_NUMBER() OVER (PARTITION BYrow.idORDER BYtxId DESC) as row_numberoperation,row asuser_cols FROM users BRANCH changes)WHERErow_number= 1 AND operation != 'delete'
MERGEINTO Users BRANCH optimized as tUSINGincr_changes as sONs.id=t.idWHENMATCHED[and(time cond)]updateWHENNOT MATCHED insert all
BRANCH解锁场景二:多流拼接
Writepartialinsertstoonebranch,mergeincrementaltomergedbranch
//step3:mergeintothetargetbranchmergeintotablebranchoptimizedastusingaggDfassont.key=s.keywhenmatchupdate*whennotmatchinsert*
//step2:compactviawindowaggregations
//step1:definewindow
WindowSpecwindowSpec=Window.partitionBy(primaryKey).orderBy(functions.desc(orderColumn)).rangeBetween(Window.unboundedPreceding(),Window.unboundedFollowing());
Primarykey->col(keycolumn)Ordercolumn->max(order column);Datacolumn->first(data column,true)
Puffin解锁场景一:异步Stat构建
Storetablestats
Puffin解锁场景:index构建
View解锁场景:MV
A materialized view isapre-computeddata set derived from a query specification (theSELECT in the view definition) and stored for later use.
Par t 03
Iceberg新特性在腾讯应用实践
CBO
●Buildtablestatisticsasynchronously,and updatepartitionlevelstatisticsincrementallyviathetasketch.
Indexing
●Asyncindexing,supportBloomfilterandBitmapIndex
CREATEINDEX index_name ON[TABLE]table_nameUSINGBLOOMFILTER ( { colName1 [ options ] } [, ...] ) ][ options ] OPTIONS ( { key1 [ = ] val1 } [, ...] )
Authorization
●Thousandsofcolumnsinatable●Differentdepartmentsfocusonseparatedcolumns●Useauthorizedviewinsteadoftable
A/Btesting
●Asyncindexinguponqueryanalysis●Asyncz-orderclusteringuponqueryanalysis●Effectvalidationonthe branch
感谢观看!