行业研究公司研究宏观策略财报招股书会议纪要 seedance2.0 低空经济 DeepSeek AIGC 大模型

徐振中_想要一个简单易用的实时ml平台？你必须知道的一些复杂细节

电子设备2023-08-02ArchSummit深圳2023|全球架构师峰会C***

AI智能总结

Invisible Interfaces 研报总结

核心观点

该研报探讨了实时机器学习平台复杂性的抽象问题，旨在通过“隐形接口”简化实时决策流程，提升业务效率。核心观点在于，实时决策在多个领域（如欺诈预防、个性化推荐、动态定价等）具有重要价值，但实现过程中面临数据孤岛、协作开销和技术复杂性等挑战。

实时决策的重要性

实时决策在现代业务中至关重要，多个企业案例（如Instacart、LinkedIn、WhatsApp、Pinterest、Airbnb）展示了实时机器学习带来的显著效益：

Instacart每年减少数百万欺诈成本
LinkedIn实时反滥用系统提升30%坏分子捕获率，21%假账户检测率
WhatsApp100ms延迟增加20-30%垃圾信息
Pinterest实时推荐提升11%用户参与度，减少10%隐藏量
Airbnb实时个性化搜索排名提升5.1%预订量

实时决策面临的挑战

实时决策平台建设面临三大核心挑战：

从实验到生产的转化：原型开发慢、本地与远程执行差异、语言和运行时环境分散
流批数据分离：架构演进困难、回填复杂、训练预测不一致
系统复杂性：成本、延迟、正确性难以控制，缺乏优化手段

解决方案

针对上述挑战，提出三大解决方案：

解决方案1：基于关系表达式的编译

统一熟悉的API，可插拔多种计算引擎
通过关系表达式（RE）中间层实现代码复用
优化编译过程，适配不同计算引擎（Panda、DuckDb、Flink、Spark）
缩短原型开发时间至分钟级

解决方案2：抽象流批数据存储

统一流批数据源，简化回填流程
实现在线/离线特征库统一管理
支持多种存储技术（DWH、日志、实时存储）
通过时间戳和实体键实现精确时间点关联

解决方案3：优化控制旋钮

抽象优化复杂性，提供高级控制参数
实现成本、延迟、正确性的平衡控制
自定义工作负载优化策略
提供云资源管理边界，保障数据安全

技术架构

提出的数据架构包含：

数据层：统一流批数据源、特征库、在线/离线存储
计算层：关系表达式编译、多引擎适配、智能优化
服务层：特征服务、模型服务、监控服务等
工作流层：训练、评估、预测全流程编排

结论

通过“隐形接口”技术，可实现：

通用易用的API
响应式系统
无需配置即“即用即工作”
用户可控制优化参数，避免意外成本和延迟
提升数据科学效率，缩短开发周期

最终目标是为企业构建实时机器学习平台提供简单、高效、可控的解决方案。

Considerations for Abstracting Complexities ofa Real-time ML Platform Zhenzhong XuCofounder & CTO @ claypot.aiJuly, 2023 The discovery ofsomething invisible The Invisible Interface Ubiquitous Easy and responsive Just works! The endeavor to make things useful Real-time Decisionsthat powers your business Customer supportDynamic pricing/discountingRisk AssessmentAccount Take OverAdsSentiment analysisObjectdetection… The world is moving towards real-time ●Instacart: The Journey to Real-Time Machine Learning(2022)○Directly reducesmillions of fraud-related costsannually.●LinkedIn’s Real-time Anti-abuse(2022)○LinkedIn moved from an offline pipeline (hours) to real-time pipeline (minutes), and saw30% increase in bad actors caught onlineand21% improvement in fake account detection.●How WhatsApp catches and fights abuse(2022 |slides)○A few 100ms delay can increase the spam by 20-30%.●How Pinterest Leverages Realtime User Actions in Recommendation to Boost Engagement(2022)○According to Pinterest, this“has been one of our most impactful innovations recently,increasing Home feed engagement by 11% while reducing Pinner hide volume by 10%●Airbnb: Real-time Personalization using Embeddings for Search Ranking(2018)○Moving from offline scoring to online scoring grows bookings by+5.1% The hard things towardsreal-time decisions ●Data silo and staleness●Collaborationoverhead●Tech complexity Challenge 1 : FromExperimentation to Production ●Slow prototyping●Local vs. remote execution●Divergent language & runtime Local Experimentation with Traditional Models Local Experimentation with LLMs Need aninvisibleinterface to plug into compute ecosystems Declare features with familiar APIs @transformationdefaverage_transaction_amount_by_merchant(tx: Transactions,wspec: WindowSpec): returntx.groupby(["cc_num","merchant"])["amt"].window(wspec).mean() Data Science Friendly:Python <> SQL Same code can run on different computation engines Compile into a relational expression (RE), whichis SQL equivalent Compile & optimize RE into the computationengine(e.g., Panda, DuckDb, Flink, Spark) best suited forthe job Spin up and manage computation jobs Solution1 : RelationalExpression based Compilation ●Unified yet familiar API●Pluggable to many compute engines●Minimize human error●Prototype in minutes Challenge 2: Streaming andBatch Divided ●Evolving architecture●Difficult to backfill●Train-predict inconsistencies Kappa (Streaming) Architecture Batch and streaming source unified to simplify backfill Need an invisible interface to plug into storage ecosystems Data Fabric for a Streaming Pipeline Training dataset backfill requirespoint-in-time correctness Point-in-time joins to generate training data Given a spine (entity keys + timestamp + label), join features to generate training data train_df = pitc_join_features(spine_df,features=["tx_max_1h","user_unique_ip_30d",],) Solution 2: Abstract streamingand batch data storage ●Unified streaming & batch source●Unified online & offline feature stores●Pluggable to most storage technologies Challenge 3: It should just work! ●Cost, latency, correctnesssurprises!●Lack optimizations knobs Optimization @transformation deftransaction_count(tx: Transactions, wspec: WindowSpec):returntx[tx.status =="failed"].groupby("account_id").window(wspec).count() Various intelligent optimization can be done tomake appropriate tradeoff across storage andcompute systems. Solution 3: Optimization knobs ●Abstract optimization complexity●User controls with high level knobs●Trust, nosurprises! Make invisible interfacepossible! ●Ubiquitous●Easy and responsive●Just works! https://zhenzhongxu.com/zhenzhong@claypot.ai the invisible interface

点击免费查看完整报告

你可能感兴趣

徐振中_在硅谷企业工程师领导力导师道路上的一些观察和思考

基础化工

ArchSummit深圳2023|全球架构师峰会2023-08-02