codecamp

Samza 写入HDFS

samza-hdfs 模块实现了一个 Samza Producer 来写入 HDFS。当前的实现包括一个现成的 HdfsSystemProducer,和三个 HdfsWriterS:一个写入原始字节的消息到SequenceFile 的 BytesWritable 键和值;另一个写入 UTF-8 Strings 到一个 SequenceFile 与 LongWritable 键和 Text 值;最后一个写出 Avro 数据文件,包括自动反映的 POJO 对象的模式。

配置 HdfsSystemProducer

您可以像任何其他 Samza 系统一样配置 HdfsSystemProducer:使用 job.properties 文件中设置的配置键和值。您可以配置系统生产者以供您使用,StreamTasks 如下所示:

# set the SystemFactory implementation to instantiate HdfsSystemProducer aliased to 'hdfs-clickstream'
systems.hdfs-clickstream.samza.factory=org.apache.samza.system.hdfs.HdfsSystemFactory

# define a serializer/deserializer for the hdfs-clickstream system
# DO NOT define (i.e. comment out) a SerDe when using the AvroDataFileHdfsWriter so it can reflect the schema
systems.hdfs-clickstream.samza.msg.serde=some-serde-impl

# consumer configs not needed for HDFS system, reader is not implemented yet

# Assign a Metrics implementation via a label we defined earlier in the props file
systems.hdfs-clickstream.streams.metrics.samza.msg.serde=some-metrics-impl

# Assign the implementation class for this system's HdfsWriter
systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.TextSequenceFileHdfsWriter
#systems.hdfs-clickstream.producer.hdfs.writer.class=org.apache.samza.system.hdfs.writer.AvroDataFileHdfsWriter

# Set compression type supported by chosen Writer. Only BLOCK compression is supported currently
# AvroDataFileHdfsWriter supports snappy, bzip2, deflate or none (null, anything other than the first three)
systems.hdfs-clickstream.producer.hdfs.compression.type=snappy

# The base dir for HDFS output. The default Bucketer for SequenceFile HdfsWriters
# is currently /BASE/JOB_NAME/DATE_PATH/FILES, where BASE is set below
systems.hdfs-clickstream.producer.hdfs.base.output.dir=/user/me/analytics/clickstream_data

# Assign the implementation class for the HdfsWriter's Bucketer
systems.hdfs-clickstream.producer.hdfs.bucketer.class=org.apache.samza.system.hdfs.writer.JobNameDateTimeBucketer

# Configure the DATE_PATH the Bucketer will set to bucket output files by day for this job run.
systems.hdfs-clickstream.producer.hdfs.bucketer.date.path.format=yyyy_MM_dd

# Optionally set the max output bytes (records for AvroDataFileHdfsWriter) per file.
# A new file will be cut and output continued on the next write call each time this many bytes
# (records for AvroDataFileHdfsWriter) are written.
systems.hdfs-clickstream.producer.hdfs.write.batch.size.bytes=134217728
#systems.hdfs-clickstream.producer.hdfs.write.batch.size.records=10000 

假设上述配置已经针对同一文件中的其他位置的标签 some-serde-impl 和 some-metrics-impl 标签正确配置了度量标准和序列实现 job.properties。这些属性中的每一个都具有合理的默认值,因此您可以省略不需要为您的工作运行定制的属性。

从 HDFS 文件读取  »

Samza YARN安全
Samza 从HDFS文件读取
温馨提示
下载编程狮App,免费阅读超1000+编程语言教程
取消
确定
目录

Samza API

关闭

MIP.setData({ 'pageTheme' : getCookie('pageTheme') || {'day':true, 'night':false}, 'pageFontSize' : getCookie('pageFontSize') || 20 }); MIP.watch('pageTheme', function(newValue){ setCookie('pageTheme', JSON.stringify(newValue)) }); MIP.watch('pageFontSize', function(newValue){ setCookie('pageFontSize', newValue) }); function setCookie(name, value){ var days = 1; var exp = new Date(); exp.setTime(exp.getTime() + days*24*60*60*1000); document.cookie = name + '=' + value + ';expires=' + exp.toUTCString(); } function getCookie(name){ var reg = new RegExp('(^| )' + name + '=([^;]*)(;|$)'); return document.cookie.match(reg) ? JSON.parse(document.cookie.match(reg)[2]) : null; }