pysparkでCSVを読み込んでDataFrameを作成なんでもブログ

pysparkでCSVを読み込んでDataFrameを作成

2020年09月13日 02時54分21秒

【読み込むCSV】

Name,Age,TestDate

tanaka,10,2020-08-01 01:00:00

suzuki,13,2020-08-02 02:00:00

satou,15,2020-08-03 03:00:00

【pythonモジュール】

from pyspark.sql import SparkSession

from pyspark.sql.types import IntegerType, StructField, StructType, StringType, TimestampType

schema = StructType(

[

StructField('Name', StringType(), False),

StructField('Age', IntegerType(), False),

StructField('TestDate', TimestampType(), False)

]

)

spark = SparkSession.builder.getOrCreate()

df = spark.read.format(

'csv'

).option(

'timestampFormat', 'yyyy-MM-dd HH:mm:ss'

).load(

'/tmp/test.csv', schema=schema

)

df.show()

+-------+----+-----------------------+

| Name| Age| TestDate|

+-------+----+-----------------------+

|tanaka | 10|2020-08-01 01:00:00|

|suzuki | 13|2020-08-02 02:00:00|

| satou | 15|2020-08-03 03:00:00|

+------+----+------------------------+

df.schema

StructType(List(StructField(Name,StringType,true),StructField(Age,IntegerType,true),StructField(TestDate,TimestampType,true)))

【schemaを指定しない場合】

schemaを指定せずにCSVを読み込むことも可能。
むしろschemaの定義とCSVの値が一致しなければnullになってしまうため、指定しない方が楽。
が、schemaを指定しない場合は全カラムがStringとして扱われてしまう。

df = spark.read.format(

'csv'

).load(

'/tmp/test.csv'

)

df.schema

StructType(List(StructField(_c0,StringType,true),StructField(_c1,StringType,true),StructField(_c2,StringType,true)))

Category・その他 | Comments(0)

Comment

« 【pyspark】DataFrameのレコード毎に処理を行う。UDF(User Defined Function) | HOME | P大工の源さん超韋駄天でドハマり »

Name:
Subject:
Mail:
URL:
Decoration:
Comment:
Pass:

なんでもブログ

pysparkでCSVを読み込んでDataFrameを作成

プロフィール

カテゴリー

最新記事

リンク