Pipeline generated features¶
Transformer types¶
Each header below represents a feature created using a transformation available through the Spark Pipeline API. These transforms are used in the context of pipeline_features
.
[[pipeline_features]]
input_column = "immyear_diff"
output_column = "immyear_caution"
transformer_type = "bucketizer"
categorical = true
splits = [-1,0,6,11,9999]
[[pipeline_features]]
input_columns = ["race","srace"]
output_column = "race_interacted_srace"
transformer_type = "interaction"
interaction¶
Interact two or more features, creating a vectorized result.
[[pipeline_features]]
# interact the categorical features for mother caution flag, mother present flag, and mother jaro-winkler score
input_columns = ["m_caution", "m_pres", "jw_m"]
output_column = "m_interacted_jw_m"
transformer_type = "interaction"
bucketizer¶
From the pyspark.ml.feature.Bucketizer()
docs: “Maps a column of continuous features to a column of feature buckets.”
Attributes:
splits
– Type: Array of integers. Required for this transformer_type. Per thepyspark.ml.feature.Bucketizer()
docs: “Split points for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors.”
[[pipeline_features]]
input_column = "relate_a"
output_column = "relatetype"
transformer_type = "bucketizer"
categorical = true
splits = [1,3,5,9999]