# Pipeline generated features ## Transformer types Each header below represents a feature created using a transformation available through the Spark Pipeline API. These transforms are used in the context of `pipeline_features`. ``` [[pipeline_features]] input_column = "immyear_diff" output_column = "immyear_caution" transformer_type = "bucketizer" categorical = true splits = [-1,0,6,11,9999] [[pipeline_features]] input_columns = ["race","srace"] output_column = "race_interacted_srace" transformer_type = "interaction" ``` ### interaction Interact two or more features, creating a vectorized result. ``` [[pipeline_features]] # interact the categorical features for mother caution flag, mother present flag, and mother jaro-winkler score input_columns = ["m_caution", "m_pres", "jw_m"] output_column = "m_interacted_jw_m" transformer_type = "interaction" ``` ### bucketizer From the `pyspark.ml.feature.Bucketizer()` docs: "Maps a column of continuous features to a column of feature buckets." * Attributes: * `splits` -- Type: Array of integers. Required for this transformer_type. Per the `pyspark.ml.feature.Bucketizer()` docs: "Split points for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. The splits should be of length >= 3 and strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; otherwise, values outside the splits specified will be treated as errors." ``` [[pipeline_features]] input_column = "relate_a" output_column = "relatetype" transformer_type = "bucketizer" categorical = true splits = [1,3,5,9999] ```