ML 1: Data Wrangling

Packages Used: No new packages are used in this notebook.

It is often said that preparing data for subsequent processing (a.k.a. data wrangling) represents 80% of the effort in most data analysis efforts. A variety of tools are available in the DataFrames package to simplify the effort.

1. Tabular Data

Most data wrangling tools are designed to operate on tabular data:

Ex 1: Array vs. Tabular Representations

Three variables: var1 with possible values {A,B}; var2 with values {a,b}; var3 with values {1,2,3,4}:

Array:

a b
A 1 2
B 3 4

Table:

var1 var2 var3
A a 1
A b 2
B a 3
B b 4

Array representation is very concise and often used to efficiently represent data for up to three variables, but it can be hard to manipulate for more than three variables. For example, the three variables needed to represent a network (i, j, w) can be represented as an adjacency matrix array (using an efficient sparse matrix) as compared to using a tabular arc list representation. For more than three variables, it is easier to manipulate tabular data; for example, if a new variable, var4, needs to be added in Ex 1, above, only another column needs to be added to the table, while the array would need to become 3-D.

2. Table Joins and Split-Apply-Combine Strategy

Ex 2: Activity Profiling

Activity profiling is the systematic analysis of the items and orders handled in a warehouse in order to improve its design and operation. In the design of an order picking system, a representative set of customer orders are used together with the item master file to generate parameters that are used for a variety of different warehousing decisions, including equipment and method selection and slotting. If available, the previous three months to one year of customer orders provide a reasonably representative set of orders (a.k.a. an order set).

image.png

An example of activity profiling is presented in the figure above. Each customer order is composed of one or more lines, where each line represents an item–quantity pair (e.g., order 1 has five lines, with the first line indicating five units of item A have been ordered). In this example, the five customer orders along with the item master file are used to create the following warehouse design parameters:

  1. Total Lines—the total number of lines for all items in all orders over some period of time (representative of total picking activity); used to select piece-picking methods.
  2. Lines per Order—the average number of different items (i.e., lines or SKUs) in an order; used to select piece-picking method.
  3. Cube per Order—the average total cubic volume of all of the units (i.e., pieces) in an order; used to select piece-picking methods.
  4. Flow per Item—the total number of storage and retrieval operations performed for the item over some period of time; used to select pallet-picking equipment.
  5. Lines per Item (a.k.a. popularity)—the total number of lines for the item in all orders over some period of time (representative of picking activity for an item); used to select case- and piece-picking equipment and for slotting.
  6. Cube Movement—the total unit demand of the item over some period of time times the cubic volume of each unit (representative of the cube in storage for the item); used to select pallet-, case-, and piece-picking equipment.
  7. Demand Correlation—the percent (or just the number) of orders in which both items appear; used for zoning and slotting.

In the figure above, the item master file includes the dimensions, cubic volume, and weight of each item. Note that an item's cube can be less than the product of its dimensions (e.g., items C and D). In addition, the unit of measure (UOM) is typically included. The UOM is a description of whether the quantity of inventory for an item refers to individual units (eaches or pieces, EA), cases (CA), or pallets (PA). A conversion ratio is used whenever multiple units of measure are used for the same item.

Total Lines (11)

The total number of lines for all items in all orders over some period of time (representative of total picking activity); used to select piece-picking methods.

Lines per Order (2.2)

The average number of different items (i.e., lines or SKUs) in an order; used to select the piece-picking method.

Cube per Order (493.2)

The average total cubic volume of all of the units (i.e., pieces) in an order; used to select piece-picking methods. The unit of measure (UOM) is typically included. The UOM is a description of whether the quantity of inventory for an item refers to individual units (eaches or pieces, EA), cases (CA), or pallets (PA). A conversion ratio is used whenever multiple units of measure are used for the same item.

First Approach: Regular Julia

Total cubic volume for Order 1:

Second Approach: Split-Assign-Combine Strategy

Split: DataFrame grouped by ORDER

Apply: Total cubic volume of each order

Combine: Mean of total cubic volume per order

Flow per Item (11 for SKU A,5 for SKU B,...)

The total number of storage and retrieval operations performed for the item over some period of time; used to select pallet-picking equipment.

Split:

Apply then Combine: Apply sum function to total quantities for each item and then combine into a new dataframe skudf.

Lines per Item (3 for SKU A,2 for SKU B,...)

The total number of lines for the item in all orders over some period of time (representative of picking activity for an item); used to select case- and piece-picking equipment and for slotting.

Cube Movement (330 for SKU A,120 for SKU B,...)

The total unit demand of the item over some period of time times the cubic volume of each unit (representative of the cube in storage for the item); used to select pallet-,case-,and piece-picking equipment.

Demand Correlation

The percent of orders in which both items appear. It can be used for zoning and slotting, which are warehouse-related layout problems. Demand correlation is used as the machine-to-machine flow matrix in the QAP-based layout procedures.

image.png

Note: In the following, indexin is used to assure that the number of orders and items are indexed from 1 to m and 1 to n, respectively, in the calculations. The rows and columns of the calculated W corresponed to the items in the sequence returned by unique(item).

3. Missing Data: Dropping vs. Imputation

For a variety of different reasons, some of the cells in a table may be missing data. In Julia, missing data is identified with the type missing. All subsequent results that use the missing data will themselves be identified as missing; as a result, it is necessary to either remove or replace the missing data:

For large data sets with few missing values, any missing data is typically dropped; for smaller data sets, where the missing data occur in columns containing relatively unimportant variables, it may be possible to provide a reasonable value with which to replace the missing cell. For example, for numerical values, the replacement value could be the average of the non-missing values for the same variable; for categorical values, the replacement value could be the most likely or mode value of the non-missing values.

Ex 3: Add Orders with Missing Data to Ex 2

Note: We need to use the string data to include missing data because creating the table directly using DataFrame, as was done in Ex 2, will generate an error.

For this particular example involving activity profiling, the number of different items (or SKUs) is an important factor indicating the difficulty of order picking individual pieces (EA) or cartons (CA), while the quantity of each item is less important. As a result, any observation with a missing SKU will be dropped, while any observation that is only missing the quantity (QTY) can be replaced with the average quantity ordered for that SKU in the other observations without missing values.