Data

To simplify handling data for estimating different models, the package implements data types. The main one (for now) is the DataSD struct as a subtype of Data. This is a mutable struct type that gathers the different data for search sessions. For model estimation, data from other types (e.g., DataFrame) needs to be gathered into this type. The Tutorials provide some examples of how this can be appropriately constructed.

Most fields are required, but some fill in nothing by default. For example, the search order or where consumers stopped scrolling may not be observed, in which case, these fields default to nothing.

StructuralSearchModels.DataSD — Type

DataSD{T} <: Data

Data type for the Search and Discovery model. Indexing is session-based. See the tutorials for examples on how to simulate or construct this data from real datasets.

Fields

consumer_ids::Vector{Int}: Consumer ID for each session.
product_ids::Vector{Vector{Int}}: Product IDs available in each session.
product_characteristics::Vector{Matrix{T}}: Product characteristics matrix for each session.
positions::Vector{Vector{Int}}: Position of each product in each session.
consideration_sets::Vector{Vector{Bool}}: Whether each product was searched (clicked) in each session.
purchase_indices::Vector{Int}: Index of the purchased product within each session.
min_discover_indices::Union{Vector{Int}, Nothing}: Index of the last product that must have been discovered (lowest position of any click). Used during estimation; populate with fill_indices_min_discover!. Defaults to nothing.
stop_indices::Union{Vector{Int}, Nothing}: Index of the last product discovered before stopping. nothing if scrolling is not observed.
session_characteristics::Union{Vector{Vector{T}}, Nothing}: Session-level characteristics. nothing if not available (default).
search_paths::Union{Vector{Vector{Int}}, Nothing}: Search order within each session. nothing if search order is not observed.

source

The data tracks everything on a session level, with the consumer_ids field allowing to track consumers over multiple sessions (which is currently not yet used).

Data can be accessed with d.fieldname, where d is a DataSD object. Indexing is then based on the different fields. For example, the following accesses the product characteristics of the first session in the data: d.product_characteristics[1].

For product characteristics, note that the last column always is a dummy indicator for the outside option.

Data Generation

Data from a search model model can be generated with the generate_data function. There are two versions.

The first one generates data from scratch, which requires specifying how many consumers and sessions to simulate (each consumer can have multiple sessions, which is not yet used). By default, it generates generic products using generate_products.

The second one takes products, sessions etc. as given and only simulates new search paths from model.

StructuralSearchModels.generate_data — Function

generate_data(model::Model, n_consumers, n_sessions_per_consumer;
                    n_A0 = 1, n_d = 1,
                    indices_list_characteristics = 1:length(m.β),
                    products = generate_products(n_consumers * n_sessions_per_consumer, MvNormal(I(length(m.β)-1))),
                    drop_undiscovered_products = false,
                    kwargs_path_generation...)

Generate and return a DataSD object for the model model with n_consumers consumers and n_sessions_per_consumer sessions per consumer. By default, this assumes that there is one alternative in the initial awareness set (n_A0=1), one alternative per position (n_d=1), and generates generic products using generate_products. Undiscovered products are not dropped by default. kwargs_path_generation are passed to the function generating the search paths.

Returns

A DataSD object with all fields populated, including consumer_ids, product_ids, product_characteristics, positions, consideration_sets, purchase_indices, min_discover_indices, search_paths, and stop_indices.

Example

using Distributions, StructuralSearchModels
m = SD(β = [-0.05, 3.0], Ξ = 3.5, ρ = [-0.1], ξ = 2.5,
       dE = Normal(), dV = Normal(), dU0 = Uniform(), zdfun = "log")
d = generate_data(m, 100, 1; seed = 1)
# DataSD with 100 sessions

source

generate_data(model::Model, data::Data;
                products = generate_products(data::Data),
                kwargs_path_generation...)

Generate and return a new DataSD object for the model model using the existing data object data. This allows simulating new search paths for the same consumers and products. If undiscovered products have been dropped, new products are sampled from data using generate_products. kwargs_path_generation are passed to the function generating the search paths.

Returns

A DataSD object with the same consumers and products as data but with freshly simulated search paths.

Example

# Re-simulate search paths for the same consumers and products
d_sim = generate_data(m_hat, d; seed = 2)

source

StructuralSearchModels.generate_products — Function

generate_products(n_sessions, distribution::Distribution;
    n_products = 1_000_000,
    n_products_per_session = 30,
    outside_option = true,
    kwargs...)

Generate product IDs and their characteristics for n_sessions sessions. Each session has n_products_per_session products randomly sampled from n_products available in total. Product characteristics are sampled from distribution, which must be multivariate (e.g., MvNormal) when using multiple characteristics. If outside_option = true, an outside option is prepended to each session with product ID 0, last characteristic set to 1.0, and all others set to 0.0.

Returns

A tuple (product_ids, product_characteristics) where each element is a vector over sessions: product_ids[i] is a Vector{Int} and product_characteristics[i] is a Matrix of characteristics for session i.

Example

using Distributions, StructuralSearchModels
product_ids, product_characteristics = generate_products(100, Normal())
product_ids, product_characteristics = generate_products(100, MvNormal(I(3));
    n_products = 500, n_products_per_session = 10)

source

Keyword options

The generate_data function passes kwargs_path_generation into the path generation. The following options currently are available:

seed: use a specific seed to generate data.
conditional_on_search=false: whether generate only search paths with at least one search. false by default.
conditional_on_search_iter=100: to generate search paths with at least one search for a particular session (with products etc.), the code tries by default up to 100 times using new draws for the shocks.

Convenience functions

The package exports some convenience functions for the DataSD type. These include:

d == d1 checks whether two DataSD objects are the same.
d[1:5] selects the first five search sessions from the data
fill_indices_min_discover!(data::DataSD) adds the min_discover_indices from the consideration sets. This function is automatically called when estimating a model and the min_discover_indices field is not yet set (i.e., it is nothing).

The following are other functions that are available to manipulate the data and can be helpful for estimation.

StructuralSearchModels.update_positions! — Function

update_positions!(data::DataSD, nA0, nd)

Update data.positions in-place to reflect nA0 products in the initial awareness set (position 0) and nd products per subsequent position. Modifies data directly and returns nothing.

source

StructuralSearchModels.add_product_fe! — Function

add_product_fe!(model::SDModel, data::DataSD, n_min::Int, location::String)

Add product fixed effects to model and data for all products observed at least n_min times. location controls where the fixed effects enter: "search" shifts the search value, "hidden" shifts the hidden part of utility, or "both" shifts both. Product indicator columns are added to data.product_characteristics in-place, and the model's information structure is updated accordingly.

source

StructuralSearchModels.fill_indices_min_discover! — Function

fill_indices_min_discover!(data::DataSD)

Populate data.min_discover_indices in-place from data.consideration_sets and data.positions. For each session, computes the array index of the last product that must have been discovered based on the deepest click position. Returns nothing.

source

StructuralSearchModels.merge_data — Function

merge_data(data1::DataSD, data2::DataSD)

Merge two DataSD objects into one by concatenating all fields. Duplicate consumer_ids are replaced with consecutive integers.

Returns

A new DataSD with all sessions from data1 followed by those from data2.

source