Title: | Imputation of Missing Data in Sequence Analysis |
---|---|
Description: | Multiple imputation of missing data in a dataset using MICT or MICT-timing methods. The core idea of the algorithms is to fill gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. Prediction is based on either a multinomial or random forest regression model. Covariates and time-dependent covariates can be included in the model. |
Authors: | Kevin Emery [aut, cre], Anthony Guinchard [aut], Andre Berchtold [aut], Kamyar Taher [aut] |
Maintainer: | Kevin Emery <[email protected]> |
License: | GPL-2 |
Version: | 2.2.0 |
Built: | 2025-02-14 06:14:38 UTC |
Source: | https://github.com/emerykevin/seqimpute |
seqimp
object
obtained with the seqimpute
functionFunction that adds the clustering result to a seqimp
object
obtained with the seqimpute
function
addcluster(impdata, clustering)
addcluster(impdata, clustering)
impdata |
An object of class |
clustering |
clustering made on the multiple imputed dataset. Can either be a dataframe or a matrix, where each row correspond to an observation and each column to a multiple imputed dataset |
Returns a seqimp
object containing the cluster to which each
sequence in each imputed dataset belongs. Specifically, a column named
cluster is added to the imputed datasets.
seqimp
into a dataframe or a mids
objectThe function converts a seqimp
object into a specified format.
fromseqimp(data, format = "long", include = FALSE)
fromseqimp(data, format = "long", include = FALSE)
data |
An object of class seqimp as created by the function seqimpute |
format |
The format in which the seqimp object should be returned. It
could be: |
include |
logical that indicates if the original dataset with missing
value should be included or not. This parameter does not apply
if |
The argument format
specifies the object that should be returned
by the function. It can take the following values
"long"
produces a data set in which imputed data sets are stacked vertically.
The following columns are added: 1) .imp
referring to the
imputation number, and 2) .id
the row names of the original dataset
"stacked"
the same as "long"
, but without the inclusion of
the two columns .imp
and .id
"mids"
produces an object of class mids
, which is the format
used by the mice
package.
Transform a seqimp
object into the desired format.
Kevin Emery
## Not run: # Imputation with the MICT algorithm imp <- seqimpute(data = gameadd, var = 1:4) # The object imp is transformed to a dataframe, where completed datasets are # stacked vertically imp.stacked <- fromseqimp( data = imp, format = "stacked", include = FALSE ) ## End(Not run)
## Not run: # Imputation with the MICT algorithm imp <- seqimpute(data = gameadd, var = 1:4) # The object imp is transformed to a dataframe, where completed datasets are # stacked vertically imp.stacked <- fromseqimp( data = imp, format = "stacked", include = FALSE ) ## End(Not run)
Dataset containing variables on the gaming addiction of young people.
The data consists of gaming addiction, coded as either 'no' or 'yes',
measured over four consecutive years for 500 individuals, three covariates
and one time-dependent covariate. The yearly states
are recorded in columns 1 (T1_abuse
) to 4 (T4_abuse
).
The three covariates are
Gender
(female or male),
Age
(measured at time 1),
Track
(school or apprenticeship).
The time-varying covariate consists of the individual's relationship to
gambling at each of the four time points, appearing in columns
T1_gambling
, T2_gambling
,
T3_gambling
, and T4_gambling
. The states are either
no, gambler or problematic gambler
data(gameadd)
data(gameadd)
A data frame containing 500 rows, 4 states variable, 3 covariates and a time-dependent covariate.
seqimp
objectPlot a seqimp
object. The state distribution plot of the first
m
completed datasets is shown, possibly alongside the original
dataset with missing data
## S3 method for class 'seqimp' plot(x, m = 5, include = TRUE, ...)
## S3 method for class 'seqimp' plot(x, m = 5, include = TRUE, ...)
x |
Object of class |
m |
Number of completed datasets to show |
include |
logical that indicates if the original dataset with missing value should be plotted or not |
... |
Arguments to be passed to the seqdplot function |
Kevin Emery
seqimp
objectPrint a seqimp
object
## S3 method for class 'seqimp' print(x, ...)
## S3 method for class 'seqimp' print(x, ...)
x |
Object of class |
... |
additional arguments passed to other functions |
Kevin Emery
Generation of missing data in sequence based on a Markovian approach.
seqaddNA( data, var = NULL, states.high = NULL, propdata = 1, pstart.high = 0.1, pstart.low = 0.005, pcont = 0.66, maxgap = 3, maxprop = 0.75, only.traj = FALSE )
seqaddNA( data, var = NULL, states.high = NULL, propdata = 1, pstart.high = 0.1, pstart.low = 0.005, pcont = 0.66, maxgap = 3, maxprop = 0.75, only.traj = FALSE )
data |
A data frame containing sequences of a categorical (multinomial)
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
states.high |
A list of states with a higher probability of initiating a subsequent missing data gap. |
propdata |
Proportion of trajectories for which missing data is simulated, as a decimal between 0 and 1. |
pstart.high |
Probability of starting a missing data gap for the
states specified in the |
pstart.low |
Probability of starting a missing data gap for all other states. |
pcont |
Probability of a missing data gap to continue. |
maxgap |
Maximum length of a missing data gap. |
maxprop |
Maximum proportion of missing data allowed in a sequence, as a decimal between 0 and 1. |
only.traj |
Logical, if |
The first time point of a trajectory has a pstart.low
probability to
be missing. For the next time points, the probability to be missing depends
on the previous time point. There are four cases:
1. If the previous time point is missing and the maximum length of a
missing gap, which is specified by the argument maxgap
, is reached,
the time point is set as observed.
2. If the previous time point is missing, but the maximum length of a gap is
not reached, there is a pcont
probability that this time point is missing.
3. If the previous time point is observed and the previous time point belongs
to the list of states specified by pstart.high
, the probability to
be missing is pstart.high
.
4. If the previous time point is observed but the previous time point does not
belong to the list of states specified by pstart.high
, the
probability to be missing is pstart.low
.
If the proportion of missing data in a given trajectory exceeds the
proportion specified by maxprop
, the missing data simulation is
repeated for the sequence.
A data frame with simulated missing data.
Kevin Emery
# Generate MCAR missing data on the mvad dataset # from the TraMineR package ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86) # Generate missing data on mvad where joblessness is more likely to trigger # a missing data gap mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") ## End(Not run)
# Generate MCAR missing data on the mvad dataset # from the TraMineR package ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86) # Generate missing data on mvad where joblessness is more likely to trigger # a missing data gap mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") ## End(Not run)
Extract all the trajectories without missing value.
seqcomplete(data, var = NULL)
seqcomplete(data, var = NULL)
data |
either a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
Returns either a data frame or a state sequence object, depending the type of data that was provided to the function
Kevin Emery
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.complete <- seqcomplete(gameadd, var = 1:4)
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.complete <- seqcomplete(gameadd, var = 1:4)
The seqimpute package implements the MICT and MICT-timing methods. These are multiple imputation methods for longitudinal data. The core idea of the algorithms is to fills gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. The prediction is based on either a multinomial or a random forest regression model. Covariates and time-dependent covariates can be included in the model.
The MICT-timing algorithm is an extension of the MICT algorithm designed to address a key limitation of the latter: its assumption that position in the trajectory is irrelevant.
seqimpute( data, var = NULL, np = 1, nf = 1, m = 5, timing = FALSE, frame.radius = 0, covariates = NULL, time.covariates = NULL, regr = "multinom", npt = 1, nfi = 1, ParExec = FALSE, ncores = NULL, SetRNGSeed = FALSE, end.impute = TRUE, verbose = TRUE, available = TRUE, pastDistrib = FALSE, futureDistrib = FALSE, ... )
seqimpute( data, var = NULL, np = 1, nf = 1, m = 5, timing = FALSE, frame.radius = 0, covariates = NULL, time.covariates = NULL, regr = "multinom", npt = 1, nfi = 1, ParExec = FALSE, ncores = NULL, SetRNGSeed = FALSE, end.impute = TRUE, verbose = TRUE, available = TRUE, pastDistrib = FALSE, futureDistrib = FALSE, ... )
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A specifying the columns of the dataset
that contain the trajectories. Default is |
np |
Number of prior states to include in the imputation model for internal gaps. |
nf |
Number of subsequent states to include in the imputation model for internal gaps. |
m |
Number of multiple imputations to perform (default: |
timing |
Logical, specifies the imputation algorithm to use.
If |
frame.radius |
Integer, relevant only for the MICT-timing algorithm, specifying the radius of the timeframe. |
covariates |
List of the columns of the dataset containing covariates to be included in the imputation model. |
time.covariates |
List of the columns of the dataset with time-varying covariates to include in the imputation model. |
regr |
Character specifying the imputation method. Options include
|
npt |
Number of prior observations in the imputation model for terminal gaps (i.e., gaps at the end of sequences). |
nfi |
Number of future observations in the imputation model for initial gaps (i.e., gaps at the beginning of sequences). |
ParExec |
Logical, indicating whether to run multiple imputations
in parallel. Setting to |
ncores |
Integer, specifying the number of cores to use for parallel computation. If unset, defaults to the maximum number of CPU cores minus one. |
SetRNGSeed |
Integer, to set the random seed for reproducibility in
parallel computations. Note that setting |
end.impute |
Logical. If |
verbose |
Logical, if |
available |
Logical, specifies whether to consider already imputed
data in the predictive model. If |
pastDistrib |
Logical, if |
futureDistrib |
Logical, if |
... |
Named arguments that are passed down to the imputation functions. |
The imputation process is divided into several steps, depending on the type of gaps of missing data. The order of imputation of the gaps are:
Internal gap:
there is at least np
observations
before an internal gap and nf
after the gap
Initial gap:
gaps situated at the very beginning of a trajectory
Terminal gap:
gaps situated at the very end of a trajectory
Left-hand side specifically located gap (SLG):
gaps
that have at least nf
observations after the gap, but less than
np
observation before it
Right-hand side SLG:
gaps
that have at least np
observations before the gap, but less than
nf
observation after it
Both-hand side SLG:
gaps
that have less than np
observations before the gap, and less than
nf
observations after it
The primary difference between the MICT and MICT-timing algorithms lies in their approach to selecting patterns from other sequences for fitting the multinomial model. While the MICT algorithm considers all similar patterns regardless of their temporal placement, MICT-timing restricts pattern selection to those that are temporally closest to the missing value. This refinement ensures that the imputation process adequately accounts for temporal dynamics, imping in more accurate imputed values.
An object of class seqimp
, which is a list with the following
elements:
data
A data.frame
containing the original
(incomplete) data.
imp
A list of m
data.frame
corresponding to
the imputed datasets.
m
The number of imputations.
method
A character vector specifying whether MICT or MICT-timing was used.
np
Number of prior states included in the imputation model.
nf
Number of subsequent states included in the imputation model.
regr
A character vector specifying whether multinomial or random forest imputation models were applied.
call
The call that created the object.
Kevin Emery <[email protected]>, Andre Berchtold, Anthony Guinchard, and Kamyar Taher
Halpin, B. (2012). Multiple imputation for life-course sequence data. Working Paper WP2012-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3639.
Halpin, B. (2013). Imputing sequence data: Extensions to initial and terminal gaps, Stata's. Working Paper WP2013-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3620
Emery, K., Studer, M., & Berchtold, A. (2024). Comparison of imputation methods for univariate categorical longitudinal data. Quality & Quantity, 1-25. https://link.springer.com/article/10.1007/s11135-024-02028-z
# Default multiple imputation of the trajectories of game addiction with the # MICT algorithm ## Not run: set.seed(5) imp1 <- seqimpute(data = gameadd, var = 1:4) # Default multiple imputation with the MICT-timing algorithm set.seed(3) imp2 <- seqimpute(data = gameadd, var = 1:4, timing = TRUE) # Inclusion in the MICt-timing imputation process of the three background # characteristics (Gender, Age and Track), and the time-varying covariate # about gambling set.seed(4) imp3 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11 ) # Parallel computation imp4 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11, ParExec = TRUE, ncores = 5, SetRNGSeed = 2 ) ## End(Not run)
# Default multiple imputation of the trajectories of game addiction with the # MICT algorithm ## Not run: set.seed(5) imp1 <- seqimpute(data = gameadd, var = 1:4) # Default multiple imputation with the MICT-timing algorithm set.seed(3) imp2 <- seqimpute(data = gameadd, var = 1:4, timing = TRUE) # Inclusion in the MICt-timing imputation process of the three background # characteristics (Gender, Age and Track), and the time-varying covariate # about gambling set.seed(4) imp3 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11 ) # Parallel computation imp4 <- seqimpute( data = gameadd, var = 1:4, covariates = 5:7, time.covariates = 8:11, ParExec = TRUE, ncores = 5, SetRNGSeed = 2 ) ## End(Not run)
This function plots the most frequent patterns of missing data, based on the seqfplot function.
seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
with.complete |
Logical, if |
void.miss |
Logical, if |
... |
Additional parameters passed to the seqfplot function. |
This plot function is based on the seqfplot function, allowing users to visualize patterns of missing data within sequences. For details on additional customizable arguments, see the seqfplot documentation.
By default, this function plots the 10 most frequent patterns. The number
of patterns to be plotted can be adjusted using the idxs
argument
in seqfplot
.
Kevin Emery
# Plot the 10 most common patterns of missing data seqmissfplot(gameadd, var = 1:4) # Plot the 10 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE) # Plot only the 5 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)
# Plot the 10 most common patterns of missing data seqmissfplot(gameadd, var = 1:4) # Plot the 10 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE) # Plot only the 5 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)
This function identifies and visualizes states that best characterize
sequences with missing data at each position (time point), comparing them to
sequences without missing data at each position (time point). It is based on
the seqimplic function. For more information on the
methodology, see the seqimplic
documentation.
seqmissimplic(data, var = NULL, void.miss = TRUE, ...)
seqmissimplic(data, var = NULL, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
void.miss |
Logical, if |
... |
parameters to be passed to the seqimplic function |
returns a seqimplic
object that can be plotted and printed.
Kevin Emery
# For illustration purpose, we simulate missing data on the mvad dataset, # available in the TraMineR package. The state "joblessness" state has a # higher probability of triggering a missing gap ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") # The states that best characterize sequences with missing data implic <- seqmissimplic(mvad.miss, var = 17:86) # Visualization of the results plot(implic) ## End(Not run)
# For illustration purpose, we simulate missing data on the mvad dataset, # available in the TraMineR package. The state "joblessness" state has a # higher probability of triggering a missing gap ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") # The states that best characterize sequences with missing data implic <- seqmissimplic(mvad.miss, var = 17:86) # Visualization of the results plot(implic) ## End(Not run)
This function plots all patterns of missing data within sequences, based on the seqIplot function.
seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
with.complete |
Logical, if |
void.miss |
Logical, if |
... |
Additional parameters passed to the seqIplot function. |
This function uses seqIplot to visualize all patterns of missing data within sequences. For further customization options, refer to the seqIplot documentation.
Kevin Emery
# Plot all the patterns of missing data seqmissIplot(gameadd, var = 1:4) # Plot all the patterns of missing data discarding # complete trajectories seqmissIplot(gameadd, var = 1:4, with.missing = FALSE)
# Plot all the patterns of missing data seqmissIplot(gameadd, var = 1:4) # Plot all the patterns of missing data discarding # complete trajectories seqmissIplot(gameadd, var = 1:4, with.missing = FALSE)
The seqQuickLook()
function aimed at providing an overview of the
number and size of the different types of gaps
spread in the original dataset.
seqQuickLook(data, var = NULL, np = 1, nf = 1)
seqQuickLook(data, var = NULL, np = 1, nf = 1)
data |
a data.frame where missing data are coded as NA or a state sequence object built with seqdef function |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
np |
number of previous observations in the imputation model of the internal gaps. |
nf |
number of future observations in the imputation model of the internal gaps. |
The distinction between internal and SLG gaps depends on the
number of previous (np
) and future (nf
) observations that are
set for the MICT
and MICT-timing
algorithms.
Returns a data.frame
object that summarizes, for each
type of gaps (Internal Gaps, Initial Gaps, Terminal Gaps,
LEFT-hand side SLG, RIGHT-hand side SLG, Both-hand side SLG),
the minimum length, the maximum length, the total number of gaps and
the total number of missing they contain.
Andre Berchtold and Kevin Emery
data(gameadd) seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)
data(gameadd) seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)
The purpose of seqTrans
is to spot impossible transitions
in longitudinal categorical data.
seqTrans(data, var = NULL, trans)
seqTrans(data, var = NULL, trans)
data |
a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
trans |
|
It returns a matrix where each row is the position of an impossible transition.
Andre Berchtold and Kevin Emery
data(gameadd) seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))
data(gameadd) seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))
Extract all the trajectories with at least one missing value
seqwithmiss(data, var = NULL)
seqwithmiss(data, var = NULL)
data |
either a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
Returns either a data frame or a state sequence object, depending the type of data that was provided to the function
Kevin Emery
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)
seqimp
objectSummary of a seqimp
object
## S3 method for class 'seqimp' summary(object, ...)
## S3 method for class 'seqimp' summary(object, ...)
object |
Object of class |
... |
additional arguments passed to other functions |
Kevin Emery