| Title: | Imputation of Missing Data in Sequence Analysis |
|---|---|
| Description: | Multiple imputation of missing data in a dataset using MICT or MICT-timing methods. The core idea of the algorithms is to fill gaps of missing data, which is the typical form of missing data in a longitudinal setting, recursively from their edges. Prediction is based on either a multinomial or random forest regression model. Covariates and time-dependent covariates can be included in the model. |
| Authors: | Kevin Emery [aut, cre], Anthony Guinchard [aut], Andre Berchtold [aut], Kamyar Taher [aut] |
| Maintainer: | Kevin Emery <[email protected]> |
| License: | GPL-2 |
| Version: | 2.2.1 |
| Built: | 2026-05-20 08:40:27 UTC |
| Source: | https://github.com/emerykevin/seqimpute |
seqimp object
obtained with the seqimpute functionFunction that adds the clustering result to a seqimp object
obtained with the seqimpute function
addcluster(impdata, clustering)addcluster(impdata, clustering)
impdata |
An object of class |
clustering |
clustering made on the multiple imputed dataset. Can either be a dataframe or a matrix, where each row correspond to an observation and each column to a multiple imputed dataset |
Returns a seqimp object containing the cluster to which each
sequence in each imputed dataset belongs. Specifically, a column named
cluster is added to the imputed datasets.
seqimp into a dataframe or a mids
objectThe function converts a seqimp object into a specified format.
fromseqimp(data, format = "long", include = FALSE)fromseqimp(data, format = "long", include = FALSE)
data |
An object of class seqimp as created by the function seqimpute |
format |
The format in which the seqimp object should be returned. It
could be: |
include |
logical that indicates if the original dataset with missing
value should be included or not. This parameter does not apply
if |
The argument format specifies the object that should be returned
by the function. It can take the following values
"long"produces a data set in which imputed data sets are stacked vertically.
The following columns are added: 1) .imp referring to the
imputation number, and 2) .id the row names of the original dataset
"stacked"the same as "long", but without the inclusion of
the two columns .imp and .id
"mids"produces an object of class mids, which is the format
used by the mice package.
Transform a seqimp object into the desired format.
Kevin Emery
## Not run: # Imputation with the MICT algorithm imp <- seqimpute(data = gameadd, var = 1:4) # The object imp is transformed to a dataframe, where completed datasets are # stacked vertically imp.stacked <- fromseqimp( data = imp, format = "stacked", include = FALSE ) ## End(Not run)## Not run: # Imputation with the MICT algorithm imp <- seqimpute(data = gameadd, var = 1:4) # The object imp is transformed to a dataframe, where completed datasets are # stacked vertically imp.stacked <- fromseqimp( data = imp, format = "stacked", include = FALSE ) ## End(Not run)
Dataset containing variables on the gaming addiction of young people.
The data consists of gaming addiction, coded as either 'no' or 'yes',
measured over four consecutive years for 500 individuals, three covariates
and one time-dependent covariate. The yearly states
are recorded in columns 1 (T1_abuse) to 4 (T4_abuse).
The three covariates are
Gender (female or male),
Age (measured at time 1),
Track (school or apprenticeship).
The time-varying covariate consists of the individual's relationship to
gambling at each of the four time points, appearing in columns
T1_gambling, T2_gambling,
T3_gambling, and T4_gambling. The states are either
no, gambler or problematic gambler
data(gameadd)data(gameadd)
A data frame containing 500 rows, 4 states variable, 3 covariates and a time-dependent covariate.
seqimp objectPlot a seqimp object. The state distribution plot of the first
m completed datasets is shown, possibly alongside the original
dataset with missing data
## S3 method for class 'seqimp' plot(x, m = 5, include = TRUE, ...)## S3 method for class 'seqimp' plot(x, m = 5, include = TRUE, ...)
x |
Object of class |
m |
Number of completed datasets to show |
include |
logical that indicates if the original dataset with missing value should be plotted or not |
... |
Arguments to be passed to the seqdplot function |
Kevin Emery
seqimp objectPrint a seqimp object
## S3 method for class 'seqimp' print(x, ...)## S3 method for class 'seqimp' print(x, ...)
x |
Object of class |
... |
additional arguments passed to other functions |
Kevin Emery
Generation of missing data in sequence based on a Markovian approach.
seqaddNA( data, var = NULL, states.high = NULL, propdata = 1, pstart.high = 0.1, pstart.low = 0.005, pcont = 0.66, maxgap = 3, maxprop = 0.75, only.traj = FALSE )seqaddNA( data, var = NULL, states.high = NULL, propdata = 1, pstart.high = 0.1, pstart.low = 0.005, pcont = 0.66, maxgap = 3, maxprop = 0.75, only.traj = FALSE )
data |
A data frame containing sequences of a categorical (multinomial)
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
states.high |
A list of states with a higher probability of initiating a subsequent missing data gap. |
propdata |
Proportion of trajectories for which missing data is simulated, as a decimal between 0 and 1. |
pstart.high |
Probability of starting a missing data gap for the
states specified in the |
pstart.low |
Probability of starting a missing data gap for all other states. |
pcont |
Probability of a missing data gap to continue. |
maxgap |
Maximum length of a missing data gap. |
maxprop |
Maximum proportion of missing data allowed in a sequence, as a decimal between 0 and 1. |
only.traj |
Logical, if |
The first time point of a trajectory has a pstart.low probability to
be missing. For the next time points, the probability to be missing depends
on the previous time point. There are four cases:
1. If the previous time point is missing and the maximum length of a
missing gap, which is specified by the argument maxgap, is reached,
the time point is set as observed.
2. If the previous time point is missing, but the maximum length of a gap is
not reached, there is a pcont probability that this time point is missing.
3. If the previous time point is observed and the previous time point belongs
to the list of states specified by pstart.high, the probability to
be missing is pstart.high.
4. If the previous time point is observed but the previous time point does not
belong to the list of states specified by pstart.high, the
probability to be missing is pstart.low.
If the proportion of missing data in a given trajectory exceeds the
proportion specified by maxprop, the missing data simulation is
repeated for the sequence.
A data frame with simulated missing data.
Kevin Emery
# Generate MCAR missing data on the mvad dataset # from the TraMineR package data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86) # Generate missing data on mvad where joblessness is more likely to trigger # a missing data gap mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")# Generate MCAR missing data on the mvad dataset # from the TraMineR package data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86) # Generate missing data on mvad where joblessness is more likely to trigger # a missing data gap mvad.miss2 <- seqaddNA(mvad, var = 17:86, states.high = "joblessness")
Extract all the trajectories without missing value.
seqcomplete(data, var = NULL)seqcomplete(data, var = NULL)
data |
either a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
Returns either a data frame or a state sequence object, depending the type of data that was provided to the function
Kevin Emery
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.complete <- seqcomplete(gameadd, var = 1:4)# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.complete <- seqcomplete(gameadd, var = 1:4)
seqimpute.iter: Imputation of missing data in longitudinal categorical data
seqimpute( data, var = NULL, np = 1, nf = 1, m = 5, niter = 1, timing = FALSE, frame.radius = 0, covariates = NULL, time.covariates = NULL, regr = "multinom", npt = 1, nfi = 1, ParExec = FALSE, ncores = NULL, SetRNGSeed = FALSE, end.impute = TRUE, verbose = TRUE, available = TRUE, pastDistrib = FALSE, futureDistrib = FALSE, ... )seqimpute( data, var = NULL, np = 1, nf = 1, m = 5, niter = 1, timing = FALSE, frame.radius = 0, covariates = NULL, time.covariates = NULL, regr = "multinom", npt = 1, nfi = 1, ParExec = FALSE, ncores = NULL, SetRNGSeed = FALSE, end.impute = TRUE, verbose = TRUE, available = TRUE, pastDistrib = FALSE, futureDistrib = FALSE, ... )
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A specifying the columns of the dataset
that contain the trajectories. Default is |
np |
Number of prior states to include in the imputation model for internal gaps. |
nf |
Number of subsequent states to include in the imputation model for internal gaps. |
m |
Number of multiple imputations to perform (default: |
niter |
Number of iterations of the algorithm. |
timing |
Logical, specifies the imputation algorithm to use.
If |
frame.radius |
Integer, relevant only for the MICT-timing algorithm, specifying the radius of the timeframe. |
covariates |
List of the columns of the dataset containing covariates to be included in the imputation model. |
time.covariates |
List of the columns of the dataset with time-varying covariates to include in the imputation model. |
regr |
Character specifying the imputation method. Options include
|
npt |
Number of prior observations in the imputation model for terminal gaps (i.e., gaps at the end of sequences). |
nfi |
Number of future observations in the imputation model for initial gaps (i.e., gaps at the beginning of sequences). |
ParExec |
Logical, indicating whether to run multiple imputations
in parallel. Setting to |
ncores |
Integer, specifying the number of cores to use for parallel computation. If unset, defaults to the maximum number of CPU cores minus one. |
SetRNGSeed |
Integer, to set the random seed for reproducibility in
parallel computations. Note that setting |
end.impute |
Logical. If |
verbose |
Logical, if |
available |
Logical, specifies whether to consider already imputed
data in the predictive model. If |
pastDistrib |
Logical, if |
futureDistrib |
Logical, if |
... |
Named arguments that are passed down to the imputation functions. |
An object of class seqimp, which is a list with the following
elements:
dataA data.frame containing the original
(incomplete) data.
impA list of m data.frame corresponding to
the imputed datasets.
mThe number of imputations.
methodA character vector specifying whether MICT or MICT-timing was used.
npNumber of prior states included in the imputation model.
nfNumber of subsequent states included in the imputation model.
regrA character vector specifying whether multinomial or random forest imputation models were applied.
callThe call that created the object.
Kevin Emery <[email protected]>, Andre Berchtold, Anthony Guinchard, and Kamyar Taher
Halpin, B. (2012). Multiple imputation for life-course sequence data. Working Paper WP2012-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3639.
Halpin, B. (2013). Imputing sequence data: Extensions to initial and terminal gaps, Stata's. Working Paper WP2013-01, Department of Sociology, University of Limerick. http://hdl.handle.net/10344/3620
Emery, K., Studer, M., & Berchtold, A. (2024). Comparison of imputation methods for univariate categorical longitudinal data. Quality & Quantity, 1-25. https://link.springer.com/article/10.1007/s11135-024-02028-z
This function plots the most frequent patterns of missing data, based on the seqfplot function.
seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)seqmissfplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
with.complete |
Logical, if |
void.miss |
Logical, if |
... |
Additional parameters passed to the seqfplot function. |
This plot function is based on the seqfplot function, allowing users to visualize patterns of missing data within sequences. For details on additional customizable arguments, see the seqfplot documentation.
By default, this function plots the 10 most frequent patterns. The number
of patterns to be plotted can be adjusted using the idxs argument
in seqfplot.
Kevin Emery
# Plot the 10 most common patterns of missing data seqmissfplot(gameadd, var = 1:4) # Plot the 10 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE) # Plot only the 5 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)# Plot the 10 most common patterns of missing data seqmissfplot(gameadd, var = 1:4) # Plot the 10 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE) # Plot only the 5 most common patterns of missing data discarding # complete trajectories seqmissfplot(gameadd, var = 1:4, with.missing = FALSE, idxs = 1:5)
This function identifies and visualizes states that best characterize
sequences with missing data at each position (time point), comparing them to
sequences without missing data at each position (time point). It is based on
the seqimplic function. For more information on the
methodology, see the seqimplic documentation.
seqmissimplic(data, var = NULL, void.miss = TRUE, ...)seqmissimplic(data, var = NULL, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
void.miss |
Logical, if |
... |
parameters to be passed to the seqimplic function |
returns a seqimplic object that can be plotted and printed.
Kevin Emery
# For illustration purpose, we simulate missing data on the mvad dataset, # available in the TraMineR package. The state "joblessness" state has a # higher probability of triggering a missing gap ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") # The states that best characterize sequences with missing data implic <- seqmissimplic(mvad.miss, var = 17:86) # Visualization of the results plot(implic) ## End(Not run)# For illustration purpose, we simulate missing data on the mvad dataset, # available in the TraMineR package. The state "joblessness" state has a # higher probability of triggering a missing gap ## Not run: data(mvad, package = "TraMineR") mvad.miss <- seqaddNA(mvad, var = 17:86, states.high = "joblessness") # The states that best characterize sequences with missing data implic <- seqmissimplic(mvad.miss, var = 17:86) # Visualization of the results plot(implic) ## End(Not run)
This function plots all patterns of missing data within sequences, based on the seqIplot function.
seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)seqmissIplot(data, var = NULL, with.complete = TRUE, void.miss = TRUE, ...)
data |
Either a data frame containing sequences of a categorical
variable, where missing data are coded as |
var |
A vector specifying the columns of the dataset
that contain the trajectories. Default is |
with.complete |
Logical, if |
void.miss |
Logical, if |
... |
Additional parameters passed to the seqIplot function. |
This function uses seqIplot to visualize all patterns of missing data within sequences. For further customization options, refer to the seqIplot documentation.
Kevin Emery
# Plot all the patterns of missing data seqmissIplot(gameadd, var = 1:4) # Plot all the patterns of missing data discarding # complete trajectories seqmissIplot(gameadd, var = 1:4, with.missing = FALSE)# Plot all the patterns of missing data seqmissIplot(gameadd, var = 1:4) # Plot all the patterns of missing data discarding # complete trajectories seqmissIplot(gameadd, var = 1:4, with.missing = FALSE)
The seqQuickLook() function aimed at providing an overview of the
number and size of the different types of gaps
spread in the original dataset.
seqQuickLook(data, var = NULL, np = 1, nf = 1)seqQuickLook(data, var = NULL, np = 1, nf = 1)
data |
a data.frame where missing data are coded as NA or a state sequence object built with seqdef function |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
np |
number of previous observations in the imputation model of the internal gaps. |
nf |
number of future observations in the imputation model of the internal gaps. |
The distinction between internal and SLG gaps depends on the
number of previous (np) and future (nf) observations that are
set for the MICT and MICT-timing algorithms.
Returns a data.frame object that summarizes, for each
type of gaps (Internal Gaps, Initial Gaps, Terminal Gaps,
LEFT-hand side SLG, RIGHT-hand side SLG, Both-hand side SLG),
the minimum length, the maximum length, the total number of gaps and
the total number of missing they contain.
Andre Berchtold and Kevin Emery
data(gameadd) seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)data(gameadd) seqQuickLook(data = gameadd, var = 1:4, np = 1, nf = 1)
The purpose of seqTrans is to spot impossible transitions
in longitudinal categorical data.
seqTrans(data, var = NULL, trans)seqTrans(data, var = NULL, trans)
data |
a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
trans |
|
It returns a matrix where each row is the position of an impossible transition.
Andre Berchtold and Kevin Emery
data(gameadd) seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))data(gameadd) seqTransList <- seqTrans(data = gameadd, var = 1:4, trans = c("yes->no"))
Extract all the trajectories with at least one missing value
seqwithmiss(data, var = NULL)seqwithmiss(data, var = NULL)
data |
either a data frame containing sequences of a multinomial
variable with missing data (coded as |
var |
the list of columns containing the trajectories. Default is NULL, i.e. all the columns. |
Returns either a data frame or a state sequence object, depending the type of data that was provided to the function
Kevin Emery
# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)# Game addiction dataset data(gameadd) # Extract the trajectories without any missing data gameadd.withmiss <- seqwithmiss(gameadd, var = 1:4)
seqimp objectSummary of a seqimp object
## S3 method for class 'seqimp' summary(object, ...)## S3 method for class 'seqimp' summary(object, ...)
object |
of class |
... |
additional arguments passed to other functions |
Kevin Emery