Title: | Diagnostic Index Expectation Maximisation in R |
---|---|
Description: | Likelihood-based genome polarisation finds which alleles of genomic markers belong to which side of the barrier. Co-estimates which individuals belong to either side of the barrier and barrier strength. Uses expectation maximisation in likelihood framework. The method is described in Baird et al. (2023) <doi:10.1111/2041-210X.14010>. |
Authors: | Natalia Martinkova [aut, cre] , Stuart Baird [aut] |
Maintainer: | Natalia Martinkova <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.4.1 |
Built: | 2024-11-23 03:32:00 UTC |
Source: | https://github.com/cran/diemr |
A subset of single nucleotide polymorphisms in butterflies of the genus Brenthis.
vcf file with 13 individuals and 4 markers.
The data is used to test conversion of genotype data from vcf to diem format with
function vcf2diem
.
filename <- system.file("extdata", "brenthis.vcf", package = "diemr")
filename <- system.file("extdata", "brenthis.vcf", package = "diemr")
Checks format of files with genotype data.
CheckDiemFormat(files, ChosenInds, ploidy)
CheckDiemFormat(files, ChosenInds, ploidy)
files |
A character vector with paths to files with genotypes. |
ChosenInds |
A numeric vector of indices of individuals to be included in the analysis. |
ploidy |
A logical or a list of length equal to length of |
The input file must have genotypes of one marker for all individuals on one
line. The line must start with a letter "S" and contain only characters
"_" or "U" for unknown genotypes or a third/fourth allele, "0" for homozygots for
allele 1, "1" for heterozygots, and "2" for homozygots for allele 2. Check the
vignette with browseVignettes(package = "diemr")
for the example of the
input format.
Ploidies must be given as a list with each element corresponding to a genomic compartment (aka a file). For each compartment, the numeric vector specifying ploidies of all individuals chosen for the specific analysis must be given.
Returns invisible TRUE
if all files are executable by diem
. Exits
with informative error messages otherwise, specifying file names and lines with
potential problems. When too many lines contain problems, the first six are given.
# set up input genotypes file names, ploidies and selection of individual samples inputFile <- system.file("extdata", "data7x3.txt", package = "diemr") ploidies <- list(c(2, 1, 2, 2, 2, 1, 2)) inds <- 1:7 # check input data CheckDiemFormat(files = inputFile, ploidy = ploidies, ChosenInds = inds) # File check passed: TRUE # Ploidy check passed: TRUE
# set up input genotypes file names, ploidies and selection of individual samples inputFile <- system.file("extdata", "data7x3.txt", package = "diemr") ploidies <- list(c(2, 1, 2, 2, 2, 1, 2)) inds <- 1:7 # check input data CheckDiemFormat(files = inputFile, ploidy = ploidies, ChosenInds = inds) # File check passed: TRUE # Ploidy check passed: TRUE
Estimates how to assign alleles in a genome to maximise the distinction between two
unknown groups of individuals. Using expectation maximisation (EM) in likelihood
framework, diem
provides marker
polarities for importing data, their likelihood-based diagnostic index and its support
for all markers, and hybrid indices for all individuals.
diem( files, ploidy = FALSE, markerPolarity = FALSE, ChosenInds, ChosenSites = "all", epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1, maxIterations = 50, ... )
diem( files, ploidy = FALSE, markerPolarity = FALSE, ChosenInds, ChosenSites = "all", epsilon = 0.99999, verbose = FALSE, nCores = parallel::detectCores() - 1, maxIterations = 50, ... )
files |
A character vector with paths to files with genotypes. |
ploidy |
A logical or a list of length equal to length of |
markerPolarity |
|
ChosenInds |
A numeric vector of indices of individuals to be included in the analysis. |
ChosenSites |
A logical vector indicating which sites are to be included in the analysis. |
epsilon |
A numeric, specifying how much the hypothetical diagnostic markers should
contribute to the likelihood calculations. Must be in |
verbose |
Logical or character with path to directory where run diagnostics will be saved. |
nCores |
A numeric number of cores to be used for parallelisation. Must be
|
maxIterations |
A numeric. |
... |
additional arguments. |
Given two alleles of a marker, one allele can belong to one side of a barrier
to geneflow and the other to the other side. Which allele belongs where is a non-trivial
matter. A marker state in an individual can be encoded as 0 if the individual is
homozygous for the first allele, and 2 if the individual is homozygous for the second
allele. Marker polarity determines how the marker will be imported. Marker polarity
equal to FALSE
means that the marker will be imported as-is. A marker with
polarity equal to TRUE
will be imported with states 0 mapped as 2 and states 2
mapped as 0, in effect switching which allele belongs to which side of a barrier to
geneflow.
When markerPolarity = FALSE
, diem
uses random null polarities to
initiate the EM algorithm. To fix the null polarities, markerPolarity
must be
a list of length equal to the length of the files
argument, where each element
in the list is a logical vector of length equal to the number of markers (rows) in
the specific file.
Ploidy needs to be given for each compartment and for each individual. For example,
for a dataset of three diploid mammal males consisting of an autosomal
compartment, an X chromosome
compartment and a Y chromosome compartment, the ploidy list would be
ploidy = list(rep(2, 3), rep(1, 3), rep(1, 3)
. If the dataset consisted of
one male and two females,
ploidy for the sex chromosomes should be vectors reflecting that females have two X
chromosomes, but males only one, and females have no Y chromosomes:
ploidy = list(rep(2, 3), c(1, 2, 2), c(1, 0, 0))
.
When verbose = TRUE
, diem
will output multiple files with information
on the iterations of the EM algorithm, including tracking marker polarities and the
respective likelihood-based diagnostics. See vignette vignette("Understanding-genome-polarisation-output-files",
package = "diemr")
for a detailed explanation of the individual output files.
A list including suggested marker polarities, diagnostic indices and support for all markers, four genomic state counts matrix for all individuals, and polarity changes for the EM iterations.
To ensure that the data input format of the genotype files, ploidies and individual
selection are readable for diem
, first use CheckDiemFormat.
Fix all errors, and run diem
only once the checks all passed.
The working directory or a folder optionally specified in the verbose
argument must have write permissions. diem
will store temporary files in the
location and output results files.
The grain for parallelisation is the compartment files
.
# set up input genotypes file names, ploidies and selection of individual samples inputFile <- system.file("extdata", "data7x3.txt", package = "diemr") ploidies <- list(c(2, 1, 2, 2, 2, 1, 2)) inds <- 1:6 # check input data CheckDiemFormat(files = inputFile, ploidy = ploidies, ChosenInds = inds) # File check passed: TRUE # Ploidy check passed: TRUE # run diem ## Not run: # diem will write temporal files during EM iterations # prior to running diem, set the working directory to a location with write permission fit <- diem(files = inputFile, ChosenInds = inds, ploidy = ploidies, nCores = 1) # run diem with fixed null polarities fit2 <- diem( files = inputFile, ChosenInds = inds, ploidy = ploidies, nCores = 1, markerPolarity = list(c(TRUE, FALSE, TRUE)) ) ## End(Not run)
# set up input genotypes file names, ploidies and selection of individual samples inputFile <- system.file("extdata", "data7x3.txt", package = "diemr") ploidies <- list(c(2, 1, 2, 2, 2, 1, 2)) inds <- 1:6 # check input data CheckDiemFormat(files = inputFile, ploidy = ploidies, ChosenInds = inds) # File check passed: TRUE # Ploidy check passed: TRUE # run diem ## Not run: # diem will write temporal files during EM iterations # prior to running diem, set the working directory to a location with write permission fit <- diem(files = inputFile, ChosenInds = inds, ploidy = ploidies, nCores = 1) # run diem with fixed null polarities fit2 <- diem( files = inputFile, ChosenInds = inds, ploidy = ploidies, nCores = 1, markerPolarity = list(c(TRUE, FALSE, TRUE)) ) ## End(Not run)
Changes encodings of genomic markers according to user specification.
emPolarise(origM, changePolarity = TRUE)
emPolarise(origM, changePolarity = TRUE)
origM |
A character vector of genotypes comprising of _012 encodings. |
changePolarity |
A logical scalar, indicating whether to leave the marker as is
( |
Returns a character vector with polarised markers.
Note that diem and importPolarized accept also a U
encoding for an unknown or third allele, but emPolarise
requires all U
to
be replaced with _
.
diem
for determining appropriate marker polarity with
respect to a barrier to geneflow.
emPolarise(c("0", "0", "1", "2", "2"), TRUE) # [1] "2" "2" "1" "0" "0" emPolarise(c("0", "_", "2", "2", "1"), FALSE) # [1] "0" "_" "2" "2" "1"
emPolarise(c("0", "0", "1", "2", "2"), TRUE) # [1] "2" "2" "1" "0" "0" emPolarise(c("0", "_", "2", "2", "1"), FALSE) # [1] "0" "_" "2" "2" "1"
Reads genotypes from a file and changes marker polarity.
importPolarized( files, changePolarity, ChosenInds, ChosenSites = "all", nCores = 1, verbose = FALSE, ... )
importPolarized( files, changePolarity, ChosenInds, ChosenSites = "all", nCores = 1, verbose = FALSE, ... )
files |
A character vector with paths to files with genotypes. |
changePolarity |
A logical vector with length equal to the number of markers. |
ChosenInds |
A numeric vector of indices of individuals to be included in the analysis. |
ChosenSites |
A logical vector indicating which sites are to be included in the analysis. |
nCores |
A numeric number of cores to be used for parallelisation. Must be
|
verbose |
Logical whether to show messages on import progress. |
... |
Optional numeric vector of |
For details on the input data format, check the file
with
CheckDiemFormat.
The changePolarity
argument influences how each marker is imported. Value
FALSE
means that the marker will be imported as it is saved in the file
. Value
TRUE
means that the genotypes encoded as 0
will be imported as 2
, and genotypes
encoded in the file
as 2
will be imported as 0
.
Returns a character matrix with rows containing individual genotypes and columns containing markers.
diem for determining appropriate marker polarity with respect to a barrier to gene flow.
dat <- importPolarized( files = system.file("extdata", "data7x3.txt", package = "diemr"), changePolarity = c(FALSE, TRUE, TRUE), ChosenInds = 1:6, ChosenSites = "all" ) dat # m1 m2 m3 # 1 "0" "1" "2" # 2 "0" "0" "0" # 3 "1" "1" "0" # 4 "1" "2" "0" # 5 "2" "2" "1" # 6 "2" "2" "_"
dat <- importPolarized( files = system.file("extdata", "data7x3.txt", package = "diemr"), changePolarity = c(FALSE, TRUE, TRUE), ChosenInds = 1:6, ChosenSites = "all" ) dat # m1 m2 m3 # 1 "0" "1" "2" # 2 "0" "0" "0" # 3 "1" "1" "0" # 4 "1" "2" "0" # 5 "2" "2" "1" # 6 "2" "2" "_"
Estimates a diagnostic marker for the state counts of all genomic markers for all individuals. Using the hypothetical, diagnostic marker, calculates individual state counts with respect to their weighted similarity to the diagnostic marker states.
ModelOfDiagnostic( I4, OriginalHI, epsilon = 0.99, verbose = FALSE, folder = "likelihood", ... )
ModelOfDiagnostic( I4, OriginalHI, epsilon = 0.99, verbose = FALSE, folder = "likelihood", ... )
I4 |
a matrix or data.frame with 4 numeric columns representing character state counts for missing data, homozygots for allele 1, heterozygots, and homozygots for allele 2. Individuals in rows. |
OriginalHI |
numeric vector of length equal to number of rows in |
epsilon |
A numeric, specifying how much the hypothetical diagnostic markers should
contribute to the likelihood calculations. Must be in |
verbose |
Logical or character with path to directory where run diagnostics will be saved. |
folder |
character specifying path to a folder for the verbose output. |
... |
additional arguments. |
The OriginalHI
can be calculated with pHetErrOnStateCount
.
Matrix with dimensions of I4.
diem
for utilising the model to determine appropriate marker
polarisation in estimating barriers to geneflow.
A subset of single nucleotide polymorphisms in Myotis myotis from Harazim et al. (2021). The genotypes were modified for testing purposes in such a way that markers 15 and 17 now include additional indel and substitution alleles. Eight markers used in the dataset are monomorphic.
vcf file with 14 individuals and 20 markers.
The data is used to test conversion of genotype data from vcf to diem format with
function vcf2diem
.
Harazim M., Pialek L., Pikula J., Seidlova V., Zukal J., Bachorec E., Bartonicka T., Kokurewicz T., Martinkova N. (2021) Associating physiological functions with genomic variability in hibernating bats. Evolutionary Ecology, 35, 291-308, doi: 10.1007/s10682-020-10096-4.
filename <- system.file("extdata", "myotis.vcf", package = "diemr")
filename <- system.file("extdata", "myotis.vcf", package = "diemr")
Using genotype allele counts, calculates the hybrid index, heterozygosity and error rate in a single individual.
pHetErrOnStateCount(sCount)
pHetErrOnStateCount(sCount)
sCount |
A numeric vector of length 4 with allele counts for missing data, homozygots for allele 1, heterozygots, and homozygots for allele 2. |
Allele counts are genomic state counts multiplied by ploidy. As different compartments might have different ploidies (e.g. autosomal markers, sex chromosomes, mitochondrial markers), allele counts should be calculated per compartment and then summarised to obtain the correct genomic allele counts. When all individuals in each compartmenst have the same ploidy, state counts do not need to be corrected.
Returns a named numeric vector with three values: p - hybrid index, Het - heterozygosity, Err - error rate.
pHetErrOnStateCount(sCount = c(2, 4, 2, 6)) # p Het Err # 0.5833333 0.1666667 0.1428571
pHetErrOnStateCount(sCount = c(2, 4, 2, 6)) # p Het Err # 0.5833333 0.1666667 0.1428571
This function calculates genotype frequencies from polarized genotypes, ideally
imported using the
importPolarized
function. It plots individuals onto a ternary De Finetti
diagram and includes a curve indicating Hardy-Weinberg equilibrium if specified.
plotDeFinetti( genotypes, cols, HWE = TRUE, tipLabels = c("Homozygous 0", "Heterozygous 1", "Homozygous 2"), ... )
plotDeFinetti( genotypes, cols, HWE = TRUE, tipLabels = c("Homozygous 0", "Heterozygous 1", "Homozygous 2"), ... )
genotypes |
A character matrix comprising of _012 encodings. |
cols |
A character vector of colors with a length equal to the number of
individuals (rows) in |
HWE |
Logical indicating whether to plot the curve for Hardy-Weinberg Equilibrium. |
tipLabels |
A character vector of length 3 with names for the ternary plot vertices. |
... |
additional graphical parameters (see plot.default). |
To import and polarize genotypes, use the importPolarized function.
Alternatively, the I4 matrix can be used as input for genotypes
.
No return value; the function is called for its side effects.
gen <- importPolarized( file = system.file("extdata", "data7x10.txt", package = "diemr"), changePolarity = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), ChosenInds = 1:7 ) plotDeFinetti(gen, cols = palette.colors(nrow(gen), "Accent"), pch = 19)
gen <- importPolarized( file = system.file("extdata", "data7x10.txt", package = "diemr"), changePolarity = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), ChosenInds = 1:7 ) plotDeFinetti(gen, cols = palette.colors(nrow(gen), "Accent"), pch = 19)
This function adds a marker axis with chromosome names to an existing plot of polarized genotypes. It requires that the plot is already created using plotPolarized.
plotMarkerAxis( includedSites, ChosenSites, tickDist = 1e+06, axisInfo = NULL, ... )
plotMarkerAxis( includedSites, ChosenSites, tickDist = 1e+06, axisInfo = NULL, ... )
includedSites |
A character path to a file with columns |
ChosenSites |
A logical vector indicating which sites are to be included in the analysis. |
tickDist |
A numeric indicating the spacing of physical tick marks along a chromosome. |
axisInfo |
A list with user-defined tick positions and labels for marker axis. See Details. |
... |
additional arguments. |
The includedSites
file should ideally be generated by
vcf2diem to ensure congruence between the plotted genotypes and
the respective metadata.
Tick mark distances within a chromosome are located at tickDist
and formated to
multiples of millions.
The optional axisInfo
argument must have five named elements with the following
information:
CHROMbreaks
: Numeric vector with positions defining ticks separating
chromosomes. The metric for all positions is in the number of markers.
CHROMnamesPos
: Numeric vector with positions to place the chromosome
labels.
CHROMnames
: Character vector with the names of the chromosomes. Must be the
same length as CHROMnamesPos
.
ticksPos
: Numeric vector with positions of ticks within chromosomes.
ticksNames
: Character vector with the names to be displayed at ticksPos
.
When axisInfo = NULL
, the function extracts the necessary information from the
includedSites
file.
Returns an invisible axisInfo
list with the tick positions and labels
for the marker axis.
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") inds <- 1:14 fit <- diem("myo-001.txt", ChosenInds = inds, ploidy = FALSE) gen <- importPolarized("myo-001.txt", fit$markerPolarity, inds) h <- apply(gen, 1, function(x) pHetErrOnStateCount(sStateCount(x)))[1, ] plotPolarized(gen, h, xlab = "") plotMarkerAxis("myo-includedSites.txt", rep(TRUE, 11), tickDist = 100) ## End(Not run)
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") inds <- 1:14 fit <- diem("myo-001.txt", ChosenInds = inds, ploidy = FALSE) gen <- importPolarized("myo-001.txt", fit$markerPolarity, inds) h <- apply(gen, 1, function(x) pHetErrOnStateCount(sStateCount(x)))[1, ] plotPolarized(gen, h, xlab = "") plotMarkerAxis("myo-includedSites.txt", rep(TRUE, 11), tickDist = 100) ## End(Not run)
Plots genotypes that can be optionally polarized.
plotPolarized( genotypes, HI, cols = c("#FFFFFF", "#800080", "#FFE500", "#008080"), ... )
plotPolarized( genotypes, HI, cols = c("#FFFFFF", "#800080", "#FFE500", "#008080"), ... )
genotypes |
A character matrix comprising of _012 encodings. |
HI |
A numeric vector of individual hybrid indices with length equal to number
of rows in |
cols |
A vector of four colors, representing missing data, homozygotes for genotype 0, heterozygotes and homozygotes for genotype 2. |
... |
To import and polarize genotypes, use the importPolarized function.
When using diem, hybrid indices,
HI
, can be found in the file 'HIwithOptimalPolarities.txt'. Alternatively,
calculate HI
from the polarized genotypes as shown in the examples.
By default, the function plots colored tick marks for individuals, changing the
color at the steepest change in sorted HI
. The second and fourth colors in
cols
are used for the tick marks.
To turn off this feature, use the argument tick = FALSE
.
To use custom tick mark colors, provide a vector of colors for all individuals
(equal to the number of rows in genotypes
). The vector of colors must be
ordered according to order(HI)
.
To include individual labels
(e.g., accession numbers), provide a character
vector with the names in the same order as they are in the genotypes
.
No return value, called for side effects. In the default plot, purple and green
represent sides of the barrier to gene flow encoded as 0
and 2
, respectively,
yellow shows heterozygotes and white represents missing or undetermined genotypes.
Individuals are ordered according to the sorted HI
.
plotMarkerAxis to add chromosome information to the x axis.
gen <- importPolarized( file = system.file("extdata", "data7x10.txt", package = "diemr"), changePolarity = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), ChosenInds = 1:7 ) h <- apply(gen, 1, FUN = function(x) pHetErrOnStateCount(sStateCount(x)))[1, ] plotPolarized(genotypes = gen, HI = h) # Incorrect tick color order plotPolarized(gen, h, col.ticks = c(rep("purple", 5), "green", "purple"), lwd = 3) # Correct tick color order plotPolarized(gen, h, col.ticks = c(rep("purple", 5), "green", "purple")[order(h)], lwd = 3) # Correct individual label order plotPolarized(gen, h, labels = c(paste("purple", 1:5), "green 1", "purple 6"), ylab = "")
gen <- importPolarized( file = system.file("extdata", "data7x10.txt", package = "diemr"), changePolarity = c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), ChosenInds = 1:7 ) h <- apply(gen, 1, FUN = function(x) pHetErrOnStateCount(sStateCount(x)))[1, ] plotPolarized(genotypes = gen, HI = h) # Incorrect tick color order plotPolarized(gen, h, col.ticks = c(rep("purple", 5), "green", "purple"), lwd = 3) # Correct tick color order plotPolarized(gen, h, col.ticks = c(rep("purple", 5), "green", "purple")[order(h)], lwd = 3) # Correct individual label order plotPolarized(gen, h, labels = c(paste("purple", 1:5), "green 1", "purple 6"), ylab = "")
This function estimates positions of ordered single nucleotide polymorphisms (SNPs) that correspond
to a window spanning a user-defined distance in the SNP positions mapped to a reference.
Each window is centered at the SNP mapped position.
Conversion of a SNP rank position metric to a mapped position metric is useful for
kernel smoothing of the diem
output state along a genomic sequence.
rank2map(includedSites, ChosenSites = "all", windowSize = 1e+07, nCores = 1)
rank2map(includedSites, ChosenSites = "all", windowSize = 1e+07, nCores = 1)
includedSites |
A character path to a file with columns |
ChosenSites |
A logical vector indicating which sites are to be included in the analysis. |
windowSize |
A numeric window size for metric conversion in base-pairs. |
nCores |
A numeric number of cores to be used for parallelisation. Must be
|
Single nucleotide polymorphisms (SNPs) tend to be spread across a genome randomly.
To facilitate interpretation of the diem
output, the marker states should be
assessed on the metric of their position along chromosomes (contigs). The windows for
kernel smoothing might contain a variable number of markers. This function estimates
which markers should be assessed together given their proximity on a chromosome.
Values in includedSites
are in essence SNP positions in BED format with a header.
The includedSites
file should ideally be generated by
vcf2diem to ensure congruence across all analyses.
The function reads SNP positions from the specified BED-like file and divides the genome into segments based on chromosomes. Each segment is then processed to identify genomic windows encompassing each SNP, considering the specified window size. This process is parallelized to enhance performance, and each SNP is considered within its chromosomal context to ensure accurate window placement.
Minimum value of windowSize
is equal to 3, but in genomic data evaluations, window
size should be at least two orders of magnitude larger. A good approximation of a
useful minimum window size is $(genome size) / ((number of SNPSs) / 2)$.
A two-column matrix with the number of rows corresponding to the number of
ChosenSites
, indicating start and end indices of adjacent markers that are
within an interval of length windowSize
centered on the specific marker.
The unit of parallelization when using nCores > 1
is set per chromosome.
This may differ from the parallelization approach used in diem, where processing
of compartment files is parallelized. Note that while compartment files can correspond
to chromosomes, this is not necessarily the case.
Natalia Martinkova
Filip Jagos [email protected]
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") rank2map("myo-includedSites.txt", windowSize = 50) ## End(Not run)
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") rank2map("myo-includedSites.txt", windowSize = 50) ## End(Not run)
This function smooths polarized genotype states using a Laplace kernel density estimation. It calculates a smoothed version of the genotype states over specified windows.
smoothPolarizedGenotypes( genotypes, includedSites, ChosenSites = "all", windows = NULL, windowSize = NULL, ... )
smoothPolarizedGenotypes( genotypes, includedSites, ChosenSites = "all", windows = NULL, windowSize = NULL, ... )
genotypes |
A character matrix comprising of _012 encodings. |
includedSites |
A character path to a file with columns |
ChosenSites |
A logical vector indicating which sites are to be included in the analysis. |
windows |
A two-column numeric matrix with indices of start and end positions for
windows for all markers indicated by |
windowSize |
A numeric window size for metric conversion in base-pairs. |
... |
Additional parameters to be passed to rank2map if |
Ensure that ChosenSites
is the same as was used to import polarized genotypes.
The function uses a Laplace kernel to weight the genotype states within a window around each marker position, based on physical positions of the markers. The smoothing process accounts for chromosome-level scales.
The Laplace kernel density is calculated as:
where is the position,
is the center of the kernel, and
is the scale parameter.
The scale parameter is
, where
is the position of the last
marker on the chromosome.
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") fit <- diem("myo-001.txt", ChosenInds = 1:14) gen <- importPolarized("myo-001.txt", changePolarity = fit$markerPolarity, ChosenInds = 1:14) h <- apply(gen, 1, \(x) pHetErrOnStateCount(sStateCount(x)))[1, ] gen2 <- smoothPolarizedGenotypes(genotypes = gen, includedSites = "myo-includedSites.txt", windowSize = 50) plotPolarized(gen, h) plotPolarized(gen2, h) ## End(Not run)
## Not run: # Run this example in a working directory with write permissions myo <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(myo, "myo") fit <- diem("myo-001.txt", ChosenInds = 1:14) gen <- importPolarized("myo-001.txt", changePolarity = fit$markerPolarity, ChosenInds = 1:14) h <- apply(gen, 1, \(x) pHetErrOnStateCount(sStateCount(x)))[1, ] gen2 <- smoothPolarizedGenotypes(genotypes = gen, includedSites = "myo-includedSites.txt", windowSize = 50) plotPolarized(gen, h) plotPolarized(gen2, h) ## End(Not run)
Counts genomic states in one sample.
sStateCount(s)
sStateCount(s)
s |
A character vector with elements "_", "0", "1", "2" representing missing data, homozygots for allele 1, heterozygots, and homozygots for allele 2. The vector should represent a single individual. |
Summarizes the number of markers that are fixed for an allele in the genome of one individual. This is used to construct the I4 matrix in diem.
Numeric vector of length 4 with counts of "_", "0", "1", "2" respectively.
emPolarise for changing marker polarity.
genotype <- c("0", "0", "_", "2", "1", "0", "1") sStateCount(genotype) # [1] 1 3 2 1 # calculate state counts for a polarised genotype sStateCount(emPolarise(genotype, TRUE)) # [1] 1 1 2 3
genotype <- c("0", "0", "_", "2", "1", "0", "1") sStateCount(genotype) # [1] 1 3 2 1 # calculate state counts for a polarised genotype sStateCount(emPolarise(genotype, TRUE)) # [1] 1 1 2 3
A subset of single nucleotide polymorphisms in fish for testing purposes of multiallelic markers.
vcf file with 92 individuals and 6 markers.
The data is used to test conversion of genotype data from vcf to diem format with
the function vcf2diem
.
filename <- system.file("extdata", "testdata.vcf", package = "diemr")
filename <- system.file("extdata", "testdata.vcf", package = "diemr")
Reads vcf files and writes genotypes of the most frequent alleles based on chromosome positions to diem format.
vcf2diem(SNP, filename, chunk = 1L, requireHomozygous = TRUE)
vcf2diem(SNP, filename, chunk = 1L, requireHomozygous = TRUE)
SNP |
A character vector with a path to the '.vcf' or '.vcf.gz' file, or an |
filename |
A character vector with a path where to save the converted genotypes. |
chunk |
Numeric indicating by how many markers should the result be split into separate files. |
requireHomozygous |
Logical whether to require the marker to have at least one homozygous individual for each allele. |
Importing vcf files larger than 1GB, and those containing multiallelic
genotypes is not recommended. Instead, use the path to the
vcf file in SNP
. vcf2diem
then reads the file line by line, which is
a preferred solution for data conversion, especially for
very large and complex genomic datasets.
The number of files vcf2diem
creates depends on the chunk
argument
and class of the SNP
object.
Values of chunk < 100
are interpreted as the number of files into which to
split data in SNP
. For SNP
object of class vcfR
, the number
of markers per file is calculated from the dimensions of SNP
. When class
of SNP
is character
, the number of markers per file is approximated
from a model with a message. If this number of markers per file is inappropriate
for the expected
output, provide the intended number of markers per file in chunk
greater
than 100 (values greater than 10000 are recommended for genomic data).
vcf2diem
will scan the whole input specified in the SNP
file, creating
additional output files until the last line in SNP
is reached.
Values of chunk >= 100
mean that each output file
in diem format will contain chunk
number of lines with the data in SNP
.
When the vcf file contains markers not informative for genome polarisation,
those are removed and listed in a file ending with omittedSites.txt in the
directory specified in the SNP
argument or in the working directory.
The omitted loci are identified by their information in the CHROM and POS columns,
and include the QUAL column data. The last column is an integer specifying
the reason why the respective marker was omitted. The reasons why markers are
not informative for genome polarisation using diem
are:
Marker has fewer than 2 alleles representing substitutions.
Required homozygous individuals for the 2 most frequent alleles are not present
(optional, controlled
by the requireHomozygous
argument).
The second most frequent allele is found only in one heterozygous individual.
Dataset is invariant for the most frequent allele.
Dataset is invariant for the allele listed as the first ALT in the vcf input.
The CHROM, POS, and QUAL information for loci included in the converted files are listed in the file ending with includedSites.txt. Additional columns show which allele is encoded as 0 in its homozygous state and which is encoded as 2.
No value returned, called for side effects.
Natalia Martinkova
Filip Jagos [email protected]
Jachym Postulka [email protected]
## Not run: # vcf2diem will write files to a working directory or a specified folder # make sure the working directory or the folder are at a location with write permission myofile <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(SNP = myofile, filename = "test1") vcf2diem(SNP = myofile, filename = "test2", chunk = 3) ## End(Not run)
## Not run: # vcf2diem will write files to a working directory or a specified folder # make sure the working directory or the folder are at a location with write permission myofile <- system.file("extdata", "myotis.vcf", package = "diemr") vcf2diem(SNP = myofile, filename = "test1") vcf2diem(SNP = myofile, filename = "test2", chunk = 3) ## End(Not run)