Using real data

The workflow for analyzing real (non-simulated data) is typically:

  1. [NOT NEEDED FOR tsinfer] mrpast polarize the data. Only Relate and SINGER need the data filtered and polarized prior to running mrpast arginfer, since tsinfer takes the ancestral state as an input argument you can just pass in an ancestral sequence with --ancestral.

  2. mrpast arginfer on the polarized data to produce ARGs.

  3. mrpast process --solve to process and solve the maximum likelihood problem.

  4. mrpast confidence to generate confidence intervals on the parameters.

There are some additional considerations:

  • It is often best to pass these options to mrpast process: --rate-maps ratemap.chr --rate-map-threshold 1e-9. This requires the ARG to only be sampled from regions with a recombination rate less than 1e-9. We have found that ARG inference tends to be more accurate in such regions.

  • If you are concerned about particular regions of the genome (either the quality of sequencing, or things like selection influencing results), you can modify your rate maps to set the recombination rate really high in those regions. Then the above recombination rate threshold will prevent sampling from those regions.