Estimating Unknown Cycles in Geophysical data

Examples of cyclic (periodic) behavior in geophysical data abound. In many cases the primary period is known, such as in daily measurements of rain, temperature, and sea level. However, many time series of measurements contain cycles of unknown or varying length. We consider the problem of estimating the unknown period in a time series.We review the basic methods, compare their performance through a simulation studyusing observed sea level data, apply them to an astronomical data set, and discuss generalizations of the methods.

This paper provides a review of some of the statistical methods that have been developed to estimate the unknown period in a time series. The problem is considered in the general case of unequally spaced observation times-the statistical problem becomes easier when data is at regular intervals. We review and briefly describe competing statistical methods for determining the unknown period in Section 3. In Section 4 we compare the accuracy of the competing methods using sea level data with a known primary period. In particular, we study the sensitivity of the estimation methods to the amount of data missing (spacing of time points) and to the degree of noise, by randomly removing observations and by adding varying degrees of random noise to the data. We will see that the Lomb-Scargle periodogram and cubic spline smoothing method are the most robust to missing data (unequal spacing) and to increased noise, in that they are most able to accurately estimate the period regardless of the spacing or noise level. In section 5 we demonstrate the methods using data on the light magnitude of a variable star. We end the paper with some discussion of future directions for statistical methods.

Literature Review of Nonparametric Period Finding Methods
There has been a rich history in spectrum analysis which includes frequency estimations in signal processing. Note the period of a cycle is the reciprocal of the frequency, so for monthly effects with period of 12, the frequency would be 1/12. Conventional spectral analysis techniques like the periodogram requires the input signal to be uniformly sampled which is hardly satisfied in practice. Deeming (1975) began the work on estimation approaches for the unevenly spaced observational data of variable stars. Then dozens of methods have been developed, both parametric and nonparametric. This paper only discusses the nonparametric approaches for period estimation.
Periodograms are generally based on discrete Fourier transform and power spectrum. The classical periodogram was developed more than a century ago by Schuster (1898). Deeming (1975) applied the discrete Fourier transform to unequally spaced data in astronomy. Walker (1971) and Hannan (1973) explored the asymptotic properties of the periodogram estimator. Later Lomb (1976)-Scargle (1982 periodogram becomes a well-known and widely used algorithm for periodicity detection in unevenly-spaced time series. A second class of approaches search the period by evaluating dispersion either in the sum of lengths between phase-sorted data or sum of dispersion in phased bins compared against trial periods. The String-length method was attributed to Lafler and Kinman (1965), Renson (1978) and Dworetsky (1983). Clarke (2002) presented a generalization "Rope-length Method" for multivariate time series data. The Phase Dispersion Minimization (PDM) was due to Stellingwerf (1978).
The third class of approaches estimate the period by fitting local regression or smoothing splines. Periodic smoothing splines were discussed in the context of spectral estimation by Cogburn and Davis (1974). The application of cubic spline in the period estimation was introduced by C. Akerlof (1994). Friedman (1984) invented a variable-span local linear smoother so called "SuperSmoother", and Reimann (1994) first adopted it in his dissertation.

Model and Methodology
Let g be a periodic function with the true period p 0 . We have N pairs of observations (t n , Y n ), 1 ≤ n ≤ N , in which t n represents the time when the observation was made. The observations are ordered by the value of t n , i.e., 0 < t 1 ≤ t 2 ≤ · · · ≤ t N . We write the model as where n 's are independent identically distributed (i.i.d.) errors with E( n |t n ) = 0 and V ar( n |t n ) < ∞. The typical goals are to estimate p 0 and g(·).
A simple idea is to construct a nonparametric estimatorĝ(·|p) of g under the assumption that the period of g is p. See P. Hall and Rice (2000). We then mapĝ to R by periodicity and define the Sum Squared Error (SSE) as We choose the estimatorp which minimizes SSE(p). For an appropriate estimatorg(·|p) under the assumption of period p, we takeĝ =g(·|p) to be the estimator of g. This paper will address searching period rather than estimating g(·) in the next few sections.

Discrete Fourier Transform, Power Spectrum and Classical Periodogram
Consider a continuous function g(t) which are uniformly sampled at the discrete time 0 < t 1 ≤ t 2 ≤ · · · ≤ t N , the Fourier transform F for the discrete sampling is where f is the frequency. We use the canonical notation of Fourier Transform in term of the frequency f to keep the consistency with spectrum analysis. Note f = 1 p .
-3-manuscript submitted to Earth and Space Science The squared amplitude, known as the power spectrum, is defined as With the power spectrum defined, Schuster (1898) first proposed the classical periodogram The estimate of the frequencyf on the interval is the one that maximizes the periodogram P S (f ).
However, the classical periodogram has several drawbacks. With unevenly spaced data, the Fourier power spectrum does not have well-defined statistical properties (Lomb, 1976), this is because the discrete Fourier transform relies on some strong assumptions: evenly spaced observations of infinite duration, Gaussian white noise, and stationary behavior (VanderPlas, 2018). As a consequence, the classical periodogram does not work well on unevenly spaced data, see Table 3 in Section 5.

Lomb-Scargle Periodogram
Lomb (1976) and Scargle (1982) proposed a Fourier-like power spectrum estimator to characterize the periodicity in the unevenly spaced data. Lomb-Scargle periodogram can be also seen as fitting the least square of sine waves to the unevenly spaced data, where the amplitude A f and phase φ f depend on the trial frequency f . With some calculus, the least square solution, so called Lomb-Scargle periodogram is established as .
VanderPlas (2018) has an in depth discussion on the connection between classic periodogram and Lomb-Scargle periodogram. If the data is evenly spaced and consists of Gaussian noise, the Lomb-Scargle periodogram reduces to the classical periodogram. The Lomb-Scargle periodgram is more computationally efficient than the classic periodogram. Another distinct benefit of Lomb-Scargle peridogram is that the unnormalized periodogram in Equation 8 follows a χ 2 distribution with two degrees of freedom when the error terms are Gaussian noise (Scargle, 1982).

Cubic Spline
General cubic spline methods are described in Wasserman (2006). Here we use the definitions and notation from Reimann (1994), which are specifically tailored for periodic data.
A function s(·) on interval [0, 1] is a periodic cubic spline with K knots at t k , where k = 1, 2, · · · , K. It should satisfy the following properties: • In each interval [t k−1 , t k ], k = 1, 2, · · · , K + 1 (Define t 0 = 0 and t K+1 = 1), s(·) is a polynomial of degree three. • s(·) and its first and second order derivatives are continuous everywhere in [0,1] and satisfy the periodicity constrains Given the set of knots, the spline model is determined by coefficients in each interval satisfying the above constraints. The model is fitted to data using least squares. To find the period using a cubic spline: • convert the raw data (t n , Y n ) into the phased data (ρ n , Y n ) by ρ n = tn p mod 1 for a trial period p (thus the phase space ρ ∈ [0, 1]), where n = 1, 2, · · · , N ; • fit a cubic spline for the phased data (ρ n , Y n ) for the fixed number of knots K for all trial periods, and compute the corresponding SSE(p).
The estimatep is the period that minimizes SSE(p).

Local Linear Regression and Supersoomther
Local linear regression uses weighted averaging to provide a linear approximation to a nonlinear function at a point (Wasserman, 2006). To estimate the period, fit a local linear regression on the phased data (ρ n , Y n ), n = 1, 2, · · · , N for some bandwidth. Let B i denote the i th bandwidth and J the number of observations in B i . We use fit a local linear regression in each band where j 's are i.i.d. error terms. The local linear estimator in each band can be computed by (weighted) least squareŝ whereα andβ are obtained from local fits to data points in each band. The estimatê p is the period that minimizes SSE(p).
Friedman's Supersmoother (Friedman, 1984) performs three linear smooths of the phased data (ρ n , Y n ), n = 1, 2, · · · , N with long, medimum and short bandwidths. Then it does a local cross-validation to determine which bandwidth gives the best fit at each phase value. The period estimate is obtained through minimizing the Sum of Absolute Residuals (SAR) where theŶ n (p) is the fitted value from Supersmoother at a trial period p, andσ n is the estimate of the standard deviation of the errors.

String Length Methods
Phase-folding methods compute the dispersion of the data in the phase space to search the period that minimizes the dispersion. For example, String-length computes the phase of the raw observations for each trial period p and sorts phase data in an ascending order of the phase. Let (ρ * n , Y * n ) be the ordered phased data, where n = 1, 2, · · · , N . The best period minimizes the String-length statistic Note Y * N +1 = Y * 1 and ρ * N +1 = ρ * 1 .
However, String-length depends on the differences in the phase as well as in the response, so a change in either could lead to a different estimate of the period. Lafler and Kinman (1965) recommended minimizing the following statistic Another modified string length method is due to Renson (1978) which estimates the period by minimizing the quantity: where b is chosen so that the difference (ρ * n+1 − ρ * n ) 2 + b 2 won't be too small.

Phase Dispersion Minimization
The variance is computed by If we divide the data into M distinct samples and each sample has n m observations, where m = 1, 2, · · · , M , then each sample has the variance s 2 m , and the overall variance of all samples is The Phase Dispersion Minimization method(PDM) is implemented in two steps: • convert the raw data into the phased data for a trial period p by ρ i = ti p mod 1, where i = 1, 2, · · · , N . ρ ∈ [0, 1]; • divide the full phase interval [0, 1] into M fixed bins with observations in each bin chosen so that these observations have similar phase.
Compute the PDM statistics for each trial period p by If p is not the correct period, then s 2 (p) ≈ σ 2 and PDM(p) ≈ 1; if p is the correct period, then PDM(p) will reach a local minimum compared with the neighboring periods, ideally near zero.

Statistical Inference on the Period
We briefly discuss confidence intervals for the period. The finite sample distribution of the estimated period is not generally quantifiable. However, under restrictive assumptions some exact distributional results are available. For example, as noted above under the assumption of normality the unnormalized Lomb-Scargle peridogram in Equation (8) follows a χ 2 distribution with two degrees of freedom. This can be used to make a confidence interval (Baluev, 2008). More generally applicable procedures could be based on asymptotic (large sample) normality. Since the Lomb-Scargle peridogram solves a least squares minimization, asymptotic normality can be established and used to create an approximate interval. Similarly, the smoothing methods (spline and local), result in estimators which have asymptotic normal properties (Wasserman, 2006). In practice, we recommend using bootstrapping to estimate standard errors and create confidence intervals. The interested reader is again referred to (Wasserman, 2006) for details. In the current paper, we seek to provide point estimates of the period, so we do not pursue this further here. Rather we provide the practitioner some guidance to select the most reliable method by conducting a simulation study in the next section. Graham et al. (2013) has conducted a comparison of several period finding techniques applied to observational data of variable stars from three projects: Catalina Realtime Transient Survey, ASAS Catalog of Variable Stars and MACHO. However, his comparison did not consider the heterogeneity in data. Moreover, the true period of a variable star is actually decided by computation instead of prior knowledge, so the comparison of accuracy measures is not persuasive. Therefore, we present a different comparison by simulating the sea level data at the La Jolla Station, California. The seal level data were collected by the project "Permanent Service for Mean Sea Level". The periodic variation of the sea level is known to be annual because of the "steric effect", which is caused by the annual variation in water temperature at shallow depths. The sea level data consist of 300 observations, which ranges from 1992 to 2017 and are evenly sampled. The sampling rate is 12, i.e., 12 observations per year, the time scale has been modified to be month/12 so that the natural period is 1, and the standard error of the white noise in the data is estimated to be σ = 90.53 millimeter.

A Brief Comparison of Period Finding Methods
Our goal is to examine how period estimation methods behave under specified nonuniformity of measurement times and differing variability of additive random noise. To simulate the non-uniformity, we mimic missing data by randomly selecting time points and remove the observations which are not selected. We vary the proportion of sampling observations from 20% to 70%. With a lower proportion of the data being randomly sampled, the sampled data become more unevenly spaced. For each selected proportion prop, we randomly select 300 × prop time points to sample, then estimate the period using each of the methods described in this paper. We replicate this 100 times, thus creating 100 samples for each proportion. We quantify the performance of the different methods in two ways. First, we consider the classical statistical mean squared error (MSE): wherep i is the estimated period from artificial data i, and p 0 is the known period. As a second measure of performance we consider the accuracy metric from Siyanbola et al. (2012) which has been also used by Graham et al. (2013) in his comparison: where p 0 is the true period, ∆τ is the duration of the time series, δφ max is the maximum allowed phase offset after period-folding some cycles. In sea level data p 0 = 1 year, ∆τ = 25 years, also considering that the minimum spacing between two trial periods is 0.005 year in the simulation, we simplify the above accuracy metric to So if the estimatep is within 0.01 year to the true period p 0 , we acceptp as an accurate estimate. The percentages of accurate estimates of each method in 100 runs are plotted as follows: The accuracy in Figure 1 and mean square error in Table 1 suggest that Lomb-Scargle periodogram and cubic spline have a robust performance in every unevenly spaced case (different proportions sampled). However, Lomb-Scargle periodogram has slightly smaller mean squared error than the cubic spline. Local regression has the smallest MSE when 40% or more of sea level data are sampled. The Lafler-Kinman(LK), Renson(REN) and Dispersion Minimization(PDM) methods perform poorly relative to the other three methods. As data gets noisier, period finding methods perform worse and eventually fail to detect the true period. Thus, the resistance of each method to increased variability of white noise is of great interest. To examine the impact of noisy data, we randomly sample 60% of the sea level data 100 times, in each case we add additional Gaussian random noise with different standard deviation at 0.5σ, 1.0σ, 1.5σ, 2.0σ, 2.5σ to the sampled data. Note σ is the estimated standard deviation of the original sea level data, so adding noise at 0.5σ increases the total noise variance by 1.25 relative to the original data variance, while adding noise with standard deviation 2σ increase total variance by a factor of 5. The accuracy plot in Figure 2 and MSE in Table 2 suggests that the performance of Lomb-Scargle periodogram and cubic spline deteriorate sharply when the variance of the white noise is above 2σ 2 . The misleading estimates of Lomb-Scargle periodogram in noisy unevenly spaced time series in the simulation coincide with the partial findings from Schimmel (2001

An Application: Periodicity in the Light Magnitude of Variable Stars
Determination of the periodicity is a fundamental issue in the study of variable stars, which includes classification of variable stars, calibration of the period-luminosity relation, determination of the pulsation modes, detection of stellar rotation and so on.
The data of the variable stars were collected through MACHO project, which is a collaboration of scientists at the Mt. Stromlo and Siding Spring Observatories, the Center for Particle Astrophysics at Santa Barbara, San Diego, the University of California at Berkeley, and the Lawrence Livermore National Laboratory. Data were collected daily over a 4-year period when weather permitted, on approximately 8 million stars in the Large Magellanic Cloud (LMC) and the bulge of the Milky Way.
A Cepheid variable is a type of star that pulsates varying in both diameter and temperature and producing changes in brightness with a well-defined stable period and amplitude. Data of Lcb1 Cepheid variable star from LMC is used as the example which consists of 327 observations made in 385 days. The maximum spacing(gap) between two observations is 32.83 days.

Methods
Period Estimate (in days)    Figure 3 visualizes how each method searches for the period. Note that Lafler-Kinman and Local Regression were not plotted since their search paths are identical to Renson and cubic spline.
The horizontal dash line in plot(a) Lomb-Scargle periodogram is the critical value. Periodogram peaks exceeding this line are considered significant and thus is the estimate of period. The significant level α = 0.01 is used. Renson method in plot(b) calculates the string-length statistic for all trial periods, the minimum statistic, usually the first valley, is taken as the period estimate. Cubic spline and Phase disperson minimization are similar to Renson string-length but minimize different statistics. There are multiple valleys in Renson string-length and Phase dispersion minimization plots other than the first. They correspond to the multiple of the period and may have close statistics to the first valley. String-length methods and Phase Dispersion minimization are less likely to distinguish these valleys and thus estimate multiple periods when data are noisier or more irregularly spaced.

Conclusions
Cyclic behavior is a prominent feature in many types of geophysical data. These periodic effects can be estimated without specifying a parametric model, and should be accounted for in statistical analysis of the data. In this paper we consider estimating unknown cycle lengths in non-uniform time series data. We have examined their performance by a simulation on sea level data. Then we have considered the case of estimating a primary period as in the case of the light magnitude of a periodic star. Several of the methods in this paper work quite well in this case, especially Lomb-Scargle periodpgram and periodic cubic spline.
As a final comment we note that many geophysical time series may have multiple cycles impacting the data. Modern statistical model selection methods such as LASSO can be used to simultaneously determine the important periods and estimate the effects at each important period (Kato & Uemura, 2012). Other unsolved research problems still require efforts, for example, unevenly spaced data that consist of periodic signals with non-sinusoidal shapes, or correlated noise.