Centennial variations in sunspot number, open solar flux, and streamer belt width: 1. Correction of the sunspot number record since 1874

We analyze the widely used international/Zürich sunspot number record, R, with a view to quantifying a suspected calibration discontinuity around 1945 (which has been termed the “Waldmeier discontinuity”). We compare R against the composite sunspot group data from the Royal Greenwich Observatory network and the Solar Optical Observing Network, using both the number of sunspot groups, NG, and the total area of the sunspots, AG. In addition, we compare R with the recently developed interdiurnal variability geomagnetic indices IDV and IDV(1d). In all four cases, linearity of the relationship with R is not assumed and care is taken to ensure that the relationship of each with R is the same before and after the putative calibration change. It is shown the probability that a correction is not needed is of order 10−8 and that R is indeed too low before 1945. The optimum correction to R for values before 1945 is found to be 11.6%, 11.7%, 10.3%, and 7.9% using AG, NG, IDV, and IDV(1d), respectively. The optimum value obtained by combining the sunspot group data is 11.6% with an uncertainty range 8.1–14.8% at the 2σ level. The geomagnetic indices provide an independent yet less stringent test but do give values that fall within the 2σ uncertainty band with optimum values are slightly lower than from the sunspot group data. The probability of the correction needed being as large as 20%, as advocated by Svalgaard (2011), is shown to be 1.6 × 10−5.


Introduction
This is the first of a series of three papers to model the long-term variation in open solar flux and the variation in the latitudinal width of the coronal streamer belt. The model used in paper 3 [Lockwood and Owens, 2014] is a development of that by Owens and Lockwood [2012], which is a variant of the application of the open flux continuity equation by Solanki et al. [2000]. The development of open solar flux reconstructions and their modeling has recently been reviewed by Lockwood [2013] and Owens and Forsyth [2013] and the modeling in paper 3 reproduces the reconstructions by Lockwood et al. [2014b]. As in the Solanki et al. paper, the open flux emergence rate is modeled using sunspot number, although the precise quantification is different, being based on average flux content and occurrence frequency of coronal mass ejections. Thus, it is important that the input sunspot record is as accurate as possible. In this first paper, we discuss in detail the calibration of international/Zürich sunspot number across a putative calibration error around 1945.
A survey of sunspot metrics and catalogues has recently been presented by Clette et al. [2007] and Lefevre and Clette [2014]. The international sunspot number, R, quantifies the sunspot activity present on the surface of the Sun according to the formula where N G is the number of sunspot groups, N S is the number of individual spots, and k is the observatory factor that varies with location and observing instrument. This has been compiled since 1981 from a number of observers (currently it uses 86 observers located in 29 different countries) and is designed to be a continuation of Brussels. Before this date, daily observations are too sparse but monthly values have been compiled. In the present paper we refer to the intercalibrated WDC-SILSO composite as R. The need to scale N G and N S from observations, sketches, or photographic plates leads to some degree of subjectivity, but this is limited in the modern data: the RMS dispersion between simultaneous daily observations by many separate observers, including the effect of different meteorological conditions, is of order 8%. Nevertheless, subjective decisions about what constitutes a group must be made and a drift in the k factor could lead to a calibration error. Clette et al. [2007] have reviewed the history of the compilation of R and discussed potential sources of error.
In this first paper we consider the accuracy of R after 1874, when photographic observations by the Royal Greenwich Observatory commenced. The WDC-SILSO composite in this interval uses international sunspot number for 1981 to the present, Zürich sunspot number for 1882-1980, and Wolf sunspot numbers for 1874-1882 (which were intercalibrated with Zürich numbers over the interval 1877-1892 by A. Wolfer (see Paper 2 [Lockwood et al., 2014c]). In Paper 2 we study sunspot numbers before 1874.
There are several complications with the sunspot number record. The early Zürich/Wolf sunspot number shows well-known long-term differences to the group sunspot number R G defined by Schatten [1994, 1998] as where N G is number of sunspot groups and k′ is a site/observer factor and the averaging is done over the N observers that are available for that day. Daily values are then averaged to give annual means. The factor 12.08 was derived to make R G and modern international sunspot number R the same on average. Because of the greater difficulty in quantifying the number of individual spots, N S , the group number is generally regarded as more robust than the early Wolf number (see discussion by, for example, Hathaway [2010] and Usoskin [2013]). However, because, in general, N S is not necessarily proportional to N G on all time scales, it should be noted that R G and R are not identical metrics of solar activity. Hence, differences can be real and so do not necessarily indicate errors in one or both data sequences. Leussu et al. [2013] have used Schwabe's original data to identify an inconsistency in the Wolf sunspot number around 1849, associated with the change of the primary observer from Schwabe to Wolf. This calls for a reduction in R before 1848 by 20% which accounts for much of the large difference between R G and R before this date.
Recently, Svalgaard [2011] has argued that the international sunspot number record in common usage, and as stored in many data centers, suffers from another calibration problem around 1945 with values before then being 20% too low. Svalgaard terms this the "Waldmeier discontinuity" [see also Aparicio et al. 2012 andCliver et al. 2013] and argues that it was caused by the introduction of a weighting scheme for sunspot counts according to their size and a change in the procedure used to define a group; both changes that may have been introduced by the then director of the Zürich observatory, Max Waldmeier, when he took over responsibility for the production of the Wolf sunspot number in 1945. Svalgaard argues that these corrections were not applied before this date, despite Waldmeier's claims to the contrary. By comparison with other long time series of solar and solar-terrestrial indices, Svalgaard makes a compelling case that this discontinuity is indeed present in the data. He argues that sunspot number values before 1945 need to be increased by 20% as a result.
One issue that needs careful consideration in quantifying the correction needed is the homogeneity of the data series that R is compared to, both before and after the putative discontinuity. Also, it is very important that the indices compared are constructed completely independent of R, if they are to act as a test and/or calibration of R. Because sunspot number has been used as a standard for many years, this is more of a problem than it first appears to be. For example, there is evidence that sunspot data were used to correct geomagnetic observations and such cross contamination is highly undesirable.
In this paper, we use four indices for which the provenance is well known. The time series of annual means of these data are shown in Figure 1. Two of these (specifically the number of sunspot groups, N G , and the total area of sunspots, A G ) are taken from the sunspot group data generated from 1874 onward by the Royal Greenwich Observatory (RGO) based on photographic observations at Greenwich and several support stations (Cape of Good Hope, Kodaikanal, and Mauritius), used to minimize data loss due to cloud cover. Differences between data sets can arise due to errors introduced by the personal bias of the observer in defining a group, limited seeing conditions at the observation site, different amounts of scattered light, the difference in the time when the observations were made, and the instrumentation and techniques deployed.  Baranyi et al. [2001] have shown this leads to systematically lower A G values. Foster [2004] and Balmaceda et al. [2009] used the Pulkovo Observatory SD data, along with some other data sources, to achieve intercalibration between the RGO and SOON sunspot areas. In section 2 of this paper, we also consider intercalibration of the RGO and SOON data. In section 3 we study the 1945 discontinuity using both the combined RGO/SOON data and the RGO data alone.
The other data used in the present paper are interdiurnal variation geomagnetic indices. Svalgaard and Cliver [2010] compiled the IDV index based on Bartels' u index [Bartels, 1932]. Lockwood et al. [2013] have generated a variant of this index, IDV(1d) designed to be as homogeneous in its construction as possible. In this paper, we make use of both indices after 1874, when they are extremely similar despite considerable differences in how they are compiled and the different data sets used. We chose these indices because both depend primarily on the strength of the near-Earth interplanetary magnetic field (IMF) without the strong dependence on solar wind speed found in range indices such as aa and Ap (see review by Lockwood [2013]): this means that they also have a simpler relationship to sunspot number. Lockwood et al. [2014a] show that modeled and observed IMF data series correlate with N G n with optimum n of 0.4 and 0.5 for observed and geomagnetically reconstructed IMF, respectively. In both cases, the simultaneous N G n value explains about 70% of the IMF variation in annual means, the remainder being mainly associated with the prior history of N G . In section 4 we employ the relationship between IDV and IDV(1d) and the simultaneous N G value to provide two further tests of R. Further aspects of the relationship between sunspot number and the interdiurnal geomagnetic indices are discussed in Paper 2 of this series [Lockwood et al., 2014c].

IDV (nT)
year e). The grey vertical line in each panel of Figure 1 marks the date of the putative Waldmeier discontinuity. There is no obvious major change in the behavior of the sunspot number, the sunspot group data, or the geomagnetic indices around this time; however, it lies on the steep initial rising phase of a solar cycle when a change in a scaling factor would be hardest to detect.

Intercalibration of the RGO and SOON Data Sets
Because the focus of the present paper is the sunspot number R, we use this to intercalibrate the RGO and SOON data, rather than a series of intercalibrations of overlapping data. This method was actually employed in some places by Balmaceda et al. [2009], in addition to their use of overlapping data series. Figure 2 demonstrates the two methods used to search for calibration changes that are employed in this paper. In this section, both methods are used to investigate the intercalibration of the RGO and SOON sunspot group data. We employ only data from after the putative Waldmeier discontinuity in R around 1945 to intercalibrate the RGO and SOON data. . The horizontal dashed lines show the correlation for the RGO data alone with the horizontal dash-dotted lines showing that for the SOON data alone. These correlograms both show clear peaks, but the question arises how well this defines the optimum values of C N and C A , because it is not immediately apparent at what value each correlation is significantly lower than its peak value. This question is addressed in Figures 2c and 2d which show the significance S of the difference between the peak correlation and that at general C, computed using the Fisher Z test and assessed by comparison with the AR-1 autoregressive red noise model using the procedure of Lockwood [2002]. It can be seen the peak correlation lies between those for the two data series individually (which is slightly higher for the RGO data in both cases). However, as C is varied away from the optimum values, r falls because of the discontinuity introduced between the data sets by the nonoptimum values of C. S is a measure of the statistical significance of that decline.

Analysis Using Fit Residuals
Figures 2e and 2f study the mean residuals of the best fit to R using the full data sequences of N G or A G m (for 1946-2012). These plots show the differences < δ > a À <δ > b where < δ > a is the mean residual for the SOON data (1977SOON data ( -2012 and < δ > b is the mean residual for the RGO data used (1946RGO data used ( -1976. Figure 2e is for N G , and Figure 2f is for A G m . To normalize these differences, they are shown as a percentage of the overall mean for C = 1. For the optimum C, the difference should fall to zero, i.e., there is no change in mean residual across the RGO/SOON data join. These values are marked by the vertical dashed lines, being at C = C N = 0.911 for N G and C = C A = 1.435 for A G : these values are very close indeed to the C values giving peak correlations in Figures 2a and 2b (Figures 2c and 2d show that the significance of any difference is very close indeed to zero). Figures 2g and 2h show the probability p value that <δ> b and <δ> a are the same. Because both the sample sizes and the variances are not the same for the two data subsets, we use Welch's t test to evaluate the p values of the difference between the mean fit residual before and after 1 January 1946; this two-sample t test is a parametric test that compares two independent data samples [Welch, 1947]. Because it is not assumed that the two data samples are from populations with equal variances, the test statistic under the null hypothesis has an approximate Student's t distribution with a number of degrees of freedom given by Satterthwaite's approximation [Satterthwaite, 1946]. The distributions of residuals are close to Gaussian distributions in this case and so application of nonparametric tests (specifically, the Mann-Whitney U (Wilcoxon) test of the medians and the Kolmogorov-Smirnov test of the overall distributions) gives very similar results (not shown).
Using overlap of the Pulkovo data, Balmaceda et al. [2009] obtained a value of C A = 1.489 ± 0.194 (for A G m , with m = 0.732) and the value of 1.435 derived here lies comfortably within their uncertainty band. The main difference in the optimum value arises because we have used R to intercalibrate whereas Balmaceda et al. used, in the main, overlapping data. In addition, we have only used data for after the putative Waldmeier discontinuity in 1945. The great consistency between the two methods demonstrates the great stability of R during the interval studied . Hathaway [2010] derives and uses a value of 1.48. Figure 3 shows scatterplots of annual means of R as a function of (a) N G and (b) A G for all data from after the putative Waldmeier. The solid points are the RGO data, and the open circles are the SOON data for correction factors C N = 0.911 for N G and C A = 1.435 for A G m . It can be seen that the behavior for both RGO and corrected SOON data is the same after the putative Waldmeier discontinuity, with an almost linear variation for N G (but slightly greater scatter) and a somewhat less linear variation for A G .

Sunspot Group Data
Figure 4 studies the relationships between R and N G and between R and A G before, after, and across the putative discontinuity in 1945. For both N G and A G , the optimum RGO/SOON calibration factors (C N and C A , respectively) derived in section 2 are applied.
In Figure 4, several parameters are shown as a function of an exponent m, which is varied between 0 and 1.5.  (Figure 4b), the behavior before and after 1 January 1946 is almost identical with peak correlation at m = 0.87 in both data subsets. However, the behavior of the total data set (1874-2012, black line) is different with the peak correlation at a slightly larger m. This is a clear indication that there is a discontinuity in either R or A G (or both) around 1945-1946. In the case of N G m (Figure 4a), the behavior of the before and after subsets is not as similar as in the case of A G , with the before data set giving slightly lower correlations that peak at a lower m. As expected from the scatter in Figure 3, peak correlations are slightly lower for N G than A G . Again, the correlation for the total data set is different from that of the two subsets individually, again showing a discontinuity in one or both data series around 1 January 1946. Figures 4c and 4d show the significances S of the difference between the three peak correlations and their values at general m, generated in the same way as Figures 2c and 2d, where S b is for the data from before the putative discontinuity and S a is for the data from after it. Figures 4e and 4f are used to derive the optimum values of m, by showing the probability that the data for before and after share the same m, (1 À S a )(1 À S b ) as a function of m. The peak of this probability is essentially unity for A G and gives the optimum m of 0.87. For N G the peak is very slightly lower at 0.96 and is at m = 1.00. Figure 5 repeats this process using the geomagnetic indices IDV and IDV(1d). In this case, the behavior is more clearly seen if we correlate these geomagnetic indices with R n because n < 1 for all best fits. We here add a third time period, 1845-1873, for which both R and geomagnetic data are available, results for which are shown in green. For IDV(1d) the behavior is the same in all three intervals with peak correlation at n = 0.54: correlations get progressively lower for the earlier intervals, which is to be expected as measurement noise should have decreased over time for both the sunspot and geomagnetic data series as observing techniques improved. For IDV, Figure 5b shows that the 1874-1945 and 1946-2012 intervals again give very similar behavior, but that for 1845-1873 is very different indeed (green line). In particular, this interval gives a much larger correlation coefficient and at an n value near unity. This demonstrates that IDV for this interval is much more similar to R than for later intervals. In fact, IDV from this period is based on a different type of data than for later periods, being 10u, where u is Bartel's u index which, for much of the interval in question, was compiled from the diurnal range of geomagnetic activity [see Svalgaard and Cliver, 2010]: hence, IDV is not a homogeneously constructed index, whereas Lockwood et al. [2013] devised IDV(1d) to be as homogeneous in its construction as it possibly could be. In this paper, we use only data after 1874 to avoid these problems and to match the interval used for the sunspot group data, but the implications of Figure 5 will be discussed again in Paper 2 [Lockwood et al., 2014c]. Like IDV(1d), IDV shows the same behavior with R in the 1874-1945 and 1946-2012 intervals and using the probability distribution that they have the same exponent n given in Figure 5f, we find the optimum n is 0.69 for IDV. The relationship between R and the geomagnetic indices is actually more complex than it appears here because of the role of past history of solar activity [Lockwood et al., 2014a].  [Lockwood and Owens, 2014], we derive a loss time constant that varies over the solar cycle between 0.7 years and 1.3 years giving a somewhat smaller average value. Lockwood et al. [2011] show that the resulting open solar flux predictability falls from 1 to 0.5 over 22 years, and Lockwood et al. [2014a] detect an influence on the near-Earth IMF of the mean sunspot number over the previous 11 years as well as the current value. We here use the current value only to evaluate average levels of sunspot number before and after the putative 1945 discontinuity: in the next section we show that the results are consistent with the findings from the sunspot group data, but the uncertainties are much greater because of the additional history factor.

Analysis of the Amplitude of the Putative Discontinuity in International Sunspot Number
The values of the exponents n and m derived in section 3 are here used to analyze the amplitude of the putative discontinuity around 1945 in the R record. The procedure used is this: the sunspot number is amended before 1945 by multiplying by a correction factor f R which is varied between 0.9 and 1.3 (i.e., a change between a 10% decrease and a 30% increase is implemented). For each f R this generates a sequence R C , where R C = R for 1946 and after and R C = f R × R for 1945 and before. R C is then linearly regressed with A G m , N G m , IDV 1/n , and IDV(1d) 1/n , using the relevant values of m and n derived in the last section (m = 0.871 for A G , m = 1.000 for N G , n = 0.540 for IDV(1d), and n = 0.690 for IDV).
The fit residuals δ are then evaluated. For example, for A G , the residuals are given by where s and c are the best fit linear regression coefficients. The mean value of δ for before the discontinuity , <δ> b , is then compared to that for after the discontinuity , <δ> a , and the difference between the two quantified and its significance evaluated using Welch's t test, as discussed above. This procedure is applied to A G , N G , IDV, and IDV(1d). An example of the results is presented in Figure 6, which is for f R = 1.100. Figure 6a shows the time variations of R C along with best fit regressions which are termed R AG , R NG , R IDV , and R IDV(1d) . Figure 6b shows the time series of the fit residuals δ for each case. As expected from the correlation coefficients, the residuals are always small for A G , slightly larger for N G and sometimes quite considerable for the geomagnetic indices, owing to the limitations in the correlation discussed in section 3.2.
For each of the four calibration sequences and every f R , the difference <δ> a À <δ> b is evaluated and their variations with the f R value are shown in Figure 7a. The difference decreases linearly with f R in every case, and the optimum value is where <δ> a À <δ> b is zero. This happens at the f R values given in Table 1 which are in the range 1.079 (a 7.9% correction) to 1.117 (a 11.7% correction).
Comparison of the residuals variations in Figure 6b shows that the agreement is much better for some parameters than others. To derive an overall correction factor that takes this variation into account, we here evaluate the probability p values that <δ> a and <δ> b are the same using Welch's t test, as above. Normalizing over the whole distribution with f R gives the probability density functions (pdfs) shown in Figure 7b. The pdfs are broader (and thus with a lower peak) for IDV and IDV(1d) because of the lower correlation (the peak r for R and A G m being 0.98, whereas the peak r for both IDV(1d) 1/n and IDV 1/n is 0.84). The broad pdf and the low peak for the geomagnetic data emphasize how much poorer a test of the R calibration is given by the geomagnetic data than is given by the sunspot data. Even for geomagnetic activity indices responding primarily to solar EUV effects on the thermosphere, geomagnetic activity is an indirect indicator of sunspot number, and we concur with Mursula et al. [2009] who pointed out that geomagnetic data are not a sufficiently precise measure of sunspot numbers to make them valid for calibration.
We here combine the probabilities by taking their product to generate the pdf in black. The optimum f R is then taken as the peak of this combined distribution and its uncertainty evaluated at the 2σ level. Table 1 gives The first column uses all four calibration data series (A G , R G , IDV(1d), and IDV); the second uses all four, but only the RGO sunspot group data are used. The third column uses only the sunspot group data. For each, the optimum, maximum, and minimum (at the 2σ level) values of f R are given along with the values from A G , R G , IDV(1d), and IDV in isolation. Also given are the probabilities that no correction is required, p(f R = 0.000), and that the correction required is 20%, p(f R = 1.200).   Figure 7. (a) The difference between the mean fit residuals after 1 January 1946 <δR> a and before that date <δR> b as a function of the f R value used to generate R C from R. The values of <δR> a À <δR> b are shown for the best fits from (green) A G , (mauve) N G , (blue) IDV, and (orange) IDV(1d). The optimum f R value in each case is where the line crosses the zero level (the horizontal black line) and is tabulated in Table 1. (b) The probability distribution functions from p values obtained by applying Welch's t test to the difference between the means <δR> a and <δR> a . The total probability from A G and N G is given by the black line. The vertical black line in both panels marks the optimum value of f R = 1.120 with the 2σ uncertainty band (1.070 ≤ f R ≤ 1.155) shown in grey. three different combinations of the data. In the first column, all four calibration data series are used. In the third column, we only used the sunspot group data (A G and N G ) and use the geomagnetic data as an independent check. Because we think it is important to maintain the full independence of the sunspot and geomagnetic data, this is our preferred option. In the second column we search for any influence of the calibration of the SOON and RGO data on the result by only using RGO data. As a check, we have also applied the correlation method (see section 2.1) to define the optimum f R values. The optimum f R obtained by correlating with both A G and N G was found to be slightly larger, but the uncertainty band around it is broader because fewer data were used.
The combined pdf for the preferred option is shown by the black line in Figure 7b, and the vertical black line gives the optimum f R (= 1.116, i.e., a 11.6% correction), and the shaded grey area gives the 2σ uncertainty band (i.e., 95% of the combined pdf lies in this band) which is between f R = 1.081 (an 8.1% correction) and f R = 1.148 (a 14.8% correction). Although the geomagnetic indices give slightly lower values, including them makes little difference because their probability distributions are so much broader. Using only the RGO data (i.e., we omit the SOON data for 1976-2012) increases the values, but only slightly, so the intercalibration of RGO and SOON is not strongly influencing the results.
Table 1 also quantifies the probabilities that f R is as small as unity and as large as 1.20. Both are exceedingly small. Therefore, we can discount the possibility that no correction is needed; however, we can equally discount the possibility that it is as large as 20% as was advocated by Svalgaard [2011].

Conclusions
We have studied the putative discontinuity in the international sunspot number record in around 1 January 1946. Our results confirm the conclusion of Svalgaard [2011] that the discontinuity is present in the widely used data set that is available from most data centers. However, Svalgaard's estimate that a 20% correction is here shown to be an overestimate. Using the RGO/SOON spot area, Svalgaard [2011] actually found an 18% correction (it is not clear from his paper how this was quantified) that he rounded up to 20%, but both values are far out on the far tail of the probability distribution found here. We suggest preferred value should be based on sunspot data alone, maintaining its independence from geomagnetic data: we have demonstrated problems with early IDV data caused by a failure to maintain that independence. Hence, we recommend an optimum correction of 11.6%, with the 2σ uncertainty range being 8-14%. The geomagnetic data provide broad confirmation of this with IDV giving and optimum correction of 10.3% and IDV(1d) giving 7.9%.