A WAVELET THRESHOLDING APPLICATION ON PARTICULATE MATERIAL CONCENTRATION TIME SERIES UMA APLICAÇÃO DE LIMIAR DE ONDA NA SÉRIE DE TEMPO DE CONCENTRAÇÃO DE MATERIAL DE PARTÍCULADO

Denoising data is extremely important in data analysis to visualize patterns, estimate structural features present in the data and avoid misclassifications. In environmental field, it is of interest to identify days with high air pollutants levels that can impact on population health. Pollutants levels collected data in a certain region, as occur in any data collection procedure, have presence of random noise, which can lead to misclassifications of the real context of pollution at the considered region. In this sense, statistical denoising methods are welcome to reduce noise in the data. The present paper proposes the use of a soft wavelet thresholding with universal threshold policy for denoising particulate materials (PM) time series and identifying days with high concentration levels of these pollutants. The technique is applied on PM10 and PM2.5 time series collected by the São Paulo Environmental Company (CETESB) from Santos station during the period of 2018-2020.


INTRODUCTION
Wavelet based statistical methods have been extensively applied to analyze data in several areas of science, such as engineering, genetics, ecology, physics and so on. In a nonparametric regression model, the focus of this work, it is possible to expand an unknown square integrable function as a linear combination of wavelet basis functions, in the same way as other basis functions, such as splines, Fourier and polinomials for example. But in opposite to these basis functions ones, wavelet basis expansions are typically sparse, i.e, most of the coefficients of the representation are zero. In fact, due the well localization of the wavelet basis in time and space, the wavelet coefficients of the expansion are nonzero only at positions that contain important features of the function, such as peaks, discontinuities, oscillations and/or minimum and maximum points. It allows the analysis of the underlying function only throw these few nonzero wavelet coefficients. See VIDAKOVIC (1999), NASON (2008) and MALLAT (1998) for excellent overviews about statistical methods based on wavelets.
Although the wavelet coefficients are essentially null in smooth parts of the function to be estimated, in practice, one observes data that are contaminated by noise, which is intrinsic to any data collection process. The impact of the noise presence in the data is noisy coefficients in wavelet domain, called empirical wavelet coefficients, i.e, empirical coefficients obtained after the application of a discrete wavelet transformation (DWT) on the original data also are noisy. Several According to São Paulo Environmental Company (CETESB, 2021a, free translation), "Particulate Material is a set of pollutants consisting of dust, smoke and all types of solid and liquid material that remain suspended in the atmosphere because of their small size. The main sources of particulate emissions to the atmosphere are: automotive vehicles, industrial processes, biomass burning, re-suspension of dust from the ground, among others. Particulate matter can also form in the atmosphere from gases such as sulfur dioxide (SO2), nitrogen oxides (NOx) and volatile organic compounds (VOCs), which are mainly emitted in combustion activities, turning into particles such as result of chemical reactions in the air." There are several types of particulate materials. We consider in this work the inhalable particles (PM10), that are particulate materials with aerodynamic diameter lesser than 10 µm and "can become trapped in the upper part of the respiratory system or penetrate deeper, reaching the pulmonary alveoli…" (CETESB, 2021a, free translation), and fine inhalable particles (PM2.5), that are particulate materials with aerodynamic diameter lesser than 2.5 µm and "penetrate deeply into the respiratory system and can reach the pulmonary alveoli." (CESTEB, 2021a, free translation). At the Brazilian state of São Paulo, CETESB has stations distributed around the state that measure, among several pollutants, PM10 and PM2.5 daily concentrations. It is of great interest to identify days with high PM10 and PM2.5 concentrations to analyze degree of pollution in the air, since the size of particles is directly associated with their potential to cause health problems, the smaller the greater the effects caused.
Particulates can also reduce visibility into the atmosphere.
In this sense, the goal of this work is to identify days with high particulate material PM10  Let is suppose one has ( , is integer) observations of a time series coming from the following model, Once the wavelet coefficients are estimated, we apply the inverse discrete wavelet transform (IDWT) on to estimate , i.e, . The full process of denoising is represented in Figure 2, adapted from SOUSA (2021).  Table 1 shows air quality classifications according to PM10 and PM2.5 levels. Note that for PM10, the critical concentration level for classification "Good" is 50 and for PM2.5 is 25 .

WAVELET ANALYSIS AND RESULTS
We applied a DWT on the time series PM10 and PM2.5 for denoising and analyzing these datasets on wavelet domain. To perform DWT, we chose Daubechies wavelet basis with 10 null moments, see VIDAKOVIC (1999).

PM 10 TIME SERIES ANALYSIS
After application of a DWT on PM10 time series, we obtained the vector of 1024 empirical wavelet coefficients, which are shown in Figure 5 (7) is important to perform universal thresholding (6) for denoising empirical coefficients and to obtain the estimates of the wavelet coefficients . In PM10 time series, the noise standard deviation estimate, and its associated universal threshold are given by, according to equations (7) and (6) respectively, and . The soft thresholding rule (5) shrinks to zero the empirical coefficients less than 19.21, which occurs in 976 of the 1024 PM10 empirical coefficients, or about 95% of them. Thus, after soft thresholding rule application, only 48 coefficients remain nonzero, making identification of the main features of the data easier, once these 48 nonzero coefficients concentrate all the important characteristics to be estimated from the data. Figure 6 shows the estimated wavelet coefficients by resolution level. Observe the significant noise reduction on higher resolution levels, where practically all coefficients were shrunk to zero. The smooth version of PM10 time series can be obtained after IDWT application on estimated coefficients vector and is presented in Figure 7 and 8. The first one shows the smooth time series and the original one for magnitude comparison. The second one shows only the smooth time series with the critical value for "good" quality according to Table 1, i.e, PM10 concentration equals to 50. In fact, after denoising, just 11 days remained above the critical value for "good" quality, most of them between May and July, 2020. These detected days are in Table 2.   Table 1.

PM2.5 TIME SERIES ANALYSIS
Similar wavelet analysis was done in PM2.5 time series. The empirical coefficients are shown in Figure 9 and the estimated noise standard deviation and universal threshold according to (7) and (6)   15 The smooth PM2.5 time series is plotted with the original one in Figure 11 and separately in Figure 12. In the first one, it is possible to observe the smoothness degree applied to the original time series, i.e, the action of the denoising process by thresholding. After smoothness, only three days had PM2.5 concentration above critical value of 25 for "good" air quality classification according to Table 1. These days are set in Table 3. is the critical value for "goog" air quality classification according to Table 1.

CONCLUSIONS
We applied soft wavelet thresholding with universal threshold policy for denoising wavelet coefficients and identifying days with high PM10 and PM2.5 concentrations levels at Santos station during the period of 2018 to 2020. In fact, the estimation procedure allowed to concentrate the analysis on a few number (about 5% of the sample size) of nonzero wavelet coefficients and smooth the time series, which facilitate visualization and identification of the desirable days.
Without the denoising process, days with good concentrations of PM10 or PM2.5 can be misclassified as not good concentrations days due the presence of noise in the collected data. In this sense, the application of the thresholding estimator is welcome for denoising data and analyzing them on a sparse set of coefficients.
Applications of the method in other discriminant levels of concentrations can be done.
Further, the behavior of the estimator when other wavelet basis and/or threshold policy is adopted should be investigated as future works.