Main Flood Page

Here are the tasks from the 2009 flood proposal I have been following and notes on what has been done so far:

Task I: Exploratory National Scale Analysis of Climate-flood-spatial scale connections.

1. The USGS daily flow and annual maximum flood data sets will be processed to retain stations that have continuous records that extend to at least 1950, have a drainage area in excess of 1000 sq. miles, and whose flood records are free from the effects of diversion and flow regulation.

Switched to HCDN data as their flood records are free from diversion and regulation.

Downloaded the HCDN data (1874-1988 water years) and processed annual data.

Out of 1703 stations, 396 had basins greater than 1000 sq. miles.

Out of those 201 had 50 years of data.

Plotted the locations of the 201 stations and number of years.

Extracted annual maximum flow data.

2. At each station, the annual maximum flow record will be used to estimate the T year flood magnitude, for varying T = 5, 10, 20, 50. Next, all daily flows whose magnitudes exceed this threshold (partial duration series) will be identified and their dates will also be recorded.

The United States standard for flood frequency analysis is the Log-Pearson III distribution.

Found a really helpful matlab code that estimates flood frequencies using U.S. Water Resources Council publication Flood Flow Frequencies, Bulletin #17B (1976, revised 1981,1982).

The only thing it was missing (explained in the file description) is the generalized skew from Plate 1 of Bulletin 17-B. I was able to go through the fortran PeakFQ program and extract that (figure of it), although I probably spent too much time on that for a small difference in accuracy. If anyone needs that though, let me know. Also I emailed the data to the matlab code creator (Jeff Burkey) and he said he would incorporate it.

High resolution jpg of first station flood analysis.

Flood analysis of each of the 201 stations.(large png files)

Recorded T = 5,10,20, and 50 for each station.

Processed daily data and saved year, month, day of exceedence at each of the four levels at the 201 stations.

3. A cluster analysis will be performed on these dates to identify a set of stations with common seasonality. A method that accounts for the annual cycle (circular data) will be used.

I set up the circular annual cycle by Yochanan's suggestion of a vector to represent a year.

Jan 1 is X=0, Y=1; Apr 1 is X=1, Y=0, Jul 1 is X=0, Y=-1, and Oct 1 is X=-1, Y=0. Each day of the year has it's own direction (columns: month day X Y).

Converted month and day of exceedence to X and Y.

Clustering the X Y dates with moments to tie in with the previous hurricane work and because the list of dates are differing lengths. Note that for T = 5 or 10, I have all 201 stations, but with T = 20, 198 stations reached that threshold within the recorded time. T = 50 brought the number down to 160. Also with the T = 20 case 21 out of the 198 stations just have one exceedence date so the variance is 0. For T = 50, 72 out of the 160 have a zero variance.

5 yr flood clustering - first I ran 1000 k-mean simulations to determine optimal number of clusters. Three clusters really stands out in the plot. I picked one simulation to look at and the sil numbers for that are here. Plotted the station's cluster number as a color. Happy to see it is not geographically random - it could have been since I'm not using location as a clustering variable (only dates of exceedence).
Looking at a boxplot of the centroids, the XY date of cluster one is the Apr to Jul quad (mean around mid-May), cluster two is in the Jan to Apr quad (mean late Feb), and cluster three it's hard to determine because the spread is large on both X and Y and the mean near 0 on both. In my coordinate system X is going down while Y is going up and vice versa, so never get both 0 at the same time. A boxplot of the variance, and plotted ellipses at station sites.

10 yr flood clustering - 1000 k-mean simulations gave an option of 4 or 7. Looked at this case where it could be 4 or 7 and this one where it is clearly 7 (I didn't seem to find ones that were clearly 4 alone - maybe I should just go with 7).
4 or 7 case for 4: ellipse and station location, and X Y date centroid boxplot.
4 or 7 case for 7: ellipse and station location, and X Y date centroid boxplot.
Just 7 case: ellipse and station location, and X Y date centroid boxplot. The just 7 case looks like a rearrangement of the 4 or 7 case 7 one. Cluster 1 is cluster 6 in previous, 2 (5), 3 (4), 4 (7), 5 (1), 6 (2), and 7 (3). Good consistency. I like the just 7 case the best as the mean sil were higher and X Y date boxplots had shorter whiskers.

20 yr flood clustering - 1000 k-mean simulations gave 5 clusters. Individual case of 5 clusters: ellipse and station location, and X Y date centroid boxplot.

50 yr flood clustering. For this one beyond 6 clusters gave clusters of zero size (matlab error), so I reduced the possible clusters from 2-9 to 2-6. 1000 k-mean simulations gave 4 or 6 clusters.
Individual case of 4 clusters: ellipse and station location, and X Y date centroid boxplot.
Individual case of 6 clusters: ellipse and station location, and X Y date centroid boxplot.
I think I would go with 4 as I don't see improvement in using 6.

Questions I have on this:

Wondering if I really should be using moments clustering on a X Y date as X and Y are not independent. Also wondering if I should add something to the clustering like lat/lon or climate because in task 4 it indicates similar region and climate.

4. For each cluster, a set of stations that allow nested spatial analysis and covers a range of drainage areas in an area with roughly the same climate, and have comparable record length will be selected. For each station, and for each flood event an estimate of the atmospheric moisture flux/storm track will be made using the re-analysis data fields. The storm track will be identified as a connected set of grid boxes that is in the vicinity of the drainage basin of interest and has a vertically integrated moisture flux that is above some threshold.

Started with taking a look at the moisture flux for the first station (69W, 47N: very northern tip of Maine - 01011000 ALLAGASH RIVER NEAR ALLAGASH, ME, 1229 sq miles). When I found out the location for this first station I thought maybe I should choose something more in the middle of the country, but then I thought no, we want this to work for all of the US so Maine is just as good as any. An X marks the station location. First flood in the record is Apr 24 1973 and it's a 5 yr (in plots a is 5 yr flood, b 10 yr, c 20 yr, d 50 yr) flood:

station 1 flood 1, 24 Apr 1973

was distressed to see divergence of moisture field at the station location until I looked at previous days:

station 1 flood 1 - 1 day, 23 Apr 1973

station 1 flood 1 - 2 days, 22 Apr 1973

station 1 flood 1 - 3 days, 21 Apr 1973

one can see that at minus 1 and 2 days the station is clearly in the moisture swath. Looking at CPC precip the heavy rain over the station was Apr 22 and 23 (minus 1 and 2 days):

heavy rain at minus 1 and 2 days, Apr 22 and 23 1973

We had already selected only large basins in step 1 so it is not surprising that there is a lag as precip drains to the river from far away points.

The next flood I looked at was again station 1, but this time flood number 2 which is a 20 yr flood. Same pattern of a delay by a day:

station 1 flood 2, 29 Apr 1973

station 1 flood 2 - 1 days, 28 Apr 1973

station 1 flood 2 - 2 days, 27 Apr 1973

station 1 flood 2 - 3 days, 26 Apr 1973

Also the day after flood 2, flood 3 occurs (station 1 flood 3, Apr 30 1973). It is 10 yr flood, but it is clearly just a continuation of the previous day (no new moisture/rain):

station 1 flood 3, 30 Apr 1973

I can identify moisture swaths using matlab contour with a threshold (both in value and size) as I have done with drought blobs before. I was thinking about looking at the day of flood and up to 3 days prior for the largest vertically integrated moisture flux convergence at the station and using that day to outline a swath using a threshold of -0.1 (g/m^2/s) and minimum size of 4 one degree boxes. A question I had though is that does not include persistence as described below. Is there a better way to track these?

5. All storm tracks (including days preceding the major event) for all the flood events at a site will then be clustered. The clustering algorithm we propose to use will cluster based on the origin, length and geometry of track as well as on the magnitude and persistence of the moisture flux. A recent algorithm that would be suitable for such a classification has been applied to hurricane tracks.The associated probabilities and intensities will be estimated to aid prediction, simulation, flood frequency analysis, and the correspondence with the larger scale atmospheric circulation. The goal of clustering is to permit a statistical analysis of the data set by having a sample size over which one can average, and reduce the uncertainty associated with looking at individual events.

6. For each cluster of tracks, we can compute the average rank (and its spread) of the associated flood event in the historical record. The cluster attributes can then be compared versus the rank (implicitly the return period) of the event. For instance if the tracks associated with a cluster with predominantly high rank floods is long, meridional and has the highest ranks of moisture flux, while the second cluster has relatively short tracks with no preferred orientation and moisture fluxes with low ranks, then cluster 1 would correspond to large scale forcing and cluster 2 to local or synoptic activity. If there is no discernible pattern to the flood ranks, and the track attributes in the clusters then the hypothesis that as the return period of the event increases, the likelihood that the mechanism is large scale flow is rejected. Bootstrap techniques could be used to check the statistical significance of the cluster assignments and attribute differences.

7. If we do identify a cluster that corresponds to large scale fluxes, then composites of atmospheric circulation fields averaged over all events in that cluster could be developed to identify the large scale SST pattern, steering winds, convergence, outgoing longwave radiation and vorticity. These would then help develop an empirical understanding of the associated climate mechanism for that category of floods. If there are an adequate number of events in the cluster of interest, then further dividing the cluster on the basis of the atmospheric or SST patterns (e.g., the eigenvectors of these fields) would be useful.

8. Once tracks, clusters and composites for each station have been identified, they will be pooled to identify superclusters. The idea here is that if the “large scale” clusters of all stations with a drainage area greater than some threshold cluster into the same cluster, then we have identified an area threshold beyond which essentially there is a common large scale operative mechanism. The information as to the average return period at which this happens can then also be improved since the cluster is much larger (even if the same events are being co-classified). Similarly, the reliability of the ocean/atmosphere composites associated with each cluster can be improved by superclustering.

9. Review the composition of the superclusters to see if high return period events for all stations are classified into the same clusters, or if the high events below some drainage area threshold are clustered with the “local” cluster for the larger basin. This analysis will then allow us to identify the threshold area A* , and the corresponding probability of exceedance p* beyond which large scale fluxes dominate the flood process, if such a separation is feasible.