My favorites | Sign in
Project Logo
          
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
Cascading Change Log

0.8.1

Fixed bug where c.t.Lfs did not force local mode for current MapReduce step.

Fixed bug where writing to a c.t.TupleCollector would fail if using a c.s.SequenceFile in some cases.

Added a few minor improvements to reduce stray object creations, and speedup c.t.Tuple serialization.

0.8.0

Updated c.o.x.TagSoupParser to accept 'features', use these features to recover past behaviors.

Updated janino and tagsoup libraries to 2.5.15 and 1.2, respectively. Note that tagsoup, in theory, is not
backwards compatible by default. See their release notes: http://home.ccil.org/~cowan/XML/tagsoup/#1.2

Added some forward compatible changes for supporting Hadoop 0.18 at the API level. Currently there are other
issues preventing some tests from passing on Hadoop 0.18.

Changed c.f.FlowException to return the parent c.f.Flow name.

Changed behavior of c.f.MultiMapReducePlanner to use c.t.h.MultiInputFormat to allow single Mappers
to support many different Hadoop InputFormat types simultaneously. This deprecates the need to normalize
sources to a map and reduces the number of jobs in a c.f.Flow in some cases.

Changed behavior of Cascading to allow for multiple paths from the same c.t.Tap source to be co-grouped on
via c.p.CoGroup. This allows for a kind of self-join where each stream is processed by a different operation
path within the Mapper.

Added c.o.f.And, c.o.f.Or, c.o.f.Xor, and c.o.f.Not logic operator c.o.Filter implementations. They should be used
to compose more complex filters from existing implementations.

Changed the behavior of c.o.BaseOperation to properly initialize itself if it is a c.o.Filter instance. This
removes the requirement that Filter implementations must set declaredFields to Fields.ALL, as it makes no
sense for a Filter to declare fields.

Added c.f.PlannerException, a subclass of c.f.FlowException, and updated c.f.MultiMapReducePlanner to throw
it on failures. Functionality of writing DOT files has been moved from FlowException to PlannerException.

Added c.o.f.FilterNotNull and c.o.f.FilterNull filter classes.

Changed c.f.MultiMapReducePlanner to fail if it encounters an c.p.Each to c.p.Every chain. In these cases, a
c.p.Group type must be between them.

Deleted c.o.Cut class as it was effectively a duplicate of c.o.Identity.

Changed c.f.MultiMapReducePlanner to fail if a c.p.GroupAssertion is not accompanied by another c.o.Aggregator
operation. This is required so that the GroupAssertion does not change the passing tuple stream if it is planned out.

Changed c.f.MultiMapReducePlanner to no longer insert new c.p.Each( ..., new Identity(), ... ) as a place holder.

Renamed c.p.PipeAssembly to c.p.SubAssembly to better reflect its purpose, which is to encapuslate reusable
pipe assemblies in the same manner as a sub-process or sub-routine. A temporary c.p.PipeAssembly class has been
provided for backwards compatibility.

Fixed bug where c.t.TapCollector would throw an NPE if a custom Tap was not using paths.

Changed behavior of c.f.Flow where if a c.f.FlowListener throws an exception, the Flow instance receiving the
exception will stop (by calling Flow.stop()). Listeners will continue to fire as expected and Flow.complete()
will re-throw the thrown exception (as was the original behavior).

Added ability to set a Cascading specific temporary directory path for use by intermediate taps created
within c.f.Flow instances. Use c.t.Hfs.setTemporaryDirectory() to configure.

Fixed bug where the 'mapred.jar' property was begin stepped on if previously set by the calling application.

Changed c.t.Tap and c.f.Flow to return c.t.TupleIterator and c.t.TupleCollector instead of c.t.TapIterator and
c.t.TapCollector, respectively.

Added c.t.Tap.flowInit( c.f.Flow flow ) to allow a given tap to know what flows it is participating in. It is called
immediately after the Flow instance is initailized.

Fixed bug with nested c.p.PipeAssembly instances where some nested assemblies threw an internal error from
the planner.

Changed c.o.Debug to accept a prefix text string that will be prefixed to every message.

Fixed bug where c.f.MultiMapReducePlanner would fail when normalizing inputs to a group where the inputs
passed through one or more splits.

Fixed bug where c.g.CoGroup silently stepped on input pipes with the same input name.

0.7.1

Fixed bug in c.f.MultiMapReducePlanner where a source used on more than one c.p.Group would cause an internal
error during planning.

Changed c.f.MultiMapReducePlanner to normalize heterogeneous sinks.

Changed c.f.MultiMapReducePlanner to keep a splitting c.p.Each on the previous step, instead of being duplicated
on each branch. If the Each is preceeded by a source c.t.Tap, it will be duplicated across branches to reduce
the number of step in the Flow.

Fixed bug in c.f.MultiMapReducePlanner where too many temp tap instances were being inserted while normalizing
the flow sources.

Changed c.t.Fields to fail if given duplicate field names.

Changed behavior if Hadoop FileInputSplit is not used and property "map.input.file" is not set. If there is one
source, it will returned as the source for the mapper stack, otherwise an exception is thrown. Subsequently joins
and merges of non-file sources is not supported until a discriminator can be passed to the mapper.

Fixed bug in c.t.Tuple where NPE was thrown under certain compareTo operations.

Fixed bug that prevented CoGrouping or Merging on the same source even though it was one or more Groupings away.

0.7.0

Changes project structure, removed 'examples' sub-project.

Updated to support Hadoop 0.17.x. This version is not API compatible with any Hadoop version less than 0.17.0.

Added ability to stop all c.f.Flows executing within a c.c.Cascade instance via the stop() method.

Changed c.f.FlowConnector to only take a Map of properties. These properties are passed downstream to various
subsystems. This removes the Hadoop JobConf constructor, but it still can be passed as a property value. Also
properties will be pushed into a defaul JobConf, bypassing any direct JobConf coupling in applications.

Changed c.f.Flow to automatically register a shutdown hook killing remote jobs on vm exit.

Changed c.f.Flow.stop() to immediately stop all running jobs.

Changed c.o.Operation to an interface and introduced c.o.BaseOperation. This makes creating custom Operation types
more flexible and intuitive. c.o.Filter, c.o.Function, c.o.Aggregator, and c.o.Assertion now extend c.o.Operation.

Added c.p.c.OuterJoin, c.p.c.MixedJoin, c.p.c.LeftJoin, and c.p.c.RightJoin c.p.c.CoGrouper classes. They
compliment the default c.p.c.InnerJoin CoGrouper class.

Added support for passing an intermediateSchemeClass to the underlying planner to be used as the default c.s.Scheme
for intermediate c.t.Tap instances internal to a given c.f.Flow.

Fixed bug where c.p.Group is immediately followed by another c.p.Group (or their sub-classes) and fields could not
be resolved between them.

Added support for c.t.Tap instances implementing c.f.FlowListener. If implemented, they will automatically be
added to the Flow event listeners collection and will receive Flow events.

Fixed case where multiple source c.t.Tap instances return true for the containsFile method. Now verifies only one
Tap contains the file, and fails otherwise.

Changed c.s.TextLine to not set numSinkParts to 1 by default. Now uses the natural number of parts.

Changed MapReduce planner to force an intermediate file between branches with Hadoop incompatible source Taps
on joins/merges. If the taps are compatible (have same Scheme), all branches will be processed in same Mapper
before the c.p.Group.

Added merge capabilities in c.p.GroupBy. This allows multiple input branches to be grouped as if a single stream.

Fixed bug in c.t.TapCollector where writing to a Sequence file threw a NPE.

Added c.f.MapReduceFlow to support custom MapReduce jobs, allowing them to participate in a Cascade job.

0.6.1

Changed thrown c.f.FlowException instances to include cause message.

Fixed bug where empty sink or source map was not detected.

0.6.0

Changed default argument selector for c.p.Every to be Fields.ALL, to be consistent with the default value of c.p.Each.

Added support for assembly traps. If an exception is thrown from inside an c.o.Operation, the offending Tuple
can be saved to a file for later processing, allowing the job to complete.

Added support for stream assertions. STRICT and VALID assertions can be built into a pipe assembly, and optionally
planned out during runtime. Assertions will throw exceptions if they fail.

Changed c.o.a.First, Last, Min, and Max to optionally ignore specified values. Useful if you do not wish
for a 'default' value to be considered first, or last in a set.

Changed c.o.a.Sum to take a Class for coercion of the result value.

Changes c.o.Max and Min to use infinity as initial values so zero is bigger than a really small number
for Max, and zero is smaller than a really big number for Min.

Changed order of JobConf initialization. c.f.FlowStep now is added to the JobConf last in order to catch
all lazily configured values.

Changed compile to include debug info by default.

Fixed bug in c.t.MultiTap where super scheme was not returned if available.

0.5.0

Added skipIfSinkExists property to c.f.Flow. Set to true if the c.c.Cascade should skip the Flow instance even
if the sink is stale and not set to be deleted on initialization.

Fixed bug in c.t.h.HttpFileSystem that URL escaped the ? prefixing the query string.

Fixed bug where a join with duplicate taps was not recognized during job planning. Now an appropriate error
message is displayed, instead of jobs completing with only one instance of the resource stream.

Fixed c.t.h.HttpFileSystem to remember authority information in the url and prefix it when missing.

Changed c.s.TextLine to accept either on or two source fields. If one, only the 'line' value
is sourced from the value, discarding the 'offset' value.

Added c.o.r.RegexSplitGenerator to support splitting single tuple values into multiple tuples based on a regex
delimiter. Includes new tests.

Added c.s.CascadeStats and c.s.FlowStats to provide access to current state and statistics of particular
Cascade, Flow, or the child Flows of a Cascade.

Added ability to sort grouping values with sort argument on c.p.GroupBy. Sorts can be reversed.

Added c.o.e.ExpressionFilter, the c.o.Filter analog to c.o.e.ExpressionFunction.

0.4.1

Fixed path normalization regex in c.u.Util where it munged any path starting with file:///.

0.4.0

Changed c.p.GroupBy default grouping fields to c.t.Fields.ALL from Fields.FIRST. This change provides a simple
way to sort a tuple stream based on the order of the tuple fields.

Changed c.f.FlowConnector to create c.f.Flow instances that will bypass the reducer if no c.p.Group is participating
in the assembly. Previoiusly Group instances were inserted if missing. This allows a chain of c.p.Every instances
to be used to process/filter a tuple stream without the invoking the reducer needlessly (if a sort isn't required).
This change also supports bypassing the default Hadoop OutputCollector in the mapper via the sink c.t.Tap instance.

Changed c.f.FlowStep behavior to run in 'local' mode if either the sink or source tap is a c.t.Lfs instance. This
allows for c.f.Flow instances to run mixed if configured to execute on a particular cluster by default. This behavior
supports complex import/export processes against the HDFS or other supported remote filesystem.

Changed behavior of c.t.Dfs to force use of HDFS. Previously Dfs would default to the local FileSystem
if the job was run in 'local'mode. Now a Dfs instance will cause failures if it cannot connect to a HDFS cluster.
Using c.t.Hfs will provide previous Dfs behavior. Hfs will use the 'default' filesystem if a scheme is not present
in the 'stringPath' (i.e. hdfs://host:port/some/path).

Added c.stats package to allow for collecting statics of Cascades, Flows, and FlowSteps.

Updated c.f.Flow and c.c.Cascade log messages to be easier to follow when executing many flow instances
simultaneously.

Added compression flag to c.s.TextLine. Can now toggle compression (Hadoop style compression) per Tap instance.
This prevents clusters with compression enabled by default to export text files with a .deflate extension.

Added support for bypassing Hadoop OutputCollector via Tap.setUseTapCollector() method. Setting to true will force
Cascading to use the c.t.TapCollector instead. This bypasses bugs in Hadoop with custom FileSystem types. This will
always be true for http(s) and s3tp filesystems when using a c.t.Hfs Tap type (atleast until HADOOP-3021 is resolved).

Added c.t.TupleCollector, complementing c.t.TupleIterator, for directly writing Tuple instances out via a c.t.Tap
instance.

Added c.f.FlowListener so that c.f.Flow instances can fire events on starting, completed, and throwable.

Changed c.t.h.S3HttpFileSystem so it can now create files remotely.

Renamed cascading.spill.threshold to cascading.cogroup.spill.threshold, so there is less a chance of collision.

Made numerous optimizations to improve overall performance. Namely split and merge of key/value tuples to remove
redundancy in the stream between the mapper and reducer.

Changed c.p.Operators to push c.o.Operation results directly through to next operation without intermediate
collection. This should improve pipelining of large result streams and lower runtime memory footprint.

Changed c.c.Cascade so it now runs Flows in parallel if Hadoop is clustered, and there are no dependencies between the
Flows.

Moved c.Cascade and related classed to c.cascade package. Wanted to preempt any future ugliness.

Added support in c.t.h.S3HttpFileSystem for these properties: fs.s3tp.awsAccessKeyId and fs.s3tp.awsSecretAccessKey

0.3.0

Added ability to push Log4j logger properties to mapper/reducer via JobConf.
Use jobConf.set("log4j.logger","logger1=LEVEL,logger2=LEVEL")

Added missing equals() and hashCode() in c.t.MultiTap.

Added c.t.h.ZipInputFormat (and ZipSplit) to support zip files. c.s.TextLine supports transparent
reading of zip files if the filename ends with .zip, but cannot write to them. This code is
loosely based on HADOOP-1824. If the underlying filesystem is hdfs or file, splits will be created
for each ZipEntry. Otherwise ZipEntries are iterated over to be more stream friendly. Progress status is
supported.

Added http, https, and s3tp read-only file systems to Hadoop. Use these URLs, respectively:
http://, https://, and s3tp://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/key

Added c.o.t.DateFormatter supporting text formatting of time stamps created by c.o.t.DateParser.

Fixed bug where in complex assemblies, some Scopes were not resolved.

Fixed bug where tap instances were not being inserted before some CoGroup joins if there was a previous Group in the
assembly.

Upgraded JGraphT to 0.7.3

Changed c.t.SpillableTupleList allows for iteration across entries.

Changed c.f.FlowException to optionally allow for printing of underlying pipe graph for debugging.

Added c.o.t.FieldFormatter function to format Tuples into complex strings using j.u.Formatter formatting.

Added c.o.a.Last aggregator to find the last value encountered in a group.

Changed c.o.a.Max and c.o.a.Min to maintain original value type. Will return null if no values are encountered.

Changed c.o.a.First to use Fields.ARG by default. Removed Fields constructor.

Added c.t.Fields.join(Fields...) method to allow for joining multiple Fields instances into a new instance.

Can retrieve Tuple values by field name through the TupleEntry class via the get(String) method.

Added c.t.TupleCollector interface to simplify the operation interfaces.

Added a Debug filter that will print to either stderr or stdout. Useful for debugging stream transformations.

Added CascadingTestCase base test class

Added Insert Function that allows for literal values to be inserted into the Tuple stream.

0.2.0

CoGroup will now spill to disk on extremely large co-groupings. Configurable via "cascading.spill.threshold".
Defaults to 10k elements.

java.util.Properties instances can be used to set defauls for FlowConnectors.

Fix for InnerJoin, the default join for CoGroup.

Introduced MultiTap to support concatenation of files into a pipe assembly.

RegexParser now fails on a failed match. Prevents it being used or behaving as a filter.

Fixed bug with PipeAssembly instances not properly being assimiliated into the pipeGraph.

Fixed assertion error thrown by JGraphT.

Renamed Tap method deleteOnInit to deleteOnSinkInit.


0.1.0

First release.
Show details Hide details

Change log

r456 by cwensel on Sep 13, 2008   Diff
version 0.8.1
Go to: 

Older revisions

r455 by cwensel on Sep 13, 2008   Diff
version 0.8.1
r454 by cwensel on Sep 13, 2008   Diff
updated comments
r453 by cwensel on Sep 13, 2008   Diff
Fixes bug where c.t.Lfs did not force
local mode for current MapReduce step.
All revisions of this file

File info

Size: 17101 bytes, 338 lines
Hosted by Google Code