java - Dealing with an incompatible version change of a serialization framework -
problem description
we have hadoop cluster on store data serialized bytes using kryo (a serialization framework). kryo version used has been forked official release 2.21 apply our own patches issues have experienced using kryo. current kryo version 2.22 fixes these issues, different solutions. result, cannot change kryo version use, because mean no longer able read data stored on our hadoop cluster. address problem, want run hadoop job which
- reads stored data
- deserializes data stored old version of kryo
- serializes restored objects new version of kryo
- writes new serialized representation our data store
the problem is not trivial use 2 different versions of same class in 1 java program (more precisely, in hadoop job's mapper class).
question in nutshell
how possible deserialize , serialize object 2 different versions of same serialization framework in 1 hadoop job?
relevant facts overview
- we have data stored on hadoop cdh4 cluster, serialized kryo version 2.21.2-ourpatchbranch
- we want have data serialized kryo version 2.22, incompatible our version
- we build our hadoop job jars apache maven
possible (and impossible) approaches
(1) renaming packages
the first approach has come our minds rename packages in our own kryo branch using relocation functionality of maven shade plugin , release different artifact id depend on both artifacts in our conversion job project. instantiate 1 kryo object of both old , new version , use old 1 deserialization , new 1 serializing object again.
problems
don't use kryo explicitly in hadoop jobs, rather access through multiple layers of our own libraries. each of these libraries, necessary to
- rename involved packages and
- create release different group or artifact id
to make things more messy, use kryo serializers provided other 3rd party libraries have same thing.
(2) using multiple class loaders
the second approach came not depend on kryo @ in maven project contains conversion job, load required classes jar each version, stored in hadoop's distributed cache. serializing object this:
public byte[] serialize(object foo, jarclassloader cl) { final class<?> kryoclass = cl.loadclass("com.esotericsoftware.kryo.kryo"); object k = kryoclass.getconstructor().newinstance(); bytearrayoutputstream baos = new bytearrayoutputstream(); final class<?> outputclass = cl.loadclass("com.esotericsoftware.kryo.io.output"); object output = outputclass.getconstructor(outputstream.class).newinstance(baos); method writeobject = kryoclass.getmethod("writeobject", outputclass, object.class); writeobject.invoke(k, output, foo); outputclass.getmethod("close").invoke(output); baos.close(); byte[] bytes = baos.tobytearray(); return bytes; }
problems
though approach might work instantiate unconfigured kryo object , serialize / restore object, use more complex kryo configuration. includes several custom serializers, registered class ids et cetera. example unable figure out way set custom serializers classes without getting noclassdeffounderror - following code not work:
class<?> kryoclass = this.loadclass("com.esotericsoftware.kryo.kryo"); object kryo = kryoclass.getconstructor().newinstance(); method adddefaultserializer = kryoclass.getmethod("adddefaultserializer", class.class, class.class); adddefaultserializer.invoke(kryo, uri.class, uriserializer.class); // throws noclassdeffounderror
the last line throws a
java.lang.noclassdeffounderror: com/esotericsoftware/kryo/serializer
because uriserializer
class references kryo's serializer
class , tries load using own class loader (which system class loader), not know serializer
class.
(3) using intermediate serialization
currently promising approach seems using independant intermediate serialization, e.g. json using gson or alike, , running 2 separate jobs:
- kryo:2.21.2-ourpatchbranch in our regular store -> json in temporary store
- json in temporary store -> kryo:2-22 in our regular store
problems
biggest problem solution fact doubles space consumption of data processed. moreover, need serialization method works without problems on of our data, need investigate first.
i use multiple classloaders approach.
(package renaming work. seem ugly, one-off hack beauty , correctness can take seat. intermediate serialization seems risky - there reason using kryo, , reason negated using different intermediate form).
the overall design be:
child classloaders: old kryo new kryo <-- both simple wrappers \ / \ / \ / \ / | default classloader: domain model; controller re-serialization
- load domain object classes in default classloader
load jar modified kryo version , wrapper code. wrapper has static 'main' method 1 argument: name of file deserialize. call main method via reflection default classloader:
class deserializer = deserializerclassloader.loadclass("com.example.deserializer.main"); method mainin = deserializer.getmethod("main", string.class); object graph = mainin.invoke(null, "/path/to/input/file");
- this method:
- deserializes file 1 object graph
- places object shared space. threadlocal simple way, or returning wrapper script.
- this method:
when call returns, load second jar new serialization framework simple wrapper. wrapper has static 'main' method , argument pass name of file serialize in. call main method via reflection default classloader:
class serializer = deserializerclassloader.loadclass("com.example.serializer.main"); method mainout = deserializer.getmethod("main", object.class, string.class); mainout.invoke(null, graph, "/path/to/output/file");
- this method
- retrieves object threadlocal
- serializes object , writes file
- this method
considerations
in code fragments, 1 classloader created each object serialization , deserialization. want load classloaders once, discover main methods , loop on files, like:
for (string file: files) { object graph = mainin.invoke(null, file + ".in"); mainout.invoke(null, graph, file + ".out"); }
do domain objects have reference any kryo class? if so, have difficulties:
- if reference class reference, eg call method, first use of class load 1 of 2 kryo versions default classloader. probably cause problems part of serialization or deserialization might performed wrong version of kryo
- if reference used instantiate kryo objects , store reference in domain model (class or instance members), kryo serializing part of in model. may deal-breaker approach.
in either case, first approach should examine these references , eliminate them. 1 approach ensure have done ensure default classloader not have access any kryo version. if domain objects reference kryo in way, reference fail (with classnotfounderror if class referenced directly or classnotfoundexception if reflection used).
Comments
Post a Comment