Côtehaus is a house in the coast. In the land of hopes, Côtehaus is to become a Java 7 subset syntax directed Dalvik bytecode translator. Also known as a compiler.
It is released dually under The MIT License and The GNU General Public License version 3, with the exception of file src/main/types.h
which includes copied code governed by a separate license as pointed out from that particular file. For the general case, see LICENSE file.
If you have a Debian system, a Makefile
is in Côtehaus to automate the install of tools.
So you run make debian_install_zlib
and
make debian_install_polarssl
.
If you are wandering the Cygwin ramblings, you have to install yourself the packages gcc-core
, libgcc1
, mingw64-x86_64-gcc-core
(that will provide the command mingw64-x86_64-gcc
), mingw64-i686-gcc-core
(that will provide the command mingw64-i686-gcc
), libiconv-devel
, bison
and make
. Check the installed packages by running cygcheck iconv
, cygcheck gcc
, gcc -v
and
x86_64-w64-mingw32-gcc.exe -v
.
If your choice is neither Debian nor Cygwin, theese instructions may turn out a bit lacking, you're unwound at luck and creativity.
Whichever way you go, get done the job of installing zlib
and polarssl
. Read the Makefile
if you need to know more about them.
Côtehaus may be seen as divided in phases: input, processing and output.
Being at input phase means to take an input file and while seeing it make records of things useful to recognize about that input file. The input file is a Java program, as in this example:
class A {
int add(int a, int b) {
return a+b;
}
}
The prototype Iadd(II)
present in the input file (derived from int add(int a, int b)
) is to Côtehaus important to record during the input phase.
After the input phase finishes, Côtehaus has a lot of records derived from things seen in the input files, like Iadd(II)
and many others.
Then the processing phase begins, and the goal of this phase is to set the records in a defined order. Let's take this example: add(II)
and Isubstract(II)
. They are 2 prototype records that might be produced but in an unexpected order, let's say Isubstract(II)
is produced first and Iadd(II)
is later. Côtehaus knows that the expected order is right the invert of the produced order. This time we let ourselves trust Côtehaus to determine the expected order, because from a point of view, such a job is what a compiler is to make for us. During the processing phase Iadd(II)
is set first and Isubstract(II)
later.
All the records are persisted to an output file at the output phase. Sending messages, respecting conventions, using bitpatterns, informing how much size a prototype's array occupies, are among things that describe the output phase and have the output file done.
The way how the code is written, there are naming customs used for each one of the mentioned phases. For example add_...
for the input phase, build_o...
for the processing phase and pack_...
for the output phase.
The name of the item belongs on the Dalvik's format specification, string_id_item
and proto_id_item
are examples of items defined in that specification. These names might be further shortened in Côtehaus, as in str_id
and proto_id
, that should be easy to guess which is which one.
Figure out which item is handled on which phase by which piece of source code is explained next. Take a name stub from the shown above, like add_...
for the input phase, then another name stub like proto_id
. Put them together, and if you see a function named add_proto_id
somewhere arround the source code, you'll know that function handle the input of a proto_id_item
, of course of the input phase.
Exercise
From the table below.
Phase | item |
---|---|
add_... |
str_id |
build_o... |
proto_id |
pack... |
class_def |
Tell what should be the functions that:
string_id_item
during the output phase.proto_id_item
during the processing phase.Answer
pack_str_id
handles a string_id_item
during the output phase.build_oproto_id
handles a proto_id_item
during the processing phase.To follow the example of proto_id
used above, a proto_id_item
may be seen as defined in theese terms: tok = elements of (ShortyDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok, TypeDescriptor_tok)
. Said so, a proto_id
is defined in terms of all tokens (tok
). Each token is then recorded as a string with a call to add_str_id
.
The records placed by add_str_id
are chained in a list without any expected order. As many times add_str_id
is called, a countet increments to carry the quantity of records placed into that chained list. Exercise: look at the code of add_str_id
and try to tell the name of the list and the name of the counter.
At the input phase, the file java7.y
encodes information about what to do with a Java program input file. It leverages a subset of the Java language at its 7th version. You need a tool called Bison that puts java7.y
into terms that Côtehaus doesn't understand but computer machines you use do, a fact just fulfilling to Côtehaus. So, java7.y
encodes how to distinguish the several parts of a Java program. If speaking about a prototype as one of that many parts, java7.y
associates a prototype with calls to add_proto_id
, which in turn makes calls to add_str_id
, putting the records in place, into either chained lists, one for prototypes, the other for strings.
Let's see an add_proto_id
function depict:
struct proto_id_st *add_proto_id(/*things relative to a proto_id*/) {
struct proto_id_st *proto_id = /*mallocs to a proto_id*/;
/*set proto_id relative things, making malloc or using a function parameter directly:
proto_id->a_member = assign a function parameter as is;
proto_id->another_member = malloc based on another function parameter (apply some logic);
etc.
*/
/* Chain the new proto_id object i to a list with all of the rest of proto_id's.
"_list" is by custom part of the chain name. This is so to proto_id as to any item type.
"_head" is by custom part of the name of the chain head for each item type
*/
list_add(&proto_id->proto_id_list, &proto_id_head);
// count the items in the chain
proto_id_list_size++;
return proto_id;
}
At the processing phase the ordering (sort) of records happen, taking the records somewhat closer to how Dalvik's format expects, that is, the expected order.
There are upstream functions doing the job, like build_oproto_id
. Functions like that lean on downstream functions that make the bottom half of the job, as is the case of oproto_id_st_compar
. Arrays gain a place in processing phase, while the records they hold are those that chained lists used to in the input phase. Arrays are to work in compass with ..._compar
functions, together they accomplish the goal of sorting the records.
ostr_id_ary
is an ordered array of strings. o
is for ordered, _ary
is for array, just name customs. str_id
is the part of the name shortened from Dalvik's format spec. _idx
is an atribute where an ascending numerical value is assigned to each record, considering that if more than one str_id
has as its string target one with the same string value as another one, only one is taken to represent all of them (an equivalence class representative) and only such one is assigned an _idx
. Because this _idx
is assigned over the sorted array, then this _idx
conforms to what Dalvik's format expects about array indexes.
Let's go ahead from an example based on a same one used before.
class A {
int add(int a, int b) {
return a+b;
}
int substract(int a, int b) {
return a-b;
}
}
The break down of a prototype can be reasoned as a prototype has a shorty description, a return type, and an array of parameter types. For the case of int add(int a, int b)
-derived prototype, let's allow the conceeded wisdom that its shorty description is I(II)
. For int substract(int a, int b)
it is the same at all, I(II)
, because it has the same return type and the same parameter number and types.
The break down of a prototype to the deepest level is just the break down of each of its parts. So, a return type is indeed a type, an array of parameter types is an array of which each element is a type, and a type points at a string_id which in turn points at a string data to hold the string representing that type's descriptor. By the way, this way of reasoning can be refered to as recursive. In the chained list of string data, we have a boilerplate of I
's records produced at the input phase.
After records are ordered all I
are set contiguous. Then the boilerplate is eliminated by taking just one I
from the string data array as the representative for an equivalence class for all the I
's. This process happens at the processing phase and transits the whole break down. An equivalence class representative is choosen from the string ids, then from the types, and widening even beyond the extent, I(II)
is the same for add
and substract
so from the array of shorty descriptions, just one is choosen as representative to be emitted later at output phase. Instead of each record pointing at its own copy, all point at the position (or index in Dalvik's speak and in truth in common speak), of the only one copy that represents all of them.
In a spatial referential system as the .dex file is, items have an inherent address, a numerical value of the item's distance to a zero point or the beginin of the referential system (the whole message or the output file's beginin). If an array has a fixed length to all of its items, then each item's address can be encoded in terms of the item position or index, assigned to in an item's attribute which name contains _idx
in Côtehaus's source code. If the estipulation (expectation) about an array is to have variable length items, then Côtehaus tracks the address distance each item adds to the next one, called offset or relative_offset
in the source code. Probably in accordance with this reasoning, the Dalvik format requests the array of string ids are index based, whilst the array of list type arrays are offset based, just to take 2 random cases.
Items (records) in otype_array
(the array of list type arrays) have each of them a fixed length and an uniformly variable length (varies its length for it has more or less items but uniformly for each item is fixed sized). relative_offset
is used as for this case as by name custom for all offset based arrays, and each item is assigned a value for its relative_offset
at processing phase counting the offset and the overall length (fixed+variable) of the previous item over a sorted, equivalence-class-represented array, so that equivalent list type array items are neither counted nor emitted at output phase.
The program at the output phase carries with emitting an external representation in compliance to Dalvik's format. Right here, the the fitting of the datas in a spatial referential system overwhelms the data's meaning itself. after that fact, library calls are used that invert the relation format-meaning, turning the format into the meaning. This is just like you swap forefront and background on how you see a picture. To name one, a library call for such a purpose is on the function htole32
, standing for host representation to little endian of 32 bits. Library calls like that are made from the pack_...
functions.
Alignmemt is primarily performed at output phase, though distance between arrays to be aligned is based on numbers computed as records are being blown, during previous phases. up_bounds_ary
has something to do with that along processing and output phases.
Sending messages to the output, whilst an aligment unaware version would direct to emit the bits for the next array start, with alignment on board some bits are emitted without meaning but pushing the next array to a position that satisfies a condition about its distance from the start of the message (the beginin of the output file). That condition is requested by the Dalvik's format spec. Those meaningless bits are called padding.
The padding length has this formula:
padding_length=(next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment
last_up_bound_address%next_item_alignment
tells the upper items' highest address excess respecting as how much it was to fall exactly aligned to next item alignment.(next_item_alignment-last_up_bound_address%next_item_alignment)%next_item_alignment
, the substraction takes the excess' complement, it gives how much length to make the next padding, except if there was no excess, in which case last reminder operation (that outside the parenthesis) takes on fixing up that.When a set of equivalence-class items is equivalence-class-represented by any one of them, not only Côtehaus emitts just that one, but also all the references, either by index or by offset, are fixed at that one, because the choosen representant can be fully taken as the new copy to refer to, from everywhere. After that kind of factorization (boilerplate elimination) the break down of prototype id is: one copy of the many I(II)
(int add(int a, int b)
, int substract(int a, int b)
et. al.) exists and points at (refer to) a return type (indeed a type) by index. Also it points at an array of types by offset, each of which points at a type, the same copy that that one return type. The return type points at a copy of string id by index that in turn points at a copy of string data for the type descriptor I
, and no other copies exists in the output file as for string data as string id representing the I
type.
Vaguely speaking, pairs of source/bytecode that Côtehaus should be able to transform, from source to bytecode, are like the listed next. The bytecode listed next might be .apk wrapped, so the reader is meant to take into account only the .dex wrapped inside. Especially, ignoring the .apk PGP signature which is not in scope (don't get it wrong with the .dex header's MD5 hash, this one yes, is in scope).