A Robust Class of Data Languages and an Application to Learning

We introduce session automata, an automata model to process data words, i.e., words over an infinite alphabet. Session automata support the notion of fresh data values, which are well suited for modeling protocols in which sessions using fresh values are of major interest, like in security protocols or ad-hoc networks. Session automata have an expressiveness partly extending, partly reducing that of classical register automata. We show that, unlike register automata and their various extensions, session automata are robust: They (i) are closed under intersection, union, and (resource-sensitive) complementation, (ii) admit a symbolic regular representation, (iii) have a decidable inclusion problem (unlike register automata), and (iv) enjoy logical characterizations. Using these results, we establish a learning algorithm to infer session automata through membership and equivalence queries.


Introduction
The study of automata over data words, i.e., words over an infinite alphabet, has its origins in the seminal work by Kaminski and Francez [21].Their finite-memory automata (more commonly called register automata) equip finite-state machines with registers in which data values (from the infinite alphabet) can be stored and be reused later.Register automata preserve some of the good properties of finite automata: they have a decidable emptiness problem and are closed under union and intersection.On the other hand, register automata are neither determinizable nor closed under complementation, and they have an undecidable equivalence/inclusion problem.There are actually several variants of register automata, which all have the same expressive power but differ in the complexity of decision problems [14,5].In the sequel, many more automata models have been introduced (not necessarily with registers), aiming at a good balance between expressivity, decidability, and closure properties [29,14,23,7,17,16].Some of those models extend register automata, inheriting their drawbacks such as undecidability of the equivalence problem.
We will follow the work on register automata and study a model that supports the notion of freshness.When reading a data value, it may enforce that the data value is fresh, i.e., it has not occurred in the whole history of the run.This feature has been proposed in [33] to model computation with names in the context of programming-language semantics.Actually, fresh names are needed to model object creation in object-oriented languages, and they are important ingredients in modeling security protocols which often make use of so-called fresh nonces to achieve their security assertions [24].Fresh names are also crucial in the field of network protocols, and they are one of the key features of the π-calculus [28].Like ordinary register automata, fresh-register automata preserve some of the good properties of finite automata.However, they are not closed under complement and also come with an undecidable equivalence problem.
In this paper, we propose session automata, a robust automata model over data words.Like register automata, session automata are a syntactical restriction of fresh-register automata, but in an orthogonal way.Register automata drop the feature of checking global freshness (referring to the whole history) while keeping a local variant (referring to the registers).Session automata, on the other hand, discard local freshness, while keeping the global one.Session automata are well-suited whenever fresh values are important for a finite period, for which they will be stored in one of the registers.They correspond to the model from [8] without stacks.
Not surprisingly, we will show that session automata and register automata describe incomparable classes of languages of data words, whereas both are strictly weaker than fresh-register automata.Contrary to finite-state unification based automata introduced in [22], session automata (like fresh-register automata) do not have the capability to reset the content of a register.However, they can test global freshness which the model of [22] cannot.The variable automata from [16] do not employ registers, but rather use bound and free variables.However, variable automata are close to our model: they use a finite set of bound variables to track the occurrences of some data values, and a single free variable for all other data values (that must be different from data values tracked by bound variables).Contrary to our model, variable automata cannot test for global freshness, but we are not able to recognize the language of all data words, contrary to them.
In this paper, we show that session automata (i) are closed under intersection, union, and resource-sensitive complementation1 , (ii) have a unique canonical form (analogous to minimal deterministic finite automata), (iii) have a decidable equivalence/inclusion problem, and (iv) enjoy logical characterizations.Altogether, this provides a versatile framework for languages over infinite alphabets.
In a second part of the paper, we present an application of our automata model in the area of learning, where decidability of the equivalence problem is crucial.Learning automata deals with the inference of automata based on some partial information, for example samples, which are words that either belong to the accepted language or not.A popular framework is that of active learning defined by Angluin [2] in which a learner may consult a teacher for so-called membership and equivalence queries to eventually infer the automaton in question.Learning automata has many applications in computer science.Notable examples are the use in model checking [15] and testing [3].See [26] for an overview.
While active learning of regular languages is meanwhile well understood and is supported by freely available libraries such as LearnLib [27] and libalf [10], extensions beyond plain regular languages are still an area of active research.Recently, automata dealing with potentially infinite data as basis objects have been studied.Seminal works in this area are that of [1,20] and [19].While the first two use abstraction and refinement techniques to cope with infinite data, the second approach learns a sub-class of register automata.Note that session automata are incomparable with the model from [19].Thanks to their closure and decidability properties, a conservative extension of Angluin's classical algorithm will do for their automatic inference.
Outline.The paper is structured as follows.In Section 2 we introduce session automata.Section 3 presents the main tool allowing us to establish the results of this paper, namely the use of data words in symbolic normal form and the construction of a canonical session automaton.The section also presents some closure properties of session automata and the decidability of the equivalence problem.Section 4 gives logical characterizations of our model.In Section 5, we present an active learning algorithm for session automata.This paper is an extended version of [9].

Data Words and Session Automata
We let N be the set of natural numbers and N >0 be the set of non-zero natural numbers.In the following, we fix a non-empty finite alphabet Σ of labels and an infinite set D of data values.In examples, we usually use D = N.A data word over Σ and D is a sequence In other words, w is an element from (Σ × D) * .For d ∈ {d 1 , . . ., d n }, we let first w (d) denote the position j ∈ {1, . . ., n} where d occurs for the first time, i.e., such that d j = d and there is no k < j such that d k = d.Accordingly, we define last w (d) to be the last position where d occurs.
This section recalls two existing automata models over data words -namely register automata, previously introduced in [21], and fresh-register automata, introduced in [33] as a generalization of register automata.Moreover, we introduce the new model of session automata, our main object of interest.
Register automata (initially called finite-memory automata) equip finite-state machines with registers in which data values can be stored and be read out later.Fresh-register automata additionally come with an oracle that can determine if a data value is fresh, i.e., has not occurred in the history of a run.Both register and fresh-register automata are closed under union and intersection, and they have a decidable emptiness problem.However, they are not closed under complementation, and their equivalence problem is undecidable, which limits their application in areas such as model checking and automata learning.Session automata, on the other hand, are closed under (resource-sensitive) complementation, and they have a decidable inclusion/equivalence problem.
Given a set R, we let In the automata models that we are going to introduce, R will be the set of registers.
Transitions will be labeled with an element from R ∪ R ∪ R ↑ , which determines a register and the operation that is performed on it.More precisely, r writes a globally fresh value into r, r writes a locally fresh value into r, and r ↑ uses the value that is currently stored in r.
Definition 2.1 (Fresh-Register Automaton, cf.[33]).A fresh-register automaton (over Σ and D) is a tuple A = (S, R, ι, F, ∆) where • S is the non-empty finite set of states, • R is the non-empty finite set of registers, • ι ∈ S is the initial state, • F ⊆ S is the set of final states, and • ∆ is a finite set of transitions: each transition is a tuple of the form (s, (a, π), s ) where s, s ∈ S are the source and target state, respectively, a ∈ Σ, and π ∈ R ∪ R ∪ R ↑ .We call (a, π) the transition label.
For a transition (s, (a, π), s ) ∈ ∆, we also write s (a,π) − −− → s .When taking this transition, the automaton moves from state s to state s and reads a symbol (a, d) ∈ Σ × D. If π = r ↑ ∈ R ↑ , then d is the data value that is currently stored in register r.If π = r ∈ R , then d is some globally fresh data value, which has not been read in the whole history of the run; d is then written into register r.Finally, if π = r ∈ R , then d is some locally fresh data value, which is currently not stored in the registers; it will henceforth be stored in register r.
Let us formally define the semantics of A. A configuration is a triple γ = (s, τ, U ) where s ∈ S is the current state, τ : R D is a partial mapping encoding the current register assignment, and U ⊆ D is the set of data values that have been used so far.By dom(τ ), we denote the set of registers r such that τ (r) is defined.Moreover, τ (R) We say that γ is final if s ∈ F .As usual, we define a transition relation over configurations and let (s, τ, U ) (a,d) ==⇒ (s , τ , U ), where (a, d) ∈ Σ × D, if there is a transition s (a,π) − −− → s such that the following conditions hold: (1) for suitable configurations γ 0 , . . ., γ n with γ 0 = (ι, ∅, ∅) (here the partial mapping ∅ represents the mapping with empty domain).The run is accepting if γ n is a final configuration.The language L(A) ⊆ (Σ × D) * of A is then defined as the set of data words for which there is an accepting run.
Note that fresh-register automata cannot distinguish between data words that are equivalent up to permutation of data values.More precisely, given w, w ∈ (Σ × D) * , we , for all i, j ∈ {1, . . ., n}, we have d i = d j iff d i = d j .For instance, (a, 4)(b, 2)(b, 4) ≈ (a, 2)(b, 5)(b, 2).In the following, the equivalence class of a data word w wrt.≈ is written [w] ≈ .We call L ⊆ (Σ × D) * a data language if, for all w, w ∈ (Σ × D) * such that w ≈ w , we have w ∈ L if, and only if, w ∈ L. In particular, L(A) is a data language for every fresh-register automaton A.
We obtain natural subclasses of fresh-register automata when we restrict the transition labels (a, π) ∈ Σ × (R ∪ R ∪ R ↑ ) in the transitions.Definition 2.2 (Register Automaton, [21]).A register automaton is a fresh-register automaton where every transition label is from Σ × (R ∪ R ↑ ).
Like register automata, session automata are a syntactical restriction of fresh-register automata, but in an orthogonal way.Instead of local freshness, they include the feature of global freshness.

Definition 2.3 (Session Automaton).
A session automaton is a fresh-register automaton where every transition label is from Σ × (R ∪ R ↑ ).
We first compare the three models of automata introduced above in terms of expressive power.
Example 2.4.Consider the set of labels Σ = {req, ack} and the set of data values D = N, representing an infinite supply of process identifiers (pids).We model a simple (sequential) system where processes can approach a server and make a request, indicated by req, and where the server can acknowledge these requests, indicated by ack.More precisely, (req, p) ∈ Σ × D means that the process with pid p performs a request, which is acknowledged when the system executes (ack, p).
Figure 1(a) depicts a register automaton that recognizes the language L 1 of data words verifying the following conditions: • there are at most two open requests at a time; • a process waits for an acknowledgment before making another request; • every acknowledgment is preceded by a request; • requests are acknowledged in the order they are received.In the figure, an edge label of the form (req, r i ∨r ↑ i ) shall denote that there are two transitions, one labeled with (req, r i ), and one labeled with (req, r ↑ i ).Whereas a transition labeled with (req, r i ) is taken when the current data value does not appear currently in the registers (but could have appeared before in the data word) and store it in r i , transition labeled with (req, r ↑ i ) simply checks that the current data is stored in register r i .The automaton models a server that can store two requests at a time and will acknowledge them in the order they are received.For example, it accepts (req, 8)(req, 4)(ack, 8)(req, 3)(ack, 4)(req, 8)(ack, 3)(ack, 8).
When we want to guarantee that, in addition, every process makes at most one request, we need the global freshness operator.Figure 1(b) hence depicts a session automaton recognizing the language L 2 of all the data words of L 1 in which every process makes at most one request.Notice that the transition from s 0 to s 1 is now labeled with (req, r 1 ), so that this transition can only be taken in case the current data value has never been seen before.We obtain A 2 from A 1 by replacing every occurrence of r i ∨ r ↑ i with r i .While (req, 8)(req, 4)(ack, 8)(req, 3)(ack, 4)(req, 8)(ack, 3)(ack, 8) is no longer contained in L 2 , (req, 8) (req, 4)(ack, 8)(req, 3)(ack, 4)(ack, 3) is still accepted.
As a last example, consider the language L 3 of data words in which every process makes at most one request (without any other condition).A fresh-register automaton recognizing it is given in Figure 2.
Proposition 2.5.Register automata and session automata are incomparable in terms of expressive power.Moreover, fresh-register automata are strictly more expressive than both register automata and session automata.
Proof.We use the languages L 1 , L 2 , and L 3 defined in Example 2.4 to separate the different automata models.
First, the language L 1 , recognizable by a register automaton, is not recognized by any session automaton.Indeed, denoting w d the data word (req, d)(ack, d), no session automaton using k registers can accept Intuitively, the session automaton must store all k + 1 data values of the requests in order to check the acknowledgement, and cannot discard any of the k first data values to store the (k + 1)th since all of them have to be reused afterwards (and at that time they are not globally fresh anymore).More precisely, after reading w 1 w 2 • • • w k the configuration must be of the form (s, τ, {1, 2, . . ., k}) with τ being a permutation of {1, . . ., k}.Reading w k+1 , with fresh data value k + 1, must then replace the content of a register with k + 1. Suppose it is register j.Then, when reading the second occurrence of w j , data value j is not globally fresh anymore, yet it is not stored anymore in the registers, which does not allow us to accept this data word.
Then, the language L 2 , recognizable by a session automaton, is indeed not recognizable by a register automaton, for the same reasons as already developed in Proposition 5 of [21].Intuitively, the automaton needs to register every data value encountered since it has to ensure the freshness of every pid.
Finally, language L 3 , recognized by a fresh-register automaton, is not recognized by any register automaton (see again Proposition 5 of [21]) nor by any session automaton.In since when reading the letter (req, k + 1), all the k + 1 data values seen so far should be registered to accept the suffix afterwards.A formal proof can be done in the same spirit as for L 1 .
Example 2.6.To conclude the section, we present a session automaton with 2 registers that models a P2P protocol.A user can join a host with address x, denoted by action (join, x).
The request is either forwarded by x to another host y, executing (forw 1 , x)(forw 2 , y), or acknowledged by (ack, x).In the latter case, a connection between the user and x is established so that they can communicate, indicated by action (com, x).Note that the sequence of actions (forw 1 , x)(forw 2 , y) should be considered as an encoding of a single action (forw, x, y) and is a way of dealing with actions that actually take two or more data values, as considered, e.g., in [19].An example execution of our protocol is (join, 145)(forw, 145, 978)(forw, 978, 14)(ack, 14)(com, 14)(com, 14)(com, 14).In Figure 3, we show the session automaton for the P2P protocol: it uses 2 registers.Following [8], our automata can be easily extended to multi-dimensional data words.This also holds for the learning algorithm that will be presented in Section 5.

Symbolic Normal Form and Canonical Session Automata
Closure properties of session automata, decidability of inclusion/equivalence and the learning algorithm will be established by means of a symbolic normal form of a data word, as well as a canonical session automaton recognizing those normal forms.The crucial observation is that data equality in a data word recognized by a session automaton only depends on the transition labels that generate it.In this section, we suppose that the set of registers of a session automaton is of the form R = {1, . . ., k}.In the following, we let Γ = N >0 ∪ N ↑ >0 and, for k ≥ 1, Γ k = {1, . . ., k} ∪ {1, . . ., k} ↑ .
It "produces" a data word if, and only if, a register is initialized before it is used.Formally, we say that u is well-formed if, for all positions j ∈ {1, . . ., n} with op(π j ) = ↑, there is i < j such that π i = reg(π j ) .Let WF ⊆ (Σ × Γ) * be the set of all well-formed words.
, we can associate an equivalence relation ∼ u over {1, . . ., n}, letting i ∼ u j if, and only if, • reg(π i ) = reg(π j ), and • i ≤ j and there is no position k ∈ {i + 1, . . ., j} such that π k = reg(π i ) , or j ≤ i and there is no position k ∈ {j + 1, . . ., i} such that π k = reg(π j ) .If u is well-formed, then the data values of every data word w = (a 1 , d 1 ) • • • (a n , d n ) that a session automaton "accepts via" u conform with the equivalence relation ∼ u , that is, we have d i = d j iff i ∼ u j.This motivates the following definition.Given a well-formed , for all i, j ∈ {1, . . ., n}, we have Let γ(u) denote the set of all concretizations of u.Observe that, if w is a data word from γ(u), then γ(u) = [w] ≈ .Concretization is extended to sets L ⊆ (Σ×Γ) * of well-formed words, and we let γ(L) def = u∈L ∩ WF γ(u).Note that, here, we first filter the well-formed words before applying the operator.Now, let A = (S, R, ι, F, ∆) be a session automaton.In the obvious way, we may consider A as a finite automaton over the finite alphabet Σ × (R ∪ R ↑ ).We then obtain a regular language L symb (A) Though we have a symbolic representation of data languages recognized by session automata, it is in general difficult to compare their languages, since different symbolic words may give rise to the same concretizations.For example, we have γ((a, 1 )(a, 1 )(a, 1 ↑ )) = γ((a, 1 )(a, 2 )(a, 2 ↑ )).However, we can associate, with every data word, a symbolic normal form, producing the same set of concretizations.Intuitively, the normal form uses the first (according to the natural total order) register whose current data value will not be used anymore.In the above example, (a, 1 )(a, 1 )(a, 1 ↑ ) would be in symbolic normal form: the data value stored at the first position in register 1 is not reused so that, at the second position, register 1 must be overwritten.For the same reason, (a, 1 )(a, 2 )(a, 2 ↑ ) is not in symbolic normal form, in contrast to (a, 1 )(a, 2 )(a, 2 ↑ )(a, 1 ↑ ) where register 1 is read at the end of the word.
The relation between the mappings γ and snf is illustrated below Figure 4: (a) A data word and its sessions, (b) Session automaton recognizing all 2-bounded data words One easily verifies that L = γ(snf (L)), for all data languages L. Therefore, equality of data languages reduces to equality of their symbolic normal forms: Lemma 3.2.Let L and L be data languages.Then, L = L if, and only if, snf (L) = snf (L ).
Of course, symbolic normal forms may use any number of registers so that the set of symbolic normal forms is a language over an infinite alphabet as well.However, given a session automaton A, the symbolic normal forms that represent the language L(A) uses only a bounded (i.e., finite) number of registers.Indeed, an important notion in the context of session automata is the bound of a data word.Intuitively, the bound of w = (a is the minimal number of registers that a session automaton needs in order to execute w.Or, in other words, the bound is the maximal number of overlapping sessions.A session is an interval delimiting the occurrence of one particular data value.Formally, a session of w is a set I ⊂ N >0 of the form {first w (d), first w (d) + 1, . . ., last w (d)} with d ∈ D a data value appearing in w.Given k ∈ N >0 , we say that w is k-bounded if every position i ∈ {1, . . ., n} is contained in at most k sessions.Let DW k denote the set of k-bounded data words, and let SNF k def = snf (DW k ) denote the set of symbolic normal forms of all k-bounded data words.
One can verify that a data word w is k-bounded if, and only if, snf (w) is a word over the alphabet Σ × Γ k .Notice that A data language L is said to be k-bounded if L ⊆ DW k .It is bounded if it is k-bounded for some k.Note that the set of all data words is not bounded.
Figure 4(a) illustrates a data word w with four different sessions.It is 2-bounded, as no position shares more than 2 sessions.S. We call A data deterministic if it is symbolically deterministic and, for all s ∈ S, a ∈ Σ, and r 1 , r 2 ∈ R with r 1 = r 2 , we have that (s, (a, r 1 )) ∈ dom(∆) implies (s, (a, r 2 )) / ∈ dom(∆).Intuitively, given a data word as input, the automaton is data deterministic if, in each state, given a pair letter/data value, there is at most one fireable transition.
Notice that session automata, even when symbolically or data deterministic, may not necessarily be "complete", in the sense that it is possible that a run over a data word falls into a deadlock situation: this is the case when the session automaton forced a data value to be removed from the set of registers, though it will be seen in the future.
While "data deterministic" implies "symbolically deterministic" by definition, the converse is not true.E.g., the session automaton A 2 from Figure 1(b) and the one of Figure 4(b) are symbolically deterministic but not data deterministic.However, the session automaton obtained from A 2 by removing, e.g., the transition from s 0 to s 2 (coupled with the transition from s 0 to s 1 , it causes non-determinism when reading a fresh data value at a request), is data deterministic (and is indeed equivalent to A 2 , in the sense that it recognizes the same language L(A 2 )).
Example 3.4.We explain how to construct a symbolically deterministic session automaton A, with k ≥ 1 registers, such that L symb (A) = SNF k .Its state space is S = {0, . . ., k} × 2 {1,...,k} , consisting of (i) the greatest register already initialized (indeed we will only use a register r if every register r < r has already been used), (ii) a subset P of registers that we promise to reuse again before resetting their value.The initial state of A is (0, ∅), whereas the set of accepting states is ({0, . . ., k}) × {∅}.We now describe the set of transitions.For every a ∈ Σ, i ∈ {0, . . ., k}, P ⊆ {1, . . ., k}, and r ∈ {1, . . ., k}: By determinizing a finite-state automaton recognizing the symbolic language, it is easy to show that every language recognized by a session automaton is also recognized by a symbolically deterministic session automaton: we shall study this question in more detail in the next section.The next theorem shows that this is not true for data deterministic session automata.
Theorem 3.5.Session automata are strictly more expressive than data deterministic session automata.
Proof.We show that the data language L = DW 2 cannot be recognized by a data deterministic session automaton.Indeed, suppose that such an automaton exists, with k registers.Then, consider the word w = (a, 1)(a, 2)(a, 3) • • • (a, k + 1) ∈ L, where every data value is fresh.By data determinism, there is a unique run accepting w.Along this run, let i < j be two positions such that their two fresh data values have been stored in the same register r (such a pair must exist since the automaton has only k registers).Without loss of generality, we can consider the greatest position j verifying this condition, and then the greatest position i associated with j.This means that register r is used for the last time when reading j, and has not been used in-between positions i and j.Now, the word (a, 1)(a, 2)(a, 3) • • • (a, k + 1)(a, i) ∈ L must be recognized by the automaton, but cannot since data value i appearing on the last position is not fresh anymore, and yet not stored in one of the registers (since register r was reused at j).

Canonical Session Automata.
We now present the main result of this section showing that every session automaton A is equivalent to a canonical session automaton A C , whose symbolic language L symb (A C ) contains only symbolic normal forms.Theorem 3.6.Let A = (S, R, ι, F, ∆) be a session automaton with R = {1, . . ., k}.Then, L(A) is k-bounded.Moreover, snf (L(A)) is a regular language over the finite alphabet Σ × Γ k .A corresponding automaton Ã can be effectively computed.Its number of states is at most exponential in k and linear in |S|.
Proof.First, if A is a session automaton using k registers, the language Example 3.4, constructing a symbolically deterministic session automaton for SNF k = snf (γ((Σ × Γ k ) * )), shows that regularity of the symbolic language (Σ × Γ k ) * is preserved under the application of snf (γ(•)).We now prove that this is the case for every regular language over Σ × Γ k .In particular, for the symbolic regular language L symb (A), this will show that snf (L(A)), which is equal to snf (γ(L symb (A))), is regular.
i.e., the set of well-formed symbolic words having the same concretizations as some word from L. We show that snf (γ(L) Hence, starting from a word w in γ(u) (which is non empty since u is well-formed), we have u = snf (w) (by uniqueness of the symbolic normal form) and w ∈ γ(u ) ⊆ γ(L), so that u ∈ snf (γ(L)).We know from Example 3.4 that SNF k is regular.We now show that L is regular: knowing that snf (γ(L)) = SNF k ∩ L, this will permit to conclude that snf (γ(L)) is regular.To do so, let A = (S, R, ι, F, ∆) be a session automaton with R = {1, . . ., k} such that L symb (A) = L.We construct a session automaton Ã = (S × Inj(k), R, (s 0 , ∅), F × Inj(k), ∆) recognizing the symbolic language L. Hereby, Inj(k) is the set of partial injective mappings from {1, . . ., k} to {1, . . ., k}, and ∅ ∈ Inj(k) denotes the mapping with empty domain.These partial mappings are used to remember the correspondence between old registers and new ones, so they may be understood as a set of constraints.For example, the mapping (2 → 1, 1 → 3) stands for "old register 2 henceforth refers to 1, and old register 1 henceforth refers to 3".Each subset of these constraints forms always a valid partial injective mapping.In the following, such a subset is called a sub-mapping.For example, σ = (1 → 3) is a sub-mapping of the previous one; it can then be extended with the new constraint 2 → 2, which we denote σ[2 → 2].We describe now the transition relation of Ã: with σ maximal sub-mapping of We simulate r ↑ -transitions simply using the current mapping σ.For r -transitions, we update σ, recording the new permutation of the registers: the maximal sub-mapping σ of σ 1 is either σ 1 itself or σ 1 where exactly one constraint r 1 → r 3 is removed to free r 1 .
One can indeed show that L symb ( Ã) = L. Inclusion L symb ( Ã) ⊆ L is easy to show since an accepting run in Ã can be mapped to an accepting run in A using the partial injective mappings maintained in the states of Ã.For the other inclusion, it suffices to prove that for every symbolic word u ∈ L and well-formed word u such that γ(u ) = γ(u), we have u ∈ L symb ( Ã).By definition of γ, we know that projections of u and u over the finite alphabet Σ are the same, and that ∼ u = ∼ u : the latter permits to reconstruct by induction a unique sequence of partial injective mappings linking the registers used in u and in u .An accepting run of A on u can therefore be mapped to an accepting run of Ã on u .Building the product of the automaton recognizing SNF k and the automaton Ã, we obtain a session automaton using k registers recognizing snf (γ(L)).Its number of states is bounded above by O(|Q| × k! × (k + 1) × 2 k ) (as the number of partial injective mappings in Inj(k) is bounded above by O(k!)).
From the automaton Ã built in the proof of the previous theorem, we can consider the (unique up to isomorphism) minimal deterministic finite-state automaton A C (i.e., symbolically deterministic session automaton) equivalent to it: this automaton will be called the canonical session automaton.In case A is data deterministic, we can verify that Ã is symbolically deterministic, and hence the minimal automaton A C has at most O(|Q| × k! × (k + 1) × 2 k ) states.Otherwise, a determinization phase has to be done resulting in a canonical session automaton with at most 2 O(|Q|×k!×(k+1)×2 k ) states.
Example 3.7.Examples of A and Ã, as defined in the previous proof, are given in Figure 6.The figure also depicts the canonical automaton A C associated with A, obtained by determinizing and minimizing the product of both Ã and the symbolically deterministic automaton recognizing SNF 2 (as given in Figure 5).Note that A C is symbolically deterministic and minimal.
3.4.Closure Properties.Using Theorem 3.6, we obtain some language theoretical closure properties of session automata, which they inherit from classical regular languages.These results demonstrate a certain robustness as required in verification tasks such as compositional verification [11] and infinite-state regular model checking [18].Theorem 3.8.We have the following closure properties: • Session automata are closed under union and intersection.• Session automata are closed under resource-sensitive complementation: Given a session automaton A with k registers, there is a session automaton A with k registers such that Proof.Let A be a session automaton using k registers, and B a session automaton using k registers.Using a classical product construction for A C and B C , we obtain a session automaton using min(k, k ) registers recognizing the data language L(A) ∩ L(B).The language L(A) ∪ L(B) is recognized by the session automaton, using max(k, k ) registers, that we obtain as the "disjoint union" of A and B, branching on the first transition in one of these two automata.Finally, let us consider a symbolically deterministic session automaton A using k registers.Without loss of generality, by adding a sink state, we can suppose that A is complete.Then, every well-formed symbolic word over Σ × Γ k has exactly one run in A. The automaton A constructed from A by taking as accepting states the non-accepting states of A verifies that Notice that A is symbolically deterministic, but not necessarily data deterministic (even if A is), because of the completion step.
Theorem 3.9.The inclusion problem for session automata is decidable.

Proof. Considering two session automata A and B, we can decide inclusion L(A) ⊆ L(B) by considering the canonical automata
In case B is data deterministic, B C has a size polynomial in the number of states of B, but exponential in the number of registers.Testing the inclusion L symb (A C ) ⊆ L symb (B C ) may be done by first complementing L symb (B C ) (which does not add states since B C is symbolically deterministic) and then testing the emptiness of its intersection with L symb (A C ).In the overall, this implies a complexity of the inclusion check that is polynomial in the number of states of A and B, but exponential in the number of registers used by B. In case B is not data deterministic, a determinization phase may add an exponent in the size and the number of registers of B.
As a corollary, we obtain that the emptiness problem and the universality problem under k-boundedness (i.e., knowing whether the language of a session automaton with k registers is the whole set of k-bounded data words) are decidable for session automata.It is not surprising for the emptiness problem, since it already holds for fresh-register automata.Notice that the problem is shown co-NP-complete for register automata in [31], and we can show that the emptiness problem is co-NP-complete, too, for session automata.First, co-NP-hardness can be shown by a reduction to the 3-SAT problem, in a very similar way as in [31].Then, the co-NP upper bound comes from the symbolic view.Indeed, for a session automaton A with k registers, L(A) = ∅ if and only if L symb (A) ∩ WF k = ∅ (where WF k denotes the set of well-formed symbolic words over alphabet Σ × Γ k ).We may not construct a finite automaton recognizing L symb (A) ∩ WF k (that has a size exponential in k), but instead non-deterministically search for a witness of non-emptiness of L symb (A) ∩ WF k , i.e., a well-formed word u such that u ∈ L symb (A).Notice that the membership test of u in the finite-state automaton A can be performed in polynomial time, hence, to conclude, we must simply show the existence of a well-formed witness u of polynomial size.As for register automata in [31], this relies on the fact that, even though the total number of configurations of A is exponential in k (due to the set U of initialized registers), along a run of A, only a polynomial (in A and k) number of configurations can be visited, since the set U will take at most k + 1 values during the computation (the initialization of registers is done in a certain order, and no register can be emptied at any point).Hence, by disallowing the visit of two occurrences of the same configuration, the existence of a well-formed witness implies the existence of a well-formed witness of polynomial size.
While fresh-register automata are not complementable for the set of all data words, they are complementable for k-bounded data words, using the previous theorem.The reason is that, given a fresh-register automaton A, one can construct a session automaton B such that L(B) = L(A) ∩ DW k .

Logical Characterizations
In this section, we provide logical characterizations of session automata.4.1.MSO Logic over Data Words.We consider the standard data monadic second-order logic (dMSO), which is an extension of classical MSO logic by the binary predicate x ∼ y to compare data values.
We fix infinite supplies of first-order variables x, y, . .., which are interpreted as word positions, and second-order variables X, Y, . .., which are taken as sets of positions.We let dMSO be the set of formulae ϕ defined by the following grammar: with x, y first-order variables, X a second-order variable and a ∈ Σ.The semantics of formulae in dMSO is given in Table 1: we define w, σ |= ϕ (to be read as "w satisfies ϕ when free variables of ϕ are interpreted as prescribed in σ") by induction over ϕ, where w = (a 1 , d 1 ) • • • (a n , d n ) ∈ (Σ × D) * is a data word and σ is a valuation of (at least the) free variables in ϕ, i.e., such that for every first-order free variable x, we have σ(x) ∈ {1, . . ., n} and for every second-order free variable X, we have σ(X) ⊆ {1, . . ., n}.For a first-order  In addition, we use abbreviations such as true, x ≤ y, ∀x ϕ, ϕ ∧ ψ, ϕ → ψ, etc.A sentence is a formula without free variables.For a dMSO sentence ϕ, we set L(ϕ) As usual, to deal with free variables, it is possible to extend the alphabet Σ as follows.If V is the set of variables that occur in ϕ, we have to consider data words over Σ = Σ × {0, 1} V and D. Intuitively, these data words include the interpretation of the free variables.If a data word carries, at position i, the letter (a, b, d) ∈ Σ × D with b[x] = 1 (where b[x] refers to the x-component of b), then x is interpreted as position i.If b[X] = 1, then X is interpreted as a set containing i. Whenever we refer to a word over the extended alphabet Σ, we will silently assume that the interpretation of a first-order variable x is uniquely determined, i.e., there is exactly one position i where b[x] = 1.This is justified, since the set of those "well-shaped" words is (symbolically) regular.This way we can transform any well-shaped word ŵ ∈ (Σ × {0, 1} V × D) * into a pair (w, σ) where w is a data word of (Σ × D) * and σ is a valuation of variables in V , and vice versa.
Note that dMSO is a very expressive logic and goes beyond virtually all automata models defined for data words [29,32,6,12].However, if we restrict to bounded languages, we can show that dMSO is no more expressive than session automata.Theorem 4.1.Let L be a bounded data language.Then, the following statements are equivalent: • There is a session automaton A such that L(A) = L.
• There is a dMSO sentence ϕ such that L(ϕ) = L.
Proof.The construction of a dMSO formula of the form ∃X 1 • • • ∃X m (α∧∀x∀y (x ∼ y ↔ β)), with α and β classical MSO formulae (not containing predicate ∼), from a session automaton A was implicitly shown in [8] (with a different goal, though).The idea is that the existential second-order variables X 1 , . . ., X m are used to guess an assignment of transitions to positions.In α, it is verified that the assignment corresponds to a run of A. Moreover, β checks if data equality corresponds to the data flow as enforced by the transition labels from Γ k .The formula has a size polynomial in the size of the automaton.In Section 4.2, formulae of this shape will be studied in more detail.
We now describe in detail our active learning algorithm for a given session automaton A given in Table 1.It is based on a loop which repeatedly constructs a closed table using membership queries, builds the corresponding automaton and then asks an equivalence query.This is repeated until A is learned.An important part of a active learning algorithm is the treatment of counterexamples provided by the teacher as an answer to an equivalence query.Suppose that for a given A O constructed from a closed table O = (T, U, V ) the teacher answers by a counterexample data word w.Let z = snf (w).If z uses more registers than available in the current alphabet, we extend the alphabet and then the table.If the obtained table is not closed, we restart from the beginning of the loop.Otherwise -and also if z does not use more registers -we use Rivest and Schapire's [30] technique to extend the table by adding a suitable v to V making it non-closed.The technique is based on the notion of breakpoint that we now recall.As z is a counterexample, (1) Then, for all i with 1 ≤ i ≤ m + 1, let z be decomposed as z = u i v i with u i , v i ∈ (Σ × Γ) * , where u 1 = v m+1 = ε, v 1 = u m+1 = z and the length of u i is equal to i − 1 (we have also z = u i z i v i+1 for all i such that 1 ≤ i ≤ m).Let s i ∈ U be the state visited by z just before reading the ith letter, along the computation of z on ).Because of (1) such a break-point must exist and can be obtained with O(log(m)) membership queries by a binary search.The word v i+1 is called the distinguishing word.If V is extended by v i+1 the table is not closed anymore (row(s i ) and row(s i z i ) become different).Now, the algorithm closes the table again, then asks another equivalence query and so forth until termination.At each iteration of the loop the number of rows (each of those correspond to a state in the automaton A C ) is increased by at least one.Notice that the same counterexample might be given several times.The treatment of the counterexample only guarantees that the table will contain one more row in its upper part.We obtain the following: Theorem 5.2.Let A be a session automaton over Σ and D, using k registers.Let A C be the corresponding canonical session automaton.Let N be its number of states, k its number of registers and M the length of the longest counterexample returned by an equivalence query.Then, the learning algorithm for A terminates with at most O(k|Σ|N 2 + N log(M )) membership and O(N ) equivalence queries.Notice that L symb (A 2 ) = L symb (A C )∩(Σ×Γ 1 ) * .This means that the equivalence query must give back a data word whose normal form is using at least 2 registers (here (a, 7)(a, 4)(b, 7) with normal form (a, 1 )(a, 2 )(b, 1 ↑ )).As the word uses 2 registers, we extend the alphabet to Σ × Γ 2 and obtain table O 3 .We close the table and get O 4 .From there we obtain the hypothesis automaton A 4 .After the equivalence query we get (a, 1 )(a, 2 )(b, 1 ↑ )(b, 2 ↑ ) as normal form of the data word counterexample (a, 9)(a, 3)(b, 9)(b, 3).After adding (b, 2 ↑ ) to V and closing the table by moving (a, 1 )(a, 2 )(b, 1 ↑ ) to the top we get finally the table O 5 from which the canonical automaton A C is obtained and the equivalence query succeeds.

Conclusion
In this paper, we developed a theory of session automata, which form a robust class of data languages.In particular, they are closed under union, intersection, and resource-sensitive complementation.Moreover, they enjoy logical characterizations in terms of (a fragment of) MSO logic with a predicate to compare data values for equality.Finally, unlike most other automata models for data words, session automata have a decidable inclusion problem.This makes them attractive for verification and learning.In fact, we provided a complete framework for algorithmic learning of session automata, making use of their canonical normal form.An interesting direction to follow would be to try to apply those methods to other models of automata dealing with data values like data automata [6,5] or variable automata [16].As a next step, we plan to employ our setting for various verification tasks.
In particular, the next step is to implement our framework, using possibly other learning algorithms than the one of Rivest and Shapire that we presented in this article, for instance using the LearnLib platform [27] or libalf [10].

Figure 3 :
Figure 3: Session automaton for the P2P protocol

Figure 7 :
Figure 7: The successive observation tables