Skip to main content

Most Common Words – A cP Systems Solution

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10725))

Abstract

Finding the most common words in a text file is a famous “programming pearl”, originally posed by Jon Bentley (1984). Several interesting solutions have been proposed by Knuth (an exquisite model of literate programming, 1986), McIlroy (an engineering example of combining a timeless set of tools, 1986), Hanson (an alternate efficient solution, 1987). Here we propose a concise efficient solution based on the fast parallel and associative capabilities of cP systems. We also check their parallel sorting capabilities and propose a dynamic version of the classical pigeonhole algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bentley, J., Knuth, D., McIlroy, D.: Programming pearls: a literate program. Commun. ACM 29(6), 471–483 (1986). http://doi.acm.org/10.1145/5948.315654

    Article  Google Scholar 

  2. Knuth, D.E.: Literate programming. Comput. J. 27(2), 97–111 (1984). http://dx.doi.org/10.1093/comjnl/27.2.97

    Article  MATH  Google Scholar 

  3. Lynch, N.A.: Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco (1996)

    MATH  Google Scholar 

  4. Nicolescu, R.: Parallel and distributed algorithms in P systems. In: Gheorghe, M., Păun, G., Rozenberg, G., Salomaa, A., Verlan, S. (eds.) CMC 2011. LNCS, vol. 7184, pp. 35–50. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28024-5_4

    Chapter  Google Scholar 

  5. Nicolescu, R.: Parallel thinning with complex objects and actors. In: Gheorghe, M., Rozenberg, G., Salomaa, A., Sosík, P., Zandron, C. (eds.) CMC 2014. LNCS, vol. 8961, pp. 330–354. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-14370-5_21

    Google Scholar 

  6. Nicolescu, R.: Structured grid algorithms modelled with complex objects. In: Rozenberg, G., Salomaa, A., Sempere, J.M., Zandron, C. (eds.) CMC 2015. LNCS, vol. 9504, pp. 321–337. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-28475-0_22

    Chapter  Google Scholar 

  7. Nicolescu, R.: Revising the membrane computing model for byzantine agreement. In: Leporati, A., Rozenberg, G., Salomaa, A., Zandron, C. (eds.) CMC 2016. LNCS, vol. 10105, pp. 317–339. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54072-6_20

    Chapter  Google Scholar 

  8. Nicolescu, R., Ipate, F., Wu, H.: Programming P systems with complex objects. In: Alhazov, A., Cojocaru, S., Gheorghe, M., Rogozhin, Y., Rozenberg, G., Salomaa, A. (eds.) CMC 2013. LNCS, vol. 8340, pp. 280–300. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54239-8_20

    Chapter  Google Scholar 

  9. Nicolescu, R., Wu, H.: Complex objects for complex applications. Rom. J. Inf. Sci. Technol. 17(1), 46–62 (2014)

    Google Scholar 

  10. Păun, G., Rozenberg, G., Salomaa, A. (eds.): The Oxford Handbook of Membrane Computing. Oxford University Press Inc., New York (2010)

    MATH  Google Scholar 

  11. Tel, G.: Introduction to Distributed Algorithms. Cambridge University Press, Cambridge (2000)

    Book  MATH  Google Scholar 

  12. Van Wyk, C.J.: Literate programming. Commun. ACM 30(7), 583–599 (1987). http://doi.acm.org/10.1145/28569.315738

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radu Nicolescu .

Editor information

Editors and Affiliations

A Appendix cP Systems: P Systems with Complex Symbols

A Appendix cP Systems: P Systems with Complex Symbols

We present the details of our cP framework, simplified from our earlier papers [5, 6].

1.1 A.1 Complex Symbols as Subcells

Complex symbols or subcells, play the roles of cellular micro-compartments or substructures, such as organelles, vesicles or cytoophidium assemblies (“snakes”), which are embedded in cells or travel between cells, but without having the full processing power of a complete cell. In our proposal, subcells represent nested labelled data compartments which have no own processing power: they are acted upon by the rules of their enclosing cells.

Our basic vocabulary consists of atoms and variables, collectively known as simple symbols. Complex symbols are similar to Prolog-like first-order terms, recursively built from multisets of atoms and variables. Together, complex symbols and simple symbols (atoms, variables) are called symbols and can be defined by the following formal grammar:

figure g

Atoms are typically denoted by lower case letters (or, occasionally, digits), such as a, b, c, \(\textit{1}\). Variables are typically denoted by uppercase letters, such as X, Y, Z. Functors are term (subcell) labels; here functors can only be atoms, not variables.

For improved readability, we also consider anonymous variables, which are denoted by underscores (“\(\_\)”). Each underscore occurrence represents a new unnamed variable and indicates that something, in which we are not interested, must fill that slot.

Symbols that do not contain variables are called ground, e.g.:

  • Ground symbols: a, \(a(\lambda )\), a(b), a(bc), \(a(b^2 c)\), a(b(c)), \(a(bc(\lambda ))\), a(b(c)d(e)), a(b(c)d(e)), \(a(b(c)d(e(\lambda )))\), \(a(bc^2 d)\).

  • Symbols which are not ground: X, a(X), a(bX), a(b(X)), a(XY), \(a(X^2)\), a(XdY), a(Xc()), a(b(X)d(e)), a(b(c)d(Y)), \(a(b(X^2)d(e(Xf^2)))\); also, using anonymous variables: \(\_\), \(a(b\_)\), \(a(X\_)\), \(a(b(X)d(e(\_)))\).

  • This term-like construct which starts with a variable is not a symbol (this grammar defines first-order terms only): X(aY).

Note that we may abbreviate the expression of complex symbols by removing inner \(\lambda \)’s as explicit references to the empty multiset, e.g. \(a(\lambda ) = a()\).

In concrete models, cells may contain ground symbols only (no variables). Rules may however contain any kind of symbols, atoms, variables and terms (whether ground and not).

Unification. All symbols which appear in rules (ground or not) can be (asymmetrically) matched against ground terms, using an ad-hoc version of pattern matching, more precisely, a one-way first-order syntactic unification (one-way, because cells may not contain variables). An atom can only match another copy of itself, but a variable can match any multiset of ground terms (including \(\lambda \)). This may create a combinatorial non-determinism, when a combination of two or more variables are matched against the same multiset, in which case an arbitrary matching is chosen. For example:

  • Matching \(a(b(X)fY) = a(b(cd(e))f^2g)\) deterministically creates a single set of unifiers: \(X, Y = cd(e), fg\).

  • Matching \(a(XY^2) = a(de^2f)\) deterministically creates a single set of unifiers: \(X, Y = df, e\).

  • Matching \(a(b(X)c(\textit{1}X)) = a(b(\textit{1}^2)c(\textit{1}^3))\) deterministically creates one single unifier: \(X = \textit{1}^2\).

  • Matching \(a(b(X)c(\textit{1}X)) = a(b(\textit{1}^2)c(\textit{1}^2))\) fails.

  • Matching \(a(XY) = a(df)\) non-deterministically creates one of the following four sets of unifiers: \(X, Y = \lambda , df\); \(X, Y = df, \lambda \); \(X, Y = d, f\); \(X, Y = f, d\).

1.2 A.2 High-Level or Generic Rules

Typically, our rules use states and are applied top-down, in the so-called weak priority order.

Pattern matching. Rules are matched against cell contents using the above discussed pattern matching, which involves the rule’s left-hand side, promoters and inhibitors. Moreover, the matching is valid only if, after substituting variables by their values, the rule’s right-hand side contains ground terms only (so no free variables are injected in the cell or sent to its neighbours), as illustrated by the following sample scenario:

  • The cell’s current content includes the ground term:

    \(n(a \, \phi (b \, \phi (c) \, \psi (d)) \, \psi (e)).\)

  • The following (state-less) rewriting rule is considered:

    \(n(X \, \phi (Y \, \phi (Y_1) \, \psi (Y_2)) \, \psi (Z)) ~ \rightarrow ~ v(X) \, n(Y \, \phi (Y_2) \, \psi (Y_1)) \, v(Z).\)

  • Our pattern matching determines the following unifiers:

    \(X = a\), \(Y = b\), \(Y_1 = c\), \( Y_2 = d\), \(Z = e\).

  • This is a valid matching and, after substitutions, the rule’s right-hand side gives the new content:

    \(v(a) ~ n(b \, \phi (d) \, \psi (c)) ~ v(e).\)

Generic rules format. We consider rules of the following generic format (we call this format generic, because it actually defines templates involving variables):

figure h

Where:

  • current-state and target-state are atoms or terms;

  • symbols, in-symbols, promoters and inhibitors are symbols;

  • in-symbols become available after the end of the current step only, as in traditional P systems (we can imagine that these are sent via an ad-hoc fast loopback channel);

  • subscript \(\alpha \) \(\in \) \(\{\scriptstyle \mathtt {min}\displaystyle \), \(\scriptstyle \mathtt {max}\displaystyle \}\), indicates the application mode, as further discussed in the example below;

  • out-symbols are sent, at the end of the step, to the cell’s structural neighbours. These symbols are enclosed in round parentheses which further indicate their destinations, above abbreviated as \(\delta \). The most usual scenarios include:

    • \((a)\downarrow _i\) indicates that a is sent over outgoing arc i (unicast);

    • \((a)\downarrow _{i,\,j}\) indicates that a is sent over outgoing arcs i and j(multicast);

    • \((a)\downarrow _\forall \) indicates that a is sent over all outgoing arcs (broadcast).

    All symbols sent via one generic rule to the same destination form one single message and they travel together as one single block (even if the generic rule is applied in mode \(\scriptstyle \mathtt {max}\displaystyle \)).

Example. To explain our rule application mode, let us consider a cell, \(\sigma \), containing three counter-like complex symbols, \(c(\textit{1}^2)\), \(c(\textit{1}^2)\), \(c(\textit{1}^3)\), and the two possible application modes of the following high-level “decrementing” rule:

figure i

The left-hand side of rule \(\rho _\alpha \), \(c(\textit{1}\, X)\), can be unified in three different ways, to each one of the three c symbols extant in cell \(\sigma \). Conceptually, we instantiate this rule in three different ways, each one tied and applicable to a distinct symbol:

  1. 1.

    If \(\alpha = \, \scriptstyle \mathtt {min}\displaystyle \), rule \(\rho _\mathtt {min}\) non-deterministically selects and applies one of these virtual rules \(\rho _1\), \(\rho _2\), \(\rho _3\). Using \(\rho _1\) or \(\rho _2\), cell \(\sigma \) ends with counters \(c(\textit{1})\), \(c(\textit{1}^2)\), \(c(\textit{1}^3)\). Using \(\rho _3\), cell \(\sigma \) ends with counters \(c(\textit{1}^2)\), \(c(\textit{1}^2)\), \(c(\textit{1}^2)\).

  2. 2.

    If \(\alpha = \, \scriptstyle \mathtt {max}\displaystyle \), rule \(\rho _\mathtt {max}\) applies in parallel all these virtual rules \(\rho _1\), \(\rho _2\), \(\rho _3\). Cell \(\sigma \) ends with counters \(c(\textit{1})\), \(c(\textit{1})\), \(c(\textit{1}^2)\).

Special cases. Simple scenarios involving generic rules are sometimes semantically equivalent to loop-based sets of non-generic rules. For example, consider the rule

$$ S_1 ~ a(x(I) \; y(J)) ~ \rightarrow _\mathtt {max}~ S_2 ~ b(I) ~ c(J), $$

where the cell’s contents guarantee that I and J only match integers in ranges [1, n] and [1, m], respectively. Under these assumptions, this rule is equivalent to the following set of non-generic rules:

$$ S_1 ~ a_{i,j} ~ \rightarrow S_2 ~ b_i ~ c_j, ~ \forall i \in [1,n], j \in [1,m]. $$

However, unification is a much more powerful concept, which cannot be generally reduced to simple loops.

Benefits. This type of generic rules allow (i) a reasonably fast parsing and processing of subcomponents, and (ii) algorithm descriptions with fixed-size alphabets and fixed-sized rulesets, independent of the size of the problem and number of cells in the system (often impossible with only atomic symbols).

Synchronous vs asynchronous. In our models, we do not make any syntactic difference between the synchronous and asynchronous scenarios; this is strictly a runtime assumption [4]. Any model is able to run on both the synchronous and asynchronous runtime “engines”, albeit the results may differ. Our asynchronous model matches closely the standard definition for asynchronicity used in distributed algorithms [3, 11]; however, this is not needed in this paper so we don’t follow this topic here.

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nicolescu, R. (2018). Most Common Words – A cP Systems Solution. In: Gheorghe, M., Rozenberg, G., Salomaa, A., Zandron, C. (eds) Membrane Computing. CMC 2017. Lecture Notes in Computer Science(), vol 10725. Springer, Cham. https://doi.org/10.1007/978-3-319-73359-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73359-3_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73358-6

  • Online ISBN: 978-3-319-73359-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics