Jekyll2023-08-13T05:57:33+00:00http://blog.geoffc.nz/feed.xmlGeoff ClitheroeSenior Principal Engineer at AtlassianVector Similarity Search with PostgreSQL and pgvector2023-08-13T00:00:00+00:002023-08-13T00:00:00+00:00http://blog.geoffc.nz/pgvector<p><a href="https://github.com/pgvector/pgvector">pgvector</a> adds vector similarity search to <a href="https://www.postgresql.org/">PostgreSQL</a>.</p>
<p>In this post I cover my quick experiment creating embeddings for text and storing and searching them in
PostgreSQL. This makes it easy to query and search structured and unstructured data in PostgreSQL. The code for my experiment
is available at <a href="https://github.com/gclitheroe/exp">https://github.com/gclitheroe/exp</a>.</p>
<p>Embeddings are used to capture how related information is. In this case text. pgvector lets us store the embeddings in a vector type
in postgres and query how similar they are. In this case I’m using cosine; a smaller angle between
a pair of vectors the closer together they are and the more likely they are to have similar meaning.</p>
<p>I used a couple of data sets from <a href="https://www.kaggle.com/">https://www.kaggle.com/</a></p>
<ul>
<li><a href="https://www.kaggle.com/datasets/suraj520/customer-support-ticket-dataset">Customer Support Ticket Dataset</a></li>
<li><a href="https://www.kaggle.com/datasets/davidshinn/github-issues">GitHub Issues</a></li>
</ul>
<p>To create embeddings from the text I used Python and the <a href="https://huggingface.co/thenlper/gte-small">gte-small</a> model:</p>
<ul>
<li>Doing well in the <a href="https://huggingface.co/spaces/mteb/leaderboard">Massive Text Embedding Benchmark (MTEB) Leaderboard</a>.</li>
<li>Trained on English text only.</li>
<li>Input will be truncated to 512 tokens.</li>
<li>Embeddings have 384 dimensions.</li>
<li>Storing an embedding in a Postgres vector type uses <code class="language-plaintext highlighter-rouge">4 * dimensions + 8 bytes</code>. In this case 1544 bytes per embedding.</li>
<li>Other models are available that have been trained on multi-lingual input. They generate embeddings with more dimensions.</li>
</ul>
<p>With the source text and embeddings stored in the database it is then easy to query them using SQL and the additional
pgvector operators. I also stored embeddings for sample search queries testing easier. In an application these would be computed from user input.</p>
<p>The <code class="language-plaintext highlighter-rouge">items</code> table holds 8469 support tickets with associated embeddings. Queries perform well without an index.</p>
<p>Results are about software problems even though the exact grammar and content of the phrases is different.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM items ORDER BY cos ASC;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| 1) If you want new (not already installed) software, you may need to use: +
| +
| 1.) Windows 7 Professional. +
| +
| 2.) This problem started occurring after the recent software update. I haven't made any other changes to the device.
cos | 0.10983020430459844
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| +
| I have the product purchased as a full time job. I have used the software and it has worked so far and I am satisfied! A few months ago I need assistance as soon as possible because it's affecting my work and productivity.
cos | 0.11112324098668158
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| I want to give your company a free demo program. Please help me create this program. +
| +
| I want your support. Please add your name. I'm worried that the issue might be hardware-related and might require repair or replacement.
cos | 0.11770975873161138
-[ RECORD 4 ]----
...
</code></pre></div></div>
<p>Using a LIKE query on the raw text doesn’t yield any results. Although I didn’t spend any time trying to make this into a better
phrase query.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT description FROM items WHERE description LIKE '%software problem%';
(0 rows)
</code></pre></div></div>
<p>There is no specific mention of dogs laying in the sun but the search does find some possible matches including ‘pet’ in an email address.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'dog laying in the sun') AS cos
FROM items ORDER BY cos ASC;
-[ RECORD 1 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| I want a picture of your dog. Please come and visit me soon. +
| +
| I'll keep the pictures. Please come to me soon. I've checked for any available software updates for my {product_purchased}, but there are none.
cos | 0.20796826662189327
-[ RECORD 2 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. 1-800-859-7267 2 e-mail us at tips@pet-babe.us for questions or to try out this product if you I've tried different settings and configurations on my {product_purchased}, but the issue persists.
cos | 0.2115174712052278
-[ RECORD 3 ]--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
description | I'm having an issue with the {product_purchased}. Please assist. +
| +
| We have two customers: +
| +
| Carnivorous pet! +
| +
| Grenadine (Grizzly) +
| +
| Kelica ( I've tried different settings and configurations on my {product_purchased}, but the issue persists.
cos | 0.2263596774337866
-[ RECORD 4 ]
...
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">issues</code> table holds 1,000,000 GitHub issues with associated embeddings. Query performance can be greatly
improved by adding an index to group the embeddings into lists that are probed at query time. This increases performance
but can reduce recall.</p>
<p>With no index a full table scan is needed and cosine is calculated for all rows.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>EXPLAIN ANALYSE SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM issues ORDER BY cos ASC LIMIT 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=217132.44..217144.11 rows=100 width=365) (actual time=735.190..743.605 rows=100 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on search (cost=0.00..20.12 rows=4 width=32) (actual time=0.031..0.032 rows=1 loops=1)
Filter: (term = 'software problem'::text)
Rows Removed by Filter: 3
-> Gather Merge (cost=217112.32..314241.30 rows=832476 width=365) (actual time=728.648..737.053 rows=100 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
-> Sort (cost=216112.29..217152.89 rows=416238 width=365) (actual time=702.454..702.461 rows=77 loops=3)
Sort Key: ((issues.embedding <=> $0))
Sort Method: top-N heapsort Memory: 108kB
Worker 0: Sort Method: top-N heapsort Memory: 115kB
Worker 1: Sort Method: top-N heapsort Memory: 108kB
-> Parallel Seq Scan on issues (cost=0.00..200203.97 rows=416238 width=365) (actual time=3.581..642.596 rows=333333 loops=3)
Planning Time: 0.220 ms
JIT:
Functions: 19
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 4.064 ms, Inlining 0.000 ms, Optimization 1.492 ms, Emission 15.255 ms, Total 20.811 ms
Execution Time: 746.185 ms
</code></pre></div></div>
<p>Adding an index to issues speeds up queries by using approximate nearest neighbor search. This trades some recall for performance. The index must be first created when there is already some data in the tabel</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SET maintenance_work_mem TO '512 MB';
CREATE INDEX ON issues USING ivfflat (embedding vector_cosine_ops) WITH (lists = 1000);
</code></pre></div></div>
<p>With the index in place queries are a significantly faster Set search probes to ~sqrt the number of lists. If set equal the number of lists the index won’t be used.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SET ivfflat.probes = 35;
EXPLAIN ANALYSE SELECT description, embedding <=> (SELECT embedding FROM search WHERE term = 'software problem') AS cos
FROM issues ORDER BY cos ASC LIMIT 100;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=17835.12..17848.26 rows=100 width=365) (actual time=74.305..74.560 rows=100 loops=1)
InitPlan 1 (returns $0)
-> Seq Scan on search (cost=0.00..20.12 rows=4 width=32) (actual time=0.026..0.029 rows=1 loops=1)
Filter: (term = 'software problem'::text)
Rows Removed by Filter: 3
-> Index Scan using issues_embedding_idx on issues (cost=17815.00..149137.00 rows=1000000 width=365) (actual time=74.302..74.547 rows=100 loops=1)
Order By: (embedding <=> $0)
Planning Time: 0.260 ms
Execution Time: 74.620 ms
</code></pre></div></div>
<p>pgvector has made it easy to store and search embedding data in PostgreSQL. With the current explosion
of advances in AI a new world of possibilities is opening up.</p>
<hr />
<h1 id="useful-references">Useful References</h1>
<ul>
<li><a href="https://cloud.google.com/blog/topics/developers-practitioners/meet-ais-multitool-vector-embeddings">Meet AI’s multitool: Vector embeddings</a></li>
<li><a href="https://cloud.google.com/blog/topics/developers-practitioners/find-anything-blazingly-fast-googles-vector-search-technology">Find anything blazingly fast with Google’s vector search technology</a></li>
<li><a href="https://platform.openai.com/docs/guides/embeddings/what-are-embeddings">What are embeddings?</a></li>
<li><a href="https://medium.com/mlearning-ai/embedding-similarity-search-25c6911240af">Embedding similarity search</a></li>
<li><a href="https://supabase.com/blog/openai-embeddings-postgres-vector">Storing OpenAI embeddings in Postgres with pgvector</a></li>
<li><a href="https://supabase.com/blog/fewer-dimensions-are-better-pgvector">pgvector: Fewer dimensions are better</a></li>
</ul>pgvector adds vector similarity search to PostgreSQL.Kafka, ksqlDB, and Earthquakes2023-06-02T00:00:00+00:002023-06-02T00:00:00+00:00http://blog.geoffc.nz/kafka-ksqldb-quakes<p><a href="https://kafka.apache.org/">Kafka</a> is a distributed data streaming technology. <a href="https://kafka.apache.org/documentation/streams/">Kafka Streams</a>
and more recently <a href="https://ksqldb.io/">ksqlDB</a> make it easy to build applications that respond immediately to events.</p>
<p>Streams and tables build on top of topics in brokers. Topics live in the storage layer. Streams and tables live in the processing layer. See <a href="https://www.confluent.io/blog/kafka-streams-tables-part-1-event-streaming/">Streams and Tables in Apache Kafka: A Primer</a>.</p>
<p>In this blog I’m going to walk through my experiment with Kafka and ksqlDB using earthquake location information
for New Zealand from <a href="https://geonet.org.nz">GeoNet</a>. You can follow along with the code from <a href="https://github.com/gclitheroe/exp">https://github.com/gclitheroe/exp</a>.
Earthquake location information is a useful data set to experiment with. Earthquake locations evolve as more data arrives
so there are many updates. For example for the earthquake <a href="https://www.geonet.org.nz/earthquake/2023p122368">2023p122368</a> there are 134 updates.</p>
<p><img src="/images/quakes-kafka.png" alt="" /></p>
<p>Seismic data is continuously processed by the earthquake location system and as more data arrives new earthquake locations are
made available. We will work with these locations (as files on disk), send some to a topic and then spend most of the time using streams and
tables in ksqlDB to work with the data.</p>
<p><img src="/images/kafka.png" alt="" /></p>
<h2 id="setup">Setup</h2>
<p>Install Go, Docker, and Docker Compose:</p>
<ul>
<li>Go <a href="https://go.dev/doc/install">https://go.dev/doc/install</a></li>
<li>Docker <a href="https://docs.docker.com/engine/install/">https://docs.docker.com/engine/install/</a></li>
<li>Docker Compose <a href="https://docs.docker.com/compose/install/">https://docs.docker.com/compose/install/</a></li>
</ul>
<p>Set up the <a href="https://pkg.go.dev/cmd/go#hdr-GOPATH_environment_variable">standard Go directories</a>
and clone the code:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mkdir -p $(go env GOPATH)/src/github.com/gclitheroe
cd $(go env GOPATH)/src/github.com/gclitheroe
git clone https://github.com/gclitheroe/exp
cd exp
</code></pre></div></div>
<p>Bring up the Confluent Kafka platform in Docker (see also <a href="https://ksqldb.io/quickstart-platform.html#quickstart-content">ksqlDB Quickstart</a>).
This provides Kafka along with a schema registry and ksqlDB. There is a Docker Compose file in the <code class="language-plaintext highlighter-rouge">kafka</code> directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker-compose up -d
</code></pre></div></div>
<p>Visit the control center at <a href="http://localhost:9021/">http://localhost:9021/</a> (it can take a moment to start) navigate to the control center cluster.
Create a topic called <code class="language-plaintext highlighter-rouge">quake</code> with protobuf schemas for the key and value. Use <code class="language-plaintext highlighter-rouge">protobuf/quake/quake.proto</code> for the value
and <code class="language-plaintext highlighter-rouge">protobuf/quake/key.proto</code>. These define the schema for sending and querying quake information.</p>
<h2 id="send-quake-events">Send Quake Events</h2>
<p>There is demo location information for two earthquakes. These are binary files in protobuf format that matches the schemas
used for the <code class="language-plaintext highlighter-rouge">quake</code> topic. The files were created from SeisComPML using <code class="language-plaintext highlighter-rouge">cmd/sc3ml2quake</code>. See also <a href="https://blog.geoffc.nz/protobufs-go/">Protobufs With Go</a></p>
<p>In the <code class="language-plaintext highlighter-rouge">cmd/quake-producer-kafka</code> dir:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go build
./quake-producer-kafka -input-dir demo-data/2023p007281
./quake-producer-kafka -input-dir demo-data/2023p122368
</code></pre></div></div>
<p>In the <code class="language-plaintext highlighter-rouge">cmd/quake-consumer-kafka</code> dir:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go build
./quake-consumer-kafka
</code></pre></div></div>
<p>This will echo send and receive information to the terminal. This producer and consumer pattern in the heart of many event driven
microservices.</p>
<p>Stop <code class="language-plaintext highlighter-rouge">quake-producer-kafka</code>, we don’t need it anymore.</p>
<h2 id="streams-and-tables-with-ksqldb">Streams and Tables with ksqlDB</h2>
<p>Start the ksqldb-cli container - will use this to interact with ksqlDB via SQL. All the following commands are run
at the terminal prompt in this container.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>docker exec -it ksqldb-cli ksql http://ksqldb-server:8088
</code></pre></div></div>
<p>Begin by telling ksqlDB to start all queries from the earliest point in each topic:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SET 'auto.offset.reset'='earliest';
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">quake</code> topic is an immutable log of facts with all quake events stored in it. We can materialise this into a mutable
table that has the latest information (the last message sent to Kafka) for each quake using the
<a href="https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/">CREATE SOURCE TABLE</a> statement. This creates a
materialised view table using the full fact and the schema from the registry. See <a href="https://docs.ksqldb.io/en/latest/how-to-guides/convert-changelog-to-table/">How to convert a changelog to a table</a> for more information.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE SOURCE TABLE quake_latest (
quake_id STRING PRIMARY KEY
) WITH (
kafka_topic = 'quake',
format = 'protobuf',
value_schema_full_name = 'quake.Quake'
);
</code></pre></div></div>
<p>This table can be queried. See <a href="https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/operators/">Operators</a> for
dereferencing structs and <a href="https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/scalar-functions/">Scalar functions</a> for
working with values.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SELECT public_id,
FORMAT_TIMESTAMP(FROM_UNIXTIME(time->secs * 1000), 'YYYY-MM-DD''T''hh:mm:ss') AS time,
magnitude
FROM quake_latest;
+---------------------------+---------------------------+---------------------------+
|PUBLIC_ID |TIME |MAGNITUDE |
+---------------------------+---------------------------+---------------------------+
|2023p007281 |2023-01-03T04:39:21 |5.109284389 |
|2023p122368 |2023-02-46T06:38:10 |5.976956766 |
</code></pre></div></div>
<p>We can create a stream from the <code class="language-plaintext highlighter-rouge">quake</code> topic using <a href="https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-stream/">CREATE STREAM</a>. This creates
a stream of data backed by the <code class="language-plaintext highlighter-rouge">quake</code> topic. It can be queried and processed using SQL and also
used to materialise additional tables. New operations on the stream start from the beginning (the first event in the underlying topic)
because we earlier set <code class="language-plaintext highlighter-rouge">SET 'auto.offset.reset'='earliest';</code>. Any new quake events sent to <code class="language-plaintext highlighter-rouge">quake</code> will also appear on the
stream and in any downstream queries or tables.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE STREAM quake_stream (
quake_id STRING KEY
) WITH (
kafka_topic = 'quake',
format = 'protobuf',
value_schema_full_name = 'quake.Quake',
key_schema_full_name = 'quake.Key'
);
</code></pre></div></div>
<p>The stream can be queried directly although this is query on read and can involve reading all messages in the stream.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>select public_id from quake_stream limit 3;
+-------------------------------------------------------------------------------------+
|PUBLIC_ID |
+-------------------------------------------------------------------------------------+
|2023p007281 |
|2023p007281 |
|2023p007281
</code></pre></div></div>
<p>We can create a table that is updated every time a new event appears on the stream (query on write). This is much faster
to query on read and in this case only contains the latest information for each quake. This is similar to the table we made
earlier with <code class="language-plaintext highlighter-rouge">CREATE SOURCE TABLE</code> although it is made from the stream, and we select a smaller set of information. There
has to be a projection over the stream and aggregation on the value fields, in this case <code class="language-plaintext highlighter-rouge">GROUP BY</code> and <code class="language-plaintext highlighter-rouge">LATEST_BY_OFFSET</code>
see <a href="https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/">Aggregation functions</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE TABLE quake_latest_from_stream AS
SELECT public_id,
LATEST_BY_OFFSET(time->secs) AS time,
LATEST_BY_OFFSET(magnitude) AS magnitude,
LATEST_BY_OFFSET(depth) AS depth
FROM quake_stream
GROUP BY public_id
EMIT CHANGES;
select * from quake_latest_from_stream;
+-------------------+-------------------+-------------------+-------------------+
|PUBLIC_ID |TIME |MAGNITUDE |DEPTH |
+-------------------+-------------------+-------------------+-------------------+
|2023p007281 |1672763961 |5.109284389 |6.732970715 |
|2023p122368 |1676443090 |5.976956766 |54.28686523 |
</code></pre></div></div>
<p>If we want a smaller set of information materialised in the table this is easy to do with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>drop table quake_latest_from_stream;
CREATE TABLE quake_latest_from_stream AS
SELECT public_id,
LATEST_BY_OFFSET(time->secs) AS time,
LATEST_BY_OFFSET(magnitude) AS magnitude,
LATEST_BY_OFFSET(depth) AS depth
FROM quake_stream
WHERE public_id = '2023p122368'
GROUP BY public_id
EMIT CHANGES;
select * from quake_latest_from_stream;
+-------------------+-------------------+-------------------+-------------------+
|PUBLIC_ID |TIME |MAGNITUDE |DEPTH |
+-------------------+-------------------+-------------------+-------------------+
|2023p122368 |1676443090 |5.976956766 |54.28686523 |
</code></pre></div></div>
<p>It is also easy to create a stream with a smaller set of fields in it. This can be queried and could be used to create other tables.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>CREATE STREAM quake_stream_filtered AS
SELECT
quake_id KEY,
time,
magnitude
FROM quake_stream
EMIT CHANGES;
select * from quake_stream_filtered limit 3;
+----------------------------------+----------------------------------+----------------------------------+
|KEY |TIME |MAGNITUDE |
+----------------------------------+----------------------------------+----------------------------------+
|2023p007281 |{SECS=1672763961, NANOS=234110000}|4.942331787 |
|2023p007281 |{SECS=1672763961, NANOS=642158000}|4.977927 |
|2023p007281 |{SECS=1672763961, NANOS=642158000}|5.123278328 |
</code></pre></div></div>
<p>If you would like to experiment with more data the <a href="https://github.com/gclitheroe/exp/releases/tag/quake-protobuf">quake-protobufs</a> release has a tar file
<code class="language-plaintext highlighter-rouge">quake-2020.tar.gz</code>. It contains 304510 update files for 22355 earthquakes from New Zealand from the year 2020.<br />
Download and extract this file and then run:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>quake-producer-kafka path.../quake-2020
</code></pre></div></div>
<p>We’re done. Exit ksql-cli <code class="language-plaintext highlighter-rouge">ctrl-d</code> and stop the cluster <code class="language-plaintext highlighter-rouge">docker-compose down</code>.</p>
<p>Kafka and ksqlDB provide an easy and powerful way to build streaming applications, bring together the power of streams and
database tables.</p>
<p><em>The New Zealand GeoNet programme and its sponsors EQC, GNS Science, LINZ, NEMA and MBIE are acknowledged for providing data used in this repo.</em>
<em>GeoNet doesn’t use Kafka although it does do a lot of data streaming.</em></p>Kafka is a distributed data streaming technology. Kafka Streams and more recently ksqlDB make it easy to build applications that respond immediately to events.(All a) Flutter2019-12-28T00:00:00+00:002019-12-28T00:00:00+00:00http://blog.geoffc.nz/flutter<p>I have been intending to try <a href="https://flutter.dev/">Flutter</a> for a quite a while and I recently got around to it. I liked it. A lot.</p>
<p>Flutter, from Google, allows you to build native compiled applications for mobile, web, and desktop from a single codebase. Sounds to good to be true? Here’s my test app, using earthquake data for New Zealand from the GeoNet API. The test app is running as a native application on Android and iOS as well as a Single Page Application in the browser. All from the same codebase. Couple this with the superb stateful <a href="https://flutter.dev/docs/development/tools/hot-reload">hot reloading</a> and ease of testing and I’m left feeling like an incredibly productive developer.</p>
<p><img src="/images/flutter.jpg" alt="" /></p>
<h3 id="try-it-out">Try it Out</h3>
<p>If you would like to try it out, the code for my test applicaiton is here <a href="https://github.com/gclitheroe/geonet_flutter">https://github.com/gclitheroe/geonet_flutter</a></p>
<h3 id="things-to-try">Things to Try</h3>
<p>Here are some things you could try adding to the application.</p>
<h4 id="volcano-alert-levels">Volcano Alert Levels</h4>
<p>Add a screen showing Volcano Alert Levels (VAL) for New Zealand.</p>
<ul>
<li>Use the <a href="https://api.geonet.org.nz/volcano/val">VAL data</a> from the GeoNet API.</li>
<li>The VAL are described <a href="https://www.geonet.org.nz/about/volcano/val">here</a>.</li>
<li>This will involve adding some <a href="https://flutter.dev/docs/development/ui/navigation">routing and navigation</a> to the application.</li>
</ul>
<h4 id="theme">Theme</h4>
<p>I’ve used the default theme and overridden it in a UI widget or two.</p>
<ul>
<li>Try creating a theme for the app in a more extendable way.</li>
<li>Try adding user preferences for a light or dark theme. BLoC might not be the best way to do this. Try a stateful widget as well.</li>
</ul>
<h4 id="desktop">Desktop</h4>
<p>I haven’t run the application as a desktop app. Give it a try.</p>
<h3 id="things-to-think-about">Things to Think About</h3>
<p>So, should you be using Flutter? Here are some things to consider.</p>
<ul>
<li>If you’re new to mobile development then jump in and give it a go. You can be productive and learn a lot quickly. Flutter’s <a href="https://flutter.dev/docs/get-started/install">getting started</a> docs are excellent. There is a lot of underlying platform complexity that is taken care of, and you can dip into this later if you need to.</li>
<li>Think about the third party APIs you need. If there is not already a Dart package you may need to write one or use <a href="https://flutter.dev/docs/development/platform-integration/platform-channels">platform integration</a>.</li>
<li>Flutter is suited to ‘brand first’ apps. There are widget packages for building apps that use platform specific (Android and iOS) components. Maintaining multiple UI views may start to negate the productivity gains.</li>
<li>If you currently program for mobile, you are already using between one and four programming languages (at least) and multiple UI design and testing tools. You might need to keep that complexity, or you might really enjoy reducing it.</li>
<li>What are you going to do when the next player enters the mobile landscape with a new language and paradigm?</li>
<li>Longevity is hard to predict. It’s worth considering, but the only way to ensure platform stability is probaby to build your own mobile device.</li>
</ul>
<h3 id="the-full-story">The Full Story</h3>
<p>I started with a quick prototype and then spent a bit more time understanding what a production application might look like.</p>
<h4 id="architecture">Architecture</h4>
<p>I’ve used the Business Logic Component (BLoC) architecture with the help of <a href="https://pub.dev/packages/bloc">bloc</a> and <a href="https://pub.dev/packages/flutter_bloc">flutter_bloc</a>. The UI emits events and responds to streams of state from the <a href="https://github.com/gclitheroe/geonet_flutter/blob/master/lib/bloc/quakes_bloc.dart">BLoC component</a>. This approach makes for very cleanly separated architecture which is easy to test.</p>
<h4 id="testing">Testing</h4>
<p>There is very good support for <a href="https://flutter.dev/docs/testing">testing</a>. Particularly good is the ability to test UI widgets without needing to use virutual devices.</p>
<h4 id="internationalisation">Internationalisation</h4>
<p>There is built in support for <a href="https://flutter.dev/docs/development/accessibility-and-localization/internationalization">internationalisation</a>. It took some effort to set up, but was then easy to add to. I only tackled strings - I did not consider the harder problem of Metric versus Imperial units.</p>
<h4 id="observability">Observability</h4>
<p>The <a href="https://pub.dev/packages/bloc">BLoC</a> package provides a BlocSupervisor and BlocDelegate that make it easy to add logging, analytics, and error handling in a single place.</p>
<h4 id="fun">Fun</h4>
<p>I first programmed for mobile in July 2010 after being given a Nexus One at a developer workshop. It was exciting and also profoundly dissapointing; Java, XML, slow testing, and clumsy tooling made for an often frustrating developer experience. Having to write applications twice to cover Android and iOS was a hard sell and tedious work. Flutter gets rid of that duplication. Working with Flutter was fun and I was more productive. Give it a go, see what you think.</p>
<h3 id="resources">Resources</h3>
<p>I found these resources particularly useful.</p>
<ul>
<li><a href="https://flutter.dev/docs">https://flutter.dev/docs</a></li>
<li><a href="https://medium.com/flutter-community/flutter-bloc-pattern-for-dummies-like-me-c22d40f05a56">https://medium.com/flutter-community/flutter-bloc-pattern-for-dummies-like-me-c22d40f05a56</a></li>
<li><a href="https://www.didierboelens.com/2018/08/reactive-programming---streams---bloc/">https://www.didierboelens.com/2018/08/reactive-programming—streams—bloc/</a></li>
</ul>I have been intending to try Flutter for a quite a while and I recently got around to it. I liked it. A lot.gRPC - Our Web Service Future?2016-09-05T00:00:00+00:002016-09-05T00:00:00+00:00http://blog.geoffc.nz/grpc<p>gRPC is a modern RPC framework and it looks like the future for our web services.</p>
<p>A couple of recent announcements are exciting times for developing web services:</p>
<ul>
<li><a href="http://www.grpc.io/blog/gablogpost">gRPC is version 1.0</a></li>
<li>AWS launched <a href="https://aws.amazon.com/blogs/aws/new-aws-application-load-balancer/">AWS Application Loadbalancer</a> with support for HTTP/2</li>
</ul>
<p>I’ve been writing HTTP/1.1 web services using Go for about eighteen months. Along the way we started using <a href="https://developers.google.com/protocol-buffers/">Protocol Buffers</a> and found them to be great for defining and documenting the message format. gRPC adds service definitions for Remote Proceedure Calls over HTTP/2 along with plenty of other niceness for writing services. I wrote a <a href="https://github.com/gclitheroe/grpc-exp">test project</a> to see what gRPC might look like for our uses. Here’s what I really liked:</p>
<ul>
<li>A <a href="https://github.com/gclitheroe/grpc-exp/blob/master/mtr-client/main.go">complete client</a> for my test gRPC service is 42 lines of code. It’s got token based auth, a single connection to the server with retries and back off, as well as message encoding and decoding. It uses two services (end points).</li>
<li>Comments in the protobuf service and message <a href="https://github.com/gclitheroe/grpc-exp/blob/master/protobuf/field/field.proto">definition</a> get turned into documentation in the generated libraries. The task of documenting an api just got that much easier.</li>
<li>Implementing the server is not a lot more complicated than the client. The greatest thing for me is how easy it is to add telemetry (metrics and logging) - there are interceptor types that can be implemented and added to the server e.g., to add method timing</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// telemetry is a UnaryServerInterceptor.
func telemetry(ctx context.Context, req interface{}, info *grpc.UnaryServerInfo, handler grpc.UnaryHandler) (interface{}, error) {
t := mtrapp.Start()
i, err := handler(ctx, req)
t.Track(info.FullMethod)
return i, err
}
</code></pre></div></div>
<p>Which is added to the server config:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
s := grpc.NewServer(grpc.UnaryInterceptor(telemetry))
...
</code></pre></div></div>
<p>In the test project I also added the option to read TLS certificates from the file system (mounted into the container at run time) or generate a self signed TLS certificate on the fly. The option to generate self signed certificates on the fly also makes it easy to do integration testing. It should also give options for deployment with the AWS Application Loadbalancer using either of:</p>
<ul>
<li>TLS termination with a valid certificate on the EC2 instances and <a href="http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/https-tcp-passthrough.html">TCP passthrough</a> on the balancer.</li>
<li>TLS termination with a valid certificate on the balancer and rencryption to the EC2 instances with a self signed certificate for <a href="http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/configuring-https-endtoend.html">End-to-End Encryption</a></li>
</ul>
<p>Next steps for us are to pick a small service and try a full implementation and deployment. Something I’m excited to see happen soon.</p>
<p>I found the following resourses really helpful:</p>
<ul>
<li>the <a href="http://www.grpc.io/docs/">gRPC docs</a></li>
<li>Kelsey Hightower’s <a href="https://github.com/kelseyhightower/grpc-hello-service">grpc-hello-service</a> repo was really useful, especially for understanding credentials (well worth a look for how to use JWT as well).</li>
<li>this <a href="https://coreos.com/blog/gRPC-protobufs-swagger.html">CoreOS blog</a> about handing gRPC and HTTP/1.1 JSON services in the same service.</li>
</ul>
<p><em>Thanks to the gRPC team and AWS for their awesome recent releases. Also thanks to <a href="http://www.linkedin.com/in/rjpguest">Richard Guest</a> for discussions about HTTP/2 and TLS. The future is faster, will have less lines of code, and will be here soon.</em></p>gRPC is a modern RPC framework and it looks like the future for our web services.Simpler Application Configuration2016-05-09T00:00:00+00:002016-05-09T00:00:00+00:00http://blog.geoffc.nz/cfg-go<p>Configuration - doing it wrong.</p>
<p>When we switched to Go I unfortunately brought a fair amount of Java baggage with me. The idea of code being flexible and the complexity that goes with that crept in. Specifically in this pkg, <a href="https://github.com/GeoNet/cfg">cfg</a> - a Golang library for application configuration. I wrote the cfg pkg. It’s kinda clever. I don’t recommend you use it. I’m leaving the repo up as a reminder to myself that complexity is bad.</p>
<p>cfg lets you define configuration in JSON and override that from another file and or environment vars. These feature seemed important at the time. It uses <a href="https://golang.org/pkg/reflect/#StructTag">struct tags</a> and reflection. Then as we used it in more applications the complexity of the <a href="https://github.com/GeoNet/cfg/blob/master/cfg.go#L21">Config struct</a> spiraled. WIth this and the reflection we pushed one of the only (unknown) bugs to production in recent memory. It was hard to find. It shouldn’t have happened.</p>
<p>Now we read application configuration from environment variables. Only from environment variables. We don’t use a constructor argument approach to set them from some main, this seemed prone to start up order issues. By using environment variables only there is no implication that there is a chance for the application to internally change it’s configuration. All application configuration is external.</p>
<p>Here’s an example. We log to <a href="https://logentries.com/">Logentries</a> using a small Go library - <a href="https://github.com/GeoNet/log">log</a>. Any code that needs to log to Logentries can import the log package for side effects:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">import</span> <span class="p">(</span>
<span class="n">_</span> <span class="s">"github.com/GeoNet/log/logentries"</span>
<span class="p">)</span>
</code></pre></div></div>
<p>If there is an environment variable <code class="language-plaintext highlighter-rouge">LOGENTRIES_TOKEN</code> set then the pkg switches out the log writer and logs to Logentries. All we’ve needed for config from environment variables is <a href="https://golang.org/pkg/os/#Getenv">Getenv</a>, <a href="https://golang.org/pkg/os/#ExpandEnv">ExpandEnv</a> and the very occasional string to number conversion.</p>
<p>For each application we define the requisite environment variables in the Docker <a href="https://docs.docker.com/engine/reference/commandline/run/#set-environment-variables-e-env-env-file">environment variable file format</a>. This is always done in a file called <code class="language-plaintext highlighter-rouge">env.list</code>. Using this file with Docker is easy. Using it without Docker or with CI (e.g., Travis) is also straight forwards (there are examples at the end).</p>
<p>This is all we do now. Simple. Easy. Not flexible. Enough. I think this has been the longer lesson for me with Go - simplicity is vital, strive for it.</p>
<h2 id="appendix">Appendix</h2>
<p>When not running code in Docker we read the env.list file and use it to set environment variables (for local development etc) using this bash function whipped up by <a href="https://nz.linkedin.com/in/chris-leblanc-24085523">Chris LeBlanc</a> - a GeoNet Development team member:</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">function </span>sourceenv <span class="o">{</span>
<span class="k">if</span> <span class="o">[</span> <span class="nv">$# </span><span class="nt">-lt</span> 1 <span class="o">]</span><span class="p">;</span> <span class="k">then
</span><span class="nb">echo</span> <span class="s2">"please supply a file to source env variables from"</span>
<span class="k">return </span>1
<span class="k">fi
for </span>i <span class="k">in</span> <span class="si">$(</span><span class="nb">cat</span> <span class="nv">$1</span> | <span class="nb">cut</span> <span class="nt">-f1</span> <span class="nt">-d</span><span class="s2">"#"</span> | xargs<span class="si">)</span><span class="p">;</span> <span class="k">do
</span><span class="nb">export</span> <span class="nv">$i</span>
<span class="k">done</span>
<span class="o">}</span>
</code></pre></div></div>
<p>For using the env.list file to test in Travis we can do something similar:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
script:
- export $(cat geonet-rest/env.list | grep = | xargs) && go test ./geonet-rest -v
...
</code></pre></div></div>Configuration - doing it wrong.Protobufs With Go2016-03-29T00:00:00+00:002016-03-29T00:00:00+00:00http://blog.geoffc.nz/protobufs-go<p>Faster and smaller - two important words when dealing with data.</p>
<p>I have been meaning to try Google’s <a href="https://developers.google.com/protocol-buffers/">Protocol Buffers</a> (protobuf) with <a href="https://golang.org/">Go</a> for quite a while. Structured data that’s smaller when serialized and faster to load as well as code generation - what’s not to like? I tried out protobufs on some quake data and I was surprised by how much smaller and faster a protobuf verion was - 35 times smaller and 180 times faster to unmarshal than the original data.</p>
<p>The code for my test is available here <a href="https://github.com/gclitheroe/exp">https://github.com/gclitheroe/exp</a></p>
<p>The source XML file is a <a href="http://geofon.gfz-potsdam.de/schema/0.7/sc3ml_0.7.xsd">SeisComPML</a> (XML) event file. It’s data for <a href="http://www.geonet.org.nz/quakes/2015p768477">this quake</a>. The same data is available <a href="https://quake.ethz.ch/quakeml">QuakeML</a> format <a href="http://quakeml.geonet.org.nz/quakeml/1.2/2015p768477">here</a>. The QuakeML format is created by transforming the SeisComPML on the fly and I’m interested in speed so for this experiment I’ve started from the SeisComPML.</p>
<p>SeisComPML represents data for the entire process of locating an earthquake. I’m interested in displaying only part of this information. I’ve modeled the information I want in the file <a href="https://github.com/gclitheroe/exp/blob/master/protobuf/quake/quake.proto">protobuf/quake/quake.proto</a>. I’ll call this format Quake to differentiate it from the SeisComPML. Deciding which information I wanted and how to structure it was by far the most time consuming task.</p>
<p>From the quake.proto file I can use the protobuf compilier with <a href="https://github.com/golang/protobuf">Go support</a> to generate Go code. Code for other languages including Java, Objective C, and Python can be generated from the same quake.proto file.</p>
<p>Compiling the Go code for the quake protobuf looks like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>protoc --proto_path=protobuf/quake/ --go_out=quake protobuf/quake/quake.proto
</code></pre></div></div>
<p>I can then add some funcs to unmarshal the SeisComPML, remap it to my Quake protobuf and save it to disk. I’ve also output XML and JSON versions of the Quake file for comparison. There are tests to generate the files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go test ./quake ./seiscompml07
ok github.com/gclitheroe/exp/quake 0.041s
ok github.com/gclitheroe/exp/seiscompml07 0.050s
</code></pre></div></div>
<table class="table">
<caption>File size for each format.</caption>
<thead>
<tr>
<th>Size (bytes)</th>
<th>File Name</th>
<th>Format</th>
</tr>
</thead>
<tr><td>495917</td><td>seiscompml07/etc/2015p768477.xml</td><td>SeisComPML (XML)</td></tr>
<tr><td>113830</td><td>quake/etc/2015p768477.xml</td><td>Quake (XML)</td></tr>
<tr><td>99615</td><td>quake/etc/2015p768477.json</td><td>Quake (JSON)</td></tr>
<tr><td>14181</td><td>quake/etc/2015p768477.pb</td><td>Quake (protobuf)</td></tr>
</table>
<p>There is a significant drop in file size going from the SeisComPML to my Quake format as XML. This is not surprising as I’ve omitted most of the entity mapping (publicIDs) and creation information as well as some amplitude information from the original SeisComPML. The protobuf Quake file is 35 time smaller than the corresponding SeisComPML file. This drop in size will lead to large improvement for disk i/o and network transfer times.</p>
<p>There are benchmark tests that unmarshal SeisComPML and the Quake files. The benchmarks unmarshal data from byte slices to avoid any bias from i/o. Unmarshalling the Quake protobuf is over 180 times faster than unmarshalling the complete SeisComPML; 0.16589 ms per operation versus 30.602699 ms. The Quake protobuf is also faster to unmarsal than the corresponding XML or JSON files. There is a Go benchmark test:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>go test -bench=. ./quake ./seiscompml07
</code></pre></div></div>
<table>
<caption>Unmarshal time for each format.</caption>
<thead>
<tr>
<th>ns/op</th>
<th>File Name</th>
<th>Format</th>
</tr>
</thead>
<tr><td>30269773</td><td>seiscompml07/etc/2015p768477.xml</td><td>SeisComPML (XML)</td></tr>
<tr><td>8545983</td><td>quake/etc/2015p768477.xml</td><td>Quake (XML)</td></tr>
<tr><td>1800593</td><td>quake/etc/2015p768477.json</td><td>Quake (JSON)</td></tr>
<tr><td>163473</td><td>quake/etc/2015p768477.pb</td><td>Quake (protobuf)</td></tr>
</table>
<p>It’s not really surprising that binary data is smaller and faster to work with than XML. I was a little surprised how much faster the protobuf is. I’m also stoked with how little effort it is to make this gain. Coupled with the easy code generation protobufs look like an approach worth investigating further.</p>Faster and smaller - two important words when dealing with data.Logging from Go2015-03-26T00:00:00+00:002015-03-26T00:00:00+00:00http://blog.geoffc.nz/logging-go<p>Logging from Go - much easier than in that other language.</p>
<p>I recently wrote about some of the <a href="/going-go">features of Go</a> that I really like for getting code into production. After my initial excitement there were a couple of things I had to check before being certain Go would be a winner. The first was logging. Can I easily get log messages from my app to where I can see them? For us that means the excellent Logentries service.</p>
<p>I’ve written before about reconfiguring Java syslogging in Mule to send to Logentries. It was hard won ground and not an experience I would like to repeat. Logging from Java is a mess. Throw in multi line strack traces caused by exceptions “bubbling up the stack” and getting a readable message into a syslog server is suddenly an exercise in frustration. There is an excellent blog about this from ZeroTurnaround - The State of Logging in Java 2013. An unsurprising finding is that 87% of respondents to a questionnaire have no real-time way of seeing logs from their applications in production. In 2013 there was already a confusing array of Java logging frameworks and facades. Figuring out how to make them play nicely with your application, your dev, test, and prod environments, and your syslog server is a problem that quickly goes into the basket marked Too Hard. By 2015 things in Java land have not got simpler. If you’re in that 87% of developers I sympathize - there is little to make logging from your application as easy as it should be.</p>
<p>It turns out logging from Go is so easy I was left wondering how it ever got so hard in that other language. There are two core Go packages.</p>
<ul>
<li>log - a simple logging package.</li>
<li>log/syslog - an interface to the system log service.</li>
</ul>
<p>Check the Go example for using the log package. It writes to stderr, if you’re a Twelve-Factor App acolyte then you are nearly done (you will need to switch stderr to stdout). We like to go a step further and have an application in production send it’s log messages straight to Logentries.</p>
<p>This is easy to achieve using the log and log/syslog packages using TCP or UDP:</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"log"</span>
<span class="s">"log/syslog"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="n">w</span><span class="p">,</span> <span class="n">err</span> <span class="o">:=</span> <span class="n">syslog</span><span class="o">.</span><span class="n">Dial</span><span class="p">(</span><span class="s">"tcp"</span><span class="p">,</span> <span class="s">"api.logentries.com:10000"</span><span class="p">,</span> <span class="n">syslog</span><span class="o">.</span><span class="n">LOG_NOTICE</span><span class="p">,</span> <span class="s">"LE_TOKEN"</span><span class="p">)</span>
<span class="k">if</span> <span class="n">err</span> <span class="o">!=</span> <span class="no">nil</span> <span class="p">{</span>
<span class="n">log</span><span class="o">.</span><span class="n">Fatal</span><span class="p">(</span><span class="n">err</span><span class="p">)</span>
<span class="p">}</span>
<span class="n">log</span><span class="o">.</span><span class="n">SetOutput</span><span class="p">(</span><span class="n">w</span><span class="p">)</span>
<span class="n">log</span><span class="o">.</span><span class="n">Println</span><span class="p">(</span><span class="s">"Hello Logentries."</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The Logentries docs suggest the use of TLS for untrusted networks. When we’re running code in The Cloud and sending log messages to somewhere else in The Cloud, that means all networks. By using the crypto packages (also in the Go core) this is easy to achieve. It’s made even easier by the Go code being open source - I can largely copy the experts by reading the syslog code.</p>
<p>log/logentries is a Go package I wrote that makes it easy to reconfigure log to send messages to Logentries. Before you jump in it’s worth understanding my requirements:</p>
<ul>
<li>I’m usually most interested in log messages during app start up (when configuration errors tend to show up).</li>
<li>Once an app is up and running we use metrics (not log messages) to track application performance.</li>
<li>I don’t ever want an app to block on logging.</li>
<li>If Logentries is not available for a some time I’m happy to log to stderr and then manually retrieve the logs later if I really need them. An alternative would be to store and forward.</li>
<li>A logger is for logging and a debugger is for, well, debugging. If you have to use logging for debugging then DELETE those calls before you commit.</li>
<li>I want to deploy applications without having to set up a syslog server as well. Syslog servers and their configuration are an arcane art best left to a specialist.</li>
</ul>
<p>If you need more features, the code is open source. I hope it makes a useful starting point for you.</p>
<p>Sending our log message to Logentries from Go is now a simple case of calling one method during init. The rest of the application code is unchanged. During development we set an empty Logentries token. This causes the Init method to no-op and the app continues to log only to stderr. Here’s an example app, using the package, that will log to Logentries every five seconds. Create your own Logentries token and try it out.</p>
<div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">package</span> <span class="n">main</span>
<span class="k">import</span> <span class="p">(</span>
<span class="s">"github.com/GeoNet/log/logentries"</span>
<span class="s">"log"</span>
<span class="s">"time"</span>
<span class="p">)</span>
<span class="k">func</span> <span class="n">init</span><span class="p">()</span> <span class="p">{</span>
<span class="c">// If there is an env var LOGENTRIES_TOKEN then call to Init is not needed.</span>
<span class="n">logentries</span><span class="o">.</span><span class="n">Init</span><span class="p">(</span><span class="s">"LOGENTRIES_TOKEN"</span><span class="p">)</span>
<span class="p">}</span>
<span class="k">func</span> <span class="n">main</span><span class="p">()</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">{</span>
<span class="n">log</span><span class="o">.</span><span class="n">Print</span><span class="p">(</span><span class="s">"Hello Logentries."</span><span class="p">)</span>
<span class="n">time</span><span class="o">.</span><span class="n">Sleep</span><span class="p">(</span><span class="n">time</span><span class="o">.</span><span class="n">Duration</span><span class="p">(</span><span class="m">5</span><span class="p">)</span> <span class="o">*</span> <span class="n">time</span><span class="o">.</span><span class="n">Second</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is how easy logging should be. It’s important, I need it, but it shouldn’t be a battle. Go makes it easy to log what I need and spend my energy focusing on the business problem.</p>Logging from Go - much easier than in that other language.Going Go2015-03-08T00:00:00+00:002015-03-08T00:00:00+00:00http://blog.geoffc.nz/going-go<p>Go - easy to deploy, fun to program with.</p>
<p>Last year I was trying out some of the cool features in Postgres and Postgis for generating <a href="http://www.postgresql.org/docs/9.3/static/functions-json.html">JSON</a> and <a href="http://postgis.net/docs/ST_AsGeoJSON.html">GeoJSON</a> in the database. I wanted to try these features for making a web service and I decided to try a new programming language for the task.</p>
<p>Here’s a teaser of the conclusion - more open data from GeoNet.</p>
<p>The image shows a slow slip event or silent earthquake recorded at the Gisborne GPS site.</p>
<p><img src="/images/plot.svg" alt="The image shows a slow slip event or silent earthquake recorded at the Gisborne GPS site." /></p>
<p>I had already run through the <a href="https://tour.golang.org/welcome/1">Tour of Go</a>. A lot has been written about Go syntax and language features. I liked what I saw well enough to do some investigation where the hard work often is - putting code into production.</p>
<h3 id="the-web-server">The Web Server</h3>
<p>We’ve been embedding <a href="http://eclipse.org/jetty/">Jetty</a> in our Java apps and deploying them via RPMs for a long time. I wanted to write a web service in Go and assumed I would need some sort of additional server application to deploy a Go app, or at the very least something to run in front of it. Nope! First mistake. There is a <a href="http://golang.org/pkg/net/http/#ListenAndServe">fully functioning web server</a> in the core libs.</p>
<h3 id="runtime">Runtime</h3>
<p>So I’m bound to need some kind of runtime on production, right? Nope! Go apps compile to a single binary.</p>
<p>Deploying a Go web application looks like this: <strong><em>Drop binary on system. Start it up. Job Done.</em></strong></p>
<h3 id="dependencies-and-compilation">Dependencies and Compilation</h3>
<p>There are a lot of powerful features in the core <a href="http://golang.org/pkg/">Go packages</a> but external dependencies are inevitable at some point. I expected some binary package management tool. Dependency repos in the clouds. Broken meta data. Version skew in production and general gnashing of teeth. Nope, nope, and nope! The Go compiler is so fast you compile all your code and its dependencies from source in single digit seconds or less. It’s straightforward to <a href="https://golang.org/cmd/go/">get</a> or <a href="https://github.com/tools/godep">vendor drop</a> dependencies for your code.</p>
<p>Given the C-like syntax, surely I was going to have to write a Makefile? Noooope. If you <a href="https://golang.org/doc/code.html#Organization">organise your code</a> as suggested the go tool can build it for you with no further help.</p>
<h3 id="sold-on-go">Sold on Go</h3>
<p>By this point I was hooked. No <a href="http://en.wikipedia.org/wiki/Matryoshka_doll">Russain Doll</a> JVM+Container+App deployment. No dependency hell. No compile and package time taking so long that I could forget what I was doing while I waited. It is so easy to compile and deploy Go code. It looks like someone really cares about making our jobs easier.</p>
<h3 id="a-real-project">A Real Project</h3>
<p>I turned the web services experiment into a real project - FITS (Field Time Series) - storing low data rate field data collected for the GeoNet project. FITS code is available on <a href="https://github.com/GeoNet/fits">Github</a>.</p>
<p>The fledgling first version of the <a href="http://fits.geonet.org.nz/api-docs">FITS api</a> is available to use now.</p>
<p>Here’s those JSON and GeoJSON <a href="https://github.com/GeoNet/fits/blob/6827990e3c3fefa4a47145c370f4513bfc871819/site.go#L67">functions in action</a>. They return <a href="http://fits.geonet.org.nz/site?siteID=HOLD&networkID=CG">site</a> information as GeoJSON.</p>
<p>Visualizing the data easily from FITS is important. SVG browser support is now very good and an SVG image handles well in a responsive web page. The image at the top of this post is SVG straight from the FITS api.</p>
<h3 id="deployment---docker">Deployment - Docker</h3>
<p>Having gone trendy with the code, I went full trendy with the deployment. It’s deployed in a <a href="https://www.docker.com/">Docker</a> container. The Docker build, including compiling the Go code in the container, is done in <a href="http://quay.io/">Quay.io</a>. The build is triggered from Github commits. The FITS Quay repo is <a href="https://quay.io/repository/geonet/fits">available</a>. Getting code to production this way is embarrassingly easy. Go and containers work together like two things that go really well together… for example, a glass of whisky and another glass of whisky.</p>
<p>FITS runs in production in Amazon Web Services (AWS) in the Sydney region. FITS uses <a href="http://aws.amazon.com/elasticbeanstalk/">AWS Elastic Beanstalk</a> and <a href="http://aws.amazon.com/rds/">AWS Relational Database Service</a> running Postgres+Postgis for the database.</p>
<h3 id="going-forward">Going Forward</h3>
<p>I’ve done a few projects in Go now. I still really like it. Go makes building and deploying code as easy as it should be. Most importantly to me - I’m really enjoying programming like I javan’t done for quite a while.</p>Go - easy to deploy, fun to program with.Logging - What’s Your App Doing Right Now?2014-05-13T00:00:00+00:002014-05-13T00:00:00+00:00http://blog.geoffc.nz/logging<p>Knowing how your application is performing in production should be really important to you.</p>
<p>I’ve written before about <a href="/metrics">gathering application metrics</a>. This post is about gathering log messages. I’ve started writing this post twice before and it quickly turned into an unhelpful rant about the state of Java logging. Third time’s a charm for sticking to the useful stuff without throwing my toys.</p>
<p>There is a list of what I want out of logging at the end. This may help you come up with a list of your own and also understand some of the decisions I’ve made.</p>
<p>At the moment I’m focused on logging from Mule ESB applications. This means I’m using log4j 1.x. I’m guessing you are as well. I want per application log files and the short host name to appear in the logs. I don’t want to repackage the application as I move it through dev, test, and prod.</p>
<p>I’m using <a href="https://logentries.com/">Logentries.com</a> for my log collector as they provide many of the features on my list. Importantly, their <a href="https://github.com/logentries/le_java">Java collector</a> is really nice and open source. If nothing else use that so you get multline logging - no more mangled stack traces in your logs. As an added bonus their pricing is really nice. There are other SAAS options out there, or you could roll your own infrastructure with something like <a href="http://logstash.net/">Logstash</a> For me this is not worth the time or overhead and Logentries provide a far nicer service than I could ever set up in the time I have available.</p>
<p>So now to logging from a Mule ESB application. There are a couple of ways to shave this yak. The main thing I’ve had to work around is that log4j 1.x can include system properties in the logging setup but can’t read environment variables. The approach I’ve taken is:</p>
<ol>
<li>Write Spring config to reconfigure logging inside a Mule application.</li>
<li>Provide a Logentries token and other properties to the application via properties files.</li>
<li>Use a Spring profile to enable the logging in production.</li>
</ol>
<h3 id="1-the-spring-config">1. The Spring Config</h3>
<p>This is reusable between Mule applications so I pulled it out into its own library. It’s available on Github and our public Maven repo. Looking at the <a href="https://github.com/GeoNet/mule-logging/blob/master/src/main/resources/mule-logentries-logging.xml">config</a> it does the following;</p>
<ul>
<li>Provides a bean to look up the host name.</li>
<li>Provides a log4j pattern.</li>
<li>Removes any existing log4j appenders.</li>
<li>Adds an appender to send to Logentries.</li>
<li>Adds a appender per application rolling file system log to mimic the one that Mule usually provides.</li>
</ul>
<p>Before you load this config your application needs to have the following properties defined:</p>
<ul>
<li>le.token - a valid token for logentries.com</li>
<li>app.name - the name of the Mule application.</li>
<li>app.version - the version of the Mule application.</li>
</ul>
<p>A shout out to Harry Lime for his answer on <a href="http://stackoverflow.com/questions/4400583/initializing-log4j-with-spring">Stack Overflow</a> that got me started on reconfiguring log4j with Spring.</p>
<h3 id="2-providing-the-properties">2. Providing the Properties</h3>
<h4 id="letoken">le.token</h4>
<p>We use the same approach with all our applications. We bundle default and empty properties with the application and then load an option set of overrides from the file system. The override file is provided by Puppet in production and contains, amongst other things, the correct Logentries token for the environment.</p>
<h4 id="appname-and-appversion">app.name and app.version</h4>
<p>We build our Mule applications using our own <a href="https://github.com/GeoNet/gradle-mule-plugin">Gradle plugin</a> which writes a build-version.properties file into the application zip for us. The application version comes from our <a href="https://github.com/GeoNet/gradle-build-version-plugin">Gradle build version plugin</a>. If you don’t want to take the same approach then you could write them into your mule-config as global properties.</p>
<p>Loading these properties then looks like this. The load order is important to ensure that the file from /etc/sysconfig overrides the one bundled in the Mule app. Also setting <code class="language-plaintext highlighter-rouge">ignore-unresolvable="true"</code> is important as we don’t have the same properties in all our files.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><spring:beans>
<context:property-placeholder order="0" ignore-unresolvable="true" location="mule-heart-beat-producer.properties"/>
<context:property-placeholder order="-1" ignore-resource-not-found="true" ignore-unresolvable="true" location="file:/etc/sysconfig/mule-heart-beat-producer.properties"/>
<context:property-placeholder order="1" ignore-unresolvable="true" location="build-version.properties"/>
</spring:beans>
</code></pre></div></div>
<p>See the <a href="http://blogs.mulesoft.org/mule-meets-zuul-centralized-properties-management-part-1/">MuleSoft Blog</a> for a different approach to providing application properties.</p>
<h3 id="3-spring-profile">3. Spring Profile</h3>
<p>Once the properties are set all that’s left is to import the logging configuration. We do this in a Spring profile which is enabled in production.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <spring:beans profile="production">
<spring:import resource="classpath:mule-logentries-logging.xml"/>
</spring:beans>
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>Here’s what some log messages look like at Logentries. Awesome stuff for two dependencies and a few lines of XML. The log messages are from our testing version of the back end services for a new version of the mobile quake applications that we are working on. Multiple location options and faster notifications coming soon to a phone near you.</p>
<p><img src="/images/logentries.png" alt="" /></p>
<h3 id="appendix-1---enabling-the-spring-profile-and-logs-from-mule">Appendix 1 - Enabling the Spring Profile and Logs from Mule</h3>
<p>I also want to see the logs from the Mule server itself at Logentries as well as be able to turn on the Spring profile in production. I want multi line logging so I’m going to use the logentries-appender again. Here’s one way to do this. It requires repackaging your Mule server and adding some files on disk in production. The goal is to pass some extra System properties to the Mule JVM.</p>
<p>In the Mule distribution:</p>
<ul>
<li>Add the <a href="http://mvnrepository.com/artifact/com.logentries/logentries-appender">logentries-appender.jar</a> to lib/boot</li>
<li>Add an appender to conf/log4j.properties This refers to System properties that we will set in a moment:</li>
</ul>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Default log level
log4j.rootCategory=INFO, console, le
log4j.appender.le=com.logentries.log4j.LogentriesAppender
log4j.appender.le.layout=org.apache.log4j.PatternLayout
log4j.appender.le.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss ZZZ} [${host.name}] %-5p: %F:%L %m
log4j.appender.le.Token=${logentries.token}
log4j.appender.le.Debug=False
log4j.appender.le.Ssl=True
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%-5p %d [%t] %c: %m%n
</code></pre></div></div>
<p>In the startup script bin/mule (adjust for your distro or O/S) set some additional environment variables:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#! /bin/sh
# Source networking configuration.
. /etc/sysconfig/network
# Source mule specific config.
if [ -e /etc/sysconfig/mule ]; then
. /etc/sysconfig/mule
fi
export HOSTNAME
export SPRING_PROFILE
export LOGENTRIES_TOKEN
</code></pre></div></div>
<p>Where /etc/sysconfig/mule contains:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>SPRING_PROFILE=production
LOGENTRIES_TOKEN=blah123abc
</code></pre></div></div>
<p>Finally, edit conf/wrapper.conf so that the environment variables are passed to the Mule JVM as System properties. Remember to increment the properties appropriately for your config file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>...
wrapper.java.additional.4=-Dspring.profiles.active="%SPRING_PROFILE%"
wrapper.java.additional.4.stripquotes=TRUE
wrapper.java.additional.5=-Dhost.name="%HOSTNAME%"
wrapper.java.additional.5.stripquotes=TRUE
wrapper.java.additional.6=-Dlogentries.token="%LOGENTRIES_TOKEN%"
wrapper.java.additional.6.stripquotes=TRUE
...
</code></pre></div></div>
<p>You can now repackage the Mule server for however you deploy it.</p>
<p>To enable logging from the Mule server to logentries I need to provide a file /etc/sysconfig/logentries.token the only contents of which are a valid logentries token. If you don’t want the Mule server to log to logentries then remove this file or leave it empty (in which case the logentries-appender will throw debug errors). Alternatively, provide all environments with a separate logging token.</p>
<p>To set the Spring profile add the profile(s) required to the file /etc/sysconfig/spring.profile (e.g. ‘production’) and restart the Mule server.</p>
<h3 id="appendix-2---what-i-want-from-logging">Appendix 2 - What I Want From Logging</h3>
<ul>
<li>log messages in one place.</li>
<li>search.</li>
<li>alerting.</li>
<li>developers involved in logging from their apps, not sysadmins having to bolt it on for production.</li>
<li>multi line logging - once you’ve had this for Java stack traces it’s hard to go back to syslog mangled error messages.</li>
<li>a log message is a line of text - don’t force pseudo formats on me.</li>
<li>log direct from an application (no faffing with syslog please).</li>
<li>don’t block the application while logging.</li>
<li>don’t lose messages if possible.</li>
<li>logging that plays well with running in the cloud or on saas.</li>
<li>log to the local file system as well in case the central collector goes away.</li>
<li>a log per application.</li>
<li>search by host occasionally.</li>
<li>easy reconfiguration between dev, test, and prod.</li>
<li>no requirement to repackage my app for each environment.</li>
<li>some nice graphs would be good.</li>
<li>log everything that maybe useful.</li>
<li>the option to archive logs for longer periods of time.</li>
<li>security.</li>
<li>application deployment markers.</li>
<li>fast ingestion of log messages into the collector.</li>
<li>someone else to run the collector for me.</li>
<li>beatings for people that use logging when they should use a debugger (and vis versa).</li>
<li><a href="https://www.youtube.com/watch?v=ERDUbAv8Qz0">The Moon on a stick</a>.</li>
</ul>
<p>This list is pretty long already. Fortunately I don’t also have to deal with:</p>
<ul>
<li>regulatory requirements for logging.</li>
<li>private data in the log messages.</li>
</ul>
<h3 id="appendix-3---a-challenge">Appendix 3 - A Challenge</h3>
<p>If this is like any other blog written about logging from Java then people reading this will doubtless mention a different framework or facade that I can throw in here to make my life magically easier. Logging in Java is already confusing enough. Simply name checking another approach does not improve the problem. I need to get a small amount of text onto the network and into a collector. I shouldn’t have to perform multilayer classpath and configuration surgery to achieve that. If you must suggest alternatives then please do it with concrete examples on how your suggestion actually improves logging from Java in my situation. I doubt you will fit convincing, tested, and proven examples into blog comments.</p>
<p>Better yet - help get log4j2 over the line to a production release and into use on some popular projects. It has some really nice looking features that I think actually will make logging from Java simpler. We’ve tested it in dev and like what we saw. I expect it will be used our next non-Mule project.</p>Knowing how your application is performing in production should be really important to you.Metrics - What’s Your App Doing Right Now?2014-03-10T00:00:00+00:002014-03-10T00:00:00+00:00http://blog.geoffc.nz/metrics<p>How is your application performing in production right now?</p>
<p>It’s been a long time between updates, work, busy, mumble mumble mumble.</p>
<p>In a minute I’m going to talk about monitoring Java and that’s enough reason to stop reading right now. Fair enough. But before you go:</p>
<ul>
<li>How is your application performing in production right now?</li>
<li>Are you meeting demand?</li>
<li>If you add more load to a server will it cause application problems?</li>
<li>Do you know why things fail when they do?</li>
</ul>
<p>Developers - you are not excused, DevOps is the new now. Code that is not in production and performing well is worthless. If instrumenting your code to collect metrics about it’s performance in production is not already part of your regular work it soon will be. If Java’s not your thing but you get a bit mumble mumble on the topic of how your code performs in production then scan the pictures below and then head on over to <a href="https://metrics.librato.com/">Librato Metrics</a> or <a href="https://www.hostedgraphite.com/">Hosted Graphite</a> and get busy. They don’t care what metrics you’re collecting they just make it so easy that you have no excuses not get on with it right now.</p>
<p>Here’s a picture. It’s performance metrics for earthquake messages going through the Mule ESB from the SeisComP3 earthquake location system to the web. There is also a heart beat message tracked so that we know everything is still working between earthquakes. There is a lot of information here. Not least of which is it takes about 10 times longer to insert messages into the a database (akapp01) than it does to read them from the file system (akeqp01). This is the sort of information that makes targeting performance improvement, trouble shooting, or capacity planning easy. Without this sort of information you’re left guessing about what to do when something is not right with your application. Guessing only rarely leads to success.</p>
<p><img src="/images/sc3-esb-messages.png" alt="" /></p>
<p>The picture above is a Librato Metrics dashboard. Here’s one of similar information sent to Hosted Graphite.</p>
<p><img src="/images/hosted-graphite.png" alt="" /></p>
<p>Librato Metrics and Hosted Graphite both have strengths and weaknesses. Try both and see which suits your needs the best. One of the biggest differences, if you need it, is the ability to <a href="http://blog.librato.com/posts/next-generation-alerting">alert</a> on your metrics.</p>
<p>So to the Java bit. Monitoring the JVM and processes running in it usually involves using JMX. Accessing information via JMX is easy and secure sucks. Running in the cloud, with servers coming and going at a moments notice, makes this problem worse. The obvious answer is to turn the metric gathering problem around and have a JVM agent push metrics to you. There are services available that do this. There is the awesome <a href="https://newrelic.com/">New Relic</a> and others like it. However, they all come at a cost. Enough of a cost that I ended up rolling our own agent for collecting JVM metrics from Mule, Jetty, and Tomcat.</p>
<p>I didn’t have to do much. Librato Metrics is a great data store with fantastic visualisations for what I want. Getting data out of JMX is the only hard work and fortunately the <a href="http://www.jolokia.org/">Jolokia</a> project removes all the pain by providing an HTTP-JMX bridge. I started off with a Jolokia agent being queried with a Perl script. Once I was happy with the moving parts I’ve written agents the wrap Jolokia to run as an application in Mule or a Servlet container and no external script is needed. The applications themselves periodically gather and send metrics to Librato or Hosted Graphite. Getting metrics is as simple as adding some config properties to a server and dropping an app into Mule, Jetty, or Tomcat. Gathering metrics is deploying an application - no fire wall changes, no adding servers to a remote collector process, very little pain at all.</p>
<p>Here’s some Jetty metrics for Jetty and the JVM it’s running in under test load. It looks to me that with a little bit of tuning I could handle a lot more requests with this server.</p>
<p><img src="/images/jetty-librato.png" alt="" /></p>
<p>These are the projects, they are open source on Githib:</p>
<ul>
<li><a href="https://github.com/GeoNet/mule-metrics">Mule Metrics</a></li>
<li><a href="https://github.com/GeoNet/app-server-metrics">App Server Metrics</a></li>
</ul>
<p>If Librato or Graphite are not your thing then Mule Metrics or App Server Metrics should be pretty easy to extend (implement one method). There is a lot more that could be done with Jolokia and JMX beyond using it to extract metrics. As we find places where we need more detail about an application then I think we will start to look seriously at using <a href="http://metrics.codahale.com/">Coda Hale’s Metrics</a> as we need it. We’re also getting far more sophisticated about how we use logging but that’s a different topic. For now, if you haven’t got it, then work on getting some insight into your applications in production and avoid having to mumble mumble mumble when problems arise.</p>How is your application performing in production right now?