Apache Metabase



  • DESCRIPTION
  • SUPPORT

Metabase - A database framework and API for resource metadata

Metabase is an open-source analytics tool used by many organizations, large and small, for business intelligence. It has many similar capabilities as Tableau or Looker. Managing Event Streams with Apache Kafka & Node.js To get this system up and running we first need to simulate our product usage events. Apache Pinot and Druid Connectors – Docs. RaptorX – Disaggregates the storage from compute for low latency to provide a unified, cheap, fast, and scalable solution to OLAP and interactive use cases. Presto-on-Spark Runs Presto code as a library within Spark executor. Open Source Metabase Alternatives. Metabase is described as 'is the easy, open source way for everyone in your company to ask questions and learn from data'. There are more than 25 alternatives to Metabase for various platforms. The best alternative is Apache Superset, which is both free and Open Source.Other great apps like Metabase are Redash (Paid, Open Source), Cube.js (Free, Open Source.

version 1.003

Metabase is a database framework and API for resource metadata. The framework describes how arbitrary data ('facts') are associated with particular resources or related to each other. The API describes how to store, retrieve, and search this information.

History and Motivation

Metabase was originally designed as a means of storing reports from the CPAN Testers project. When Metabase was initially developed, CPAN Testers reports were sent by individual testers to a single email server, which then forwarded them to a USENET group, which was considered the authoritative store. This presented problems: some testers couldn't send email, the system wasn't very searchable, was hard to mirror, and the data inside the system was entirely unstructured.

Metabase aimed to avoid all of those problems by being transport-neutral, searchable and easier to mirror by design, and geared toward storing structured data. Simplicity is another design goal: while it has several moving parts, they're all simple and designed to be replaceable and extensible, rather than to be a perfect design up front.

A Metabase has several parts:

  • Metabase::Librarian, a class which manages access to the Archive and Index

  • Metabase::Gateway, a role for managing submission of facts to the Librarian

  • Metabase::Archive, a role for storing and retrieving facts

  • Metabase::Index, a role for indexing and searching facts

  • Metabase::Query, a role for translating a common query data structure into a backend-specific query

Apache metabase tutorial

Roles require implementations. These could use flat files, relational databases, object databases, cloud services, or anything else that can satisfy the role API.

Metabase comes with some simple, stupid backends for testing:

Facts stored with in a Metabase are defined as subclasses of Metabase::Fact. Metabase::Report is a subclass that relates multiple facts.

Metabase::Web provides the web API for storing, searching and retrieving facts. Metabase::Client::Simple is the client library to submit facts to a Metabase::Web server. A future Metabase::Client class will provide submit and search capabilities.

Bugs / Feature Requests

Please report any bugs or feature requests through the issue tracker at https://github.com/rjbs/metabase/issues. You will be notified automatically of any progress on your issue.

Apache Superset Metabase

Source Code

This is open source software. The code repository is available for public review and contribution under the terms of the license.

Apache
  • David Golden <dagolden@cpan.org>

  • Ricardo Signes <rjbs@cpan.org>

  • Leon Brocard <acme@cpan.org>

  • Florian Ragwitz <rafl@debian.org>

  • Graham Knop <haarg@haarg.org>

  • Leon Brocard <acme@astray.com>

  • Ricardo SIGNES <rjbs@cpan.org>

This software is Copyright (c) 2013 by David Golden.

This is free software, licensed under:

To install Metabase, copy and paste the appropriate command in to your terminal.

For more information on module installation, please visit the detailed CPAN module installation guide.

When it comes to dashboard performance, there are essentially four ways to get your dashboards to load faster:

  • Ask for less data.
  • Cache answers to questions.
  • Organize data to anticipate common questions.
  • Ask efficient questions.

What follows is some general guidance for how to get your dashboards to load faster, and the bulk of this guidance will focus on that third bullet, or how you can organize data to anticipate the most common questions that data will be used to answer. We’ll cover how to ask more efficient questions (i.e., SQL best practices) in an upcoming article.

And, of course, the usual caveats around premature optimization apply. The advice here assumes that you have been exploring your data for some time, and are deriving material benefits from the insight that data yields. Meaning: your queries are already helping you make better decisions. Only then should you be asking: how do I get this dashboard to load faster?

Ask for less data

This point is almost too obvious that it often goes overlooked, but it should be the first place to start. Do you actually need the data you’re querying? And even if you do need all that data, how often do you need it?

You can save a lot of time simply by restricting the data you query, such as by including adding a default filter on a dashboard. And be especially mindful of data spanning time and space. Do you really need to look at the last quarter’s worth of data every day? Or do you really need every transaction for every country?

And even if you do need to know that information, do you need it every day? Could you relocate that question to another dashboard that’s typically only reviewed weekly or monthly?

We should be open to all our data when we’re exploring our datasets, but once we settle on the kinds of decisions our organization needs to make — and the data we need to inform those decisions — we should be ruthless about excluding data that does not significantly improve our analysis.

Cache answers to questions

You don’t need to wait for data if it’s already loaded. Admins can set up Metabase to cache query results, which will store answers to questions. If you have a set of dashboard that everyone runs when they open up their computers first thing in the morning, run that dashboard ahead of time, and the questions in that dashboard will use the saved results for subsequent runs to load in seconds. People will have the option to refresh the data, but typically this is unnecessary, as most often people will only need to review data from the previous day and before.

You can configure Caching in the Settings tab of the Admin panel. These settings allow you to configure the minimum query duration to cache (so you only cache long-running queries), the Cache Time-to-live (TTL) Multiplier, (to specify how long the cache should stick around), and the Max Cache Entry Size (so you can set an upper limit on the amount of data you cache).

You can use Metabase’s auditing tools to determine when people typically run various questions, then create a script using Metabase’s API to programmatically run these questions (thereby caching their results) ahead of time, so that when people log in and navigate to their dashboards, the results load in seconds. Even without taking that extra “pre-warming” step, when your first person loads that slow query, it’ll be cached for the rest of your folks.

Organize data to anticipate common questions

The next best thing you can do is organize your data in such a way that it anticipates the questions that will be asked, which will make it easier for your database to retrieve that data.

  • Index frequently queried columns.
  • Replicate your database.
  • Denormalize data.
  • Materialize views: create new tables to store query results.
  • Aggregate data ahead of time with summary tables.
  • Pull data out of JSON and slot its keys into columns.
  • Consider a database specific to analytics.

Apache Metasploit

All but that last section below assumes you are using a traditional relational database like PostgreSQL or MySQL. That last section is about moving to a completely different type of database tuned specifically to handle analytics, and should be your last resort, especially for startups.

Index frequently queried columns

Adding indexes to your database can significantly improve query performance. But just as it wouldn’t make sense to index everything in a book, indexes do incur some overhead, so they should be used strategically.

How to use indexes strategically? Find your most queried tables, and the most commonly queried columns in those tables. You could consult your individual database to get this metadata. For example, PostgreSQL offers metadata on query numbers and performance via its pg_stat_statements module.

Remember to do the simple work of asking your Metabase users which questions and dashboards are important to them, and if they’re experiencing any “slowness” as well. Fields that most often require indexing are either time-based or id-based – think timestamps on event data, or IDs on categorical data.

Alternatively, you can use Metabase’s auditing functionality in the Enterprise Edition, which makes it easy to see who is running which queries, how often, and how long those queries took to return records.

Once you’ve identified the tables and columns you’d like to index, consult your database’s documentation to learn how to set up indexes (for example, here’s indexing in PostgreSQL).

Apache Metabase Software

Indexes are easy to set up (and take down). Here’s the basic format for a CREATE INDEX statement:

For example:

Experiment with indexing to see how you can improve query performance. If your users are commonly using multiple filters on a single table, investigate using compound indexes.

Replicate your database

If you are using a database for handling both operations (e.g., app transactions like placing orders, updating profile information, etc.) as well as for analytics (e.g., for queries that power Metabase dashboards), consider creating a replica of that production database for use as an analytics-only database. Connect Metabase to that replica, update the replica each night, and let your analysts query away. Analysts’ long-running queries won’t interfere with the day-to-day operations of your production database, and vice versa.

Outside of making your dashboards faster, keeping a replica database for data analytics is a good practice to follow to avoid potentially long-running analytical queries impacting your production environment.

Denormalize data

In some cases, it might make sense to denormalize some of your tables (i.e., to combine multiple tables into a larger table with more columns). You’ll end up storing some redundant data (such as including user information each time the user places an order), but analysts won’t have to join multiple tables to get the data they need to answer their questions.

Materialize views: create new tables to store query results

With materialized views, you’ll keep your raw, denormalized data in their tables, and create new tables (typically during off hours) to store query results that combine data from multiple tables in a way that anticipates the questions analysts will ask.

For example, you might store order and product information in different tables. You could, once a night, create (or update) a materialized view that combines the most frequently queried columns from both of those tables, and connect that materialized view to your questions in Metabase. If you’re using a database for both production and analytics, in addition to eliminating the joining process needed to combine that data, your queries won’t have to compete with production reads and writes on those tables.

Apache Metabase Free

The difference between a materialized view and a Common Table Expression (CTE, sometimes called a view), is that the materialized view stores its results in the database (and therefore can be indexed). CTEs are essentially subqueries, and are computed each time. They may be cached, but they are not stored in the database.

Materialized views will, however, consume resources in your database, and you will have to update the view manually (refresh materialized view [name]).

Aggregate data ahead of time with summary tables

The idea here is use materialized views — or even a separate set of tables — to create summary tables that minimize computation. Say you have tables with a million rows, and you want to aggregate data in multiple columns. You can create a materialized view based on aggregations of one or more tables, which will perform the initial (time-consuming) computation. Rather than have a dashboard query and compute that raw data several times throughout a day, you could instead create questions that query that summary table to get the data computed the night before.

For example, you could have an orders table that contains all of the orders table, and an order summary table that updates nightly and stores rollups and other aggregated data, such as order totals per week, month, etc. If a person wants to view the individual orders used to compute that aggregate, you can use Metabase Enterprise’s customized drill-through to link users to a question or dashboard that does query the raw data.

Pull data out of JSON and slot its keys into columns

We often see organizations storing JSON objects in a single column of a relational database like MySQL or PostgreSQL. Typically, these organizations are storing JSON payloads from event analytics software like Segment, or Amplitude.

Though some databases can index JSON (PostgreSQL can index JSON binaries, for example), you still have to grab the full JSON object each time, even if you’re only interested in a single key-value pair in the object. Instead, consider extracting each field from these JSON objects and mapping those keys to columns in a table.

Consider a database optimized for analytics

If you’ve done all of the above, and the length of your dashboard loading times are still interfering with your ability to make decisions in a timely manner, you should consider using a database that is structured specifically for fielding analytical queries. These databases are known as Online Analytics Processing databases, or OLAP databases (sometimes called Data warehouses).

Traditional relational databases like PostgreSQL and MySQL are designed for transaction processing, and are categorized as Online Transaction Processing databases (OLTP). These databases are better suited for use as operational databases, such as storing data for web or mobile applications. They are quite good at handling the following scenario: someone submits a thoughtful, germane, and not at all inflammatory comment to your website, your app fires a POST request to your backend, which routes the comment and metadata to your database for storage. OLTP databases can handle large volumes of concurrent transactions like comment posts, cart checkouts, profile bio updates, etc.

The main difference between OLAP and OLTP systems is that OLAP databases optimize analytical queries like sums, aggregates, and other analytical operations on large amounts of data, as well as bulk imports (via an ETL), whereas OLTP databases must balance large reads from the database with other transaction types: small inserts, updates, and deletes.

OLAPs typically use columnar storage. Whereas traditional (OLTP) relational databases store data by rows, databases that use columnar storage (unsurprisingly) store data by columns. This columnar storage strategy gives OLAP databases an advantage when reading data, as queries do not have to sift through irrelevant rows. There are other efficiencies that can be gained as well, such as the ability to compress columns with redundant data using bitmap indexing and run-length encoding.

Data in these databases is typically organized in fact and dimension tables, with (often massive) fact tables housing events. Each event contains a list of attributes and foreign key references to dimension tables, which contain information about those events: who was involved, what happened, product information, and so on.

Metabase supports several popular data warehouses: Google BigQuery, Amazon Redshift, Snowflake, and Apache Druid (which specializes in real-time analytics). Metabase also supports Presto, which is a query engine that can be paired with a variety of different datastores, including Amazon S3 (Simple Storage Service).

As you start out using Metabase, don’t worry too much about the underlying data store. But as your data grows, and adoption of Metabase grows, keep an eye out for indicators that you may want to investigate using a data warehouse. Redshift, for example, can query petabytes of data, and scale to querying historical data in Amazon S3. And Snowflake allows you to dynamically scale your compute resources as your organization grows.

Apache Metabase Plugin

Further reading

For more tips on improving performance, check out our articles on scaling Metabase and SQL query best practices.

If you’ve improved dashboard performance at your organization, you can share your tips on our forum.