Team WeAreMakers: Abdulrahman Elsharqawy, Mohamed Hassan AbdulRahman, Ahmed Yehia, Mohamed Moussa, Ahmed Hassan, Waleed Abdel Fattah

Published September 17, 2016 © GPL3+

Get Me There - Bus Intelligent Transportation System (BITS)

An innovative solution that uses IOT, Microsoft Azure and Data Analytics to save passengers' time.

AdvancedFull instructions providedOver 1 day5,445

Best Project - Transportation

Microsoft Azure: Building The Enterprise IoT Cloud!

Get Me There - Bus Intelligent Transportation System (BITS)

Things used in this project

Hardware components

Android device

used to test the Android App

Raspberry Pi 3 Model B

Breadboard (generic)

Jumper wires (generic)

Push-Button

LED (generic)

Resistor 221 ohm

Software apps and online services

Android Studio

Microsoft Visual Studio 2015

Microsoft Windows 10 IoT Core

Microsoft Windows IoT Core Project Templates on VS

Microsoft Windows IOT Core Dashboard

Microsoft Azure

Microsoft Azure Storage Explorer (Preview) is a standalone app from Microsoft that allows you to easily work with Azure Storage data on Windows, macOS and Linux.

Microsoft Power BI

Power BI is a suite of business analytics tools to analyze data and share insights. Monitor your business and get answers quickly with rich dashboards available on every device.

Microsoft Device Explorer

Microsoft Azure IOT Hub

Microsoft Azure

Google Maps Distance Matrix and Directions API

Fritzing

Apache Spark

Apache Hive

Apache Impala

Microsoft power bi

Cloudera ODBC

Story

Background

Have you ever waited for a bus too long then ended up finding it without any free seats and you can't take it ?!

Our solution - Get Me There - uses IOT, Data Analytics, Machine Learning, Mobile App with Cloud Computing to help bus passengers get accurate estimates of arrival times and current number of available seats in each specific bus.

The Problem, The Need

For many people all over the world bus is one of the most important ways of transportation in cities. Everybody needs to reach his destination in time with a busy schedule waiting for him. Passenger needs to know how much he has to wait for a bus, will the bus he is waiting for has a free seat for him and finally should he wait or look for a workaround to reach in-time.

On the other hand, bus service companies need to collect all service activities and customer requests to build a robust and rich analytics platform, to help faster and appropriate decisions.

The Solution: Get Me There - BITS Solution

BITS Project's Architecture

The Solution consists of the following components:

The IOT device - the data collection end point: a Bus equipped with Raspberry PI 3 board and sensors to count number of passengers getting IN/OUT each bus.

Microsoft Azure IOT hub: will communicate with IOT device (Raspberry PI 3) and receives telemetry.

Azure Stream Analytics: by which the received data will be packed into tables .

Azure Storage: tables then will stored in this storage to be used later in the following modules.

Azure API App (Web Service): provide the mobile app with information like estimated arrival time, estimated travel time (using Google Maps APIs) and number of free seats.

Mobile App: the passenger's interface to use the solution.

Hadoop Cluster / Hive for data analytics platform

MS Power BI for desktop reports

How It Works

In this video we demonstrate a complete transaction from the first trigger when a bus open the door at certain stop "1111" and then 6 passengers get into the bus and one passenger comes out, all this is simulated by push buttons on the bread board - however in reality two beam sensors installed at bus doors can be used so that whenever a passenger pass through them we can detect wither he is getting in or out - and then door closed which triggers a procedure to send collected data to the IoT hub that records this activity in the cloud storage.

When a request is submitted on the mobile app that requests a bus from a certain stop to another one, the API app detects that same bus is passing by those stops, and checks its availability and calculate number of available seats according to all previous stops in the same journey, then using google APIs calculate the forecasted duration for the bus to arrive and the duration estimated to get to the requested destination, and send it back to mobile app.

We used five tools in the demo:

API App portal (Web Service) to show number of requests received

IoT hub portal, to show number of messages received

Visual Studio to show where the message sent to the IoT hub

Android Studion emulator, to run the mobile app

MS Azure Storage Explorer, to monitor data records

BITS Demonstration

Now lets explore details of each our solution components, why we use it, and how to replicate it.

The IOT device

The target version of the IOT device is to implement a Raspberry Pi 3 board in each bus. Raspberry Pi will use beam sensors to count number of passengers getting IN/OUT at each bus stop, A GPS sensor to get bus exact location (not used in this prototype due to time constraints. However, the solution is ready to integrate GPS sensors) and door open/close sensor.

For simulation we used the following circuit to generate data about passengers getting IN/OUT the bus and bus's door open/close status. Every press on the "passenger in" is simulating a passenger getting in the bus, every press on the "passenger out" is simulating a passenger getting out the bus and every press on "bus door open/close" is simulating that bus reaching/leaving a bus stop

Each Raspberry Pi 3 will use Windows 10 IOT Core. following are steps to install the OS on the Raspberry Pi:

On PC/laptop, run "Windows IOT Core Dashboard" software got it from here.

Insert Raspberry PI3 SD card into PC's card reader. then, choose "Set up a new device" tab.

Fill form as below screen Image ..

installing Windows 10 IOT Core on SD card

After install Win10 IOT core on the SD card, insert SD card in the Raspberry's card reader.

Power on Raspberry and connect it with to the network Ethernet cable.

On your PC/laptop, check Windows IOT Core Dashboard. Raspberry should be seen on "My Devices" tab as follows ..

IOT Dashboard detects Raspberry Pi

Configure Raspberry Pi WiFi

Azure Configuration

In this part we will configure Azure different components that will be used through project. First step is to create Azure account. Microsoft offer a free subscription for one month. Thanks to Microsoft :)

Microsoft verify your email and credit card. Please note you can't create another free account using the same email, credit card.

Configuration for all components are stright forword and you can find detailed and clear steps in MSDN

Notes:

while creating of all Azure resources ensure to choose the same "Location"

Resource Group

follow below screens to create new resource group

1 / 2 • step 1: Create New Resource Group

Microsoft IOT Hub

follow below screens to create new IOT hub, for More information check below link

https://github.com/Azure/azure-iot-sdks/blob/master/doc/setup_iothub.md

1 / 2 • step 1: Create New IOT hub

Microsoft Storage account

follow below screens to create new Storage Account

1 / 2 • step 1: Create New Storage Account

Stream Analytics job

follow below screens to create new Stream Analytics job, This job reads telemetry from IOT hub and pack data in "BusJourneyInfo" table

1 / 7 • step 1: Create New Stream Analytics job

Bus Streaming Analytics query

SELECT 
* 
INTO 
BusInfo 
FROM 
BusInfoStream

don't forget to start the job to ensure that received telemetry will be saved in the table.

No need to create table to save data, analytics job reads the telemetry from IOT hub and if output table isn't exist it will create it for you. Fields will be the same as fields sent to IOT hub. This table will be created in the storage area and can be accessed using "Microsoft Azure Storage Explorer".

Create Devices

Follow steps in below link to define the Raspberry Pi devices. Each Bus should has one raspberry and each raspberry should be defined using "Device Explorer"

https://github.com/Azure/azure-iot-sdks/blob/master/tools/DeviceExplorer/doc/how_to_use_device_explorer.md

Note: You can get "IOT hub connection string" from IOT hub shared access polices as below screen

Hardware and Azure integration

- Open VS 2015. Select file Menu -> new Project -> Visual C# -> Windows IOT Core -> Background Application (IOT)

- Right click on project name in solution explorer -> "Manage NuGet Packages"

Download both "Microsoft.Azure.Devices" & "Microsoft.Azure.Devices.Client" packages

- Right click on References in solution explorer -> Add reference. Then select from Universal windows -> Extensions -> "Windows IOT Extensions for UWP "

- Code can be found in below link

https://github.com/mrahman4/BITSCode

- Compile code and deploy it on Raspberry. From Debug menu -> BITSDevice Properties -> from Debug tab choose

Target device to be Remote machine

write the name of Raspberry in Remote machine box and click find and then in remote connection dialog choose select

Note: Raspberry should be connected at this time to internet through WiFi and at the same time connected to the breadboard and sensors

In this part of the project, we simulate the behavior of certain bus during a day. We assume that this bus has a route consists of 10 bus-stops.

static int MAX_NOF_STOPS_1 = 9;   // (Number of stops - 1)  in this route 
static int[] mStations_Array  = { 1111, 2222, 3333, 4444, 5555, 6666, 7777, 8888, 9999, 1234}; 
int m_iCurrentStationID = -1; //Current station index in the route

Bus move forward from first stop till the last one then return backward to reach the first stop again to complete one journey.

static int DIRECTION_FORWARD    = 1 ;    
static int DIRECTION_BACKWORD   = -1;      
int m_iDirection = DIRECTION_FORWARD;

Bus makes 5 Journeys during the day. Each journey can have different driver.

static int MAX_NOF_JOURNIES = 5;  
static int[,] m_JourniesArray = new int[5, 2] {{ 401, 1 }, { 402, 1}, { 403, 2}, { 404, 2}, { 405, 2}} ; 
int m_iJourneyIndex = -1;

Some information is fixed regarding one bus however its changed from bus to another, such Line ID which represent bus route ID and Bus ID as many buses can serve one route

static int LINE_ID = 10;        
static int BUS_ID = 610;

In Prototype and when bus arrive to certain bus-stop, the door push button will be pressed to indicate that door is opened.

To indicate that one passenger moves up to the bus, Passenger Up push button should be clicked.

To indicate that one passenger moves down from the bus, Passenger down push button should be clicked.

When bus ready to move away from bus-stop, door push button clicked again to indicate that door is closed

private GpioPin passupPin;     // One passenger come inside the bus 
private GpioPin passdownPin;   // One passenger left the bus 
private GpioPin doorclosedPin; // bus door closed or opened

3 LEDs are used to indicate the status of each push button

Raspberry Pi3 Bus Simulation Circuit

First prototype using Arduino

Two variables are used to count number of passengers get inside the bus and number of passengers get out the bus. These 2 counters is reset with each bus stop (door open)

int m_iNumInSensor = 0 ;   //Number of passanger move inside the bus 
int m_iNumOutSensor = 0 ;   //Number of passanger move outside the bus

When door closed, All needed data will be prepared, packed and send to Azure IOT hub.

var message = new Microsoft.Azure.Devices.Client.Message(Encoding.ASCII.GetBytes(JsonConvert.SerializeObject(transaction))); 
try { await deviceClient.SendEventAsync(message); } 
catch (Exception e) { string str = e.Message; }

From "Microsoft Azure Storage Explorer", you can check that new record has been inserted in the table

Also you can ensure that Device has sent something using "Device Explorer"

And you can check that Azure IOT hub is receiving stream data

Notes:

Ensure that "Bus streaming Analytics" is running. As it is responsible to read data from IOT hub and save it in table in the storage

In Arduino you have 2 functions setup and loop, in C# both functions should be handled inside Run. InitGPIO() function play the role of setup in Arduino, inside this function you should define events such door button is pushed. Then there is a thread contains infinite loop to ensure that application will continue running, waiting for events to get fired

public void Run(IBackgroundTaskInstance taskInstance) 
{ 
deferral = taskInstance.GetDeferral(); 
InitGPIO(); 
Task.Run(() => 
{ 
while (true) 
{ 
//Thats right...do nothing. 
} 
}); 
}

Inside the InitGPIO() function, pins get defined and events function registered

doorclosedPin = gpio.OpenPin(DOORCLOSED_PIN); 
if (doorclosedPin.IsDriveModeSupported(GpioPinDriveMode.InputPullUp)) 
doorclosedPin.SetDriveMode(GpioPinDriveMode.InputPullUp); 
else 
doorclosedPin.SetDriveMode(GpioPinDriveMode.Input); 
doorclosedPin.DebounceTimeout = TimeSpan.FromMilliseconds(50); 
doorclosedPin.ValueChanged += DoorPin_ValueChanged;

- Device object that used to send telemetry to IOT hub is initiated when bus arrive to each bus stop

m_deviceClient = DeviceClient.CreateFromConnectionString 
(connectionString, "BusDevice2", Microsoft.Azure.Devices.Client.TransportType.Http1);

connectionString is the IOT hub conection string and "BusDevice2" is the Device ID

Lines & Stops database

We needed to create two entities to keep bus lines data such as (Line, Stop, Stop Order), and to keep Stops data (Stop, Lat, Long).

Creating the entities can be on Azure SQL Database, or on the same storage account used by IoT hub which is using NoSQL database. I recommend using SQL Database for those familiar with SQL commands.

Lines Data

Stops Data

Azure API App (Web Service)

The purpose of the API is to provide the needed data to fulfill passenger request on the mobile app. Rather than deployment of such huge amount of logic on the mobile app that as with larger computing power and speed will provide a reasonable performance and better customer experiance.

To fulfill a request that ask for a bus from an "origin stop" to a "destination stop", the API does the following:

Searching bus lines that pass by the requested "origin stop" in the lines/stops database, optionally more than a bus line can serve the same stop.

Again filtering the lines found to those also serve destination stop

For each of those "potential" lines, we check against "real data" comes from bus sensors through the IoT hub, to find which line is currently serving and if a bus has not yet passed requested origin stop

In case a bus found, we go through the whole journey to collect number of seats taken and those left, so that calculate final number of booked seats, then conclude available seats (we are assuming a number of 60 seats as a capacity, however it represents not only actual seating but also available standing spots, anyway this can be modified to separate the two kinds of capacity and feedback these details to the mobile app)

To estimate the duration a bus takes from the location where it is till the origin stop, we had some options:

1 - We should know first its current location (easily can be using a GPS sensor on the Raspberry Pi that sends timely records of location lat and long) , however for our prototype we are using "last stop bus departed from" as a reference, but we found a small challenge with this option that bus could be near or far from that stop, so, we did put it this way:

If bus moved from stop A at 6:00pm and is heading towards stop B, and we know it should take about T = 10 mins to get there, and the request for that bus was made at about 6:04pm which is t = 4 mins after latest departure, then during calculations we subtract T-t = 10 - 4 = 6 mins is the time estimated to reach next stop B instead of 10 mins, which is closer to reality.

Of course timely sent bus locations - say each minute or 30 seconds - will give the most accuracy to our calculations

2- The second trick comes when we need to calculate the duration from one stop to another using Google API, it will return results that are based on a shortest route algorithm, which mostly will not be the same like our bus route, and there will be a high level of inaccuracy of estimated duration times.

To resolve this puzzle, we divided the calculations according to route designated paths from one stop to another, which means we needed to calculate the duration between a stop and the next one, and accumluate the results along till the desired stop. Assuming that most of the time the path between two stops is the shortest or near to it. Adding to this, we can add as much points as we need to represent all key points in bus route either stops or not, thus maintain the highest accuracy.

Next step is aggregating travel times between last stop departed from and origin stop, then subtracting time between last departure and request time -as explained above - will provide estimated bus arrival time.

The same is done for duration from origin to detention to aggregate estimated trip travel time.

In addition to arrival/travel times, API will calculate dwell times (bus waiting times at stops) for targeted stops adding these times to relevant arrival or travel time.

Finally all information is collected for each potential line and sent to mobile app.

The following flow chart explains the API core activities with relevant dependencies

Web Service Flow Chart

Prerequisites to setup Azure API APP

Azure account - You can Open an Azure account for free.

You need to have Storage Account or a SQL Database on your Azure workspace, you can make use of the storage account used by the IoT Hub in the previous sections rather than creating other storage places.

In Visual Studio, click Help -> About Microsoft Visual Studio and ensure that you have "Azure App Service Tools v2.9.1" or higher installed.

Make sure you have .NETFramework 4.5.2 installed.

Install Swagger framework , by running the following commands in package manager console:

PM> Install-Package Swashbuckle -Pre 
PM> Install-Package Swashbuckle.Swagger -Pre

To show your PM console, select from menu as explained

1 / 2

Step by Step

1- Create a new project, select ASP .NET Web Application

2- The wizard takes you to create the App Service on your Azure account

1 / 2

Now you are ready to code.

Swagger

Swagger is a simple yet powerful representation of your RESTful API, allows you to discover and understand the capabilities of the service without access to source code , which is very helpful specially at testing.

3- You need first to enable swagger, from file "SwaggerConfig.cs" which is created for you according to project type. Just remove the commented code highlighted below.

             }) 
         .EnableSwaggerUi(c => 
             {

4- When debugging your app, the browser will open something like the below, with error code 403, don't panic. All you need is to add the word "/swagger" to the end of URL opened.

5- Now, testing your app through swagger is ready

Starting by showing the model example which is the parameters passed to the API, and exploring available methods (We only needed "GET" in our API), then testing can be done by passing values to the parameters, pressing on "Try it out" and waiting for the output results, which by default are in a JSON format

1 / 2 • Swagger UI

Publish your API App

As long as you have already created your project using your Azure account, the remaining steps to publish your App will be so easy

1 / 5

After publishing successfully on Azure, the web service homepage is opened in your browser, here you can repeat same steps for testing using Swagger, by adding "/swagger" to the URL.

You are now done.

What users of your service app need to know is URL of your API and parameters name with respective order.

Monitoring your API APP

Using Azure portal provides a live chart of requests and errors resulted from your API App usage.

Using Microsoft Azure Storage Explorer

A very helpful, and light tool to explore, import, export data in your storage account. Also we used it for testing and in our demonstration.

Microsoft Azure Storage Explorer

Machine Learning

We needed to calculate estimates for dwell times (bus waiting times at stops), looking at many models already there, we have selected KNN model, inspired by the research published by Jianxia Xin and Shuyan Chen here Bus Dwell Time Prediction Based on KNN . Choice was due to its clarity and simplicity, the model depends on clustering stops into groups based on periods during the day/week, for example: weak days peak hours, for a certain bus line.

In our case we used about 71 records for bus line no. 10, while stop dwell times to be calculated as time between bus door opening and door closing (Using GPS will give more accuracy to use the time of entrance to stop lane and exit from stop lane).

After training the model (we've chosen 70% of data for training ) and scoring, we have published the model as an API to be used by the web service adding dwell times to estimated times calculations.

Step by Step

1- Create machine components, we used k=10 as it provided better results, in the following figure, we used a sample file uploaded to the machine data set

2- Running the machine, produced a "Predictive" model that can be published

3- Publishing the predective model as an API service

4- From the link labeled "Request/Response", one can browse the API documentation produced

5- In the following figure, showing how to configure the machine to load the data from BITS storage account table "busjourneyinfo"

Passenger and Mobile Application

- Create new project in Android Studio. click next and choose default values.

- Code can be found in below link

https://github.com/mrahman4/BITSCode

- Compile application and run it on Android Emulator

Passenger should has this application on his phone, he selects the start stop and end stop the submit his request. Android application communicate with Azure web service and get information related to all buses that can take him from this stop to end stop. web service return Bus number, when bus will arrive to start stop, duration needed to reach end stop and how many seats are available in this bus.

Application should have permission to access internet so you need to add below lines in AndroidManifest.xml

<permission android:name="android.permission.INTERNET" />
<uses-permission android:name="android.permission.INTERNET" />

To support JSON & HTTP codes, add below lines in build.gardle (Module: app)

add below line before buildTyes

useLibrary  'org.apache.http.legacy'

add below line inside dependencies in the same file

compile 'com.google.code.gson:gson:2.4'

Note:

- To connect to Azure web service, RestClient class has been used from below link. Many thanks to them for help :)

http://lukencode.com/2010/04/27/calling-web-services-in-android-using-httpclient/

- getArrivalTime() function communicate with the webservice so it should be called from new thread different than main GUI thread. This new thread ca't update in GUI so you need to use runOnUIThread() to update in the GUI. Take care that Android Studio by default let you debug only in main GUI thread. if you want to debug in both threads you need to change Android Studio settings

new Thread(new Runnable() {
public void run() 
{
m_strAvailableBusesInfo = "";
getArrivalTime();
runOnUiThread(new Runnable() 
{
@Override
public void run() {
arrivalTimeTxt.setText(m_strAvailableBusesInfo);
arrivalTimeTxt.refreshDrawableState();
}
});
}
}).start();

- Webservice return JSON file. To parse this file, all unseen char should be

JSONObject  jsonRootObject = new JSONObject(new String(strJson.getBytes("UTF-8")));
//Get the instance of JSONArray that contains JSONObjects
JSONArray jsonArray = jsonRootObject.optJSONArray("Buses");
int iNofJsonArray = jsonArray.length();
//Iterate the jsonArray and print the info of JSONObjects
for(int i = 0 ; i < iNofJsonArray ; i++ )
{
String strBusID = jsonObject.optString("BusID").toString();  
String strArrivalTime = jsonObject.optString("ArrivalTime").toString();  
String strTravelTime = jsonObject.optString("TravelTime").toString();  
String strAvailableSeats= jsonObject.optString("AvailableSeats").toString(); 
}

Demo

We used five tools in the demo:

API App portal (Web Service) to show number of requests received

IoT hub portal, to show number of messages received

Visual Studio to show where the message sent to the IoT hub

Android Studion emulator, to run the mobile app

MS Azure Storage Explorer, to monitor data records

Data Analytics:

1- Data Flow:

Following diagram describes how data flow inside different tools inside data analytics cycle

2- Data transformation

We used actual DataSet related to Dublin City buses; the Data set has been downloaded from following link:

https://data.gov.ie/dataset/dublin-bus-gps-sample-data-from-dublin-city-council-insight-project (Dublin Bus GPS Data)

The Dataset contains 30 days of actual Dublin Buses GPS data across Dublin City, downloaded from Dublin City Council 'traffic control, in csv format.

Data Transformation goes through following three steps, which will reduce total number of records for the 7 days from 9.5M records to 700K records :

a) We will select 7 days only from the DataSet “from 1-1-2013 till 7-1-2013” as all other weeks have similar patterns.

b) For the sake of the project we identified following fields in the DataSet that are relative to our project, all other fields will be filtered out:

[0]prodtime: Not needed, unix time format, we will generate another column that matches day time

[1]; Line ID: Keep it. Line ID for each travel

Direction: Not needed

Journey Pattern: Not needed

[2]; Date: Keep it

[3]; Journey ID: Keep it. Unique ID for each bus journey

Operator: Not needed

Congestion: Not needed

[4]; Long: Keep it, GPS Longitude

[5]; Lat: Keep it, GPS Latitude

Delay: Not Needed

Block ID: Not needed

[6]; Bus ID: Keep It, Bus ID

[7]; Station ID: Keep it, Bus station ID

[8]; At Stop: Keep it, 1=bus stopped, 0= bus is moving

· Also we are interest only in the rows that has “At Stop =1”

· We used Impala Job to do the following:

a) Filter all unneeded columns as described above

b) Select rows only that have the “At Stop =1”

c) Order the output based on the “journey ID” fiele

Impala Code:

I. Create 7 tables one for each day, and load the data into it:

create external table businfo_2013_01_01 (prodTime bigint, lineID int, direction int, journeyPatternId string, journeydate string, journeyID int, operator string, congestion int, gpsLong float, gpLat float, delay int, blockID int, busID int, stationID int, atStop int ) row format delimited fields terminated by ',';
load data local inpath '/home/cloudera/DublinBuses010113-310113/DataSet/siri.20130101.csv' overwrite into table businfo_2013_01_01;

o We applied same code to create the remaining 6 tables: businfo_2013_01_02, businfo_2013_01_03, businfo_2013_01_04, businfo_2013_01_05, businfo_2013_01_06, businfo_2013_01_07

II. Filter all unneeded columns from the 7 tables, select “At Station =0” and order by “journey ID” field:

with 
cte_station
as ( select *,  row_number() over (partition by journeydate, journeyid, stationid 
order by prodtime asc ) as rn_station
from businfo_2013_01_01 )
select prodtime, lineid, journeydate, journeyID, gpslong, gpslat, busid, stationid, atstop from cte_station 
where rn_station = 1 and atstop = 1 
order by journeyid;

o We applied same code to filter all 6 tables: businfo_2013_01_02, businfo_2013_01_03, businfo_2013_01_04, businfo_2013_01_05, businfo_2013_01_06, businfo_2013_01_07

III. Save the 7 tables into csv files

a) Using python code we added four columns to each output files, the new columns will match the following:

a. Pass_IN: Random number of passengers getting in the bus at each station

b. Pass_Out: random number of passengers getting at from the bus at each station

c. Total_Pass: total number of passengers onboard of the bus

d. timestamp: change the prod time which is Unix time to Data Time

· Assumption: bus maximum on board number of passengers is 60

· The python code we used to perform the above tasks is uploaded on the site with following name: busDataSetEdit.py

· We ran the same code for all remaining 6 output files so at the end we will have 7 files reflects all transformations that we mentioned: businfo_pass_2013_01_01.csv', businfo_pass_2013_01_02.csv', businfo_pass_2013_01_03.csv', businfo_pass_2013_01_04.csv', businfo_pass_2013_01_05.csv', businfo_pass_2013_01_06.csv', businfo_pass_2013_01_07.csv'

3- Data Extraction

a. As our DataSet is large in size, we decided to upload only one file representing data of one day to the Azure table storage service

i. We uploaded the file using Azure Storage Explorer application where we identified the “storage account name” & “storage account Key” (extracted from Azure table properties)

ii. To read the data again from Azure table storage we used following python code:

#! /usr/bin/env python
# Auther: Mohamed Moussa
from azure.storage.table import TableService, Entity
table_service = TableService(account_name='bitsstorage1', account_key=<removed for security>')
tasks = table_service.query_entities('BusJournyData')
for task in tasks:
print(task.PartitionKey)
print(task.RowKey)
print(task.prodTime)
print(task.lineID)
print(task.journeyDate)
print(task.journeyID)
print(task.gpsLong)
print(task.gpsLat)
print(task.busID)
print(task.stationID)
print(task.atStop)
print(task.passIn)
print(task.passOut)
print(task.passOnBoard)
print(task.Dtime)

o Then we redirect the file output to a csv file, however for the sake of applying data analytics on a large dataset for more accurate date, we decided to continue using the local files as described in below steps

EndFragment

b. Data extraction is being done through following spark job, the main objective is to extract the longest journeys based on number of stations:

- Code is uploaded with name busjournyStation.py

o the output of the above job is 7 directories each directory contains a file with 2 columns, first column count the number of stations, & second column maps the journeyID

c. Another spark job is being used to generate list of buses and associate journey ID per each day as follow:

- Code is uploaded with name buscount.py

- Loading the Data

a. The output of the last step was 7 files that contains count of stations per each Journey, the objective of this step is to load the 7 files in 7 Hive tables, the quires are as follow:

create external table JournyStation_2013_01_01 (StationCount int, JourneyId int) row format delimited fields terminated by ',';
load data local inpath '/home/cloudera/DublinBuses010113-310113/DataSet/sparkJob/businfo_pass_2013_01_01.csv/part-00000'

· We have done the same Hive queries for all remaining 6 files

create external table JournyBuses_2013_01_01 (BusID int, JourneyId int) row format delimited fields terminated by ',';
load data local inpath '/home/cloudera/DublinBuses010113-310113/DataSet/sparkJob/Buses/businfo_pass_2013_01_01.csv/part-00000'

· We have done the same Hive queries for all remaining 6 files

4- Reporting

a. Scope is to report on the following aspects:

i. Total number of Journeys per day

ii. Longest Journeys (has maximum number of stations)

iii. Total number of buses per day

b. Reporting will be done using Microsoft Power Bi plus the ODBC driver to Hive tables.

c. Steps to setup Power BI, ODBC to connect to HIVE table

i. Download cloudera ODBC from http://www.cloudera.com/documentation/other/connectors/hive-odbc/2-5-12.html

ii. From windows ODBC manager select cloudera ODBC and type in the IP address and the username ‘cloudera’, then test the connectivity

iii. From powerBI select the ODBC option/Cloudera Hive ODBC

d. After importing all data files in Power Bi (through ODBC connection) following changes need to be adjusted:

i. For all 7 files related to Journeys, all header columns need to be changed as to “Stations Count” and “Journey ID”

ii. For all 7 files related to buses, all header columns need to be changed to “Buses ID” and “Journey ID”

e. Below screen shot represents the report from inside Microsoft Power BI which represents the following:

i. Total number of Journeys per day

ii. Longest Journeys (has maximum number of stations)

iii. Total number of buses per day

· Further enhancement to the report will include: total number of passengers in each day, highest journeys in regards to cunt of passengers.

Opportunities and Future Scaling

- Using GPS for accurate Bus location, and for improved route performance monitoring

- Improve the Machine Learning model to predict dwell times with less error percentage

- Developing the mobile app for Windows Phone and iOS platforms

Code

#! /usr/bin/env python
# Auther: Mohamed Moussa

import csv
import random
import datetime

# define the input file nd output file
businfoInput = open('/home/cloudera/DublinBuses010113-310113/DataSet/output/businfo_2013_01_07.csv', 'r')
businfoOutput = open('/home/cloudera/DublinBuses010113-310113/DataSet/output/businfo_pass_2013_01_07.csv', 'w')
businfoReader = csv.reader(businfoInput)
businfoWriter = csv.writer(businfoOutput, lineterminator='\n')

#define maximum number of passengers on board
maxTotal = 60
journyId=0
businfoReader.next()

# loop in the file to add four new columns: Passengers IN, Passengers Out, Number of onboard passengers, Daytime
for row in businfoReader:
	# first station & and applied on all new journyes
	if (row[3] != journyId):
		Utime=int(row[0])/1000000
		# convert Unix time into date time
		Dtime=datetime.datetime.utcfromtimestamp(
       			int(Utime)
    			).strftime('%H:%M:%S')
		journyId = row[3]
		#partitionKey & RowKey columns required n order to upload the file to Micosfot Azure table
		PartitionKey=row[3]
		RowKey=row[7]
		firstIn = random.randint(0, maxTotal)
		firstOut = 0
		onBoard= firstIn
		# write output to the output file
       	 	businfoWriter.writerow([PartitionKey, RowKey]+row+[firstIn, firstOut, onBoard, Dtime])
	else:
		# used for all sttions with the same journy
               	passOut = random.randint(0,onBoard)
       	 	passIn = random.randint(0, maxTotal-(onBoard-passOut))
        	onBoard = (onBoard + passIn) - passOut
		#partitionKey & RowKey columns required n order to upload the file to Micosfot Azure table
                PartitionKey=row[3]
                RowKey=row[7]
		Utime=int(row[0])/1000000
       	 	Dtime=datetime.datetime.utcfromtimestamp(
       	         	int(Utime)
       	         	).strftime('%H:%M:%S')
       	 	businfoWriter.writerow([PartitionKey, RowKey]+row+[passIn, passOut, onBoard, Dtime])

#close the opened files
businfoInput.close()
businfoOutput.close()

#Script to extract list of longest & shortest Journyes by number of stations
#Auther: Mohamed Moussa

from pyspark import SparkContext, SparkConf
import csv
import glob
import os

conf = SparkConf().setAppName('BusinfoJournyByStation')
sc = SparkContext(conf=conf)

for filename in glob.iglob('/home/cloudera/DublinBuses010113-310113/DataSet/output/businfo_pass_2013_01*'): #reads files from local FS
	fn=filename.split("/")
        fullname=fn[6]
	# to read files from local file system instead of Hadoop
	localFileName="file://"+filename

	trackFile= sc.textFile(localFileName)


	#Function to parse the file into tuples of journyID & Bus Station
	def makeTrackFile(line):
	        l= line.split(",")
	      	#lDate=l[4]
	        lJournyId=l[5]
	        lStation=l[9]
	        return (lJournyId,lStation)

	#filter for values that has na!=None, means to remove all records for the stations that the bus didnt stop at
	fileLines= trackFile.map(lambda line: makeTrackFile(line))
	#select Distinct over the journyID, then reduceByKey to get total number of stations, then add the date then sort the output and remove brackets
	totalByStation = fileLines.distinct().map(lambda x: (x[0],1)).reduceByKey(lambda x,y:(x+y)).map(lambda x: (x[1],x[0])).sortByKey(ascending=False).map(lambda (k, v) : "{0}, {1}".format(k, v))

	#Save output to file - local filesystem
	totalByStation.saveAsTextFile('file:/home/cloudera/DublinBuses010113-310113/DataSet/sparkJob/' + fullname)

#Script to extract list of buses and Journyes 
# Auther: Mohamed Moussa

from pyspark import SparkContext, SparkConf
import csv
import glob
import os

conf = SparkConf().setAppName('BusCount')
sc = SparkContext(conf=conf)

for filename in glob.iglob('/home/cloudera/DublinBuses010113-310113/DataSet/output/businfo_pass_2013_01_*'): #reads files from local FS
	fn=filename.split("/")
        fullname=fn[6]
	# to read files from local file system instead of Hadoop
	localFileName="file://"+filename

	trackFile= sc.textFile(localFileName)

	#Function to parse the file into tuples of journyID & Bus Station
	def makeTrackFile(line):
	        l= line.split(",")
		lJourneyID=l[5]
	        lbusId=l[8]
	        return (lbusId, lJourneyID)

	fileLines= trackFile.map(lambda line: makeTrackFile(line))
	#select Distinct over the BusID
	totalByBuses = fileLines.distinct().map(lambda (k, v) : "{0}, {1}".format(k, v))

	#Save output to file - local filesystem
	totalByBuses.saveAsTextFile('file:/home/cloudera/DublinBuses010113-310113/DataSet/sparkJob/Buses/' + fullname)
	#os.system("for i in `ls /home/cloudera/DublinBuses010113-310113/DataSet/sparkJob/*.csv/p*`; do cat $i >> /home/cloudera/DublinBuses010113-310113/Output/JournyStationFull.csv ; done")